I was randomly getting hard drive error messages. UNC means UNCorrectable ERRor, …. not good. These messages were like:
Feb 17 10:40:34 server04 kernel: ata1.00: cmd 60/f8:08:85:fe:dd/00:00:18:00:00/40 tag 1 ncq 126976 in
Feb 17 10:40:34 server04 kernel: res 40/00:10:7d:00:de/00:00:18:00:00/40 Emask 0×1 (device error)
Feb 17 10:40:34 server04 kernel: ata1.00: status: { DRDY }
Feb 17 10:40:34 server04 kernel: ata1.00: cmd 60/08:10:7d:00:de/00:00:18:00:00/40 tag 2 ncq 4096 in
Feb 17 10:40:34 server04 kernel: res 40/00:10:7d:00:de/00:00:18:00:00/40 Emask 0×1 (device error)
Feb 17 10:40:34 server04 kernel: ata1.00: status: { DRDY }
Feb 17 10:40:39 server04 kernel: ata1.00: cmd 60/08:18:7d:ff:dd/00:00:18:00:00/40 tag 3 ncq 4096 in
Feb 17 10:40:39 server04 kernel: res 40/00:10:7d:00:de/00:00:18:00:00/40 Emask 0×1 (device error)
Feb 17 10:40:39 server04 kernel: ata1.00: status: { DRDY }
and
ata1.00: exception Emask 0×0 SAct 0×0 SErr 0×0 action 0×0
ata1.00: BMDMA stat 0×25
ata1.00: cmd c8/00:08:8f:01:3c/00:00:00:00:00/e1 tag 0 dma 4096 in
res 51/40:00:8f:01:3c/00:00:00:00:00/01 Emask 0×9 (media error)
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
ata1.00: configured for UDMA/133
ata1: EH complete
ata1.00: exception Emask 0×0 SAct 0×0 SErr 0×0 action 0×0
ata1.00: BMDMA stat 0×25
ata1.00: cmd c8/00:08:8f:01:3c/00:00:00:00:00/e1 tag 0 dma 4096 in
res 51/40:00:8f:01:3c/00:00:00:00:00/01 Emask 0×9 (media error)
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
ata1.00: configured for UDMA/133
ata1: EH complete
and
[ 6308.750919] ata1.00: exception Emask 0×0 SAct 0×0 SErr 0×0 action 0×0
[ 6308.750930] ata1.00: BMDMA stat 0×65
[ 6308.750939] ata1.00: cmd c8/00:08:27:45:94/00:00:00:00:00/e0 tag 0
dma 4096 in
[ 6308.750942] res 51/40:08:28:45:94/00:00:00:00:00/e0 Emask
0×9 (media error)
[ 6308.750948] ata1.00: status: { DRDY ERR }
[ 6308.750951] ata1.00: error: { UNC }
[ 6308.757809] ata1: nv_mode_filter: 0×3f01f&0×3f01f->0×3f01f,
BIOS=0×3f000 (0xc600c0c0) ACPI=0×3f01f (20:600:0×13)
[ 6308.773842] ata1.00: configured for UDMA/100
[ 6308.774057] ata1: EH complete
[ 6315.576605] ata1.00: exception Emask 0×0 SAct 0×0 SErr 0×0 action 0×0
[ 6315.576616] ata1.00: BMDMA stat 0×65
[ 6315.576627] ata1.00: cmd c8/00:08:27:45:94/00:00:00:00:00/e0 tag 0
dma 4096 in
[ 6315.576629] res 51/40:08:28:45:94/00:00:00:00:00/e0 Emask
0×9 (media error)
[ 6315.576635] ata1.00: status: { DRDY ERR }
[ 6315.576638] ata1.00: error: { UNC }
[ 6315.588519] ata1: nv_mode_filter: 0×3f01f&0×3f01f->0×3f01f,
BIOS=0×3f000 (0xc600c0c0) ACPI=0×3f01f (20:600:0×13)
[ 6315.608523] ata1.00: configured for UDMA/100
[ 6315.608711] ata1: EH complete
I obtained and ran the SmartCTL utility:
apt-get install smartmontools
smartctl -a /dev/sda
(or your hard drive)
You’ll see output like:
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0×000b 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0×0007 144 139 021 Pre-fail Always - 3325
4 Start_Stop_Count 0×0032 094 094 040 Old_age Always - 6229
5 Reallocated_Sector_Ct 0×0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0×000b 100 253 051 Pre-fail Always - 0
9 Power_On_Hours 0×0032 084 084 000 Old_age Always - 11802
10 Spin_Retry_Count 0×0013 100 100 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0×0013 100 100 051 Pre-fail Always - 0
12 Power_Cycle_Count 0×0032 094 094 000 Old_age Always - 6203
194 Temperature_Celsius 0×0022 111 253 000 Old_age Always - 39
196 Reallocated_Event_Count 0×0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0×0012 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0×0012 200 200 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0×000a 200 253 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0×0009 200 155 051 Pre-fail Offline - 0
A retry, error, or pending count can mean bad sectors!
Luckily if you’re running ext4, or another journaling file system, these errors may recover and you can backup you machine, install a RAID and get on with life!
You might run into posts that say that adding kernel options like:
“options libata noacpi=1″ on /etc/modprobe.d/options
may not show errors of these types again, but hiding errors is not a good idea. The above errors are symptoms of a failing drive and should be replace immediately.
Even if fsck runs fine, you should force an fsck check. See my other post on running fsck.
Filed under: Technology | Tagged: Hard Drive, Sector Error, Ubuntu | 26 Comments »