banjo Posted October 25, 2011 Report Share Posted October 25, 2011 Today, my Mandy 2010.2 system failed to boot. It has been fine since I installed it last Spring, and today, when I was not at home, it failed to boot with a bunch of ATA error messages. My family did not know what to do, so they just shut it off and told me when I got home. I just looked at /var/log/messages and found about a dozen of the following sequences of errors over the previous two days: Oct 24 14:13:56 localhost init: Switching to runlevel: 0 Oct 24 14:14:29 localhost kernel: ata4: lost interrupt (Status 0x50) Oct 24 14:14:29 localhost kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Oct 24 14:14:29 localhost kernel: ata4.00: failed command: READ DMA Oct 24 14:14:29 localhost kernel: ata4.00: cmd c8/00:28:cf:2f:d5/00:00:00:00:00/e2 tag 0 dma 20480 in Oct 24 14:14:29 localhost kernel: res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 24 14:14:29 localhost kernel: ata4.00: status: { DRDY } Oct 24 14:14:29 localhost kernel: ata4: soft resetting link Oct 24 14:14:29 localhost kernel: ata4.00: configured for UDMA/33 Oct 24 14:14:29 localhost kernel: ata4.00: device reported invalid CHS sector 0 Oct 24 14:14:29 localhost kernel: ata4: EH complete I have posted this to hardware because that is what it looks like to me. However I have Googled some of these errors, and the some folks said that they fixed this with a kernel upgrade, so it may be software. Others say that it is indicative of a hard disk failure. Others have said it is due to a faulty ATA cable. The disk is a Seagate Barracuda ST3500418AS 500GB 7200 RPM SATA hooked into an old Foxconn MOBO that I have forgotten the model number of. The Seagate drive was bought new last Spring to replace another drive that failed. The MOBO has been working OK for about 3 years now. I jiggled the SATA cable on both the MOBO and the disk drive to reseat it, and the computer booted OK. Does anybody recognize these errors and have any idea what might be the problem? Kernel issue? Failing disk drive? Noisy cable? Thanks Banjo (_)=='=~ Quote Link to comment Share on other sites More sharing options...
ianw1974 Posted October 25, 2011 Report Share Posted October 25, 2011 I did a search on failed command: READ DMA This hints at motherboard failure: http://ubuntuforums.org/showthread.php?t=1497742 and this one gives you a test you can run with smartctl and look for re-aligned sectors to see if it's a drive problem: http://lime-technology.com/forum/index.php?topic=5933.0 reseating your SATA cable seemed to help, but you could also completely change the SATA cable if you have a spare, to ensure that you rule this out of the equation, suggest changing that first, then run the disk check. Your disks are new, but then that doesn't mean that they cannot fail. Your motherboard is much older, but then that is not always to say that your motherboard is a possibility. Head into your BIOS also and do a load optimised defaults in case of misconfigured BIOS as well, then run the disk test. If there are no re-aligned sectors, then it's not your disk. Quote Link to comment Share on other sites More sharing options...
isadora Posted October 25, 2011 Report Share Posted October 25, 2011 (edited) Though i see you have probably some other type of Seagate Barracuda, i want to make you aware about the biggest of troubles i had with two of my Barracuda's not so long ago. In the end it seemed to be a firmware-problem of the drive, and after a terrible procedure from desk to desk, and from form to form, it ended up they (Seagate) sent me a refurbished disk. Extended information concerning that issue is to be found at: http://www.theregist...failure_plague/ I hope it didn't hit you, but otherwise it gives a nice insight into what can happen to your HDD. Edit: some more information concerning this matter at: http://www.channelregister.co.uk/2009/01/21/seagate_firmware_fix_breaks_barracudas/ Edited October 25, 2011 by isadora Quote Link to comment Share on other sites More sharing options...
ianw1974 Posted October 25, 2011 Report Share Posted October 25, 2011 Bit extreme, doesn't sound like his problem. They occurred in 2009, he bought in 2010 so I expect Seagate fixed everything by that time :). I've got Seagate 500GB disks with 32MB cache, and I have no problems and I don't intend on upgrading the firmware. Anyway, don't attempt to upgrade the firmware on your drives unless you feel you really need to. I think your problem is somewhere else than the firmware, considering they've been working perfectly fine up until now. Quote Link to comment Share on other sites More sharing options...
daniewicz Posted October 26, 2011 Report Share Posted October 26, 2011 Since Ian is talking about settings in the BIOS, I would go one step further and clear the CMOS. There should be a jumper on your motherboard that can be used to clear the CMOS in case it has become corrupted. If you can't find this jumper, you could remove the motherboard battery for a spell. Quote Link to comment Share on other sites More sharing options...
banjo Posted October 26, 2011 Author Report Share Posted October 26, 2011 Thanks to everybody for their answers. Lots of good ideas to look at. I booted the computer around 18:30 local time yesterday and left it running all night. It was used all day today, and there were no reported problems with it. It is now about 21:30 local time, so it has been running for more than 24 hours. I looked at the messages log and found no more incidents of DMA errors reported. I think that the SATA cable was rectifying some noise on a connector or something... or was just a bit loose. I think I am going to leave the computer alone for a while and watch the log. Maybe this weekend I can find some time to check out the BIOS and the disk drive to see if there are any clues. I do not have smartctl installed. I will look for it. Banjo (_)=='=~ Quote Link to comment Share on other sites More sharing options...
banjo Posted October 26, 2011 Author Report Share Posted October 26, 2011 OK. That was easy. I installed the smart tools package and ran smartctl against the disk. It passed with one error reported about air flow temperature. I am not sure how to read this stuff so I need to do some studying. Here is the output if anyone is interested. Banjo (_)=='=~ smartctl 5.39.1 2010-01-28 r3054 [i586-mandriva-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.12 family Device Model: ST3500418AS Serial Number: 9VMTJ26F Firmware Version: CC46 User Capacity: 500,107,862,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Tue Oct 25 21:54:48 2011 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 609) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 89) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x103f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 128748105 3 Spin_Up_Time 0x0003 099 097 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 498 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail Always - 15596389 9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 2168 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 249 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 098 000 Old_age Always - 340 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 065 044 045 Old_age Always In_the_past 35 (0 3 35 20) 194 Temperature_Celsius 0x0022 035 056 000 Old_age Always - 35 (0 18 0 0) 195 Hardware_ECC_Recovered 0x001a 034 023 000 Old_age Always - 128748105 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 1 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 1 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 34922379086533 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 871942731 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 235388775 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Quote Link to comment Share on other sites More sharing options...
banjo Posted October 26, 2011 Author Report Share Posted October 26, 2011 Not to beat a dead horse or anything, but on a lark I ran smartctl on my old disk which I replaced last Spring because it was acting up so badly the computer was fairly useless. Interesting result ( posted below ) passes that old disk as well, but with "ATA Error Count: 29633". I wonder if I replaced the disk when the trouble was really in the cable. The disk is now in a USB external case and mounts OK. Banjo (_)=='=~ smartctl 5.39.1 2010-01-28 r3054 [i586-mandriva-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.10 family Device Model: ST3250410AS Serial Number: 6RY6W554 Firmware Version: 3.AAF User Capacity: 250,059,350,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Tue Oct 25 22:24:46 2011 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 430) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 64) minutes. SCT capabilities: (0x0001) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 253 006 Pre-fail Always - 0 3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1108 5 Reallocated_Sector_Ct 0x0033 073 073 036 Pre-fail Always - 1111 7 Seek_Error_Rate 0x000f 079 060 030 Pre-fail Always - 94882296 9 Power_On_Hours 0x0032 085 085 000 Old_age Always - 13294 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 099 099 020 Old_age Always - 1113 187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 103 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 075 051 045 Old_age Always - 25 (Lifetime Min/Max 19/49) 194 Temperature_Celsius 0x0022 025 049 000 Old_age Always - 25 (0 19 0 0) 195 Hardware_ECC_Recovered 0x001a 073 046 000 Old_age Always - 35281440 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 087 000 Old_age Always - 29446 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0 SMART Error Log Version: 1 ATA Error Count: 29633 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 29633 occurred at disk power-on lifetime: 13290 hours (553 days + 18 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 67 79 36 42 e1 Error: ICRC, ABRT 103 sectors at LBA = 0x01423679 = 21116537 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 78 68 36 42 e1 00 00:21:06.493 READ DMA c8 00 08 78 42 42 e1 00 00:21:06.493 READ DMA c8 00 58 d8 42 42 e1 00 00:21:06.493 READ DMA c8 00 08 60 4c 42 e1 00 00:21:06.492 READ DMA c8 00 08 70 42 42 e1 00 00:21:06.471 READ DMA Error 29632 occurred at disk power-on lifetime: 13290 hours (553 days + 18 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 6f 31 51 42 e1 Error: ICRC, ABRT 111 sectors at LBA = 0x01425131 = 21123377 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 a0 00 51 42 e1 00 00:21:04.375 READ DMA ec 00 00 00 00 00 a0 00 00:21:05.813 IDENTIFY DEVICE ef 03 42 00 00 00 a0 00 00:21:05.810 SET FEATURES [set transfer mode] ec 00 00 00 00 00 a0 00 00:21:05.807 IDENTIFY DEVICE 00 00 a0 00 00 00 00 04 00:21:05.804 NOP [Abort queued commands] Error 29631 occurred at disk power-on lifetime: 13290 hours (553 days + 18 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 5f 41 51 42 e1 Error: ICRC, ABRT 95 sectors at LBA = 0x01425141 = 21123393 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 a0 00 51 42 e1 00 00:21:04.375 READ DMA ec 00 00 00 00 00 a0 00 00:21:03.869 IDENTIFY DEVICE ef 03 42 00 00 00 a0 00 00:21:03.866 SET FEATURES [set transfer mode] ec 00 00 00 00 00 a0 00 00:21:03.863 IDENTIFY DEVICE 00 00 a0 00 00 00 00 04 00:21:03.860 NOP [Abort queued commands] Error 29630 occurred at disk power-on lifetime: 13290 hours (553 days + 18 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 6f 31 51 42 e1 Error: ICRC, ABRT 111 sectors at LBA = 0x01425131 = 21123377 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 a0 00 51 42 e1 00 00:21:04.375 READ DMA ec 00 00 00 00 00 a0 00 00:21:03.869 IDENTIFY DEVICE ef 03 42 00 00 00 a0 00 00:21:03.866 SET FEATURES [set transfer mode] ec 00 00 00 00 00 a0 00 00:21:03.863 IDENTIFY DEVICE 00 00 a0 00 00 00 00 04 00:21:03.860 NOP [Abort queued commands] Error 29629 occurred at disk power-on lifetime: 13290 hours (553 days + 18 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 1f 81 51 42 e1 Error: ICRC, ABRT 31 sectors at LBA = 0x01425181 = 21123457 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 a0 00 51 42 e1 00 00:21:02.523 READ DMA ec 00 00 00 00 00 a0 00 00:21:03.869 IDENTIFY DEVICE ef 03 42 00 00 00 a0 00 00:21:03.866 SET FEATURES [set transfer mode] ec 00 00 00 00 00 a0 00 00:21:03.863 IDENTIFY DEVICE 00 00 a0 00 00 00 00 04 00:21:03.860 NOP [Abort queued commands] SMART Self-test log structure revision number 1 SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Quote Link to comment Share on other sites More sharing options...
ianw1974 Posted October 26, 2011 Report Share Posted October 26, 2011 You do have a problem with that old disk: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 073 073 036 Pre-fail Always - 1111 7 Seek_Error_Rate 0x000f 079 060 030 Pre-fail Always - 94882296 187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 103 so you were right to replace it. But also, considering the reallocated sectors and uncorrected, I'd not use that disk for anything important. Quote Link to comment Share on other sites More sharing options...
banjo Posted October 26, 2011 Author Report Share Posted October 26, 2011 Wow. Thanks for pointing that out. I did not notice it because I don't know yet how to read the numbers. Banjo (_)=='=~ Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.