Jump to content
Sign in to follow this  
banjo

Failure to Boot

Recommended Posts

Today, my Mandy 2010.2 system failed to boot. It has been fine since I installed it last Spring, and today, when I was not at home, it failed to boot with a bunch of ATA error messages. My family did not know what to do, so they just shut it off and told me when I got home. :angry:

 

I just looked at /var/log/messages and found about a dozen of the following sequences of errors over the previous two days:

 

Oct 24 14:13:56 localhost init: Switching to runlevel: 0
Oct 24 14:14:29 localhost kernel: ata4: lost interrupt (Status 0x50)
Oct 24 14:14:29 localhost kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Oct 24 14:14:29 localhost kernel: ata4.00: failed command: READ DMA
Oct 24 14:14:29 localhost kernel: ata4.00: cmd c8/00:28:cf:2f:d5/00:00:00:00:00/e2 tag 0 dma 20480 in
Oct 24 14:14:29 localhost kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 24 14:14:29 localhost kernel: ata4.00: status: { DRDY }
Oct 24 14:14:29 localhost kernel: ata4: soft resetting link
Oct 24 14:14:29 localhost kernel: ata4.00: configured for UDMA/33
Oct 24 14:14:29 localhost kernel: ata4.00: device reported invalid CHS sector 0
Oct 24 14:14:29 localhost kernel: ata4: EH complete

 

I have posted this to hardware because that is what it looks like to me. However I have Googled some of these errors, and the some folks said that they fixed this with a kernel upgrade, so it may be software. Others say that it is indicative of a hard disk failure. Others have said it is due to a faulty ATA cable.

 

The disk is a Seagate Barracuda ST3500418AS 500GB 7200 RPM SATA hooked into an old Foxconn MOBO that I have forgotten the model number of. The Seagate drive was bought new last Spring to replace another drive that failed. The MOBO has been working OK for about 3 years now.

 

I jiggled the SATA cable on both the MOBO and the disk drive to reseat it, and the computer booted OK.

 

Does anybody recognize these errors and have any idea what might be the problem? Kernel issue? Failing disk drive? Noisy cable?

 

Thanks

Banjo

(_)=='=~

Share this post


Link to post
Share on other sites

I did a search on failed command: READ DMA

 

This hints at motherboard failure:

 

http://ubuntuforums.org/showthread.php?t=1497742

 

and this one gives you a test you can run with smartctl and look for re-aligned sectors to see if it's a drive problem:

 

http://lime-technology.com/forum/index.php?topic=5933.0

 

reseating your SATA cable seemed to help, but you could also completely change the SATA cable if you have a spare, to ensure that you rule this out of the equation, suggest changing that first, then run the disk check. Your disks are new, but then that doesn't mean that they cannot fail. Your motherboard is much older, but then that is not always to say that your motherboard is a possibility.

 

Head into your BIOS also and do a load optimised defaults in case of misconfigured BIOS as well, then run the disk test. If there are no re-aligned sectors, then it's not your disk.

Share this post


Link to post
Share on other sites

Though i see you have probably some other type of Seagate Barracuda,

i want to make you aware about the biggest of troubles i had with two of my Barracuda's not so long ago.

In the end it seemed to be a firmware-problem of the drive, and after a terrible procedure from desk to desk, and from form to form,

it ended up they (Seagate) sent me a refurbished disk.

Extended information concerning that issue is to be found at:

http://www.theregist...failure_plague/

 

I hope it didn't hit you, but otherwise it gives a nice insight into what can happen to your HDD.

 

Edit: some more information concerning this matter at:

http://www.channelregister.co.uk/2009/01/21/seagate_firmware_fix_breaks_barracudas/

Edited by isadora

Share this post


Link to post
Share on other sites

Bit extreme, doesn't sound like his problem. They occurred in 2009, he bought in 2010 so I expect Seagate fixed everything by that time :). I've got Seagate 500GB disks with 32MB cache, and I have no problems and I don't intend on upgrading the firmware.

 

Anyway, don't attempt to upgrade the firmware on your drives unless you feel you really need to. I think your problem is somewhere else than the firmware, considering they've been working perfectly fine up until now.

Share this post


Link to post
Share on other sites

Since Ian is talking about settings in the BIOS, I would go one step further and clear the CMOS. There should be a jumper on your motherboard that can be used to clear the CMOS in case it has become corrupted. If you can't find this jumper, you could remove the motherboard battery for a spell.

Share this post


Link to post
Share on other sites

Thanks to everybody for their answers. Lots of good ideas to look at.

 

I booted the computer around 18:30 local time yesterday and left it running all night. It was used all day today, and there were no reported problems with it. It is now about 21:30 local time, so it has been running for more than 24 hours. I looked at the messages log and found no more incidents of DMA errors reported.

 

I think that the SATA cable was rectifying some noise on a connector or something... or was just a bit loose.

 

I think I am going to leave the computer alone for a while and watch the log. Maybe this weekend I can find some time to check out the BIOS and the disk drive to see if there are any clues.

 

I do not have smartctl installed. I will look for it.

 

Banjo

(_)=='=~

Share this post


Link to post
Share on other sites

OK. That was easy. I installed the smart tools package and ran smartctl against the disk. It passed with one error reported about air flow temperature. I am not sure how to read this stuff so I need to do some studying. Here is the output if anyone is interested.

 

Banjo

(_)=='=~

 

smartctl 5.39.1 2010-01-28 r3054 [i586-mandriva-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.12 family
Device Model:     ST3500418AS
Serial Number:    9VMTJ26F
Firmware Version: CC46
User Capacity:    500,107,862,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Tue Oct 25 21:54:48 2011 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                       was completed without error.
                                       Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                       without error or no self-test has ever 
                                       been run.
Total time to complete Offline 
data collection:                 ( 609) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                       Auto Offline data collection on/off support.
                                       Suspend Offline collection upon new
                                       command.
                                       Offline surface scan supported.
                                       Self-test supported.
                                       Conveyance Self-test supported.
                                       Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                       power-saving mode.
                                       Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                       General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  89) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x103f) SCT Status supported.
                                       SCT Feature Control supported.
                                       SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail  Always       -       128748105
 3 Spin_Up_Time            0x0003   099   097   000    Pre-fail  Always       -       0
 4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       498
 5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x000f   072   060   030    Pre-fail  Always       -       15596389
 9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       2168
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       249
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   098   000    Old_age   Always       -       340
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   065   044   045    Old_age   Always   In_the_past 35 (0 3 35 20)
194 Temperature_Celsius     0x0022   035   056   000    Old_age   Always       -       35 (0 18 0 0)
195 Hardware_ECC_Recovered  0x001a   034   023   000    Old_age   Always       -       128748105
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       34922379086533
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       871942731
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       235388775

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Share this post


Link to post
Share on other sites

Not to beat a dead horse or anything, but on a lark I ran smartctl on my old disk which I replaced last Spring because it was acting up so badly the computer was fairly useless. Interesting result ( posted below ) passes that old disk as well, but with "ATA Error Count: 29633". I wonder if I replaced the disk when the trouble was really in the cable. The disk is now in a USB external case and mounts OK.

 

Banjo

(_)=='=~

 

smartctl 5.39.1 2010-01-28 r3054 [i586-mandriva-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.10 family
Device Model:     ST3250410AS
Serial Number:    6RY6W554
Firmware Version: 3.AAF
User Capacity:    250,059,350,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Oct 25 22:24:46 2011 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                       was completed without error.
                                       Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                       without error or no self-test has ever 
                                       been run.
Total time to complete Offline 
data collection:                 ( 430) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                       Auto Offline data collection on/off support.
                                       Suspend Offline collection upon new
                                       command.
                                       Offline surface scan supported.
                                       Self-test supported.
                                       No Conveyance Self-test supported.
                                       Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                       power-saving mode.
                                       Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                       General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  64) minutes.
SCT capabilities:              (0x0001) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x000f   100   253   006    Pre-fail  Always       -       0
 3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always       -       0
 4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1108
 5 Reallocated_Sector_Ct   0x0033   073   073   036    Pre-fail  Always       -       1111
 7 Seek_Error_Rate         0x000f   079   060   030    Pre-fail  Always       -       94882296
 9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       13294
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   099   099   020    Old_age   Always       -       1113
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       103
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   075   051   045    Old_age   Always       -       25 (Lifetime Min/Max 19/49)
194 Temperature_Celsius     0x0022   025   049   000    Old_age   Always       -       25 (0 19 0 0)
195 Hardware_ECC_Recovered  0x001a   073   046   000    Old_age   Always       -       35281440
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   087   000    Old_age   Always       -       29446
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 29633 (device log contains only the most recent five errors)
       CR = Command Register [HEX]
       FR = Features Register [HEX]
       SC = Sector Count Register [HEX]
       SN = Sector Number Register [HEX]
       CL = Cylinder Low Register [HEX]
       CH = Cylinder High Register [HEX]
       DH = Device/Head Register [HEX]
       DC = Device Command Register [HEX]
       ER = Error register [HEX]
       ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 29633 occurred at disk power-on lifetime: 13290 hours (553 days + 18 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 84 51 67 79 36 42 e1  Error: ICRC, ABRT 103 sectors at LBA = 0x01423679 = 21116537

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 c8 00 78 68 36 42 e1 00      00:21:06.493  READ DMA
 c8 00 08 78 42 42 e1 00      00:21:06.493  READ DMA
 c8 00 58 d8 42 42 e1 00      00:21:06.493  READ DMA
 c8 00 08 60 4c 42 e1 00      00:21:06.492  READ DMA
 c8 00 08 70 42 42 e1 00      00:21:06.471  READ DMA

Error 29632 occurred at disk power-on lifetime: 13290 hours (553 days + 18 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 84 51 6f 31 51 42 e1  Error: ICRC, ABRT 111 sectors at LBA = 0x01425131 = 21123377

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 c8 00 a0 00 51 42 e1 00      00:21:04.375  READ DMA
 ec 00 00 00 00 00 a0 00      00:21:05.813  IDENTIFY DEVICE
 ef 03 42 00 00 00 a0 00      00:21:05.810  SET FEATURES [set transfer mode]
 ec 00 00 00 00 00 a0 00      00:21:05.807  IDENTIFY DEVICE
 00 00 a0 00 00 00 00 04      00:21:05.804  NOP [Abort queued commands]

Error 29631 occurred at disk power-on lifetime: 13290 hours (553 days + 18 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 84 51 5f 41 51 42 e1  Error: ICRC, ABRT 95 sectors at LBA = 0x01425141 = 21123393

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 c8 00 a0 00 51 42 e1 00      00:21:04.375  READ DMA
 ec 00 00 00 00 00 a0 00      00:21:03.869  IDENTIFY DEVICE
 ef 03 42 00 00 00 a0 00      00:21:03.866  SET FEATURES [set transfer mode]
 ec 00 00 00 00 00 a0 00      00:21:03.863  IDENTIFY DEVICE
 00 00 a0 00 00 00 00 04      00:21:03.860  NOP [Abort queued commands]

Error 29630 occurred at disk power-on lifetime: 13290 hours (553 days + 18 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 84 51 6f 31 51 42 e1  Error: ICRC, ABRT 111 sectors at LBA = 0x01425131 = 21123377

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 c8 00 a0 00 51 42 e1 00      00:21:04.375  READ DMA
 ec 00 00 00 00 00 a0 00      00:21:03.869  IDENTIFY DEVICE
 ef 03 42 00 00 00 a0 00      00:21:03.866  SET FEATURES [set transfer mode]
 ec 00 00 00 00 00 a0 00      00:21:03.863  IDENTIFY DEVICE
 00 00 a0 00 00 00 00 04      00:21:03.860  NOP [Abort queued commands]

Error 29629 occurred at disk power-on lifetime: 13290 hours (553 days + 18 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 84 51 1f 81 51 42 e1  Error: ICRC, ABRT 31 sectors at LBA = 0x01425181 = 21123457

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 c8 00 a0 00 51 42 e1 00      00:21:02.523  READ DMA
 ec 00 00 00 00 00 a0 00      00:21:03.869  IDENTIFY DEVICE
 ef 03 42 00 00 00 a0 00      00:21:03.866  SET FEATURES [set transfer mode]
 ec 00 00 00 00 00 a0 00      00:21:03.863  IDENTIFY DEVICE
 00 00 a0 00 00 00 00 04      00:21:03.860  NOP [Abort queued commands]

SMART Self-test log structure revision number 1

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Share this post


Link to post
Share on other sites

You do have a problem with that old disk:

 

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 5 Reallocated_Sector_Ct   0x0033   073   073   036    Pre-fail  Always       -       1111
 7 Seek_Error_Rate         0x000f   079   060   030    Pre-fail  Always       -       94882296
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       103

 

so you were right to replace it. But also, considering the reallocated sectors and uncorrected, I'd not use that disk for anything important.

Share this post


Link to post
Share on other sites

Wow. Thanks for pointing that out. I did not notice it because I don't know yet how to read the numbers.

 

Banjo

(_)=='=~

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
Sign in to follow this  

×
×
  • Create New...