Server Crash, help interpreting log

arnoldob

183
+0/-0

Server Crash, help interpreting log

« on: January 27, 2007, 06:10:51 PM »

I've been having recurrent server crashes. It won't let me login on a locally , no web, no mail etc. I was poking around in the messages log and saw this:

Quote

Jan 27 00:00:49 spanky kernel: hdd: status error: error=0x04Aborted Command
Jan 27 00:00:50 spanky kernel: hdd: status error: status=0x00 { }
Jan 27 00:00:50 spanky kernel: hdd: status error: error=0x04Aborted Command
Jan 27 00:00:51 spanky kernel: hdd: status error: status=0x00 { }
Jan 27 00:00:51 spanky kernel: hdd: status error: error=0x04Aborted Command
Jan 27 00:00:51 spanky kernel: hdd: status error: status=0x00 { }
Jan 27 00:00:52 spanky kernel: hdd: status error: error=0x04Aborted Command
Jan 27 00:00:52 spanky kernel: hdd: ATAPI reset complete
Jan 27 00:00:52 spanky kernel: hdd: status error: status=0x00 { }
Jan 27 00:00:52 spanky kernel: hdd: status error: error=0x04Aborted Command
Jan 27 00:00:52 spanky kernel: hdd: status error: status=0x00 { }
Jan 27 00:00:53 spanky kernel: hdd: status error: error=0x04Aborted Command
Jan 27 00:00:53 spanky kernel: hdd: status error: status=0x00 { }
Jan 27 00:00:53 spanky kernel: hdd: status error: error=0x04Aborted Command
Jan 27 00:00:53 spanky kernel: hdd: status error: status=0x00 { }
Jan 27 00:00:54 spanky kernel: hdd: status error: error=0x04Aborted Command
Jan 27 00:00:54 spanky kernel: hdd: ATAPI reset complete
Jan 27 00:00:54 spanky kernel: hdd: status error: status=0x00 { }
Jan 27 00:00:54 spanky kernel: hdd: status error: error=0x04Aborted Command
Jan 27 00:00:55 spanky kernel: hdc: read_intr: status=0x51 { DriveReady SeekComplete Error }
Jan 27 00:00:55 spanky kernel: hdc: read_intr: error=0x04 { DriveStatusError }
Jan 27 00:00:55 spanky kernel: ide: failed opcode was: unknown
Jan 27 00:00:56 spanky kernel: hdd: status error: status=0x00 { }
Jan 27 00:00:56 spanky kernel: hdd: status error: error=0x04Aborted Command
Jan 27 00:00:56 spanky kernel: hdd: status error: status=0x00 { }
Jan 27 00:00:56 spanky kernel: hdd: status error: error=0x04Aborted Command
Jan 27 00:00:57 spanky kernel: hdd: status error: status=0x00 { }
Jan 27 00:00:57 spanky kernel: hdd: status error: error=0x04Aborted Command
Jan 27 00:00:57 spanky kernel: hdd: status error: status=0x00 { }
Jan 27 00:00:58 spanky kernel: hdd: status error: error=0x04Aborted Command
Jan 27 06:02:03 spanky syslogd 1.4.1: restart.

Note that there was no log activity until 06:02:03 when I restarted the server. When it was booting it prompted to scan the disks, I selected yes and saw no error indication regarding the scan. Am I correct in asumming that the messages log lines indicating hdd errors mean that there's a hardware issue with my secondary slave drive? Is there a more complete drive scanning tool that would be more helpful in confirming a hardware failure. I'd rather not just swap out hardware until I'm sure about what's going on here.

Thanks!

Logged

Tampa, FL USA

byte

2,183
+2/-0

Server Crash, help interpreting log

« Reply #1 on: January 27, 2007, 06:14:32 PM »

I'd probably say it could be the mainboard as you have problems with both hdc and hdd or maybe the cable?

Logged

--[byte]--

Have you filled in a Bug Report over @ http://bugs.contribs.org ? Please don't wait to be told this way you help us to help you/others - Thanks!

arnoldob

183
+0/-0

Server Crash, help interpreting log

« Reply #2 on: January 27, 2007, 07:57:10 PM »

I saw a page here:
http://www.captain.at/howto-linux-driveready-seekcomplete-error-drivestatuserror.php

Quote

kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
kernel: hda: drive_cmd: error=0x04 { DriveStatusError }

This says that there was an error on the harddrive, and that the command was aborted ("Command aborted"). As far as I have found, such errors once in a while mean nothing serious. As long as there are no "Uncorrectable ECC error"s or other grave errors like checksum error, bad address mark etc. it should be nothing to worry about. Such "aborted commands" occur e.g. when an unknown sector is requested, that is not present on the harddisk, buggy drivers (-> the driver sent a command that was not understood by the drive).

Another evidence that the DriveStatusError (command aborted) is harmless is that the SmartMonTools (Linux Harddisk Monitoring with SmartMonTools (smartctl)) don't report any non-zero RAW_VALUES for Reallocated_Sector_Ct, Seek_Error_Rate, Reallocated_Event_Count, Offline_Uncorrectable, UDMA_CRC_Error_Count, Multi_Zone_Error_Rate or Hardware_ECC_Recovered etc., so there were not serious errors on the harddrive, but the command was just not executed or understood by the disk.
Maybe the firmware on the harddrive is buggy.

That would seem to indicate that the message log error aren't worrisome. The fact that the server stops responding and requires power cycle reboot to run again is worrisome.

I played with smartctl a bit but saw no reported errors there.
I'll swap out the secondary controller cable and see if that helps. That's certainly cheaper than another 160GB HD or motherboard.

Quote

[root@spanky ~]# smartctl -l selftest /dev/hdc
smartctl version 5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 6960 -

[root@spanky ~]# smartctl -l selftest /dev/hda
smartctl version 5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 4299 -

Logged

Tampa, FL USA

ClaudioG

Server Crash, help interpreting log

« Reply #3 on: January 29, 2007, 11:50:47 AM »

We have had this error many times:

Parallel ATA hard disk hardware error.

Replace hard disk (check with vendor utilities: i.e. maxtor test or other)

Controller problem or cable problem have usually different error (like kernel panic, remount fs in read only, ecc)

ClaudioG

Logged

Cruthik

Server Crash, help interpreting log

« Reply #4 on: January 29, 2007, 12:00:26 PM »

What is the pin settings of your HD?

Check your pin settings... Make sure it is on the right pins (eg. Master or Slave).

Just my 2 cents.

-Cruthik

Logged

arnoldob

183
+0/-0

Server Crash, help interpreting log

« Reply #5 on: February 14, 2007, 09:27:27 PM »

I'm still messing around with this. Then SME box locks up totally once or twice a day.

The jumper settings are CS (cable select) for both drives. One drive is by itself as the primary master, the other the secondary master with a CD-ROM slave. I disconnected the CD-ROM and replaced the IDE cables. Still getting complete lockups on the server, can't login from the server keyboard, no mail, web or file services.

In frustration I pulled the drives and put them in a different box, but still get lockups. I thought it might be some kind of issue with excessive activity from an attack on a hosted website or e-mail spam, but it did the same thing with nothing plugged into the WAN interface.

This brings me back to thinking it's a drive failure of some sort, but the smartctl selftest looks ok. I tried Seagate Tools, as both drives are 160GB Seagate Barracudas, but that seems to have a problem with linux formated drives. I'd love to RMA a drive but I'd prefer to make sure I replace the right one. Any suggestions?

Logged

Tampa, FL USA

okepc

118
+0/-0

Server Crash, help interpreting log

« Reply #6 on: February 15, 2007, 09:36:08 AM »

Quote

Jan 27 00:00:55 spanky kernel: hdc: read_intr: status=0x51 { DriveReady SeekComplete Error }
Jan 27 00:00:55 spanky kernel: hdc: read_intr: error=0x04 { DriveStatusError }

This is a hardware error.
Alle cases i have come acrros had bad clusters on the hard drive or had a electronic board error.

Swap hdc with a new drive and see how that goes.

Dirk

Logged