Need help with Raid disk errors

eastend99

48
+0/-0

Need help with Raid disk errors

« on: December 12, 2007, 07:51:37 PM »

Hi,

Yesterday I received two MDADM monitoring emails. First one on 9:32 on /dev/md2, at 23:32 on /dev/md1.
The current output of mdadm --detail /dev/md1 and md2 shows:

Code: [Select]

/dev/md1:
        Version : 00.90.01
  Creation Time : Sun Feb 11 13:38:58 2007
     Raid Level : raid1
     Array Size : 104320 (101.89 MiB 106.82 MB)
    Device Size : 104320 (101.89 MiB 106.82 MB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Tue Dec 11 23:31:11 2007
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           UUID : 4c021324:7d0b66b0:59685a06:10306ca3
         Events : 0.3613

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       0        0        -      removed

       2       8       17        -      faulty   /dev/sdb1

/dev/md2:
        Version : 00.90.01
  Creation Time : Sun Feb 11 13:38:58 2007
     Raid Level : raid1
     Array Size : 195253888 (186.21 GiB 199.94 GB)
    Device Size : 195253888 (186.21 GiB 199.94 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 2
    Persistence : Superblock is persistent

    Update Time : Wed Dec 12 15:11:44 2007
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           UUID : 06c12353:d61ac8e5:2de7f6e9:7fdcfb10
         Events : 0.6595950

When I looked in to the messages log file I found some errors as well:

Code: [Select]

Dec 11 09:28:30 smesrv-01 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Dec 11 09:28:30 smesrv-01 kernel: ata2.00: cmd 35/00:08:02:dd:49/00:00:17:00:00/e0 tag 0 cdb 0x0 data 4096 out
Dec 11 09:28:30 smesrv-01 kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 11 09:28:37 smesrv-01 kernel: ata2: port is slow to respond, please be patient (Status 0xd0)
Dec 11 09:29:00 smesrv-01 kernel: ata2: port failed to respond (30 secs, Status 0xd0)
Dec 11 09:29:00 smesrv-01 kernel: ata2: soft resetting port
Dec 11 09:29:00 smesrv-01 kernel: ATA: abnormal status 0xD0 on port 0xE407
Dec 11 09:29:00 smesrv-01 last message repeated 6 times
Dec 11 09:31:41 smesrv-01 kernel: ata2.00: qc timeout (cmd 0xec)
Dec 11 09:31:41 smesrv-01 kernel: ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Dec 11 09:31:41 smesrv-01 kernel: ata2.00: revalidation failed (errno=-5)
Dec 11 09:31:41 smesrv-01 kernel: ata2: failed to recover some devices, retrying in 5 secs
Dec 11 09:31:41 smesrv-01 kernel: ata2: port is slow to respond, please be patient (Status 0xd0)
Dec 11 09:31:41 smesrv-01 kernel: ata2: port failed to respond (30 secs, Status 0xd0)
Dec 11 09:31:41 smesrv-01 kernel: ata2: soft resetting port
Dec 11 09:31:41 smesrv-01 kernel: ATA: abnormal status 0xD0 on port 0xE407
Dec 11 09:31:41 smesrv-01 su(pam_unix)[32741]: session opened for user qmailr by (uid=0)
Dec 11 09:31:41 smesrv-01 last message repeated 6 times
Dec 11 09:31:41 smesrv-01 kernel: ata2.00: qc timeout (cmd 0xec)
Dec 11 09:31:41 smesrv-01 kernel: ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Dec 11 09:31:41 smesrv-01 kernel: ata2.00: revalidation failed (errno=-5)
Dec 11 09:31:41 smesrv-01 kernel: ata2: failed to recover some devices, retrying in 5 secs
Dec 11 09:31:41 smesrv-01 kernel: ata2: port is slow to respond, please be patient (Status 0xd0)
Dec 11 09:31:41 smesrv-01 kernel: ata2: port failed to respond (30 secs, Status 0xd0)
Dec 11 09:31:41 smesrv-01 kernel: ata2: soft resetting port
Dec 11 09:31:41 smesrv-01 kernel: ATA: abnormal status 0xD0 on port 0xE407
Dec 11 09:31:41 smesrv-01 last message repeated 6 times
Dec 11 09:31:41 smesrv-01 kernel: ata2.00: qc timeout (cmd 0xec)
Dec 11 09:31:41 smesrv-01 kernel: ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Dec 11 09:31:41 smesrv-01 kernel: ata2.00: revalidation failed (errno=-5)
Dec 11 09:31:41 smesrv-01 kernel: ata2.00: disabled
Dec 11 09:31:41 smesrv-01 kernel: ata2: EH complete
Dec 11 09:31:41 smesrv-01 kernel: SCSI error : <1 0 0 0> return code = 0x40000
Dec 11 09:31:41 smesrv-01 kernel: end_request: I/O error, dev sdb, sector 390716674
Dec 11 09:31:41 smesrv-01 kernel: md: write_disk_sb failed for device sdb2
Dec 11 09:31:41 smesrv-01 kernel: md: errors occurred during superblock update, repeating
Dec 11 09:31:41 smesrv-01 kernel: SCSI error : <1 0 0 0> return code = 0x40000

<snip> these errors repeated a few hundred times in three seconds <snip>

Dec 11 09:31:44 smesrv-01 kernel: md: excessive errors occurred during superblock update, exiting
Dec 11 09:31:44 smesrv-01 kernel: SCSI error : <1 0 0 0> return code = 0x40000
Dec 11 09:31:44 smesrv-01 kernel: end_request: I/O error, dev sdb, sector 271522
Dec 11 09:31:44 smesrv-01 kernel: raid1: Disk failure on sdb2, disabling device. 
Dec 11 09:31:44 smesrv-01 kernel: 	Operation continuing on 1 devices
Dec 11 09:31:44 smesrv-01 kernel: SCSI error : <1 0 0 0> return code = 0x40000
Dec 11 09:31:44 smesrv-01 kernel: end_request: I/O error, dev sdb, sector 271530
Dec 11 09:31:44 smesrv-01 kernel: SCSI error : <1 0 0 0> return code = 0x40000
Dec 11 09:31:44 smesrv-01 kernel: end_request: I/O error, dev sdb, sector 271538
Dec 11 09:31:44 smesrv-01 kernel: SCSI error : <1 0 0 0> return code = 0x40000
Dec 11 09:31:44 smesrv-01 kernel: end_request: I/O error, dev sdb, sector 108462770
Dec 11 09:31:44 smesrv-01 kernel: SCSI error : <1 0 0 0> return code = 0x40000
Dec 11 09:31:44 smesrv-01 kernel: end_request: I/O error, dev sdb, sector 147370394
Dec 11 09:31:44 smesrv-01 kernel: RAID1 conf printout:
Dec 11 09:31:44 smesrv-01 kernel:  --- wd:1 rd:2
Dec 11 09:31:44 smesrv-01 kernel:  disk 0, wo:0, o:1, dev:sda2
Dec 11 09:31:44 smesrv-01 kernel:  disk 1, wo:1, o:0, dev:sdb2
Dec 11 09:31:44 smesrv-01 kernel: RAID1 conf printout:
Dec 11 09:31:44 smesrv-01 kernel:  --- wd:1 rd:2
Dec 11 09:31:44 smesrv-01 kernel:  disk 0, wo:0, o:1, dev:sda2

Note the ATA errors before the actual write errors. Don't know what that means.

This happend again at 23:30 (during backup execution)

Code: [Select]

Dec 11 23:30:40 smesrv-01 kernel: SCSI error : <1 0 0 0> return code = 0x40000
Dec 11 23:30:40 smesrv-01 kernel: end_request: I/O error, dev sdb, sector 208641
Dec 11 23:30:40 smesrv-01 kernel: md: write_disk_sb failed for device sdb1

<snip> again a few hundred messages in three seconds <snip>

Dec 11 23:30:40 smesrv-01 kernel: md: errors occurred during superblock update, repeating
Dec 11 23:30:40 smesrv-01 kernel: SCSI error : <1 0 0 0> return code = 0x40000
Dec 11 23:30:40 smesrv-01 kernel: end_request: I/O error, dev sdb, sector 208641
Dec 11 23:30:40 smesrv-01 kernel: md: write_disk_sb failed for device sdb1
Dec 11 23:30:40 smesrv-01 kernel: md: errors occurred during superblock update, repeating
Dec 11 23:30:40 smesrv-01 kernel: md: excessive errors occurred during superblock update, exiting
Dec 11 23:30:40 smesrv-01 kernel: SCSI error : <1 0 0 0> return code = 0x40000
Dec 11 23:30:40 smesrv-01 kernel: end_request: I/O error, dev sdb, sector 1171
Dec 11 23:30:40 smesrv-01 kernel: raid1: Disk failure on sdb1, disabling device. 
Dec 11 23:30:40 smesrv-01 kernel: 	Operation continuing on 1 devices
Dec 11 23:30:40 smesrv-01 kernel: SCSI error : <1 0 0 0> return code = 0x40000
Dec 11 23:30:40 smesrv-01 kernel: end_request: I/O error, dev sdb, sector 1173
Dec 11 23:30:40 smesrv-01 kernel: RAID1 conf printout:
Dec 11 23:30:40 smesrv-01 kernel:  --- wd:1 rd:2
Dec 11 23:30:40 smesrv-01 kernel:  disk 0, wo:0, o:1, dev:sda1
Dec 11 23:30:40 smesrv-01 kernel:  disk 1, wo:1, o:0, dev:sdb1
Dec 11 23:30:40 smesrv-01 kernel: RAID1 conf printout:
Dec 11 23:30:40 smesrv-01 kernel:  --- wd:1 rd:2
Dec 11 23:30:40 smesrv-01 kernel:  disk 0, wo:0, o:1, dev:sda1

Note there were no more ATA errors.

The machine runs SME7.2 up-to-date at the moment, it runs perfectly for about 18 months now.

I am unsure which actions to take. Both raid arrays received errors independently, never seen before. I think chances are good for a malfunctioning disk device. My questions:

Can the ATA errors in the first part of the messages log cause the arrays to degrade?

Can I reactivate the array in this situation (it differs from the previous posts in the forum on this topic because the mdadm output says the failed device; this is not the case in any of the other examples that I found). If I can, what are the correct command line statements for this situation? (I hate messing around with production arrays, partitions and disks: I am not that good at linux)

Is there a way to read SMART status of the disk device sdb from the command line? I would like some confirmation at device level before I start asking for money to buy a new disk...

Regards,

Marcel

Logged

CharlieBrady

6,918
+3/-0

Re: Need help with Raid disk errors

« Reply #1 on: December 13, 2007, 04:24:19 AM »

Quote from: eastend99 on December 12, 2007, 07:51:37 PM

Is there a way to read SMART status of the disk device sdb from the command line?

Code: [Select]

 /usr/sbin/smartctl --health /dev/sdb

Read "man smartctl" for more detail. And, for instance, http://www.linuxjournal.com/article/6983. And always remember that google is your friend.

Quote

I would like some confirmation at device level before I start asking for money to buy a new disk...

Disk is usually cheaper than data. And perhaps your drive is still under warranty.

Logged

Reinhold

517
+0/-0

Re: Need help with Raid disk errors

« Reply #2 on: December 13, 2007, 11:54:36 AM »

eastend

The ata error is in clear english text:
...
Dec 11 09:28:37 smesrv-01 kernel: ata2: port is slow to respond, please be patient (Status 0xd0)
Dec 11 09:29:00 smesrv-01 kernel: ata2: port failed to respond (30 secs, Status 0xd0)

If/when a device (disk) fails to respond it is thrown out of the raid-array iow "removed". - Your array is degraded.
This is the situation you observe.

Unless you have a faulty cable (Did you fiddle a lot in that server recently ?)
or your controller all of a sudden went dead (unlikely mostly those are "alive" or "dead" but check dmesg for news)
you harddisk does have problems either to spin it's disk or read from that spinning disk.

SMART will tell you basically the same ... in other words

In my experience any disk that failed to respond once will not recover "for good"
...it may limp on some time if you use manufacturers tools to correct (block out the bad area)
but will fail catastrophically when it hurts most (note on ATA systems the above would have completely blocked a channel ;-/ )

Get a new drive asap.
Like Charlie said: 160Gb drives have a good chance to be under warranty still (Seagate 5y)
You can check warranty stats online for most manufacturers.

Regards
Reinhold

Logged

............

eastend99

48
+0/-0

Re: Need help with Raid disk errors

« Reply #3 on: December 14, 2007, 10:12:06 AM »

Thanks for your replies.

For the record: the smartctl command returns SMART status OK.
The command (

Google)

Code: [Select]

hdparm -I /dev/sd[ab]

Returns drive info correctly for sda, returns HDIO_DRIVE_CMD(identify) failed: Input/output error
I just bought a new hard drive. You're right about the 160 Gb. €50 is a much better choice than data loss.

regards,
Marcel

Logged

bs_bay

14
+0/-0

Re: Need help with Raid disk errors

« Reply #4 on: December 14, 2007, 11:20:35 PM »

eastend99,

Watch for a couple of Gotcha's -
Check for the latest BIOS update
Check for the latest controller update - ESPECIALLY if it is SCSI based.

Backup, then update.

FYI - I just got burned on a SCSI SATA controller/disk communication error and there were some know (unknown to me) updates required and that were not installed for this particular manufacturer. It was a loss of communication between the drives and controller and now I may be looking at a costly data recovery process. While my system ran just fine for 2 years, the support team told me they could not explain why this particular failure happened when it did or how to fix this error! Be careful!!

Be sure to get a good backup and consider a different drive manufacturer.

Bill

Logged

Reinhold

517
+0/-0

Re: Need help with Raid disk errors

« Reply #5 on: December 17, 2007, 03:42:19 PM »

Bill

SCSI and SATA are two quite different interfaces.
(that the Linux-kernel claims a SCSI error should not confuse you - "S"ata is SERIAL_A_T_A !!!

)

Your problem description sounds like a reminder of SATA1 and SATA2 interface problems...

Please note: SATA2 drives are totally backward compatible with SATA controllers.
The only thing you would have to change with using a SATA2 drive is change the hd jumpers to configure it to SATA1.
(or apply the patch(es) you are talking about ...) - You can find the exact specs of your drive and the jumper settings incl. a compatibility warning at the manufacturer's website.

In case you already have had problems ... set everything to SATA1 and try again.
Data that is lost this way cannot (in all likeliness) be recovered at all ...
What's defective is defective on the disk! It's garbled through the if and not data you can or should use ...

Regards
Reinhold

Logged

............