Koozali.org: home of the SME Server

RAID's hdb

Sean

RAID's hdb
« on: September 17, 2003, 10:09:45 AM »
I need some help to verify my thinking.

I am having a problem with a box with a raid on it.  Using Darrel May's Raidmonitor contrib I get
----8<--------
ALARM! RAID configuration problem

Current configuration is:

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hdb1[0] hda1[1] 264960 blocks [2/2] [UU]
md0 : active raid1 hdb5[0] hda5[1] 15936 blocks [2/2] [UU]
md1 : active raid1 hdb6[2] hda6[1](F) 38796864 blocks [2/1] [_U]
unused devices:

Last known good configuration was:

Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hda1[1] 264960 blocks [2/1] [_U]
md0 : active raid1 hda5[1] 15936 blocks [2/1] [_U]
md1 : active raid1 hda6[1] 38796864 blocks [2/1] [_U]
unused devices:
----8<--------

... after a very long re-syncing process, which suggests to me that hdb6[2] is still down and the following from the syslog ...

----8<--------
Sep 16 18:35:43 gateway kernel: md: syncing RAID array md1
Sep 16 18:35:43 gateway kernel: md: minimum _guaranteed_ reconstruction speed: 100 KB/sec.
Sep 16 18:35:43 gateway kernel: md: using maximum available idle IO bandwith for reconstruction.
Sep 16 18:35:43 gateway kernel: md: using 128k window.
Sep 16 18:59:04 gateway sshd(pam_unix)[1897]: session closed for user root
Sep 16 19:09:04 gateway kernel: hda: read_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Sep 16 19:09:04 gateway kernel: hda: read_intr: error=0x40 { UncorrectableError }, LBAsect=5047737, sector=4485399
Sep 16 19:09:04 gateway kernel: end_request: I/O error, dev 03:06 (hda), sector 4485399
Sep 16 19:09:04 gateway kernel: interrupting MD-thread pid 6
Sep 16 19:09:04 gateway kernel: raid1: only one disk left and IO error.
Sep 16 19:09:04 gateway kernel: raid1: md1: rescheduling block 560674
Sep 16 19:09:04 gateway kernel: dirty sb detected, updating.
Sep 16 19:09:04 gateway kernel: md: updating md1 RAID superblock on device
Sep 16 19:09:04 gateway kernel: hdb6 [events: 0000006d](write) hdb6's sb offset: 38796864
Sep 16 19:09:04 gateway kernel: (skipping faulty hda6 )
Sep 16 19:09:04 gateway kernel: .
Sep 16 19:09:04 gateway kernel: raid1: md1: unrecoverable I/O read error for block 560674
Sep 16 19:09:08 gateway kernel: md1: read error while reconstructing, at block 560672(4096).
Sep 16 19:09:15 gateway kernel: nr_blocks changed to 32 (blocksize 4096, j 560672, max_blocks 9699216)
Sep 16 19:09:23 gateway kernel: hda: read_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Sep 16 19:09:30 gateway kernel: hda: read_intr: error=0x40 { UncorrectableError }, LBAsect=5047737, sector=4485399
Sep 16 19:09:38 gateway kernel: end_request: I/O error, dev 03:06 (hda), sector 4485399
Sep 16 19:09:45 gateway kernel: interrupting MD-thread pid 6
Sep 16 19:09:53 gateway kernel: raid1: only one disk left and IO error.
Sep 16 19:10:01 gateway kernel: raid1: md1: rescheduling block 560674
Sep 16 19:10:08 gateway kernel: raid1: md1: unrecoverable I/O read error for block 560674
Sep 16 19:10:16 gateway kernel: md1: read error while reconstructing, at block 560672(4096).
Sep 16 19:10:24 gateway kernel: nr_blocks changed to 32 (blocksize 4096, j 560672, max_blocks 9699216)
Sep 16 19:10:31 gateway kernel: hda: read_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Sep 16 19:10:38 gateway kernel: hda: read_intr: error=0x40 { UncorrectableError }, LBAsect=5047737, sector=4485399
Sep 16 19:10:46 gateway kernel: end_request: I/O error, dev 03:06 (hda), sector 4485399
Sep 16 19:10:53 gateway kernel: interrupting MD-thread pid 6
Sep 16 19:10:57 gateway kernel: raid1: only one disk left and IO error.
Sep 16 19:11:05 gateway kernel: raid1: md1: rescheduling block 560674
----8<--------

Note the I/O error on hda, which interrupts the syncing of md1 at block 560674.

Now, cat /root/raidmonitor/mdstat reveals:
----8<--------
Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hdb1[0] hda1[1] 264960 blocks [2/2] [UU]
md0 : active raid1 hdb5[0] hda5[1] 15936 blocks [2/2] [UU]
md1 : active raid1 hdb6[2] hda6[1](F) 38796864 blocks [2/1] [_U]
unused devices:
----8<--------

... which all suggests to me that md1 that is up on hda6[1] is actually corrupted and so will never sync across to hdb6[2] properly.

I think I should be blowing this whole thing away and restoring from tape, but first I want to test the drives, especially the volumes hd?6, for validity.

Any thoughts on what tools I can use to validate the physical wellness (preferably non-destructively) of these two drives?

Sean

Kelvin

Re: RAID's hdb
« Reply #1 on: September 17, 2003, 11:48:19 AM »
You could check your drive manufacturer's web sites for a diag utility. Seagate drives have one and although not infallible, does detect some types of problems before they become really big ones.

Kelvin

John Crisp

Re: RAID's hdb
« Reply #2 on: September 17, 2003, 04:07:27 PM »
Looks like hda is a goner, but as per reply from Kelvin it would be worth running a manufacturers drive tester on it just in case. If it's under warranty you will need to do that anyway.

Also worth noting that if they are IDE drives then it is better to run the drives as 2 masters (hda & hdc) each on their own cable,  rather than master / slave on the same cable.

Best regards,

John