I need some help to verify my thinking.
I am having a problem with a box with a raid on it. Using Darrel May's Raidmonitor contrib I get
----8<--------
ALARM! RAID configuration problem
Current configuration is:
Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hdb1[0] hda1[1] 264960 blocks [2/2] [UU]
md0 : active raid1 hdb5[0] hda5[1] 15936 blocks [2/2] [UU]
md1 : active raid1 hdb6[2] hda6[1](F) 38796864 blocks [2/1] [_U]
unused devices:
Last known good configuration was:
Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hda1[1] 264960 blocks [2/1] [_U]
md0 : active raid1 hda5[1] 15936 blocks [2/1] [_U]
md1 : active raid1 hda6[1] 38796864 blocks [2/1] [_U]
unused devices:
----8<--------
... after a very long re-syncing process, which suggests to me that hdb6[2] is still down and the following from the syslog ...
----8<--------
Sep 16 18:35:43 gateway kernel: md: syncing RAID array md1
Sep 16 18:35:43 gateway kernel: md: minimum _guaranteed_ reconstruction speed: 100 KB/sec.
Sep 16 18:35:43 gateway kernel: md: using maximum available idle IO bandwith for reconstruction.
Sep 16 18:35:43 gateway kernel: md: using 128k window.
Sep 16 18:59:04 gateway sshd(pam_unix)[1897]: session closed for user root
Sep 16 19:09:04 gateway kernel: hda: read_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Sep 16 19:09:04 gateway kernel: hda: read_intr: error=0x40 { UncorrectableError }, LBAsect=5047737, sector=4485399
Sep 16 19:09:04 gateway kernel: end_request: I/O error, dev 03:06 (hda), sector 4485399
Sep 16 19:09:04 gateway kernel: interrupting MD-thread pid 6
Sep 16 19:09:04 gateway kernel: raid1: only one disk left and IO error.
Sep 16 19:09:04 gateway kernel: raid1: md1: rescheduling block 560674
Sep 16 19:09:04 gateway kernel: dirty sb detected, updating.
Sep 16 19:09:04 gateway kernel: md: updating md1 RAID superblock on device
Sep 16 19:09:04 gateway kernel: hdb6 [events: 0000006d](write) hdb6's sb offset: 38796864
Sep 16 19:09:04 gateway kernel: (skipping faulty hda6 )
Sep 16 19:09:04 gateway kernel: .
Sep 16 19:09:04 gateway kernel: raid1: md1: unrecoverable I/O read error for block 560674
Sep 16 19:09:08 gateway kernel: md1: read error while reconstructing, at block 560672(4096).
Sep 16 19:09:15 gateway kernel: nr_blocks changed to 32 (blocksize 4096, j 560672, max_blocks 9699216)
Sep 16 19:09:23 gateway kernel: hda: read_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Sep 16 19:09:30 gateway kernel: hda: read_intr: error=0x40 { UncorrectableError }, LBAsect=5047737, sector=4485399
Sep 16 19:09:38 gateway kernel: end_request: I/O error, dev 03:06 (hda), sector 4485399
Sep 16 19:09:45 gateway kernel: interrupting MD-thread pid 6
Sep 16 19:09:53 gateway kernel: raid1: only one disk left and IO error.
Sep 16 19:10:01 gateway kernel: raid1: md1: rescheduling block 560674
Sep 16 19:10:08 gateway kernel: raid1: md1: unrecoverable I/O read error for block 560674
Sep 16 19:10:16 gateway kernel: md1: read error while reconstructing, at block 560672(4096).
Sep 16 19:10:24 gateway kernel: nr_blocks changed to 32 (blocksize 4096, j 560672, max_blocks 9699216)
Sep 16 19:10:31 gateway kernel: hda: read_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
Sep 16 19:10:38 gateway kernel: hda: read_intr: error=0x40 { UncorrectableError }, LBAsect=5047737, sector=4485399
Sep 16 19:10:46 gateway kernel: end_request: I/O error, dev 03:06 (hda), sector 4485399
Sep 16 19:10:53 gateway kernel: interrupting MD-thread pid 6
Sep 16 19:10:57 gateway kernel: raid1: only one disk left and IO error.
Sep 16 19:11:05 gateway kernel: raid1: md1: rescheduling block 560674
----8<--------
Note the I/O error on hda, which interrupts the syncing of md1 at block 560674.
Now, cat /root/raidmonitor/mdstat reveals:
----8<--------
Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hdb1[0] hda1[1] 264960 blocks [2/2] [UU]
md0 : active raid1 hdb5[0] hda5[1] 15936 blocks [2/2] [UU]
md1 : active raid1 hdb6[2] hda6[1](F) 38796864 blocks [2/1] [_U]
unused devices:
----8<--------
... which all suggests to me that md1 that is up on hda6[1] is actually corrupted and so will never sync across to hdb6[2] properly.
I think I should be blowing this whole thing away and restoring from tape, but first I want to test the drives, especially the volumes hd?6, for validity.
Any thoughts on what tools I can use to validate the physical wellness (preferably non-destructively) of these two drives?
Sean