RAID1 recovery suggestion? This is different

christian

369
+0/-0

RAID1 recovery suggestion? This is different

« on: October 17, 2008, 08:12:23 AM »

Well I had a drive (sdb2) kicked out of my raid array this evening.

to recover it I used:

mdadm --add /dev/md2 /dev/sdb2
after checking /proc/mdstat, I noticed that the recovery keeps restarting.

So checking dmesg I find:

Code: [Select]

SCSI device sda: 1953525168 512-byte hdwr sectors (1000205 MB)
SCSI device sda: drive cache: write back
ata1.00: exception Emask 0x0 SAct 0x3ffffff SErr 0x0 action 0x0
ata1.00: (irq_stat 0x40000008)
ata1.00: cmd 60/c8:78:05:42:03/00:00:00:00:00/40 tag 15 cdb 0x0 data 102400 in
         res 41/40:00:05:42:03/da:00:00:00:00/40 Emask 0x9 (media error)
ata1.00: configured for UDMA/133
SCSI error : <0 0 0 0> return code = 0x8000002
Info fld=0x4000000 (nonstd), Invalid sda: sense = 72 11
end_request: I/O error, dev sda, sector 213509
ata1: EH complete
SCSI device sda: 1953525168 512-byte hdwr sectors (1000205 MB)
SCSI device sda: drive cache: write back
[b]raid1: sda: unrecoverable I/O read error for block 4608[/b]
md: md2: sync done.

I think block 4608 is where it is restarting.

So this looks like a conundrum, failed and kicked out sdb2 and unable to recover from sda. Though I should note the system does boot! It is just very very busy trying to re-add sdb2.

Any ideas on how might be able to recover this?

Should I run fsck (or even e2fsck) on sda? If so what would be the proper command line to perform a safe recovery?

Is there a better way?

Needless to say, I will be replacing with new hard drives.

Christian

Logged

SME since 2003

christian

369
+0/-0

Re: RAID1 recovery suggestion? This is different

« Reply #1 on: October 18, 2008, 01:32:03 AM »

So after some research, this is what I'll do.

Note Smartctl tells me I have surface errors on both drives.

I probably trashed sdb when I tried to re-add it to the array so I've assume that is new disk for now.
I've gone to sme rescue
I'm making a clone of my sort-of-working sda using

Code: [Select]

dd if=/dev/sda of=/dev/sdb bs=512 conv=noerror,sync I'll then verify that the new image boots as well or better than the original sda
I'll then try to see if fsck will clean up sda.
If not then I'm hoping the dd procedure cleaned up the bad blocks on the copy (in theory the drive FW should)

If either goes well then I'll use one of the disks to synch up the new drives I just got (same type).
If not then I will at least have my data and begin a new build of the server. I just hope it doesn't come to that!

I'll let you know how it goes in case someone else comes across this.

And of course if you have any input, let me know.

BTW, it looks to me like block 4608 maps to inode 8 which I think may be the journal for sda1.

Christian

Logged

SME since 2003

christian

369
+0/-0

Re: RAID1 recovery suggestion? This is different

« Reply #2 on: October 18, 2008, 03:53:13 PM »

Quote from: christian on October 18, 2008, 01:32:03 AM

If not then I will at least have my data and begin a new build of the server. I just hope it doesn't come to that!

Well it came to that. The dd procedure resulted in a non-bootable disk which is not a complete surprise given where the bad blocks were. So I took the safer course of action and rebuilt the server and used the http://wiki.contribs.org/UpgradeDisk#Copying_from_7.x_to_7.x procedure to recover my data.

I feel better about it anyway as the bad blocks were so early on the disk (sda1) that I couldn't be sure that I wouldn't ultimately be propagating some new corruption.

Plus I get to go back and retest a bunch of our How-To's and contribs now. yipee!

Christian

Logged

SME since 2003