Koozali.org: home of the SME Server

RAID drive failure, diagnosing

Dan Brown

RAID drive failure, diagnosing
« on: July 30, 2003, 07:01:30 PM »
Early this morning, one of the drives on my software RAID 1 apparently had an accident, as reported by raidmonitor:

Personalities : [raid1]
read_ahead 1024 sectors
md1 : active raid1 hda6[1] hdc6[0](F)
      80132096 blocks [2/1] [_U]
     
md0 : active raid1 hda5[1] hdc5[0]
      15936 blocks [2/2] [UU]
     
md2 : active raid1 hda1[1] hdc1[0]
      264960 blocks [2/2] [UU]
     
unused devices:

Taking a look through the log, I get:

[root@e-smith dan]# grep hdc6 /var/log/messages
Jul 30 00:34:56 e-smith kernel: raid1: Disk failure on hdc6, disabling device.
Jul 30 00:34:56 e-smith kernel: raid1: hdc6: rescheduling block 53120
Jul 30 00:34:56 e-smith kernel: md: (skipping faulty hdc6 )

[root@e-smith dan]# grep hda6 /var/log/messages
Jul 30 00:34:56 e-smith kernel: md: hda6 [events: 00000036]<6>(write) hda6's sb offset: 80132096
Jul 30 00:34:56 e-smith kernel: raid1: hda6: redirecting sector 53120 to another mirror

It's looking from this like hdc is the drive with the failure.  The only thing I'm uncertain about is the "hda6: redirecting sector" message.  A quick google for information about linux software raid didn't give a whole lot of information about interpreting /proc/mdstat (in particular, to determine which drive has failed).  Can anybody confirm my understanding, and/or provide some better resources on this?  Thanks!

Michael Smith

Re: RAID drive failure, diagnosing
« Reply #1 on: July 31, 2003, 07:50:07 AM »
Just looking at the messages, isn't it the secondary master (hdc) that has failed?  What happens if you disconnect it & boot?

Dan Brown

Re: RAID drive failure, diagnosing
« Reply #2 on: July 31, 2003, 08:55:13 AM »
Yes, it appears that hdc is the one that failed.  The only reason I'm at all uncertain about that is the note of "hda6: redirecting sector".  I'm guessing that's supposed to mean that it can't mirror that sector on hdc6 (same sector number as the block that was "rescheduled" on hdc6).  Haven't tried removing it yet.

Michael Smith

Re: RAID drive failure, diagnosing
« Reply #3 on: July 31, 2003, 07:38:45 PM »
I'm afraid I can't help you there.  I thought perhaps this might've been one of those situations where you (not specifically YOU, Dan, but the generic "you") had gone over & over & over the problem so many times that it became a "can't see the forest for the trees" type of problem.  "Been there, done that."

Dan Brown

Re: RAID drive failure, diagnosing
« Reply #4 on: July 31, 2003, 07:56:46 PM »
I'll try taking the drive out--it's going to need to come out to go back to IBM anyway...  Thanks for the input!

Dan Brown

Re: RAID drive failure
« Reply #5 on: August 02, 2003, 08:00:53 AM »
Figuring that hdc was the bad drive, I downed the server this evening and took it out.  Then I tried to boot from just hda.  Didn't get very far--Lilo only got to L, and then it stopped.  According to http://www.numenor.demon.co.uk/ccfaq/troubleshooting.htm, this translates to:

L -  The first step has been loaded and started but the second step (/boot/boot.b) could not be loaded. This normally points to a physical error on the boot device or a faulty disk geometry.

So, I reattached hdc, booted the machine (no problems, though it complained about hdc6 not being available for the array), took a look at lilo.conf (it does say boot=/dev/hda, which looks right), and ran /sbin/lilo (which ran without errors).

I then powered down, detached hdc, and tried to boot again, with the same results.

OK, out comes the SME boot floppy, created during installation.  I stuck that in the drive and tried to boot from that, accepting the defaults (just pressed Enter at the boot prompt).  This time, it went to:

Uncompressing Linux...  Ok, booting the kernel

then froze.  Entering "mitel root=/dev/md1" and "mitel root=/dev/hda6" gave the same result.

Next attempt was to boot from a Knoppix CD.  I was able to access hda6 and view all the files there without any difficulties.

I obviously need to remove the drive to ship it to IBM, but leaving the server down for a couple of weeks until it comes back isn't exactly an attractive option--and even if it were acceptable, I expect I'd have the same problem with a blank drive installed as hdc.  Any pointers on where to go from here would be greatly appreciated!

John Alamo

Re: RAID drive failure
« Reply #6 on: August 02, 2003, 11:57:37 AM »
Just a few thoughts ...

Are you sure your pulling the faulty drive?

Could there be a hardware issue (ie jumper on drive is set to cable select or master w/ slave (or slave w/master)) where the system is reassigning the location of the drive causing paths to be invalid?

Did you manually change the boot partition in lilo.conf? From my understanding of software raids, this shouldn't be necessary as the "superblock" (md2?) will determine a valid online drive. On my esmith box, I am showing boot=/dev/md0 & root=/dev/md1

The Software RAID HOWTO has some tips & pointers that might be of help (http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html) in particular it talks about how if the RAID superblocks get out of sync, you will not be able to boot off a RAID device and you will have to run mkraid --force (section 6.1).

Boris

Re: RAID drive failure
« Reply #7 on: August 03, 2003, 03:25:41 PM »
>> (it does say boot=/dev/hda, which looks right)
If you have a mirror this should look like boot=/dev/md0
All the pointers in the lilo.conf should refer to /dev/mdx not to /dev/hdx

Dan Brown

Re: RAID drive failure
« Reply #8 on: August 06, 2003, 05:38:59 AM »
Boris, that was it--don't know why it was set to /dev/hda, but changing it to /dev/md0 and re-running lilo allowed it to boot from only hda, without hdc installed.  Thanks!

Boris

Re: RAID drive failure
« Reply #9 on: August 06, 2003, 11:02:44 PM »
You are welcome.
I am glad that I was able to return (partially) many of my thanks for using your how-to’s and advices ;-)