Koozali.org: home of the SME Server

Very odd RAID1 problem

Des Dougan

Very odd RAID1 problem
« on: January 18, 2004, 05:42:12 AM »
I recently set up a new IBM server for a client, and installed 5.6, followed by the most recent updates. Subsequent to this, the second disk in the software RAID (2 x 40 GB drives, different models, primary master and secondary slave) became inaccessible.

It appeared that the second drive had failed, so I had it replaced with another Maxtor drive. In trying to rebuild the RAID array, I used Darrell May's how-to, and found that I had no idea of the drive geometry to use in setting the partitions. I therefore backed up to desktop and re-installed 5.6. Before I applied the updates, I checked the RAID status and saw that it was sync'ing OK. I then installed the latest SME Server upgrades, and following the reboot, I found that the second drive was no longer active.

dmesg output shows:

md: Autodetecting RAID arrays.
 [events: 00000008]
 [events: 00000008]
 [events: 00000008]
md: autorun ...
md: considering hda1 ...
md:  adding hda1 ...
md: created md0
md: bind
md: running:
md: hda1's event counter: 00000008
md0: former device [dev 16:41] is unavailable, removing from array!
md: RAID level 1 does not need chunksize! Continuing anyway.
md0: max total readahead window set to 508k
md0: 1 data-disks, max readahead per data-disk: 508k
raid1: device hda1 operational as mirror 0
raid1: md0, not all disks are operational -- trying to recover array
raid1: raid set md0 active with 1 out of 2 mirrors
md: updating md0 RAID superblock on device
md: hda1 [events: 00000009]<6>(write) hda1's sb offset: 104320
md: recovery thread got woken up ...
md0: no spare disk to reconstruct array! -- continuing in degraded mode
md: recovery thread finished ...
md: considering hda2 ...
md:  adding hda2 ...
md: created md1
md: bind
md: running:
md: hda2's event counter: 00000008
md1: former device [dev 16:42] is unavailable, removing from array!
md: md1: raid array is not clean -- starting background reconstruction
md: RAID level 1 does not need chunksize! Continuing anyway.
md1: max total readahead window set to 508k
md1: 1 data-disks, max readahead per data-disk: 508k
raid1: device hda2 operational as mirror 0
raid1: md1, not all disks are operational -- trying to recover array
raid1: raid set md1 active with 1 out of 2 mirrors
md: updating md1 RAID superblock on device
md: hda2 [events: 00000009]<6>(write) hda2's sb offset: 38708544
md: recovery thread got woken up ...
md1: no spare disk to reconstruct array! -- continuing in degraded mode
md0: no spare disk to reconstruct array! -- continuing in degraded mode
md: recovery thread finished ...
md: considering hda3 ...
md:  adding hda3 ...
md: created md2
md: bind
md: running:
md: hda3's event counter: 00000008
md2: former device [dev 16:43] is unavailable, removing from array!
md: RAID level 1 does not need chunksize! Continuing anyway.
md2: max total readahead window set to 508k
md2: 1 data-disks, max readahead per data-disk: 508k
raid1: device hda3 operational as mirror 0
raid1: md2, not all disks are operational -- trying to recover array
raid1: raid set md2 active with 1 out of 2 mirrors
md: updating md2 RAID superblock on device
md: hda3 [events: 00000009]<6>(write) hda3's sb offset: 264960
md: recovery thread got woken up ...
md2: no spare disk to reconstruct array! -- continuing in degraded mode
md1: no spare disk to reconstruct array! -- continuing in degraded mode
md0: no spare disk to reconstruct array! -- continuing in degraded mode
md: recovery thread finished ...
md: ... autorun DONE.

The only other change I made was to replace the shipped tg3.0 Broadcom Ethernet driver with IBM's own driver.

I am very suspicious that what initially appeared to be a hardware fault  is actually somehow related to the 5.6U5 upgrade.

I'd appreciate any thoughts that would assist in resolving this.

Thanks,

Des Dougan

Nick Ramsay

Re: Very odd RAID1 problem
« Reply #1 on: January 18, 2004, 07:11:34 PM »
Hmm, I've never tried doing an update before the RAID has finished the initial sync, but I wouldn't have expected it to break the mirror like this.


What happens if you try to manually add in the broken mirror disk?

/sbin/raidhotadd /dev/md2 /dev/hdd1
/sbin/raidhotadd /dev/md0 /dev/hdd5
/sbin/raidhotadd /dev/md1 /dev/hdb6

That last one is the biggie!

If the partion table isn't corrupted, that should reconstruct the mirror.

Can I also ask why you are not using update 6?

Des Dougan

Re: Very odd RAID1 problem
« Reply #2 on: January 18, 2004, 09:08:10 PM »
Nick Ramsay wrote:

> Hmm, I've never tried doing an update before the RAID has
> finished the initial sync, but I wouldn't have expected it to
> break the mirror like this.
>
>
> What happens if you try to manually add in the broken mirror
> disk?
>
> /sbin/raidhotadd /dev/md2 /dev/hdd1
> /sbin/raidhotadd /dev/md0 /dev/hdd5
> /sbin/raidhotadd /dev/md1 /dev/hdb6
>
> That last one is the biggie!
>
> If the partion table isn't corrupted, that should reconstruct
> the mirror.

/sbin/raidhotadd /dev/md0 /dev/hdd1
/dev/md0: can not hot-add disk: invalid argument.

It looks like the disk is not being seen, so perhaps it is a physical problem with the controller? (Note, for whatever reason, the filesystems have been created on hda1, hda2 and hda3 on the first drive).
 
> Can I also ask why you are not using update 6?

Probably because I wasn't paying attention - I didn't realize U6 had been issued. The ibiblio mirror has a Dec. 8 timestamp on it, but I don't recall seeing it announced.

Thanks for your reply.


Des Dougan

Nick Ramsay

Re: Very odd RAID1 problem
« Reply #3 on: January 19, 2004, 10:40:42 PM »
Des Dougan wrote:


> > /sbin/raidhotadd /dev/md2 /dev/hdd1
> > /sbin/raidhotadd /dev/md0 /dev/hdd5
> > /sbin/raidhotadd /dev/md1 /dev/hdb6
> >
>
> /sbin/raidhotadd /dev/md0 /dev/hdd1
> /dev/md0: can not hot-add disk: invalid argument.
>

Is this what you actually typed?  If so, it's telling you that md0 isn't  linked to /dev/hdd1 - it should be /dev/hdd5 as I said above.

> It looks like the disk is not being seen, so perhaps it is a
> physical problem with the controller? (Note, for whatever
> reason, the filesystems have been created on hda1, hda2 and
> hda3 on the first drive).
>  

What does fdisk -l /dev/hdd tell you?

What does smartctl -a /dev/hdd report?

> > Can I also ask why you are not using update 6?
>
> Probably because I wasn't paying attention - I didn't realize
> U6 had been issued. The ibiblio mirror has a Dec. 8 timestamp
> on it, but I don't recall seeing it announced.
>

It's actually been out for a while now (at least 3 months IIRC)

Des Dougan

Re: Very odd RAID1 problem
« Reply #4 on: January 20, 2004, 07:31:43 AM »
Nick,

[root@whiteley root]# fdisk -l /dev/hdd
[root@whiteley root]# fdisk -l /dev/hda

Disk /dev/hda: 255 heads, 63 sectors, 4865 cylinders
Units = cylinders of 16065 * 512 bytes

   Device Boot    Start       End    Blocks   Id  System
/dev/hda1   *         1        13    104391   fd  Linux raid autodetect
/dev/hda2            14      4832  38708617+  fd  Linux raid autodetect
/dev/hda3          4833      4865    265072+  fd  Linux raid autodetect
[root@whiteley root]#

As can be seen above, /dev/hdd doesn't seem to be recognized at all. As far as the partitions go, I noted in my last message that for some reason, I have hda1, 2 and 3 rather than 1, 5 and 6 (which I've got on my own server), but that is how SME set it up.

I found a link on the forum to a Hitachi/IBM non-destructive disk utility, which I will run against both disks and see what it gives me.

It's unlikely that a second disk has failed immediately, so it may be a controller issue on the motherboard.

Thanks, again,

Des

Offline NickR

  • *
  • 283
  • +0/-0
    • http://www.witzendcs.co.uk/
Re: Very odd RAID1 problem
« Reply #5 on: January 20, 2004, 09:21:41 AM »
Quote from: "Des Dougan"
Nick,

[root@whiteley root]# fdisk -l /dev/hdd
[root@whiteley root]# fdisk -l /dev/hda

Disk /dev/hda: 255 heads, 63 sectors, 4865 cylinders
Units = cylinders of 16065 * 512 bytes

   Device Boot    Start       End    Blocks   Id  System
/dev/hda1   *         1        13    104391   fd  Linux raid autodetect
/dev/hda2            14      4832  38708617+  fd  Linux raid autodetect
/dev/hda3          4833      4865    265072+  fd  Linux raid autodetect
[root@whiteley root]#

As can be seen above, /dev/hdd doesn't seem to be recognized at all. As far as the partitions go, I noted in my last message that for some reason, I have hda1, 2 and 3 rather than 1, 5 and 6 (which I've got on my own server), but that is how SME set it up.



Ah, yes there was a change in the partioning scheme between 5.1.2 and 5.5 - sorry about the confusion there.  My SME 6 server is running on SCSI mirrored disks so I used my 5.1.2 server as my example to avoid confusion & failed miserably!

Quote from: "Des Dougan"

I found a link on the forum to a Hitachi/IBM non-destructive disk utility, which I will run against both disks and see what it gives me.

It's unlikely that a second disk has failed immediately, so it may be a controller issue on the motherboard.

Thanks, again,

Des


Yes, that disk tester is good, but I'm not sure how well it reports a controller problem.
Just a final thought - you do actually have a secondary master, don't you?  My next step would be to make the problem disk the secondary master and see if the behaviour changes.
--
Nick......

Kelvin

Very odd RAID1 problem
« Reply #6 on: January 20, 2004, 09:48:07 AM »
Hi Des,

What does dmseg say about the hard disks (before the md parts) ?

Does it even show that the 2 hard disks are detected ? What else have you got connected to the IDE chain ?

Your initial post suggests :-
Pri - Master : HDD
Pri - Slave : ?
Sec - Master : ?
Sec - Slave : HDD

What is on Sec - Master ? HDDs don't always like being made slaves to non hard disk devices (depending on the devices).

Kelvin

Offline ddougan

  • *
  • 155
  • +0/-0
    • http://www.DouganConsulting.com
Very odd RAID1 problem
« Reply #7 on: January 21, 2004, 05:45:24 AM »
Nick and Kelvin,

Thanks for your replies.

The secondary master is the CD drive, with nothing on the primary slave.

dmesg shows:

PCI_IDE: unknown IDE controller on PCI bus 00 device f9, VID=8086, DID=24cb
PCI_IDE: chipset revision 2
PCI_IDE: not 100% native mode: will probe irqs later
    ide0: BM-DMA at 0xa480-0xa487, BIOS settings: hda:DMA, hdb:pio
    ide1: BM-DMA at 0xa488-0xa48f, BIOS settings: hdc:DMA, hdd:DMA
hda: IC35L060AVV207-0, ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hda: 78156288 sectors (40016 MB) w/1821KiB Cache, CHS=77536/255/63
ide-floppy driver 0.99.newide
Partition check:
 hda: hda1 hda2 hda3
Floppy drive(s): fd0 is 1.44M

I will change the master/slave setting On Thursday when I visit the client, and report back.

Thanks for all the suggestions.

Des
Des Dougan

Offline ddougan

  • *
  • 155
  • +0/-0
    • http://www.DouganConsulting.com
Very odd RAID1 problem
« Reply #8 on: January 23, 2004, 09:45:43 AM »
Quote from: "ddougan"
Nick and Kelvin,

I will change the master/slave setting On Thursday when I visit the client, and report back.




I swapped the settings on the secondary IDE, and the second drive is now recognized. Something in 5.6U5 must have impacted this. I have posted a bug, suggesting that the docs/FAQs be updated to identify the need to have both disks on master.

Thanks for all the help.

Des
Des Dougan

Offline NickR

  • *
  • 283
  • +0/-0
    • http://www.witzendcs.co.uk/
Very odd RAID1 problem
« Reply #9 on: January 23, 2004, 09:55:08 AM »
Quote from: "ddougan"

I swapped the settings on the secondary IDE, and the second drive is now recognized. Something in 5.6U5 must have impacted this. I have posted a bug, suggesting that the docs/FAQs be updated to identify the need to have both disks on master.

Thanks for all the help.

Des


Hmm, I wonder if it's not really a bug, and you just got lucky on the initial install.  Personally, I always set disks to be master.

Anyway, glad you got it resolved.
--
Nick......