Koozali.org: home of the SME Server

Obsolete Releases => SME Server 7.x => Topic started by: jahlewis on October 17, 2006, 03:23:44 AM

Title: RAID issue - need help recovering degraded array
Post by: jahlewis on October 17, 2006, 03:23:44 AM: Not sure what is going on here, and what to do. Can any of you RAID guru's interpret this?

Code: [Select]
¦ Current RAID status: ¦ ¦ ¦ ¦ Personalities : [raid1] ¦ ¦ md1 : active raid1 hdb2[1] ¦ ¦ 155918784 blocks [2/1] [_U] ¦ ¦ ¦ ¦ md2 : active raid1 hdb3[1] hda3[0] ¦ ¦ 264960 blocks [2/2] [UU] ¦ ¦ ¦ ¦ md0 : active raid1 hdb1[1] hda1[0] ¦ ¦ 104320 blocks [2/2] [UU] ¦ ¦ ¦ ¦ unused devices: <none> ¦ ¦ ¦ ¦ ¦ ¦ There should be two RAID devices, not 3
-----------------------------------------------------------------------------------
I did get an email on reboot from mdam monitoring:
Code: [Select]
Subject: DegradedArray event on /dev/md1:gluon.arachnerd.org This is an automatically generated mail message from mdadm running on gluon.arachnerd.org. A DegradedArray event has been detected on md device /dev/md1.
Here is my current filesystem setup
[root@gluon]# df -h
Code: [Select]
Filesystem Size Used Avail Use% Mounted on /dev/md1 147G 8.9G 131G 7% / /dev/md0 99M 32M 63M 34% /boot none 315M 0 315M 0% /dev/shm /dev/hdd1 230G 63G 156G 29% /mnt/bigdisk
and here are some details on the RAID settings for md0 and md1 (md2 is just like md0)
Code: [Select]
[root@gluon]# mdadm -D /dev/md0 /dev/md0: Version : 00.90.01 Creation Time : Thu Jan 12 19:26:31 2006 Raid Level : raid1 Array Size : 104320 (101.88 MiB 106.82 MB) Device Size : 104320 (101.88 MiB 106.82 MB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Mon Oct 16 18:38:10 2006 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Number Major Minor RaidDevice State 0 3 1 0 active sync /dev/hda1 1 3 65 1 active sync /dev/hdb1 UUID : 5139bc2e:39939d3e:5abd791c:3ce0a6ef Events : 0.3834

Code: [Select]
[root@gluon]# mdadm -D /dev/md1 /dev/md1: Version : 00.90.01 Creation Time : Thu Jan 12 19:21:55 2006 Raid Level : raid1 Array Size : 155918784 (148.70 GiB 159.66 GB) Device Size : 155918784 (148.70 GiB 159.66 GB) Raid Devices : 2 Total Devices : 1 Preferred Minor : 1 Persistence : Superblock is persistent Update Time : Mon Oct 16 18:27:38 2006 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 Number Major Minor RaidDevice State 0 0 0 -1 removed 1 3 66 1 active sync /dev/hdb2 UUID : 0a968a22:d1b0d2bd:ab248bae:ec482cc1 Events : 0.12532934
Title: RAID issue - need help recovering degraded array
Post by: jahlewis on October 17, 2006, 03:52:27 AM: OK, I'm reading like crazy here...

As I interpret this, /dev/md1 is broken, with /dev/hda2 not being mirrored.

However, if I try to add hda2 back to md1, I get an invalid argument error:

Code: [Select]
[root@gluon]# mdadm -a /dev/md1 /dev/hda2 mdadm: hot add failed for /dev/hda2: Invalid argument

So... I tried removing the partition first:
Code: [Select]
[root@gluon]# mdadm /dev/md1 -r /dev/hda2 -a /dev/hda2 mdadm: hot remove failed for /dev/hda2: No such device or address

So now what? the / partition on hda is hosed? How do I rebuild that? I'm quickly diving out of my depth here...
Title: RAID issue - need help recovering degraded array
Post by: jahlewis on October 17, 2006, 04:22:48 AM: FWIW
Code: [Select]
[root@gluon init.d]# mdadm -E /dev/hdb2 /dev/hdb2: Magic : a92b4efc Version : 00.90.00 UUID : 0a968a22:d1b0d2bd:ab248bae:ec482cc1 Creation Time : Thu Jan 12 19:21:55 2006 Raid Level : raid1 Device Size : 155918784 (148.70 GiB 159.66 GB) Raid Devices : 2 Total Devices : 1 Preferred Minor : 1 Update Time : Mon Oct 16 18:27:38 2006 State : clean Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 Checksum : b0a4fa9a - correct Events : 0.12532934 Number Major Minor RaidDevice State this 1 3 66 1 active sync /dev/hdb2 0 0 0 0 0 removed 1 1 3 66 1 active sync /dev/hdb2

Code: [Select]
[root@gluon init.d]# mdadm -E /dev/hda2 /dev/hda2: Magic : a92b4efc Version : 00.90.00 UUID : 0a968a22:d1b0d2bd:ab248bae:ec482cc1 Creation Time : Thu Jan 12 19:21:55 2006 Raid Level : raid1 Device Size : 155918784 (148.70 GiB 159.66 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 1 Update Time : Sun Oct 15 21:07:07 2006 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Checksum : b0a3ce33 - correct Events : 0.12532928 Number Major Minor RaidDevice State this 0 3 2 0 active sync /dev/hda2 0 0 3 2 0 active sync /dev/hda2 1 1 3 66 1 active sync /dev/hdb2
Title: RAID issue - need help recovering degraded array
Post by: jahlewis on October 17, 2006, 04:40:40 AM: Last post tonight...

Reading this http://www.linuxquestions.org/questions/showthread.php?t=429857

Suggests running mdadm -C if all else fails... So I did:
Code: [Select]
[root@gluon init.d]# mdadm -C /dev/md1 -l1 -n2 /dev/hda2 /dev/hdb2 mdadm: /dev/hda2 appears to contain an ext2fs file system size=155918784K mtime=Mon Oct 16 18:27:39 2006 mdadm: /dev/hda2 appears to be part of a raid array: level=1 devices=2 ctime=Thu Jan 12 19:21:55 2006 mdadm: /dev/hdb2 appears to contain an ext2fs file system size=155918784K mtime=Sun Oct 15 20:33:12 2006 mdadm: /dev/hdb2 appears to be part of a raid array: level=1 devices=2 ctime=Thu Jan 12 19:21:55 2006 Continue creating array?

And I chickened out. Afraid of wiping the contends of the surviving partition. Does anyone know if I chose to continue, what whould happen?

Thanks for your patience
Title: RAID issue - need help recovering degraded array
Post by: crazybob on October 17, 2006, 02:17:33 PM: When I had a drive that had a failed section in the raid, I removed the problem drive, and ran a program called HDD regenerator (http://www.dposoft.net/) on the drive. When I replaced the drive, it was detected as a new drive, and the raid was rebuilt without issue. You could run the program with the drive in place depending on how long you care to go without the sever being available. Hdd regenerator can take quite a while depending on drive sizw
Title: RAID issue - need help recovering degraded array
Post by: jahlewis on October 17, 2006, 10:26:29 PM: Couple of things...

The drives are OK, since the other partitions on hda are working, so it is just a bad partition (hda2) that is attached to the md1 mirror set.

I have no idea which is hda and which is hdb in my system, so wouldn't know which to unplug.

Is the best course to stop the mirroring, make the the hdb disk the primary, reformat hda, then add it back to the mirror? If this is the case, can anyone point me in the right direction?

Thanks.
Title: RAID issue - need help recovering degraded array
Post by: ldkeen on October 17, 2006, 11:02:14 PM: jahlewis,
Can you post the partition info from /dev/hda using
Code: [Select]
#fdisk /dev/hda followed by "p" to print the info.
Quote from: "jahlewis"
I have no idea which is hda

Both your hard drives are on the same cable (which is highly discouraged) so most of the time hda would be the drive at the end of the cable and hdb would be in the middle of the cable, but if unsure you should check the jumper settings on both drives to make sure.
Code: [Select]
#mdadm -a /dev/md1 /dev/hda2
That should have done the trick. I'm trying to work out why you have 3 raid devices instead of 2. Are you running version 7.0? It looks like /dev/md2 must be your swap.
Lloyd
Title: RAID issue - need help recovering degraded array
Post by: raem on October 17, 2006, 11:35:52 PM: ldkeen & jahlewis

> I'm trying to work out why you have 3 raid devices instead of 2.
> Are you running version 7.0?

Assuming sme7 (as posted in sme7 forum), then it looks like the server was updated from sme6.x. The 3 partition format has been retained as the upgrade process did not convert it.
It will NOT be possible to simply remove & replace a drive and have the system automatically rebuild the array using the admin console menu. This will only work for new sme7 installs (or new sme7 installs plus restore from 6.x) where there are 2 partitions.

You will have to manually rebuild the array, search the forums as there have been a few good posts recently on this topic.
Title: RAID issue - need help recovering degraded array
Post by: jahlewis on October 17, 2006, 11:48:33 PM: I'm pretty sure this was a clean install during the 7.0pre or beta series, then upgraded since. I think they are on the same ide cable, so thanks for that info ray. Is hda usually the master, and hdb the slave? I did copy over a lot of stuff from a 6.0 server, so that may be where this info is from?

My question is, and I guess I'll have to look, how do I break the mirroring/RAID specifying that hdb should be the master?

Yes, md0 is boot, md2 is swap and md1 is /

Code: [Select]
[root@gluon ~]# fdisk /dev/hda The number of cylinders for this disk is set to 19457. There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with: 1) software that runs at boot time (e.g., old versions of LILO) 2) booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK) Command (m for help): p Disk /dev/hda: 160.0 GB, 160041885696 bytes 255 heads, 63 sectors/track, 19457 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/hda1 * 1 13 104391 fd Linux raid autodetect /dev/hda2 14 19424 155918857+ fd Linux raid autodetect /dev/hda3 19425 19457 265072+ fd Linux raid autodetect

aslo, FWIW, here is what the logs say during a boot:
Code: [Select]
Oct 17 06:32:35 gluon kernel: md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27 Oct 17 06:32:35 gluon kernel: md: raid1 personality registered as nr 3 Oct 17 06:32:35 gluon kernel: md: Autodetecting RAID arrays. Oct 17 06:32:35 gluon kernel: md: could not bd_claim hda2. Oct 17 06:32:35 gluon kernel: md: autorun ... Oct 17 06:32:35 gluon kernel: md: considering hdb3 ... Oct 17 06:32:35 gluon kernel: md: adding hdb3 ... Oct 17 06:32:35 gluon kernel: md: hdb2 has different UUID to hdb3 Oct 17 06:32:35 gluon kernel: md: hdb1 has different UUID to hdb3 Oct 17 06:32:35 gluon kernel: md: adding hda3 ... Oct 17 06:32:35 gluon kernel: md: hda1 has different UUID to hdb3 Oct 17 06:32:35 gluon kernel: md: created md2 Oct 17 06:32:35 gluon kernel: md: bind<hda3> Oct 17 06:32:35 gluon kernel: md: bind<hdb3> Oct 17 06:32:35 gluon kernel: md: running: <hdb3><hda3> Oct 17 06:32:35 gluon kernel: raid1: raid set md2 active with 2 out of 2 mirrors Oct 17 06:32:35 gluon kernel: md: considering hdb2 ... Oct 17 06:32:35 gluon kernel: md: adding hdb2 ... Oct 17 06:32:35 gluon kernel: md: hdb1 has different UUID to hdb2 Oct 17 06:32:35 gluon kernel: md: hda1 has different UUID to hdb2 Oct 17 06:32:35 gluon kernel: md: created md1 Oct 17 06:32:35 gluon kernel: md: bind<hdb2> Oct 17 06:32:35 gluon kernel: md: running: <hdb2> Oct 17 06:32:35 gluon kernel: raid1: raid set md1 active with 1 out of 2 mirrors Oct 17 06:32:35 gluon kernel: md: considering hdb1 ... Oct 17 06:32:35 gluon kernel: md: adding hdb1 ... Oct 17 06:32:35 gluon kernel: md: adding hda1 ... Oct 17 06:32:35 gluon kernel: md: created md0 Oct 17 06:32:35 gluon kernel: md: bind<hda1> Oct 17 06:32:36 gluon kernel: md: bind<hdb1> Oct 17 06:32:36 gluon kernel: md: running: <hdb1><hda1> Oct 17 06:32:36 gluon kernel: raid1: raid set md0 active with 2 out of 2 mirrors Oct 17 06:32:36 gluon kernel: md: ... autorun DONE. Oct 17 06:32:36 gluon kernel: EXT3 FS on md0, internal journal Oct 17 06:32:36 gluon kernel: Adding 264952k swap on /dev/md2. Priority:-1 extents:1

Thanks guys...
Title: RAID issue - need help recovering degraded array
Post by: raem on October 18, 2006, 12:21:44 AM: jahlewis,

> Lloyd wrote:
> mdadm -a /dev/md1 /dev/hda2
> That should have done the trick.

That looks appropriate, also see
man mdadm

Here's a good thread, see the post by Stefano
http://forums.contribs.org/index.php?topic=32572.msg138217#msg138217
Title: RAID issue - need help recovering degraded array
Post by: cheezeweeze on November 19, 2006, 05:19:13 PM: > Lloyd wrote:
> mdadm -a /dev/md1 /dev/hda2
> That should have done the trick.

Try this:
mdadm --add /dev/md1 /dev/hda2
Title: RAID issue - need help recovering degraded array
Post by: CharlieBrady on November 19, 2006, 07:10:31 PM: Quote from: "cheezeweeze"
> Lloyd wrote:
> mdadm -a /dev/md1 /dev/hda2
> That should have done the trick.

Try this:
mdadm --add /dev/md1 /dev/hda2

You should only do that if you are certain that the drive is good (and if so, why was it tossed out of the RAID array?) or if you don't care all that much about your data.
Title: RAID issue - need help recovering degraded array
Post by: mike_mattos on November 23, 2006, 11:12:40 PM: Given that vendors have problems deciding if the first drive is 0, 1, or A,
and that sometimes C may be the original drive and D the one added later
(even if D is Primary on Primary )

is there a way to poll SME for the drive serial number?

Really helps when using Ghost to see the drive info!

Mike
Title: RAID issue - need help recovering degraded array
Post by: CharlieBrady on November 23, 2006, 11:33:45 PM: Quote from: "mike_mattos"
Given that vendors have problems deciding if the first drive is 0, 1, or A,
and that sometimes C may be the original drive and D the one added later
(even if D is Primary on Primary )

Linux doesn't use drive letters A, C, or D, and drives are identified unambiguously by primary/secondary/master/slave. Ask google for details.
Title: RAID issue - need help recovering degraded array
Post by: mike_mattos on November 27, 2006, 08:37:33 PM: SCSI and SATA drives are harder to identify, imagine 7 identical drives on a cable, only difference is a hidden jumper, or 6 red SATA cables neatly bundled with cable ties!

Having the drive serial number allows a printout of diagnostics & after the fact confirmation that the drive being replaced is actually the drive you indended, and that a barin (brain) cramp didn't lead to tracing the wrong cable or enumerating the ID jumpers in the wrong direction!

So I ask again, can you query the drive serial number on SME?
Title: RAID issue - need help recovering degraded array
Post by: CharlieBrady on November 27, 2006, 09:18:27 PM: Quote from: "mike_mattos"

So I ask again, can you query the drive serial number on SME?

The question is really "can linux query the drive serial number"? Yes - use the sdparm command (SATA and SCSI) or hdparm command (ATA). Note however that sdparm is not installed by default. You can find a suitable RPM here:

http://dries.ulyssis.org/rpm/packages/sdparm/info.html
Title: RAID issue - need help recovering degraded array
Post by: mike_mattos on November 29, 2006, 06:47:17 PM: Charly, # sdparm didn't work ( not found error ) on either of my SME testers, 6 OR 7, and I wasn't sure which Red Hat? version? to download.

As you said, hdparm won't work on SME6 scsi ata

However. on SME7 , VOILA! Thx

Now if only it would automatically decode the RAID virtual drive!
You wouldn't happen to know if SMART is working on SME, would you?
( As in, will I get an email for a SMART error BEFORE the raid crashes?)

# hdparm -I /dev/sda1

/dev/sda1:

ATA device, with non-removable media
Model Number: WDC WD800JD-22MSA1
Serial Number: WD-WMAM9Z678073
Firmware Revision: 10.01E01
Standards:
Supported: 7 6 5 4
Likely used: 7
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 156301488
LBA48 user addressable sectors: 156301488
device size with M = 1024*1024: 76319 MBytes
device size with M = 1000*1000: 80026 MBytes (80 GB)
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, with device specific minimum
R/W multiple sector transfer: Max = 16 Current = 16
Recommended acoustic management value: 128, current value: 254
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* NOP cmd
* READ BUFFER cmd
* WRITE BUFFER cmd
* Host Protected Area feature set
* Look-ahead
* Write cache
* Power Management feature set
* SMART feature set
* FLUSH CACHE EXT command
* Mandatory FLUSH CACHE command
* Device Configuration Overlay feature set
* 48-bit Address feature set
Automatic Acoustic Management feature set
SET MAX security extension
* DOWNLOAD MICROCODE cmd
* General Purpose Logging feature set
* SMART self-test
* SMART error logging
Checksum: correct
Title: RAID issue - need help recovering degraded array
Post by: Boris on December 28, 2006, 07:49:58 PM: Quote from: "RayMitchell"
ldkeen & jahlewis

> I'm trying to work out why you have 3 raid devices instead of 2.
> Are you running version 7.0?

Assuming sme7 (as posted in sme7 forum), then it looks like the server was updated from sme6.x. The 3 partition format has been retained as the upgrade process did not convert it.
It will NOT be possible to simply remove & replace a drive and have the system automatically rebuild the array using the admin console menu. This will only work for new sme7 installs (or new sme7 installs plus restore from 6.x) where there are 2 partitions.

You will have to manually rebuild the array, search the forums as there have been a few good posts recently on this topic.

I just run into exactly the same problem while upgrading from 6.0.1 to SME7.

Any suggestions on rebuilding the RAID?
Title: RAID issue - need help recovering degraded array
Post by: kruhm on December 31, 2006, 02:53:20 PM: Hi Boris,

I had to do a fresh install then use the copyfromdisk command. Allow 2 hrs downtime.
Title: RAID issue - need help recovering degraded array
Post by: Boris on January 02, 2007, 07:19:53 PM: That's what I figured will be my best bet as well.
Fortunately this server is not critical for operations and doesn't have vital data. Easy to reinstall (now to version 7.1)

Thanks.
Boris.