Koozali.org: home of the SME Server

Obsolete Releases => SME Server 7.x => Topic started by: jahlewis on October 17, 2006, 03:23:44 AM

Title: RAID issue - need help recovering degraded array
Post by: jahlewis on October 17, 2006, 03:23:44 AM
Not sure what is going on here, and what to do.  Can any of you RAID guru's interpret this?

Code: [Select]
 ¦ Current RAID status:                                                     ¦
  ¦                                                                          ¦
  ¦ Personalities : [raid1]                                                  ¦
  ¦ md1 : active raid1 hdb2[1]                                               ¦
  ¦       155918784 blocks [2/1] [_U]                                        ¦
  ¦                                                                          ¦
  ¦ md2 : active raid1 hdb3[1] hda3[0]                                       ¦
  ¦       264960 blocks [2/2] [UU]                                           ¦
  ¦                                                                          ¦
  ¦ md0 : active raid1 hdb1[1] hda1[0]                                       ¦
  ¦       104320 blocks [2/2] [UU]                                           ¦
  ¦                                                                          ¦
  ¦ unused devices: <none>                                                   ¦
  ¦                                                                          ¦
  ¦                                                                          ¦
  ¦ There should be two RAID devices, not 3

-----------------------------------------------------------------------------------
I did get an email on reboot from mdam monitoring:
Code: [Select]
Subject: DegradedArray event on /dev/md1:gluon.arachnerd.org
This is an automatically generated mail message from mdadm running on gluon.arachnerd.org.

A DegradedArray event has been detected on md device /dev/md1.

Here is my current filesystem setup
[root@gluon]# df -h
Code: [Select]
Filesystem            Size  Used Avail Use% Mounted on
/dev/md1              147G  8.9G  131G   7% /
/dev/md0               99M   32M   63M  34% /boot
none                  315M     0  315M   0% /dev/shm
/dev/hdd1             230G   63G  156G  29% /mnt/bigdisk

and here are some details on the RAID settings for md0 and md1 (md2 is just like md0)
Code: [Select]
[root@gluon]# mdadm -D /dev/md0
/dev/md0:
        Version : 00.90.01
  Creation Time : Thu Jan 12 19:26:31 2006
     Raid Level : raid1
     Array Size : 104320 (101.88 MiB 106.82 MB)
    Device Size : 104320 (101.88 MiB 106.82 MB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Mon Oct 16 18:38:10 2006
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0


    Number   Major   Minor   RaidDevice State
       0       3        1        0      active sync   /dev/hda1
       1       3       65        1      active sync   /dev/hdb1
           UUID : 5139bc2e:39939d3e:5abd791c:3ce0a6ef
         Events : 0.3834


Code: [Select]
[root@gluon]# mdadm -D /dev/md1
/dev/md1:
        Version : 00.90.01
  Creation Time : Thu Jan 12 19:21:55 2006
     Raid Level : raid1
     Array Size : 155918784 (148.70 GiB 159.66 GB)
    Device Size : 155918784 (148.70 GiB 159.66 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Mon Oct 16 18:27:38 2006
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0


    Number   Major   Minor   RaidDevice State
       0       0        0       -1      removed
       1       3       66        1      active sync   /dev/hdb2
           UUID : 0a968a22:d1b0d2bd:ab248bae:ec482cc1
         Events : 0.12532934
Title: RAID issue - need help recovering degraded array
Post by: jahlewis on October 17, 2006, 03:52:27 AM
OK, I'm reading like crazy here...

As I interpret this, /dev/md1 is broken, with /dev/hda2 not being mirrored.

However, if I try to add hda2 back to md1, I get an invalid argument error:

Code: [Select]
[root@gluon]# mdadm -a /dev/md1 /dev/hda2
mdadm: hot add failed for /dev/hda2: Invalid argument


So... I tried removing the partition first:
Code: [Select]
[root@gluon]# mdadm /dev/md1 -r /dev/hda2 -a /dev/hda2
mdadm: hot remove failed for /dev/hda2: No such device or address


So now what?  the / partition on hda is hosed?  How do I rebuild that?  I'm quickly diving out of my depth here...
Title: RAID issue - need help recovering degraded array
Post by: jahlewis on October 17, 2006, 04:22:48 AM
FWIW
Code: [Select]
[root@gluon init.d]# mdadm -E /dev/hdb2
/dev/hdb2:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 0a968a22:d1b0d2bd:ab248bae:ec482cc1
  Creation Time : Thu Jan 12 19:21:55 2006
     Raid Level : raid1
    Device Size : 155918784 (148.70 GiB 159.66 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 1

    Update Time : Mon Oct 16 18:27:38 2006
          State : clean
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0
       Checksum : b0a4fa9a - correct
         Events : 0.12532934


      Number   Major   Minor   RaidDevice State
this     1       3       66        1      active sync   /dev/hdb2
   0     0       0        0        0      removed
   1     1       3       66        1      active sync   /dev/hdb2


Code: [Select]
[root@gluon init.d]# mdadm -E /dev/hda2
/dev/hda2:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 0a968a22:d1b0d2bd:ab248bae:ec482cc1
  Creation Time : Thu Jan 12 19:21:55 2006
     Raid Level : raid1
    Device Size : 155918784 (148.70 GiB 159.66 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1

    Update Time : Sun Oct 15 21:07:07 2006
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0
       Checksum : b0a3ce33 - correct
         Events : 0.12532928


      Number   Major   Minor   RaidDevice State
this     0       3        2        0      active sync   /dev/hda2
   0     0       3        2        0      active sync   /dev/hda2
   1     1       3       66        1      active sync   /dev/hdb2
Title: RAID issue - need help recovering degraded array
Post by: jahlewis on October 17, 2006, 04:40:40 AM
Last post tonight...

Reading this http://www.linuxquestions.org/questions/showthread.php?t=429857

Suggests running mdadm -C if all else fails... So I did:
Code: [Select]
[root@gluon init.d]# mdadm -C /dev/md1 -l1 -n2 /dev/hda2 /dev/hdb2
mdadm: /dev/hda2 appears to contain an ext2fs file system
    size=155918784K  mtime=Mon Oct 16 18:27:39 2006
mdadm: /dev/hda2 appears to be part of a raid array:
    level=1 devices=2 ctime=Thu Jan 12 19:21:55 2006
mdadm: /dev/hdb2 appears to contain an ext2fs file system
    size=155918784K  mtime=Sun Oct 15 20:33:12 2006
mdadm: /dev/hdb2 appears to be part of a raid array:
    level=1 devices=2 ctime=Thu Jan 12 19:21:55 2006
Continue creating array?


And I chickened out.  Afraid of wiping the contends of the surviving partition.  Does anyone know if I chose to continue, what whould happen?

Thanks for your patience
Title: RAID issue - need help recovering degraded array
Post by: crazybob on October 17, 2006, 02:17:33 PM
When I had a drive that had a failed section in the raid, I removed the problem drive, and ran a program called HDD regenerator (http://www.dposoft.net/) on the drive. When I replaced the drive, it was detected as a new drive, and the raid was rebuilt without issue. You could run the program with the drive in place depending on how long you care to go without the sever being available. Hdd regenerator can take quite a while depending on drive sizw
Title: RAID issue - need help recovering degraded array
Post by: jahlewis on October 17, 2006, 10:26:29 PM
Couple of things...

The drives are OK, since the other partitions on hda are working, so it is just a bad partition (hda2) that is attached to the md1 mirror set.

I have no idea which is hda and which is hdb in my system, so wouldn't know which to unplug.

Is the best course to stop the mirroring, make the the hdb disk the primary, reformat hda, then add it back to the mirror?  If this is the case, can anyone point me in the right direction?

Thanks.
Title: RAID issue - need help recovering degraded array
Post by: ldkeen on October 17, 2006, 11:02:14 PM
jahlewis,
Can you post the partition info from /dev/hda using
Code: [Select]
#fdisk /dev/hda followed by "p" to print the info.
Quote from: "jahlewis"
I have no idea which is hda

Both your hard drives are on the same cable (which is highly discouraged) so most of the time hda would be the drive at the end of the cable and hdb would be in the middle of the cable, but if unsure you should check the jumper settings on both drives to make sure.
Code: [Select]
#mdadm -a /dev/md1 /dev/hda2
That should have done the trick. I'm trying to work out why you have 3 raid devices instead of 2. Are you running version 7.0? It looks like /dev/md2 must be your swap.
Lloyd
Title: RAID issue - need help recovering degraded array
Post by: raem on October 17, 2006, 11:35:52 PM
ldkeen & jahlewis

>  I'm trying to work out why you have 3 raid devices instead of 2.
> Are you running version 7.0?

Assuming sme7 (as posted in sme7 forum), then it looks like the server was updated from sme6.x. The 3 partition format has been retained as the upgrade process did not convert it.
It will NOT be possible to simply remove & replace a drive and have the system automatically rebuild the array using the admin console menu. This will only work for new sme7 installs (or new sme7 installs plus restore from 6.x) where there are 2 partitions.

You will have to manually rebuild the array, search the forums as there have been a few good posts recently on this topic.
Title: RAID issue - need help recovering degraded array
Post by: jahlewis on October 17, 2006, 11:48:33 PM
I'm pretty sure this was a clean install during the 7.0pre or beta series, then upgraded since.  I think they are on the same ide cable, so thanks for that info ray.  Is hda usually the master, and hdb the slave? I did copy over a lot of stuff from a 6.0 server, so that may be where this info is from?

My question is, and I guess I'll have to look, how do I break the mirroring/RAID specifying that hdb should be the master?

Yes, md0 is boot, md2 is swap and md1 is /

Code: [Select]
[root@gluon ~]# fdisk /dev/hda

The number of cylinders for this disk is set to 19457.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
   (e.g., DOS FDISK, OS/2 FDISK)

Command (m for help): p

Disk /dev/hda: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/hda1   *           1          13      104391   fd  Linux raid autodetect
/dev/hda2              14       19424   155918857+  fd  Linux raid autodetect
/dev/hda3           19425       19457      265072+  fd  Linux raid autodetect


aslo, FWIW, here is what the logs say during a boot:
Code: [Select]
Oct 17 06:32:35 gluon kernel: md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
Oct 17 06:32:35 gluon kernel: md: raid1 personality registered as nr 3
Oct 17 06:32:35 gluon kernel: md: Autodetecting RAID arrays.
Oct 17 06:32:35 gluon kernel: md: could not bd_claim hda2.
Oct 17 06:32:35 gluon kernel: md: autorun ...
Oct 17 06:32:35 gluon kernel: md: considering hdb3 ...
Oct 17 06:32:35 gluon kernel: md:  adding hdb3 ...
Oct 17 06:32:35 gluon kernel: md: hdb2 has different UUID to hdb3
Oct 17 06:32:35 gluon kernel: md: hdb1 has different UUID to hdb3
Oct 17 06:32:35 gluon kernel: md:  adding hda3 ...
Oct 17 06:32:35 gluon kernel: md: hda1 has different UUID to hdb3
Oct 17 06:32:35 gluon kernel: md: created md2
Oct 17 06:32:35 gluon kernel: md: bind<hda3>
Oct 17 06:32:35 gluon kernel: md: bind<hdb3>
Oct 17 06:32:35 gluon kernel: md: running: <hdb3><hda3>
Oct 17 06:32:35 gluon kernel: raid1: raid set md2 active with 2 out of 2 mirrors
Oct 17 06:32:35 gluon kernel: md: considering hdb2 ...
Oct 17 06:32:35 gluon kernel: md:  adding hdb2 ...
Oct 17 06:32:35 gluon kernel: md: hdb1 has different UUID to hdb2
Oct 17 06:32:35 gluon kernel: md: hda1 has different UUID to hdb2
Oct 17 06:32:35 gluon kernel: md: created md1
Oct 17 06:32:35 gluon kernel: md: bind<hdb2>
Oct 17 06:32:35 gluon kernel: md: running: <hdb2>
Oct 17 06:32:35 gluon kernel: raid1: raid set md1 active with 1 out of 2 mirrors
Oct 17 06:32:35 gluon kernel: md: considering hdb1 ...
Oct 17 06:32:35 gluon kernel: md:  adding hdb1 ...
Oct 17 06:32:35 gluon kernel: md:  adding hda1 ...
Oct 17 06:32:35 gluon kernel: md: created md0
Oct 17 06:32:35 gluon kernel: md: bind<hda1>
Oct 17 06:32:36 gluon kernel: md: bind<hdb1>
Oct 17 06:32:36 gluon kernel: md: running: <hdb1><hda1>
Oct 17 06:32:36 gluon kernel: raid1: raid set md0 active with 2 out of 2 mirrors
Oct 17 06:32:36 gluon kernel: md: ... autorun DONE.
Oct 17 06:32:36 gluon kernel: EXT3 FS on md0, internal journal
Oct 17 06:32:36 gluon kernel: Adding 264952k swap on /dev/md2.  Priority:-1 extents:1


Thanks guys...
Title: RAID issue - need help recovering degraded array
Post by: raem on October 18, 2006, 12:21:44 AM
jahlewis,

> Lloyd wrote:
> mdadm -a /dev/md1 /dev/hda2
> That should have done the trick.

That looks appropriate, also see
man mdadm


Here's a good thread, see the post by Stefano
http://forums.contribs.org/index.php?topic=32572.msg138217#msg138217
Title: RAID issue - need help recovering degraded array
Post by: cheezeweeze on November 19, 2006, 05:19:13 PM
> Lloyd wrote:
> mdadm -a /dev/md1 /dev/hda2
> That should have done the trick.

Try this:
mdadm --add /dev/md1 /dev/hda2
Title: RAID issue - need help recovering degraded array
Post by: CharlieBrady on November 19, 2006, 07:10:31 PM
Quote from: "cheezeweeze"
> Lloyd wrote:
> mdadm -a /dev/md1 /dev/hda2
> That should have done the trick.

Try this:
mdadm --add /dev/md1 /dev/hda2


You should only do that if you are certain that the drive is good (and if so, why was it tossed out of the RAID array?) or if you don't care all that much about your data.
Title: RAID issue - need help recovering degraded array
Post by: mike_mattos on November 23, 2006, 11:12:40 PM
Given that vendors have problems deciding if the first drive is 0, 1, or A,
and that sometimes C may be the original drive and D the one added later
(even if D is Primary on Primary )

is there a way to poll SME for the drive serial number?  

Really helps when using Ghost to see the drive info!

Mike
Title: RAID issue - need help recovering degraded array
Post by: CharlieBrady on November 23, 2006, 11:33:45 PM
Quote from: "mike_mattos"
Given that vendors have problems deciding if the first drive is 0, 1, or A,
and that sometimes C may be the original drive and D the one added later
(even if D is Primary on Primary )


Linux doesn't use drive letters A, C, or D, and drives are identified unambiguously by primary/secondary/master/slave. Ask google for details.
Title: RAID issue - need help recovering degraded array
Post by: mike_mattos on November 27, 2006, 08:37:33 PM
SCSI and SATA drives are harder to identify, imagine 7 identical drives on a cable, only difference is a hidden jumper, or 6 red SATA cables neatly bundled with cable ties!  

Having the drive serial number allows a printout of diagnostics  & after the fact confirmation that the drive being replaced is actually the drive you indended, and that a barin (brain)  cramp didn't lead to tracing the wrong cable or enumerating the ID jumpers in the wrong direction!

So I ask again, can you query the drive serial number on SME?
Title: RAID issue - need help recovering degraded array
Post by: CharlieBrady on November 27, 2006, 09:18:27 PM
Quote from: "mike_mattos"

So I ask again, can you query the drive serial number on SME?


The question is really "can linux query the drive serial number"? Yes - use the sdparm command (SATA and SCSI)  or hdparm command (ATA). Note however that sdparm is not installed by default. You can find  a suitable RPM here:

http://dries.ulyssis.org/rpm/packages/sdparm/info.html
Title: RAID issue - need help recovering degraded array
Post by: mike_mattos on November 29, 2006, 06:47:17 PM
Charly, #  sdparm didn't work ( not found error ) on either of my SME testers,  6 OR 7, and I wasn't sure which Red Hat?  version?  to download.

As you said, hdparm won't work on SME6 scsi ata

However. on SME7 , VOILA!  Thx

Now if only it would automatically decode the RAID virtual drive!
You wouldn't happen to know if SMART is working on SME, would you?
( As in, will I get an email for a SMART error BEFORE the raid crashes?)

# hdparm -I /dev/sda1

/dev/sda1:

ATA device, with non-removable media
        Model Number:       WDC WD800JD-22MSA1
        Serial Number:      WD-WMAM9Z678073
        Firmware Revision:  10.01E01
Standards:
        Supported: 7 6 5 4
        Likely used: 7
Configuration:
        Logical         max     current
        cylinders       16383   16383
        heads           16      16
        sectors/track   63      63
        --
        CHS current addressable sectors:   16514064
        LBA    user addressable sectors:  156301488
        LBA48  user addressable sectors:  156301488
        device size with M = 1024*1024:       76319 MBytes
        device size with M = 1000*1000:       80026 MBytes (80 GB)
Capabilities:
        LBA, IORDY(can be disabled)
        Queue depth: 32
        Standby timer values: spec'd by Standard, with device specific minimum
        R/W multiple sector transfer: Max = 16  Current = 16
        Recommended acoustic management value: 128, current value: 254
        DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
             Cycle time: min=120ns recommended=120ns
        PIO: pio0 pio1 pio2 pio3 pio4
             Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
        Enabled Supported:
           *    NOP cmd
           *    READ BUFFER cmd
           *    WRITE BUFFER cmd
           *    Host Protected Area feature set
           *    Look-ahead
           *    Write cache
           *    Power Management feature set
           *    SMART feature set
           *    FLUSH CACHE EXT command
           *    Mandatory FLUSH CACHE command
           *    Device Configuration Overlay feature set
           *    48-bit Address feature set
                Automatic Acoustic Management feature set
                SET MAX security extension
           *    DOWNLOAD MICROCODE cmd
           *    General Purpose Logging feature set
           *    SMART self-test
           *    SMART error logging
Checksum: correct
Title: RAID issue - need help recovering degraded array
Post by: Boris on December 28, 2006, 07:49:58 PM
Quote from: "RayMitchell"
ldkeen & jahlewis

>  I'm trying to work out why you have 3 raid devices instead of 2.
> Are you running version 7.0?

Assuming sme7 (as posted in sme7 forum), then it looks like the server was updated from sme6.x. The 3 partition format has been retained as the upgrade process did not convert it.
It will NOT be possible to simply remove & replace a drive and have the system automatically rebuild the array using the admin console menu. This will only work for new sme7 installs (or new sme7 installs plus restore from 6.x) where there are 2 partitions.

You will have to manually rebuild the array, search the forums as there have been a few good posts recently on this topic.


I just run into exactly the same problem while upgrading from 6.0.1 to SME7.

Any suggestions on rebuilding the RAID?
Title: RAID issue - need help recovering degraded array
Post by: kruhm on December 31, 2006, 02:53:20 PM
Hi Boris,

I had to do a fresh install then use the copyfromdisk command. Allow 2 hrs downtime.
Title: RAID issue - need help recovering degraded array
Post by: Boris on January 02, 2007, 07:19:53 PM
That's what I figured will be my best bet as well.
Fortunately this server is not critical for operations and doesn't have vital data. Easy to reinstall (now to version 7.1)

Thanks.
Boris.