RAID error appearing in email...

ashpenaz

4
+0/-0

RAID error appearing in email...

« on: October 07, 2007, 08:36:25 PM »

The following message appears in my email only when I run the upgrades listed in Server-Manager:
A DegradedArray event has been detected on md device /dev/md2

I have been searching the Forums and BugTracker and though I have seen some similar situations I am still at a loss as to how to proceed. I have 2 Maxtor 160 GB Drives, ATA100, one on the primary IDE and one on the secondary IDE. I am running SME 7.2

In response to what I read in the Forums I ran the following commands with the following output:

mdadm --detail /dev/md2
/dev/md2:
Version : 00.90.01
Creation Time : Wed Jun 27 14:03:34 2007

Raid Level : raid1
Array Size : 156183808 (148.95 GiB 159.93 GB)
Device Size : 156183808 (148.95 GiB 159.93 GB)
Raid Devices : 2
Total Devices : 1
Preferred Minor : 2
Persistence : Superblock is persistent

Update Time : Sun Oct 7 12:34:49 2007
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0

UUID : 556a3837:b463294d:3d9ce593:24ae7fcc
Events : 0.3245340

Number Major Minor RaidDevice State
0 0 0 - removed
1 22 2 1 active sync /dev/hdc2
mdadm --detail /dev/md1
/dev/md1:
Version : 00.90.01
Creation Time : Wed Jun 27 14:03:34 2007
Raid Level : raid1
Array Size : 104320 (101.89 MiB 106.82 MB)
Device Size : 104320 (101.89 MiB 106.82 MB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 1
Persistence : Superblock is persistent

Update Time : Sun Oct 7 12:01:28 2007
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0

UUID : f7fac4c3:0f05ec1d:2eb557c0:2cb5965e
Events : 0.1560

Number Major Minor RaidDevice State
0 3 1 0 active sync /dev/hda1
1 22 1 1 active sync /dev/hdc1
fdisk -l

Disk /dev/hda: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/hda1 * 1 13 104391 fd Linux raid autodetect
/dev/hda2 14 19457 156183930 fd Linux raid autodetect

Disk /dev/hdc: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/hdc1 * 1 13 104391 fd Linux raid autodetect
/dev/hdc2 14 19457 156183930 fd Linux raid autodetect

Disk /dev/md1: 106 MB, 106823680 bytes
2 heads, 4 sectors/track, 26080 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md1 doesn't contain a valid partition table

Disk /dev/md2: 159.9 GB, 159932219392 bytes
2 heads, 4 sectors/track, 39045952 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md2 doesn't contain a valid partition table

Disk /dev/dm-0: 158.3 GB, 158309810176 bytes
2 heads, 4 sectors/track, 38649856 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/dm-0 doesn't contain a valid partition table

Disk /dev/dm-1: 1577 MB, 1577058304 bytes
2 heads, 4 sectors/track, 385024 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/dm-1 doesn't contain a valid partition table

cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 hdc2[1]
156183808 blocks [2/1] [_U]

md1 : active raid1 hda1[0] hdc1[1]
104320 blocks [2/2] [UU]

unused devices: <none>

Server-Manager "Manage Disk Redundancy" shows the following:
Current RAID Status

Personalities: [RAID1]
md2: active raid1 hdc2[1]
156183808 blocks [2/1] [ _U]

md1: active raid1 hda1

hdc1 [1]

104320 blocks [2/2] [UU]
unused devices: <none>
Only some of the RAID devices are unclean.
Manual intervention may be required.

Any help is appreciated. Thanks in advance.

Logged

SME 7.4
RAID 1

NickR

283
+0/-0

Re: RAID error appearing in email...

« Reply #1 on: October 07, 2007, 11:01:24 PM »

You need to add the removed partition hda2 back into md2

#mdadm /dev/md2 -a /dev/hda2

then doing

#mdadm --detail --verbose /dev/md2

should show 2 active disks and a remirror in operation

Logged

--
Nick......

ashpenaz

4
+0/-0

Re: RAID error appearing in email...

« Reply #2 on: October 07, 2007, 11:50:32 PM »

Thank you Nick. That worked.

I had tried several things I had seen in the forums but none of them worked. I am now in recovery.

Logged

SME 7.4
RAID 1

pfloor

889
+1/-0

Re: RAID error appearing in email...

« Reply #3 on: October 08, 2007, 07:12:41 AM »

You may also want to keep a very close eye on hda. If it falls out of sync again, replace the disk ASAP!!!

Logged

In life, you must either "Push, Pull or Get out of the way!"

NickR

283
+0/-0

Re: RAID error appearing in email...

« Reply #4 on: October 08, 2007, 12:29:10 PM »

@ashpenaz:

Glad to be of help.

@pfloor:

IME, it's rarely the disk itself that causes this problem, it's the controller. More accurately, it's putting the disks on different channels. I (now) always put the disks on the primary controller as master & slave and although it doesn't always work, it does seem to reduce the number of times that spurious RAID problems occur. Moving to SATA disks seems to be a good move for RAID stability.

Logged

--
Nick......

warren

293
+0/-0

Re: RAID error appearing in email...

« Reply #5 on: October 08, 2007, 01:27:27 PM »

Quote from: NickR on October 08, 2007, 12:29:10 PM

@ashpenaz:

Glad to be of help.

@pfloor:

IME, it's rarely the disk itself that causes this problem, it's the controller. More accurately, it's putting the disks on different channels. I (now) always put the disks on the primary controller as master & slave and although it doesn't always work, it does seem to reduce the number of times that spurious RAID problems occur. Moving to SATA disks seems to be a good move for RAID stability.

NickR,
Surely putting the raid1 on the same controller channel ( Primary : IDE-master-disk1 IDE-Slave-disk2 ) is more risky ! If the controller goes belly up you're left with NO Raid. The advice thats been given( http://wiki.contribs.org/Raid) before re raid1 is to have one disk on the Primary master and the other disk on the Secondary master. ?

Logged

NickR

283
+0/-0

Re: RAID error appearing in email...

« Reply #6 on: October 08, 2007, 02:51:26 PM »

Quote from: warren on October 08, 2007, 01:27:27 PM

NickR,
Surely putting the raid1 on the same controller channel ( Primary : IDE-master-disk1 IDE-Slave-disk2 ) is more risky ! If the controller goes belly up you're left with NO Raid. The advice thats been given( http://wiki.contribs.org/Raid) before re raid1 is to have one disk on the Primary master and the other disk on the Secondary master. ?

Notwithstanding that advice, I have installed many tens of SME / E-Smith machines (all of them RAID1 & mostly using IDE drives) over the years and I have never experienced a controller failure - maybe I'm just incredibly lucky!. I've even got 1 machine that is 12 years old, running Smoothwall on a disk that is at least 10 years old - it's been up for 3 years. I have only had 1 (genuine) disk failure but have seen many RAID sync problems. All I can report is that my own experience is that:

a) there is no real-world performance difference between using the same channel & separate channels
b) some chipsets exhibit RAID sync problems when the disks are on separate channels
c) the examples in (b) have often (but not always) been cured by putting the disks on the same channel.

I am merely reporting my own experiences (hence my IME preface to my comment). Others can ignore me if they wish but an alternative view can sometimes be valuable. I was merely trying to say that I wouldn't suspect the disk first without supporting evidence (like seek errors in the messages log).

Logged

--
Nick......

Elliott

150
+0/-0

Re: RAID error appearing in email...

« Reply #7 on: October 08, 2007, 10:18:20 PM »

Quote from: NickR on October 07, 2007, 11:01:24 PM

You need to add the removed partition hda2 back into md2

Could you possibly explain what led you to this conclusion? I get the mdadm errors with every reboot of one of my servers and I look at the output of all of the above messages and can't see where you figure out what's wrong. I have a similar setup with similar errors but I don't see what conclusively tells me which device is problematic.

My messages are:

Email
This is an automatically generated mail message from mdadm running on mail.dynamictrend.com.

A DegradedArray event has been detected on md device /dev/md2.
Email2
This is an automatically generated mail message from mdadm running on mail.dynamictrend.com.

A DegradedArray event has been detected on md device /dev/md1.

[root@mail ~]# mdadm --detail /dev/md1
/dev/md1:
Version : 00.90.01
Creation Time : Wed Jan 3 10:55:40 2007
Raid Level : raid1
Array Size : 104320 (101.89 MiB 106.82 MB)
Device Size : 104320 (101.89 MiB 106.82 MB)
Raid Devices : 2
Total Devices : 1
Preferred Minor : 1
Persistence : Superblock is persistent

Update Time : Mon Oct 8 14:11:12 2007
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0

UUID : f554e523:e5d6c5df:d65d57fc:d732c08e
Events : 0.2528

Number Major Minor RaidDevice State
0 3 1 0 active sync /dev/hda1
1 0 0 - removed

[root@mail ~]# mdadm --detail /dev/md2
/dev/md2:
Version : 00.90.01
Creation Time : Wed Jan 3 10:54:59 2007
Raid Level : raid1
Array Size : 78043648 (74.43 GiB 79.92 GB)
Device Size : 78043648 (74.43 GiB 79.92 GB)
Raid Devices : 2
Total Devices : 1
Preferred Minor : 2
Persistence : Superblock is persistent

Update Time : Mon Oct 8 15:56:20 2007
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0

UUID : fdd8ea67:b5564767:410e5042:29d225ba
Events : 0.3819448

Number Major Minor RaidDevice State
0 3 2 0 active sync /dev/hda2
1 0 0 - removed

[root@mail ~]# fdisk -l

Disk /dev/hda: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/hda1 * 1 13 104391 fd Linux raid autodetect
/dev/hda2 14 9729 78043770 fd Linux raid autodetect

Disk /dev/hdc: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/hdc1 * 1 13 104384+ fd Linux raid autodetect
Partition 1 does not end on cylinder boundary.
/dev/hdc2 13 9729 78043807 fd Linux raid autodetect

Disk /dev/md1: 106 MB, 106823680 bytes
2 heads, 4 sectors/track, 26080 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md1 doesn't contain a valid partition table

Disk /dev/md2: 79.9 GB, 79916695552 bytes
2 heads, 4 sectors/track, 19510912 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md2 doesn't contain a valid partition table

Disk /dev/dm-0: 77.7 GB, 77779173376 bytes
2 heads, 4 sectors/track, 18989056 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/dm-0 doesn't contain a valid partition table

Disk /dev/dm-1: 2080 MB, 2080374784 bytes
2 heads, 4 sectors/track, 507904 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/dm-1 doesn't contain a valid partition table

[root@mail ~]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 hda2[0]
78043648 blocks [2/1] [U_]

md1 : active raid1 hda1[0]
104320 blocks [2/1] [U_]

unused devices: <none>

Logged

Elliott

NickR

283
+0/-0

Re: RAID error appearing in email...

« Reply #8 on: October 09, 2007, 12:17:08 AM »

Quote from: Elliott on October 08, 2007, 10:18:20 PM

Could you possibly explain what led you to this conclusion? I get the mdadm errors with every reboot of one of my servers and I look at the output of all of the above messages and can't see where you figure out what's wrong. I have a similar setup with similar errors but I don't see what conclusively tells me which device is problematic.

I'll do my best

I'll cut your post down to the salient parts for this exercise:

Quote

/dev/md1:
Number Major Minor RaidDevice State
0 3 1 0 active sync /dev/hda1
1 0 0 - removed

/dev/md2:
Number Major Minor RaidDevice State
0 3 2 0 active sync /dev/hda2
1 0 0 - removed

Using the output below, we know that there are 2 IDE disks present: /dev/hda & /dev/hdc

You can see above that /dev/hdc1 and /dev/hdc2 have both been removed from the md1 and md2 arrays - that tells us that the problem lies with /dev/hdc

Quote

[root@mail ~]# fdisk -l

Disk /dev/hda: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/hda1 * 1 13 104391 fd Linux raid autodetect
/dev/hda2 14 9729 78043770 fd Linux raid autodetect

Disk /dev/hdc: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/hdc1 * 1 13 104384+ fd Linux raid autodetect
Partition 1 does not end on cylinder boundary.
/dev/hdc2 13 9729 78043807 fd Linux raid autodetect

Although the disks both have identical sizes and the heads, sectors & cylinders match, the second (/dev/hdc) partition table seems to have a problem - it should exactly match that of /dev/hda but there is a difference in the number of blocks. This is probably why the disk was removed from the array as one of the prime requirements is an identical number of blocks on the mirrored partitions.

Quote

[root@mail ~]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 hda2[0]
78043648 blocks [2/1] [U_]

md1 : active raid1 hda1[0]
104320 blocks [2/1] [U_]

unused devices: <none>

This is just confirming what mdadm is telling us: namely, that /dev/hdc1 & 2 are missing from the array.

To fix this particular case, you will need to blow away the partition table on /dev/hdc and then re-create it to mirror that on /dev/hda exactly. Once that has been done, you will be able to manually add /dev/hdc partitions back into the arrays as described earlier in this thread.

HTH

Logged

--
Nick......