Koozali.org: home of the SME Server

Obsolete Releases => SME Server 7.x => Topic started by: wjhobbs on February 05, 2007, 08:19:35 PM

Title: Drive failure - bad drive (or not)?
Post by: wjhobbs on February 05, 2007, 08:19:35 PM: Almost exactly a month ago I received the following email message to admin:
Quote
This is an automatically generated mail message from mdadm running on chryxus.primary.chryxus.ca.

A DegradedArray event has been detected on md device /dev/md2.

I presumed the drive had gone bad and replaced the physical drive.

On the drive that I removed, I performed a full write-read test on the entire extent of the drive -- with no errors. That drive seems fine and is now in use on a test server.

Now, today I have received exactly the same message, identifying the brand new drive (/dev/md2) as the sources of the problem.

I am beginning to suspect that the physical drive may not be the issue.

Does anyone have any comments/suggestions?

Thanks.

John
Title: Drive failure - bad drive (or not)?
Post by: bpivk on February 05, 2007, 08:33:42 PM: I have a comment:
Degraded array can be the cause of server crash, power loss or a reset. This happens if one array falls behind. Raid then rebuilds the array and you're set.

It happend to me when i had to reboot my server (hard reboot) when i had kernel panic.
Title: Drive failure - bad drive (or not)?
Post by: Gert on February 05, 2007, 08:50:46 PM: Maybe I can help, I had recover 800gb of my irreplaceable data after a raid5 failure with 2 drives down.

Log on to the console and give me the output of:

cat /proc/mdstat
Title: Drive failure - bad drive (or not)?
Post by: wjhobbs on February 05, 2007, 09:09:33 PM: Gert,

Thanks for your response.

Too late for anything useful. I am in the process of rebuilding the array and mdstat just shows the resync in progress.

What I am wondering is if anyone suspects a potential problem elsewhere, like the second IDE controller or something else.

John
Title: Drive failure - bad drive (or not)?
Post by: Gert on February 05, 2007, 09:26:47 PM: Doesn't Matter id it is rebuilding, I wanted to see what your raid configuration looks like. I is definately posible that it is the controller. In my case it was and as result it took both drives connected to my primary controller out. But the array only became degraded after a reboot and it would not boot. so I could not recreate the array. Even the rescue option from the sme cd did not detect the installation. I had to use RIP and knoppix to save my data. I ended up with a corrupt ext3 file system within a logical volume, within a lost logical volume group, within a degraded (failed) raid 5 array.

Are you using raid 1 or 5? IDE, SCSI or SATA?
Title: Drive failure - bad drive (or not)?
Post by: Gert on February 05, 2007, 09:53:17 PM: Quote
like the second IDE controller

Ok, IDE. wich device failed?

Quote
second IDE controller

I suppose it would be then hdc of hde.

while rebuilding do:

mdadm --examine /dev/hdX2 (where X is the failed drive)

and see if the checksum failes. If that is the case then your secondary controller is faulty and your array will be degraded again after you reboot.
Title: Drive failure - bad drive (or not)?
Post by: wjhobbs on February 06, 2007, 11:54:43 PM: Thanks Gert,

Sorry I didn't get to your post until the rebuild had completed. The results of --examine are:
Code: [Select]
[root@chryxus ~]# mdadm --examine /dev/hdc2 /dev/hdc2: Magic : a92b4efc Version : 00.90.00 UUID : a7d745b4:eba3a1a5:78728991:024fc0de Creation Time : Sun Jan 7 12:07:23 2007 Raid Level : raid1 Device Size : 292945152 (279.37 GiB 299.98 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 2 Update Time : Tue Feb 6 17:47:38 2007 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Checksum : 54631d6f - correct Events : 0.866385 Number Major Minor RaidDevice State this 1 22 2 1 active sync /dev/hdc2 0 0 3 2 0 active sync /dev/hda2 1 1 22 2 1 active sync /dev/hdc2
Checksum is OK at this point.

John
Title: Drive failure - bad drive (or not)?
Post by: Gert on February 07, 2007, 06:42:32 AM: ok, if your drive falls again, take a look at your checksum.