Drive failure - bad drive (or not)?

wjhobbs

171
+0/-0

Drive failure - bad drive (or not)?

« on: February 05, 2007, 08:19:35 PM »

Almost exactly a month ago I received the following email message to admin:

Quote

This is an automatically generated mail message from mdadm running on chryxus.primary.chryxus.ca.

A DegradedArray event has been detected on md device /dev/md2.

I presumed the drive had gone bad and replaced the physical drive.

On the drive that I removed, I performed a full write-read test on the entire extent of the drive -- with no errors. That drive seems fine and is now in use on a test server.

Now, today I have received exactly the same message, identifying the brand new drive (/dev/md2) as the sources of the problem.

I am beginning to suspect that the physical drive may not be the issue.

Does anyone have any comments/suggestions?

Thanks.

John

Logged

...

bpivk

908
+0/-0

Drive failure - bad drive (or not)?

« Reply #1 on: February 05, 2007, 08:33:42 PM »

I have a comment:
Degraded array can be the cause of server crash, power loss or a reset. This happens if one array falls behind. Raid then rebuilds the array and you're set.

It happend to me when i had to reboot my server (hard reboot) when i had kernel panic.

Logged

"It should just work" if it doesn't report it. Thanks!

Gert

208
+0/-0

Drive failure - bad drive (or not)?

« Reply #2 on: February 05, 2007, 08:50:46 PM »

Maybe I can help, I had recover 800gb of my irreplaceable data after a raid5 failure with 2 drives down.

Log on to the console and give me the output of:

cat /proc/mdstat

Logged

wjhobbs

171
+0/-0

Drive failure - bad drive (or not)?

« Reply #3 on: February 05, 2007, 09:09:33 PM »

Gert,

Thanks for your response.

Too late for anything useful. I am in the process of rebuilding the array and mdstat just shows the resync in progress.

What I am wondering is if anyone suspects a potential problem elsewhere, like the second IDE controller or something else.

John

Logged

...

Gert

208
+0/-0

Drive failure - bad drive (or not)?

« Reply #4 on: February 05, 2007, 09:26:47 PM »

Doesn't Matter id it is rebuilding, I wanted to see what your raid configuration looks like. I is definately posible that it is the controller. In my case it was and as result it took both drives connected to my primary controller out. But the array only became degraded after a reboot and it would not boot. so I could not recreate the array. Even the rescue option from the sme cd did not detect the installation. I had to use RIP and knoppix to save my data. I ended up with a corrupt ext3 file system within a logical volume, within a lost logical volume group, within a degraded (failed) raid 5 array.

Are you using raid 1 or 5? IDE, SCSI or SATA?

Logged

Gert

208
+0/-0

Drive failure - bad drive (or not)?

« Reply #5 on: February 05, 2007, 09:53:17 PM »

Quote

like the second IDE controller

Ok, IDE. wich device failed?

Quote

second IDE controller

I suppose it would be then hdc of hde.

while rebuilding do:

mdadm --examine /dev/hdX2 (where X is the failed drive)

and see if the checksum failes. If that is the case then your secondary controller is faulty and your array will be degraded again after you reboot.

Logged

wjhobbs

171
+0/-0

Drive failure - bad drive (or not)?

« Reply #6 on: February 06, 2007, 11:54:43 PM »

Thanks Gert,

Sorry I didn't get to your post until the rebuild had completed. The results of --examine are:

Code: [Select]

[root@chryxus ~]# mdadm --examine /dev/hdc2
/dev/hdc2:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : a7d745b4:eba3a1a5:78728991:024fc0de
  Creation Time : Sun Jan  7 12:07:23 2007
     Raid Level : raid1
    Device Size : 292945152 (279.37 GiB 299.98 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 2

    Update Time : Tue Feb  6 17:47:38 2007
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0
       Checksum : 54631d6f - correct
         Events : 0.866385


      Number   Major   Minor   RaidDevice State
this     1      22        2        1      active sync   /dev/hdc2
   0     0       3        2        0      active sync   /dev/hda2
   1     1      22        2        1      active sync   /dev/hdc2

Checksum is OK at this point.

John

Logged

...

Gert

208
+0/-0

Drive failure - bad drive (or not)?

« Reply #7 on: February 07, 2007, 06:42:32 AM »

ok, if your drive falls again, take a look at your checksum.

Logged