Koozali.org: home of the SME Server

Obsolete Releases => SME Server 7.x => Topic started by: mike_mattos on November 04, 2010, 05:26:15 PM

Title: raid issue
Post by: mike_mattos on November 04, 2010, 05:26:15 PM: today I received this in an admin email

"This is an automatically generated mail message from mdadm running on xxx.xxx.com.

A Fail event has been detected on md device /dev/md2."

So I ran cat /proc/mdstat

Personalities : [raid1]
md2 : active raid1 sda2[2](F) sdb2[1]
488279488 blocks [2/1] [_U]

md1 : active raid1 sda1[0] sdb1[1]
104320 blocks [2/2] [UU]

It appears sda2[2] has failed. BUT smartctl says both drives PASSED and auto rebuild isn't running.

The raid log shows no other messages.

How do I force an auto rebuild?
Title: Re: raid issue
Post by: Stefano on November 04, 2010, 05:39:30 PM: please searchg the forums, your question has been answered many time
thank you
Title: Re: raid issue
Post by: mike_mattos on November 04, 2010, 06:15:41 PM: stefano, the search 'auto rebuild raid' failed to yield anything useful. I've never had to do a manual rebuild on SME7, in any case the manual rebuild does not seem appropriate til I understand why the automatic repair failed to work. And how to restart the automatic rebuild.

The manual pages were not much help. I did find a four year old bug saying in effect remove & re-install the drive, a bit extreme I thought
Title: Re: raid issue
Post by: Stefano on November 04, 2010, 06:49:26 PM: try this:
Code: [Select]
mdadm -r /dev/md2 /dev/sda2 mdadm -a /dev/md2 /dev/sda2
and watch log for errors
if rebuild process doesn't complete, trash your hd and replace it with a new one
Title: Re: raid issue
Post by: mike_mattos on November 04, 2010, 08:55:42 PM: apart from the estimated 2525 minutes to finish, it seems to be rebuilding properly!

But to the main concern, how do I tell why the rebuild did not start by itself? I've seen several raid events over the last 12 months that were related to network errors, but in every other case, the rebuild was well under way before I even read the admin email notice. And even those incidents beg the question, why would a bad network node trigger a raid event? But it was repeatable !

Anyway, thanks for the help, Stefano
Title: Re: raid issue
Post by: janet on November 04, 2010, 10:18:56 PM: mike_mattos

Please learn to read available documentation as much work has been done by kind hearted souls to make information available that answers questions like yours.

Please familiarize yourself with the Manual, Contribs, Howtos & FAQ pages so you at least have a minimal understanding of where to look and what is available. Thank you.

To answer your specific issue, please see the Howto page and read the RAID article, Howto link at top of forums.
Finding any and all information starts at the main Wiki page, linked at top of forum.

By the way, I am not aware of any RAID array auto rebuild functionality that you refer to (that repairs arrays where a drive partition has been kicked out), that is a figment of your imagination. There is a manual RAID rebuild function in the admin console to add a new (or clean used) drive. An array that has lost synchronisation (eg due to power failures) is automatically resynchronised. Synchronisation and rebuilding are different things.
Partitions that have been kicked out of a drive need to be manually re-added. Refer to RAID contribs.org Howto. You should do a full test on your harddrives using a drive manufacturers diagnostic software or smartctl.
Title: Re: raid issue
Post by: mike_mattos on November 05, 2010, 01:13:13 AM: http://wiki.contribs.org/Raid#Resynchronising_a_Failed_RAID DOES NOT MATCH the suggestion of Stefano, which do seem to have resolved the problem

I'm also not imagining that drives resync automatically, or that mdadm can run a script upon detecting an error

From SMS documentation, replacing a drive is an automatic rebuild!

Adding another Hard Drive Later

ENSURE THAT THE NEW DRIVE IS THE SAME SIZE OR LARGER AS THE CURRENT DRIVE(S)

* Shut down the machine
* Install drive as master on the second IDE channel (hdc)
* Boot up
* Log on as admin to get to the admin console
* Go to #5 Manage disk redundancy

It should tell you there if the drives are syncing up. Don't turn off the server until the sync is complete or it will start from the beginning again. When it is done syncing it will show a good working raid1.
Title: Re: raid issue
Post by: janet on November 05, 2010, 01:26:54 AM: mike_mattos

Quote
I'm also not imagining that drives resync automatically, or that mdadm can run a script upon detecting an error
From SMS documentation, replacing a drive is an automatic rebuild!
* Log on as admin to get to the admin console
* Go to #5 Manage disk redundancy

The resynchronisation process may happen unattended (automatically as the book calls it) after the appropriate commands have been issued by the menu selection process, but it will not start automatically as it does need manual instigation.
That's not automatic in my books !

There are no rebuild scripts that run automatically upon detection of a RAID array error, there is only a warning message issued that a RAID array has failed.
The "rebuild percentage complete" messages only happen after manual instigation of the rebuild process.
Title: Re: raid issue
Post by: janet on November 05, 2010, 01:42:08 AM: mike

Stefanos code
Code: [Select]
mdadm -r /dev/md2 /dev/sda2 mdadm -a /dev/md2 /dev/sda2Wiki code
Code: [Select]
mdadm --remove /dev/md2 /dev/hda2 mdadm --add /dev/md2 /dev/hda2
They are the same, one uses abbreviated syntax, & of course there are different drive type/locations ie sda2 vs hda2

You state you are concerned about fixing the problem without understanding why the array failed.
Did you run full (long) diagnostic tests on BOTH drives. If one is failing, or showing signs of failure, by being kicked out of an array, then the other drive may also be approaching end of life.
It is possible just to be a random timing or other issue without either drive being faulty, so smartctl (long test) on each drive will tell you.
Title: Re: raid issue
Post by: p-jones on November 05, 2010, 10:59:00 AM: Quote
The "rebuild percentage complete" messages only happen after manual instigation of the rebuild process.

This may be the case when an existing member of a RAID set is removed and replaced by a new drive however I have had a number of occasions where there has been a power failure or some klever klog has done a dirty shutdown where an auto rebuild of the existing array has occured without manual intervention.
Title: Re: raid issue
Post by: Stefano on November 05, 2010, 11:03:14 AM: Quote from: p-jones on November 05, 2010, 10:59:00 AM
I have had a number of occasions where there has been a power failure or some klever klog has done a dirty shutdown where an auto rebuild of the existing array has occured without manual intervention.

this is normal..any time there's a sync failure, there's an auto sync

bu if a hd is kicked out of the array, there's no auto re-sync.. manual intervention is needed
Title: Re: raid issue
Post by: janet on November 05, 2010, 12:05:12 PM: Thanks Stefano

Quote
any time there's a sync failure, there's an auto sync
but if a hd is kicked out of the array, there's no auto re-sync.. manual intervention is needed

This thread is talking about a drive partition that has been kicked out of an array & the sme system does not automatically rebuild that, which is what my comments were referring to.
Mike is mixing up different situations and documentation ie automatic resynchronisation and manual rebuilding are different things.
Title: Re: raid issue
Post by: mike_mattos on November 05, 2010, 02:24:14 PM: the manual refers to a drive with an "F" flag ( md2 : active raid1 hda2[0](F) hdb2[1] <-- Shows current active partition - with one FAILED (F) ), my drive did not show this flag, thus I didn't associate my failure with the instructions, and wondered why automatic resync had failed.

Also, I recently built a 2 drive sata 7.4 system, and replaced a drive with a clean drive, it automatically installed! I'm now putting a 3 drive cage in machines, 2 enabled, 3rd a hot swap spare, and it looks like a full unattended repair is now feasible, the customer simply turns off the flashing (failed) drive and turns ON the spare.

As to the extended test, if the drive has failed, there should be a record ON THE DRIVE which smartctl would report ( based on my Windows experience. )

In any event, I hope this thread has enough key words to be searchable.
Title: Re: raid issue
Post by: CharlieBrady on November 05, 2010, 03:45:14 PM: Quote from: mike_mattos on November 04, 2010, 06:15:41 PM
I've never had to do a manual rebuild on SME7,

In that case you have never had a drive thrown out of a RAID set (i.e. marked as failed by the software raid layer). The others are correct in saying that there is no magic auto-re-add of drives considered 'failed'.

Quote
... in any case the manual rebuild does not seem appropriate til I understand why the automatic repair failed to work.

I think your time may be better spent investigating when and why it was thrown out of the array in the first place.

Quote
Also, I recently built a 2 drive sata 7.4 system, and replaced a drive with a clean drive, it automatically installed! I'm now putting a 3 drive cage in machines, 2 enabled, 3rd a hot swap spare, and it looks like a full unattended repair is now feasible, the customer simply turns off the flashing (failed) drive and turns ON the spare.

This sounds like a hardware RAID installation to me - one which will be unseen by SME server software RAID, including the monitor.