Koozali.org: home of the SME Server

Obsolete Releases => SME Server 7.x => Topic started by: DFLiddle on May 10, 2012, 11:59:31 AM

Title: Boot | disk check problem after hardware RAID failure
Post by: DFLiddle on May 10, 2012, 11:59:31 AM
I'm assisting a remote colleague with the recovery of an SME server. One of two disks in a hardware-managed RAID 1 set failed in a Dell PowerEdge T100. The RAID controller indicated that the disk was "degraded", and the system failed to boot. The disk was replaced under warranty, and the RAID controller dutifully synchronized the disks after the new one was installed.

SME, when booting, indicates that the system was not shut down cleanly and that it ought to be checked. Due to the apparent severity of the issue, it has not been possible for my colleague to bypass the disk check -- it is forced. However, the check never appears to make progress, even when allowed to run overnight.

Using an installation CD, it is possible to enter Rescue Mode and mount the SME file system. My colleague is going to ensure that his data and configuration are backed up, assuming that his USB backup drive is accessible from Rescue Mode.

After running further hardware diagnostics today, it's possible that the other hard drive is also experiencing errors, which casts doubt (in my mind) on the state of the file system as synchronized with the replacement drive. (When booted alone in the server, this new drive is also incapable of making progress on or completing a file system check.)

What ideas do you have on this situation? What further information could we provide to shed more light on the matter?
Title: Re: Boot | disk check problem after hardware RAID failure
Post by: Stefano on May 10, 2012, 12:08:45 PM
in my experience, if you are going to use raid1, you should not use hw one but let SME use its own raid1

in this way you gain more control on what's going on.

I would backup data and do a full reinstall without hw raid (btw, could you please tell us the exact model and raid controller? tia)

good luck
Title: Re: Boot | disk check problem after hardware RAID failure
Post by: DFLiddle on May 10, 2012, 01:06:48 PM
All points granted, Stefano.

The best information I have on the controller card, since Dell system configuration lists are often vague, is that it is a LSI MegaRAID SAS controller card. I will need to confirm its identity with my colleague when he gets back to me again.
Title: Re: Boot | disk check problem after hardware RAID failure
Post by: janet on May 10, 2012, 06:13:52 PM
DFLiddle

Quote
Using an installation CD, it is possible to enter Rescue Mode and mount the SME file system. My colleague is going to ensure that his data and configuration are backed up, assuming that his USB backup drive is accessible from Rescue Mode.

Yes, and that's the first thing you should be doing,, refer Wiki & Howto's.
Once the data is secure, prcoceed to rebuild the server, preferably in software RAID1 with new drives, then restore from backup or copy data back to new server.
If the standard backup routines do not achieve this (eg due to backup failure), then read this article for an overview of alternative backup & restore options
http://wiki.contribs.org/Backup_server_config

You can rebuild a single drive sme server machine for testing, ie to thoroughly test those "faulty" drives. Refer Howto's.