Koozali.org: home of the SME Server

Boot | disk check problem after hardware RAID failure

Offline DFLiddle

  • 2
  • +0/-0
Boot | disk check problem after hardware RAID failure
« on: May 10, 2012, 11:59:31 AM »
I'm assisting a remote colleague with the recovery of an SME server. One of two disks in a hardware-managed RAID 1 set failed in a Dell PowerEdge T100. The RAID controller indicated that the disk was "degraded", and the system failed to boot. The disk was replaced under warranty, and the RAID controller dutifully synchronized the disks after the new one was installed.

SME, when booting, indicates that the system was not shut down cleanly and that it ought to be checked. Due to the apparent severity of the issue, it has not been possible for my colleague to bypass the disk check -- it is forced. However, the check never appears to make progress, even when allowed to run overnight.

Using an installation CD, it is possible to enter Rescue Mode and mount the SME file system. My colleague is going to ensure that his data and configuration are backed up, assuming that his USB backup drive is accessible from Rescue Mode.

After running further hardware diagnostics today, it's possible that the other hard drive is also experiencing errors, which casts doubt (in my mind) on the state of the file system as synchronized with the replacement drive. (When booted alone in the server, this new drive is also incapable of making progress on or completing a file system check.)

What ideas do you have on this situation? What further information could we provide to shed more light on the matter?

Offline Stefano

  • *
  • 10,894
  • +3/-0
Re: Boot | disk check problem after hardware RAID failure
« Reply #1 on: May 10, 2012, 12:08:45 PM »
in my experience, if you are going to use raid1, you should not use hw one but let SME use its own raid1

in this way you gain more control on what's going on.

I would backup data and do a full reinstall without hw raid (btw, could you please tell us the exact model and raid controller? tia)

good luck

Offline DFLiddle

  • 2
  • +0/-0
Re: Boot | disk check problem after hardware RAID failure
« Reply #2 on: May 10, 2012, 01:06:48 PM »
All points granted, Stefano.

The best information I have on the controller card, since Dell system configuration lists are often vague, is that it is a LSI MegaRAID SAS controller card. I will need to confirm its identity with my colleague when he gets back to me again.

Offline janet

  • *****
  • 4,812
  • +0/-0
Re: Boot | disk check problem after hardware RAID failure
« Reply #3 on: May 10, 2012, 06:13:52 PM »
DFLiddle

Quote
Using an installation CD, it is possible to enter Rescue Mode and mount the SME file system. My colleague is going to ensure that his data and configuration are backed up, assuming that his USB backup drive is accessible from Rescue Mode.

Yes, and that's the first thing you should be doing,, refer Wiki & Howto's.
Once the data is secure, prcoceed to rebuild the server, preferably in software RAID1 with new drives, then restore from backup or copy data back to new server.
If the standard backup routines do not achieve this (eg due to backup failure), then read this article for an overview of alternative backup & restore options
http://wiki.contribs.org/Backup_server_config

You can rebuild a single drive sme server machine for testing, ie to thoroughly test those "faulty" drives. Refer Howto's.
Please search before asking, an answer may already exist.
The Search & other links to useful information are at top of Forum.