Koozali.org: home of the SME Server

Partition damaged (after power cut) - system won't start - how to repair?

Hi all,

I'm having some trouble with my server after a power cut.
It's an SME 8b6 (for some strange reason 7.* won't install on this machine, I've never understood why, but at the time I managed to install version 8, it was an earlier beta that I've since then updated from time to time). Only one SATA hard drive in it.

So, after that power cut, the system won't restart, throwing some errors about one of the partitions - I've seen these before and on a non-raid system I'd be able to repair (probably) but I know nothing about raid... It all ends with a recovery failure and a kernel panic. Here is some of the stuff I managed to copy (I've got only one screen for two computers and besides it goes fast...):

recovery required
exception Emask (followed by numbers)
I/O error, /dev/sda, sector xxxxxxx
failed to read block at offset...
JDB: recovery failed

Then the kernel panic.

I've tried using the rescue system on the installation CD, it claims I don't have Linux partitions on this machine.

There's nothing critical on that drive, still that's stuff I'd like to recover... Also, this is only a personal and test server, I can still temporarily survive without it, so all I'd like to do really is to backup the data that's on /dev/sda2 then I'll re-install it when I get the time, that's the easy part. And before anyone asks why I did not make backups earlier, I just didn't have the room for it... I've recently bought a new HD to replace the one on my workstation and now I have room to copy the data. Only I can't as long as I can't mount the partition.

So far all I've managed to do from the recovery CD is this:

Code: [Select]
mdadm --examine --scan /dev/sda1 /dev/sda2 >> /etc/mdadm.conf

This outputs a config file with 2 lines, concerning md1 and md2. Strangely both of them say there are two devices... While there are in fact 2 in total.

Code: [Select]
mdadm -A -s
cat /proc/mdstat

The raid appears to be started (but each array with 1 disk out of 2), the result of that cat command is 2 blocks of text concerning md1 and md2 that are declared active.

I can then mount /dev/md1 this appears to be the boot partition. /dev/md2 won't mount and I can't run fsck on it (some superblock error). I also have lots of /dev/md3 to 10 or even more I guessed I should ignore them.

All suggestions are welcome!

Thanks in advance.

Seb.
« Last Edit: June 29, 2011, 06:29:08 PM by Old Lodge Skins »
"How high does the sycamore grows? If you cut it down, you'll never know!" - Vanessa Williams, Pocahontas.

Offline CharlieBrady

  • *
  • 6,918
  • +3/-0
This article might be useful to you:

http://wiki.contribs.org/Recovering_SME_Server_with_lvm_drives

You need to understand how structure is layered onto the physical hard drive. First there is RAID1, then LVM over the top of that, then swap and ext3 file system within the logical volumes.

Show here what you see when you do:

cat /proc/mdstat
pvscan
lvscan

Offline CharlieBrady

  • *
  • 6,918
  • +3/-0
It's an SME 8b6 (for some strange reason 7.* won't install on this machine, I've never understood why, ...

Some newer hardware is not supported by SME7.

Hi,

It's an oldie ;) With an E2200 (if memory serves me right) and DDR1 memory... But the reason why SME7 won't support it doesn't matter right now. And anyway SME8 was doing just fine for what I have to do.

I was suspecting the LVM structure but the installation CD does not seem to have the tools for that... In fact, I've been following this tutorial: http://www.linuxjournal.com/article/8874 - but got stuck at the part that deals with LVM.
I'll read your link later today and see if it helps. Thanks.

Seb.
"How high does the sycamore grows? If you cut it down, you'll never know!" - Vanessa Williams, Pocahontas.

Allright, since SME's install disk doesn't seem to have the LVM tools I'm now downloading the System Recue CD distribution and I'll give it a shot. I believe it should have both the raid and LVM tools.
"How high does the sycamore grows? If you cut it down, you'll never know!" - Vanessa Williams, Pocahontas.

Online Stefano

  • *
  • 10,894
  • +3/-0
Allright, since SME's install disk doesn't seem to have the LVM tools I'm now downloading the System Recue CD distribution and I'll give it a shot. I believe it should have both the raid and LVM tools.

SME's install disk has lvm tools for sure.. it's able to detect SME installation and to mount it.. :-)

Well, at the command line I couldn't find them...

Never find, there's hope: using this rescue CD and the instructions in Charlie's link (the same I already had) I've been able  to acess the LVM. e2fsck says it's clean (???) and I could mount it and see my files.

Now, the server still won't start. No more exception masks and such, it would just say it can't find the volume group /dev/main/root and then of course kernel panic... So something must be broken in SME's LVM configuration. I'll try the install CD's repair procedure once more maybe it'll work this time. At least I believe I've passed an important first step.

Edit: Ok SME's install CD did see and mount the volume this time. Only it leaves me with a console telling me where the volume is mounted and how to make it the active file system... Which I did... Now I'm guessing its LVM configuration must be broken one way or an other and I need to restore it. I don't know yet how to do that so I'll start searching but in the meantime if anyone has an idea...

Seb.
« Last Edit: June 30, 2011, 01:20:43 PM by Old Lodge Skins »
"How high does the sycamore grows? If you cut it down, you'll never know!" - Vanessa Williams, Pocahontas.

Offline purvis

  • *****
  • 567
  • +0/-0
Humm, I would check the ram on that computer before progressing further. If you have another system just like the one you are using and the ram checks out on it, you might want to use it. The computer where the power failure hit might be damaged, thereby increasing your troubles.

I don't have any other box with DDR-1 unfortunately... Besides, that's not the first time I have a power cut (on this machine or others), I've had some partitions damaged and then repaired in the past but never had a problem with my RAM. This said I could still check it with memtest but I believe the other systems I've used while trying to repaid that partition (one Knoppix, the install CD and the System Rescue CD) would have given some warnings as all the live CDs live in the RAM. None of them has failed or given any warning.

I haven't had time to work on restoring the LVM yet but since I can read my data then I guess the worst is behind me.

Thanks.

Seb.
"How high does the sycamore grows? If you cut it down, you'll never know!" - Vanessa Williams, Pocahontas.

Offline purvis

  • *****
  • 567
  • +0/-0
If you have had a lot of problems losing access to your files.
Then i hope you are doing something diiferent than me.
I depend on the stability linux raid 1 is suppose to have.
If it is not stable with a few power outage problems, then i feel i am in future troubles
We use UPS but that will even lose power after a awhile. Personally, i do not use or work on anybody's computer without one.
Again, are you doing anything unstandard to lose your file system. 

Offline CharlieBrady

  • *
  • 6,918
  • +3/-0
Again, are you doing anything unstandard to lose your file system.

Old Lodge Skins has only one hard drive. RAID1 provides no protection in that case.

I haven't said I've had a LOT of problems accessing my files, either...
The power went down, the drive was working at that time which damaged the partition... That's all. That's not something you'd want to happen but in the end, it's rather common. And it wouldn't have happened if my UPS wasn't dead. Also, as Charlie said I'm working with only 1 drive here.
Other than that this machine has been working just fine for over a year now ;) Also, I've had other power failures in the past. That's the first time something like that happens on my server. Some similar problem happened on a workstation some time ago but was easily solved with e2fsck.
Actually my only real problem was that I knew nothing about raid and LVM. But I'm learning ;)

PS: One of the points of an UPS is to let you shut down your system properly. So when you see a power failure seems to last a little bit too long you should be careful and stop the system until all comes back to normal. At least I know I would.
« Last Edit: July 01, 2011, 08:13:02 PM by Old Lodge Skins »
"How high does the sycamore grows? If you cut it down, you'll never know!" - Vanessa Williams, Pocahontas.

Well... I've been searching but I still don't get it.

The recovery mode of the installation CD does mount my system just fine. I can see my files. But when trying to boot the installed system I still get this error:

First I see this:
Code: [Select]
raid set md1 active with 1 of 2 mirrors

From this I assume the raid's fine.

Then:
Code: [Select]
Activating logical volumes
Reading all physical volumes. This may take a while...

So far all is normal, right?

Code: [Select]
Activation logical volumes
Volume group "main" not found

This is where all goes south... Of course from there there's no data access and this ends with a kernel panic.

I don't understand why it doesn't find the volume group or how to restore it... Any ideas?

Thanks.

Seb.
"How high does the sycamore grows? If you cut it down, you'll never know!" - Vanessa Williams, Pocahontas.

Offline CharlieBrady

  • *
  • 6,918
  • +3/-0
The recovery mode of the installation CD does mount my system just fine. I can see my files.

What is mounted when you are in rescue mode? Is it the logical volume, or the raid device (e.g. /dev/md2)? If the latter, then do into edit mode in grub, and change 'root=/dev/main/root' to 'root=/dev/md2'.

If you have a spare computer, I would use it as a backup server. Boot the damaged server in rescue mode, and copy your files from it over the network.

In /etc/mtab I can see /dev/main/root is mounted as /and /dev/md1 as /boot.
In the grub.conf of the installed system in all 3 entries I have root=/dev/main/root. I'll try and change it.

Edit: well, no result... I still get the same error about /dev/main/root...

I could simply move the drive to my workstation and copy the files, in fact I'm probably going to do that, and if I wanted I could then make a clean install safely, but I'd still like to understand anyway.
« Last Edit: July 04, 2011, 06:25:57 PM by Old Lodge Skins »
"How high does the sycamore grows? If you cut it down, you'll never know!" - Vanessa Williams, Pocahontas.