Koozali.org: home of the SME Server

Obsolete Releases => SME Server 7.x => Topic started by: twijtzes on October 22, 2010, 02:50:06 PM

Title: Hardware failure SME server not accessible anymore
Post by: twijtzes on October 22, 2010, 02:50:06 PM
Dear All,

We had a serious server crash. Aparently everything came together. Dar refused to make backups for two consecutive weeks, but overwrote all previous backups ( :sad: STUPID ME :-(; I should have checked).

One of the disks in the raid array (HP 5i) died (72 GB SCSI). I replaced it and let the raid controller rebuild it.

I cannot access the SME server anymore. Does anyone have a trick or a tip how I can rescue the information in the ibays. I saw a rescue option on the installation CD. That has been running for more then a week now. However, i still look at the CCiSS driver loading message. I see the lights of the SCSI disks flashing. Is this supposed to take so long. There was about 25 GB of data on the server.


Please please please help me
Taco

ps. A good bottle of wine/beer/whiskey/fruit juice will be sent to you if you are able to help me (I promise!)
Title: Re: Hardware failure SME server not accessible anymore
Post by: Stefano on October 22, 2010, 04:37:42 PM
twijtzes:

does the server have a monitor? :-)

if so, please seat in front of it and tell us what you see
is there any error message?
Title: Re: Hardware failure SME server not accessible anymore
Post by: twijtzes on October 22, 2010, 06:15:07 PM
It does,

there are no error messages, I see a blue screen

on the top left SME server 7.5

In the center of the screen
Loading SCSI driver
loading cciss driver

at the bottom press space bar
F12


and a lot of disk activity
Title: Re: Hardware failure SME server not accessible anymore
Post by: Stefano on October 22, 2010, 06:26:14 PM
ok.. can you tell me which hp server are you using? did you check here (http://h18004.www1.hp.com/products/servers/linux/hplinuxcert.html) if your hw is certified for redhat 4.x?
Title: Re: Hardware failure SME server not accessible anymore
Post by: twijtzes on October 22, 2010, 08:05:14 PM
It has been running SME sever 7.5 flawlessly for some time now; I guess it is Redhat supported It is a HP DL380 G3
Title: Re: Hardware failure SME server not accessible anymore
Post by: Stefano on October 22, 2010, 08:33:07 PM
It has been running SME sever 7.5 flawlessly for some time now; I guess it is Redhat supported It is a HP DL380 G3

then open a bug in bugzilla..
I suspect an hw failure
Title: Re: Hardware failure SME server not accessible anymore
Post by: twijtzes on October 22, 2010, 08:38:14 PM
In what way is it a bug then ?

Can I access the system through a CD-like install to see what's left of the information on the drives. Could a reinstall work without modifying the data, a repair maybe; what are my options ?
Title: Re: Hardware failure SME server not accessible anymore
Post by: Stefano on October 22, 2010, 09:15:07 PM
In what way is it a bug then ?

it could be a bug or an hw failure

Quote
Can I access the system through a CD-like install to see what's left of the information on the drives. Could a reinstall work without modifying the data, a repair maybe; what are my options ?

try booting from cd with
Code: [Select]
sme rescue

at boot prompt..

I would check raid status from bios too
Title: Re: Hardware failure SME server not accessible anymore
Post by: twijtzes on October 22, 2010, 09:31:27 PM
I am a little bit afraid that if I stop what the system is doing now, the whole system may get corrupted.

Before I started the current rescue, the Raid Status of all disks was ok

What do you think, could the current status not be a rescue action from the software itself or is it hanging ?

I guess it is running the rescue option; it's been a week you know... is this holdup normal during rescue ?

Taco
Title: Re: Hardware failure SME server not accessible anymore
Post by: Stefano on October 22, 2010, 09:34:19 PM
I guess it is running the rescue option; it's been a week you know... is this holdup normal during rescue ?

boot should take 1, 2 minutes, not ages :-)

reboot, check raid status, boot from cd in rescue mode, let us know (try to boot from SME8 cd too)
Title: Re: Hardware failure SME server not accessible anymore
Post by: twijtzes on October 22, 2010, 09:48:38 PM
i'll try tomorrow morning;

thanks so far

is version 8 a good idea ?

i'll download it right now
Title: Re: Hardware failure SME server not accessible anymore
Post by: Stefano on October 22, 2010, 09:53:17 PM
is version 8 a good idea ?

it comes from CentOS5.X so it's newer, maybe with a better hw support
let us know
Title: Re: Hardware failure SME server not accessible anymore
Post by: twijtzes on October 22, 2010, 10:25:15 PM
The offer still stands, tomorrow i'll give it a go

Taco
Title: Re: Hardware failure SME server not accessible anymore
Post by: janet on October 23, 2010, 05:32:17 AM
twijtzes

A golden rule of data recovery is to stop using the disks that the data is on.
As you are running RAID rebuild and other "unknown to us" rescue/repair options, then who knows what is now happening with your disks and the data on them, if indeed there is any data still on them.

You should shutdown the machine, remove the drives, rebuild a new machine using different drives or build another test server using a single standard drive, and then mount the old drives to see if there is any date on them.

You can do a similar thing using the SME server install CD and boot up from that in rescue mode, and that way see what is on the drives, but you need to take great care that you do not inadvertantly delete any remaining data.

All the above is "critical" as you do not have a current backup, tch tch. You should always have multiple backup disks (rotated from day to day) so that you have a backup of the backup, so to speak.
Title: Re: Hardware failure SME server not accessible anymore
Post by: twijtzes on October 23, 2010, 06:34:09 AM
Hi Mary,

You are very right. I have no clue when it comes to servers, I am merely a user. Found out that SME server did the trick for us and went for it. I had DAR running and ran all updates. However each time DAR made a (rotating) backup (each day, one week period), it made a backup and overwrote the previous backup of the week before. During the backup job the system rebooted, so the backups were never finished. I should, of course, have checked the backup on a daily basis, however as it had been making the backups for more three years now, I thought there was no reason to do so. I was wrong.

I haven't done anything but put in a SME install disk in the CD drive and choose rescue.
 
Can you please explain how I could delete remaining data when using the CD, as I have done that....

Kindest regards,
Taco
Title: Re: Hardware failure SME server not accessible anymore
Post by: janet on October 23, 2010, 06:43:16 AM
twijtzes

Quote
I haven't done anything but put in a SME install disk in the CD drive and choose rescue.
Can you please explain how I could delete remaining data when using the CD, as I have done that...

Do you really mean to ask that question, it does not make sense to me.
Please explain better what it is that you want to do.
Title: Re: Hardware failure SME server not accessible anymore
Post by: twijtzes on October 23, 2010, 06:59:40 AM
So far,

I downloaded the latest 7.5 ISO and burned it to a CD. Booted the server with this new CD and chose rescue. I did this one week ago. It has eversince been displaying the text loading SCSI driver and something with cciss driver loading. There has been a huge amount of disk activity, but the boot seems to have stopped.

I have a replacement HP DL380 G4 for the old server which is also running SME server 7.5. And this new one is running like sunshine.

I would like to recover my old data, don't care about the old server (would be nice if it survived). I have backups of all the databases, as these were/are stored on two different locations, one in a DAR backup, the other in a MySql backup (run on a client computer) on a different USB drive. These were restored again.

I could, as you suggest, take out the SCSI drives from the old server and connect them externally to the new SME box. However, I have the feeling that this would permanently kill the old server, or am I wrong?

What would be the best way to take the disks out and mount them externally ?


Title: Re: Hardware failure SME server not accessible anymore
Post by: janet on October 23, 2010, 09:09:31 AM
twijtzes

Your situation is difficult to accurately diagnose remotely, especially with the very little information you have provided.

If your server will not boot to the SME CD, then I assume there must be a hardware issue that is stopping it from booting up. Maybe the motherboard is faulty, maybe the drive controller card is faulty, maybe something else ? Perhaps you should take the system to a technician to determine if there is a hardware fault if you are unable to determine that yourself.

Again I'm guessing a possibility is that maybe when you put the replacement drive in, and started it rebuilding the array, perhaps the system wiped out any data, ie it resynced to a blank drive. It's hard to tell what has happened at this stage.

Are the drives using hardware RAID1 ie not software RAID1. If so then you need to keep the drives connected to that specific (functional) drive controller for the system to work correctly.
If you are using software RAID1 then you should be able to remove the drives and connect them to a machine with similar CPU type and if the drives are OK and contain data, then the server should boot up OK.

As I have said already, anything you do with the original drives adds the high risk that something will go wrong and destroy whatever data still remains on them. Typically you would do a bare metal disk clone copy using the dd command or similar, or compatible cloning software. That way you can do testing on the copied drives to see what data still exists, without tampering with the original drives.

If you want to play "slightly" dangerously, and taking great care, then you can put one of the original drives into the good server and mount it eg as /dev/sdc
Then you can use the working system to interrogate the drive. You can then attach another known good blank drive and copy the valuable data to it.

I think the viability of doing these steps will depend on whether the drive(s) are only readable when connected to their proprietary hardware disk controller.

See the various Howtos about disks which will assist you with testing, using, mounting, rebuilding etc, some are applicable, some will have ideas & techniques you can use, eg
http://wiki.contribs.org/AddExtraHardDisk
http://wiki.contribs.org/AddExtraHardDisk_-_SCSI
http://wiki.contribs.org/Disk_Manager
http://wiki.contribs.org/Booting
http://wiki.contribs.org/Monitor_Disk_Health
http://wiki.contribs.org/Raid
http://wiki.contribs.org/Raid:LSI_Monitoring
http://wiki.contribs.org/Raid:Manual_Rebuild
http://wiki.contribs.org/Recovering_SME_Server_with_lvm_drives
http://wiki.contribs.org/USBDisks

If one of your drives is readable in another (cleanly installed OS) machine, then you could try using this Howto to recover eveyrything
http://wiki.contribs.org/UpgradeDisk

Honestly if you are unsure about how to do all the above, then take the equipment to a Linux expert and get him/her to see if your data is recoverable.

Quote
I could, as you suggest, take out the SCSI drives from the old server and connect them externally to the new SME box. However, I have the feeling that this would permanently kill the old server, or am I wrong? What would be the best way to take the disks out and mount them externally ?

I do not think doing that that will "kill" the drives, it may break the RAID array, but the data should still be intact on each disk, the real issue may be that the drive is not readable as it needs to be connected to the specific controller card (if it is hardware RAID). I rarely deal with hardware RAID so cannot comment more.


Try connecting an original drive from the faulty server, to sdc (or the next spare port) on the other good server. This assumes sda & sdb are already being used for the RAID1 software array, so adjust accrdingly.

Verify what drives connected with
fdisk -l |more
Verify details of the old drive
fdisk -l /dev/sda
make a mount point eg
mkdir -p /mnt/olddrive
mount the drive
mount /dev/sdc1 /mnt/olddrive
For your drive it might be sdc2 or sdc3 depending on the drive setup

Use Linux commnads or mc (midnight commander) to read the old drive and see what the contents are eg
ls -al /mnt/olddrive
If OK you should see the directory listing of a typical sme server
Copy files and data etc, see this howto for what is in a normal backup
http://wiki.contribs.org/Backup_server_config#Standard_backup_.26_restore_inclusions

Then unmount teh drive
umount /mnt/olddrive

Then if all went well and depending what and where you copied data etc, do on the new server
signal-event post-upgrade
reboot

All the above is generic advice, your specific situation may require the advice to be adjusted or differ to suit.
Title: Re: Hardware failure SME server not accessible anymore
Post by: twijtzes on October 23, 2010, 09:45:10 AM
First thing to to then is to stop the old server; i'll do that right away.

As the disks are hot swap hardware raid; i need to figure out if there is a hardware malfunction; if so i will repair the old server

What about the 8.0 beta from CD option ?
Title: Re: Hardware failure SME server not accessible anymore
Post by: janet on October 23, 2010, 09:57:51 AM
twijtzes

Quote
What about the 8.0 beta from CD option ?

What about it ? You are only introducing another variable if you try using the SME 8 CD.

Supposedly your system was running OK with SME7.x so it should run the 7.5 CD OK.

Better that you test your SME7.5 CD on another server box, to see if the system will boot up to that disk OK (ie the same CD you tried to boot to for the last week).
Title: Re: Hardware failure SME server not accessible anymore
Post by: twijtzes on October 23, 2010, 10:59:20 AM
I missed your last post, sorry

Booted with the SME 8 disk and gues what everything is still there !!!!!! Thank god, and you guys. Where can I send the wine/beer/whiskey; please send me an e-mail. My e-mail can be found under my name

How do I get the data from the box on to a USB disk ?

Taco
Title: Re: Hardware failure SME server not accessible anymore
Post by: janet on October 23, 2010, 11:13:05 AM
twijtzes

Quote
How do I get the data from the box on to a USB disk ?

There are various instructions and links to instructions on my earlier posts to steer you in the right direction re how to do that.

In short
connect USB drive
mount drive
copy data
unmount drive
Then use USB to restore data to another server.

Actually it is probably best/easiest if you do an actual backup.
Boot to CD
at command prompt type
console
select the backup option
connect USB when asked
wait for backup to finish (can take from an hour to quite a few hours depending on how much data is on your system, say 8hrs for 350Gb)

On the good server restore from the USB
Normally done to a clean install of the OS on first boot, when you are given the option to restore from a backup
Otherwise if your system is clean (ie no data or users etc) then follow these instructions to reset the backup on first boot switch
http://wiki.contribs.org/Backup_server_config#Restore_on_initial_reboot_after_fresh_OS_install_-_How_to_Reset_option
Title: Re: Hardware failure SME server not accessible anymore
Post by: twijtzes on October 25, 2010, 10:25:38 AM
Hello All,

Thanks so far, You may have noticed that I don't know anything about LINUX. So I'm stuck again!

I've started up using the boot CD of version 8 in rescue mode.

When ít's there, starting up the MC doesn't work. Maybe because I'm running version 8 on a version 7 installed system. The system gives a segmentation fault.

I've tried to run console in rescue mode; however it says command not found

I'm trying to mount a USB disk in the version 8 rescue mode, however so far unsuccesful. The USB disk is ext3 and is tested ok on another server. I followed the contrib on wiki.contribs.org/usbdisks; the part that deals with version SME 8, but when i want to run the commandline

Code: [Select]
hal-find-by-property --key volume.fsusage --string filesystemI also get a command not found

Now, I have realy no clue where to start. My preferable solution would be to mount a USB disk and copy the data in /mnt/sysimage/home/e-smith to the USB disk. What should I do ?
Title: Re: Hardware failure SME server not accessible anymore
Post by: janet on October 25, 2010, 12:39:52 PM
twijtzes

A search on
segmentation fault
shows many results, one is
http://forums.contribs.org/index.php/topic,31381.msg131933.html#msg131933
which advises you have either a hardware problem (especially bad RAM), a software bug or disk corruption.
Have you tested your drives one at a time with the manufacturers disk diagnostic utility ?

All the tasks you tried are not working, so this is suggestive of a common problem

I think you should test your system more thoroughly and you should really stop using it until you can prove that the various system components are working correctly. By continuing to use it with you may actually end up erasing data that is still on the drives.

You have been given suggestions in earlier posts re mounting the disks on another known working server, try that.
Title: Re: Hardware failure SME server not accessible anymore
Post by: twijtzes on October 25, 2010, 01:18:28 PM
Hi Mary,

I've installed SME server 7 and 8 on four different computers now. After a clean, successful setup I boot from de CD in linux rescue mode or SME rescue mode (For SME 7.5.1). Midnight Commander never starts in rescue mode. Neither in version 8.0 nor in 7.5.1 nor 7.5.

The cp command will do the trick anyway, so there is no real problem to be anticipated.

I would like to know how I can mount a USB disk in commandline mode under SME server 8.0 beta while running in rescue mode. I've read that this is not the same as in version 7.5.1. I don't succeed in 7.5.1 either (in rescue mode) /etc/fstab is allways missing. The mount command shows a usb disk, however mounting it is a complete different chapter.

Title: Re: Hardware failure SME server not accessible anymore
Post by: Stefano on October 25, 2010, 02:05:16 PM
boot your server with SMe's cd in rescue mode with external usb disk connected..

you should see it after boot process end.

then mount it with
Code: [Select]
mount /dev/sdX1 /mnt/yourmountpoint

where sdX and yourmountpoint should be self explaining

then you can copy your data with cp -a or rsync

NOTE: not tested, should work
Title: Re: Hardware failure SME server not accessible anymore
Post by: janet on October 25, 2010, 02:21:30 PM
twijtzes

Quote
I would like to know how I can mount a USB disk in commandline mode under SME server 8.0 beta while running in rescue mode. I've read that this is not the same as in version 7.5.1. I don't succeed in 7.5.1 either (in rescue mode) /etc/fstab is allways missing. The mount command shows a usb disk, however mounting it is a complete different chapter.

Commands already provided in post #17 of this thread. Please read the answers you are given.