Koozali.org: home of the SME Server

Obsolete Releases => SME Server 8.x => Topic started by: Arnaud on January 15, 2014, 09:27:42 PM

Title: sme8 crashed and no not restart // directory /var damaged?
Post by: Arnaud on January 15, 2014, 09:27:42 PM
Good evening everybody  :P
after spending a long time in "read-only" modus on this forum, I have to switch now in "read+write" modus.

This is my first topic and first post here, therefore I introduce myself shortly: I'm "Arnaud", a french guy living abroad, "playing" with sme8 for domestic use only for 2 or 3 years now. My knowledge and experience in informatics is not very high, for sure quite low in this forum, but I think not so bad for a hobby-home-administrator. :lol:

After 2 years without any (=not by myself generated) problem, I got a crash yesterday: I was restoring the /home of my ubuntu from the sme with "affa --full-restore" (via ssh) and it was never ending for only 5 GB. I was neither possible to log on the sme from another PC with ssh nor to login the server-manager via https. The sme was blocked: hard drives running and continuously working with the ethernet device. So I used the reset button of the sme and now it refuses to restart.

During the start procedure, it seems ok until "waiting for driver initialization".
udev and clock start ok.
/dev/md1 is clean
local file system monted ok
Then problems start short after this and before "activation /etc/fstab swaps ok:
between this, I got:
Code: [Select]
/etc/rc.d/rc.sysinit: line 873 can't find /var/log/utmp
can't access /var/log/utmp
can' access /var/log/wtmp
can't access var/run/utmp
...
can't accedd //var/log/dmesg
can't access /var/run getkey_done
..
no file /var/lock/subsys/microcode_ctl

If necessary, i can try to post some screenshots (done by camera! 8)) of the starting procedure.

What I tried today:
 - disconnect the one and the other hard drive: nothing changed (only in the starting procedure that there is only 1 disk present)
- starting with rescue CD:
mounting the md1 and md3 is problem less
I wanted then to have a look in the journals (it should be something about the crash from yesterday): /var/log/messages and I was not possible to get the file: the directory /var/log seems not to be present on the disk! :shock:
In the /var directory I only have. "account" "affa" "lib" and mail"!! Now I know why the different files /var/....... can not be reached during the starting procedure......... 8)
In my other backup sme (the "affa --rise" of my crashed sme) I can find in total 26 directories (have a look on yours!)!

--> how is it possible?? What has happened?? How avoiding this in the future??
--> copying the missing directories from the backup sme, could/would it help to make the crashed sme starting and working again?? In the best case only if all that is still present is not damaged or without missing parts.......
--> what should I do, to avoid a stupid re-installation?

I have uptodate backups, so the datas present on the drives are not a problem and can be recovered.

My machine: sme8, in server+gateway modus, no LVM, RAID1 without any message about a degraded raid.

Thanks in advance.
Salut
Arnaud
Title: Re: sme8 crashed and no not restart // directory /var damaged?
Post by: CharlieBrady on January 15, 2014, 10:59:21 PM
I was neither possible to log on the sme from another PC with ssh nor to login the server-manager via https. The sme was blocked: hard drives running and continuously working with the ethernet device. So I used the reset button of the sme and now it refuses to restart.

That was a mistake - you should have rebooted via ctl-alt-del.

I think it was also a mistake to use affa:

http://wiki.contribs.org/Affa

"This code is deprecated and unmaintained".
Title: Re: sme8 crashed and no not restart // directory /var damaged?
Post by: Arnaud on January 16, 2014, 02:43:48 PM
Hello Charlie
and thank you for these informations, even if I had still hope to read something else... :-(

OK, I will be more carefully by using the "reset" button in the future. For my apologize: my sme is running without screen and without keyboard as soon as it can be reached via ssh --> I will let a keyboard close to the machine.....

For the comments about affa: I was not aware about it because this warning is not present into the french version of the documentation! This is not good for the other french speaking users. Could "someone" add it?

This is quite a pity that this exiting and powerful tool is unmaintained. The --rise function is genial and allows me to write here, to have access to the emails and files by using the backup machine!
What application can you advice me for replacing affa for backups, if possible with the same philosophy??

Have a nice day.
Arnaud
Title: Re: sme8 crashed and no not restart // directory /var damaged?
Post by: janet on January 16, 2014, 03:44:32 PM
Arnaud

rdiff-backup may suit your needs,
or even the maintained v3 affa which is now a generic Linux application, (rather than the deprecated v2 affa which has sme server specific functionality).
There is no rise feature in v3 affa.


 
Title: Re: sme8 crashed and no not restart // directory /var damaged?
Post by: stephdl on January 16, 2014, 07:38:36 PM
For the comments about affa: I was not aware about it because this warning is not present into the french version of the documentation! This is not good for the other french speaking users. Could "someone" add it?
Ola, i'm French Too, i i'm thinking that you can ask right to edit the wiki and you will do it yourself...My philosophy of life
see this page to create a wiki account http://wiki.contribs.org/Help:Contents
@+
Title: Re: sme8 crashed and no not restart // directory /var damaged?
Post by: janet on January 16, 2014, 11:17:31 PM
Arnaud

Quote
I was restoring the /home of my ubuntu from the sme with "affa --full-restore" (via ssh) and it was never ending for only 5 GB. I was neither possible to log on the sme from another PC with ssh nor to login the server-manager via https. The sme was blocked: hard drives running and continuously working with the ethernet device. So I used the reset button of the sme and now it refuses to restart.

Restarting a server with a running restore procedure, is highly likely to cause problems.
You should really have waited for it to complete.

In Linux a lack of access does not mean it has stopped, it is more likely a sign of being busy & has queued the various requests made of the server.
You do not say how long it had taken before you killed the power etc, also you do not give us CPU, RAM, NIC etc details, so we cannot gauge the time it should have taken.
Restoring 5G of data is significant & may take a while across a network.
No details were given about the full restore you were doing, it is certainly not clear to me exactly what you were doing or restoring.

It vaguely sounds like you were inadvertantly restoring your sme server (& interrupted that), as you say there are missing folders & now your sme server will not boot etc.
If you were restoring some data files to another server & killed the power, it should not really corrupt your sme server, in most cases the sme server journaling file system will tolerate a forced power down.

I think troubleshooting your existing sme server will be difficult & time consuming, the easiest solution is to reinstall a clean OS & restore from your backups.

You could try doing an upgrade install from sme 8 CD, & if that runs OK, then do a yum update to get the latest packages.
Title: Re: sme8 crashed and no not restart // directory /var damaged?
Post by: Arnaud on January 18, 2014, 10:44:13 PM
Good evening :-P
thanks you for your indications and please accept my apologize for answering so late: I have been fighting against this recalcitrant SME  :-?

@ stephdl: yes, I can tray to do it. But first of all I have to get the situation under control here.

the maintained v3 affa
yes, this could be a good idea. With some luck, it is maybe possible to reuse the backups of the v2. In addition: I'm already used to deal with affa.

What's about BackupPC? The wiki gives the impression that it is an attractive program too.

You do not say how long it had taken before you killed the power etc, also you do not give us CPU, RAM, NIC etc details, so we cannot gauge the time it should have taken.
Restoring 5G of data is significant & may take a while across a network.
The mainboard is an Intel D525 (2x 1,8GHz) + 2GB RAM. The transfer rating is ~3,5Mb/s. So per experience, it takes less than 30' to have the 5Gb restored. I reseted after 2,5hours.
I know that during the sme is slower during the job (of course!), but it accepts a login via ssh or via the server-manager (https) during the files transfert.


No details were given about the full restore you were doing, it is certainly not clear to me exactly what you were doing or restoring.
I was restoring the /home of my client ubuntu. The files were going from the sme to the ubuntu. The sme was sending via ssh the backup files to the /home of the ubuntu.

Therefore I can not explain the missing directories on the sme.

Like you, it seems to me that the restoring process was working in the wrong direction but I had a look to the "job" today: it is correct: "affa --run" works from the client ubuntu to the sme, then "affa --ful-restore" must run in the opposite way.

I think troubleshooting your existing sme server will be difficult & time consuming, the easiest solution is to reinstall a clean OS & restore from your backups.
You could try doing an upgrade install from sme 8 CD, & if that runs OK, then do a yum update to get the latest packages.
This is exactly my meaning too and I have been working on this (in reality I'm fighting...) for 2 days!
Last night, I restored the prod sme from the backup sme.
Of course, like explain in the wiki (http://wiki.contribs.org/Affa#Users_can_not_login_to_server_-_Important (http://wiki.contribs.org/Affa#Users_can_not_login_to_server_-_Important)) the users could not log in! Solved!
But I can't still get my calendar synchronized with thunderbird lightning.
In addition, I can't reach any web site! I can only reach search engines (google or ixquick). But pinging is ok. --> squid or dansguardian makes trouble?

My plan: I will try to restore again, but from an other backup, via nfs and not ssh.

I will let you know.

Good night!
Arnaud
   

Title: Re: sme8 crashed and no not restart // directory /var damaged?
Post by: janet on January 19, 2014, 12:11:46 AM
Arnaud

Quote
I have been fighting against this recalcitrant SME  :-?

To be fair, it is not SME server that is recalcitrant, but an unsupported & buggy (?) contrib, perhaps used incorrectly.

Quote
What's about BackupPC?

You are free to use any backup system you wish per your needs & preferences.

Why do you feel the need to depart from the standard SME server manager "backup to workstation" (to a local USB or to a share) ? It is maintained as part of the base OS & works OK, you can do a full backup then daily incrementals.
Far better to be using a reliable backup system.

You can run a backup on & from your Ubuntu or other clients, to SME server.
Then the scheduled normal SME server backup will include the data from the various client backups.
You can do partial restores of the client data if necessary, in order to restore a client when needed.

My only suggestion re how you may discover what happened, is is to recreate the setup on a test system, do the Ubuntu backup & restore again & see what happens. Use exactly the same commands etc.

There is the possibility of user error, or it may be that you have found a bug !
Title: Re: sme8 crashed and no not restart // directory /var damaged?
Post by: Arnaud on January 19, 2014, 11:35:27 AM
Good morning,
so it's not going very better here:
I have made a --full-restore from an other source via nfs: the problem with users identification is not present (maybe because I have solved it before??)
But following is still not solved:
In addition, I can't reach any web site! I can only reach search engines (google or ixquick). But pinging is ok. --> squid or dansguardian makes trouble?
I have a look in the error messages:
Code: [Select]
Jan 19 10:55:00 sme-intel squid[6506]: auth_param basic program /usr/lib/squid/pam_auth: (2) No such file or directoryAfter quick looking: I don't have any directory /usr/lib/squid on the new installed (with current V8.0, md5 sum OK, CD OK) and restored server!

A look in the archive of affa:
Code: [Select]
[root@sme8-virtuelle-kcn affa]# ls clonage/scheduled.0
clonage-setup.pl  etc  home  root

Is it correct if I understand that affa has no influence on /usr directory (because not present in the archive)?
Then my problem of restore wouldn't come from affa?

I will now reinstall again, but not from the latest V8.0 CD, but as used from my old V8 beta2 CD: I know that the CD is ok.

@Janet: thanks again for the indications: after getting some peace here, I will reconsider my backup system and strategy for the next "rescue modus".

Regards,
Arnaud

PS: In 99% of the cases, IT problems are sitting between the chair and the screen! In my case the statistics is higher!! 8)
Title: Re: sme8 crashed and no not restart // directory /var damaged?
Post by: janet on January 19, 2014, 05:12:22 PM
Arnaud

Just to be sure you are using the correct procedure:

1) Before doing a full restore you should install a clean version of the SME OS from CD, then make any minimal affa configuration to be able to carry out the restore ie install affa & setup the identical job. You should NOT restore to an existing configured SME server OS.

2) After the restore completes you should issue the commands
signal-event post-upgrade
signal-event reboot

3) After doing a full restore, you then need to reinstall any contribs & other non standard modifications you made (as these are not part of the affa backup), except for custom templates (as these should have been included in the restore). Not sure if you have pam auth configured on your system, you may need to redo these modifications.

Re CD's being OK or not, if not tested, you should run the media check at the beginning of the SME server install process (I think it checks the CD md5 sum is OK).

The other backup contrib that is similar to affa (ie uses rsync over ssh locally or remote) is rdiff-backup
SME server professional support personnel use it
 
yum install smeserver-rdiff-backup --enablerepo=smecontribs

For more info re configuration & usage see
http://wiki.contribs.org/Rdiff-backup
& read the forum link provided
Title: Re: sme8 crashed and no not restart // directory /var damaged?
Post by: Arnaud on January 21, 2014, 09:29:33 PM
Hello world!   :D

yesterday afternoon was a good period for me:
- I reinstalled sme from the old CD + yum update --enablerepo=smecontribs
- The contribs have been reinstalled too
- I restored the datas from the backup with affa --full-restore via nfs and the sme was ok again. What was the problem? Why did it take such a long time????

@ Janet: are you sure for the normal order of the operations? First restore from backup then install the contribs? Is it not a problem if parameters are set for not present things??

- Last night the RAID has been recovered (I reinstalled only with 1 disk in order not to loose datas present on the second disk)
- this morning I reinstalled grub on the second disk. I will try to boot with both disks separately to check if grub is realy present and correct installed on both disks.

And of course I tried to restore the /home of the ubuntu client from the sme via affa --full-restore as I did it by the crash, but not from the same archive (explanations below): after 2 minutes the job was over without any problem.

At this occasion, I noticed that about the half of the /var/affa directory was missing (I made a backup of this directory after the crash with rescueCD before the reinstallation). Therefore, only the "daily" archives were available. By the crash, I was restoring from the archive "scheduled.0".

Some comments:
- I still have no explanations for the missing files and directories in /usr/....(see previous posts): The backup job saves the /home of the client to the /var/affa/backup_job of the sme. I could then understand that datas into /var/affa/backup_job were destroyed if the job would run in the opposite of the wished direction during the restore. But the directory /usr/... of the sme has nothing to do directly with the backup job because not present in the parameters of the job. The same for the datas of other backup jobs present into /var/affa, that have been destroyed by the crash.

- I don't know the exact reason of the problems occurred 2 days ago with the installation from the last V8.0 CD: I downloaded the iso and check it with md5 sum before burning. At the beginning of the installation, I let sme check the CD: OK! I guess that a reading failure of the CD occured during the installation.
In any case it is very angrily to loose 1 complete day because of it (installation + contribs + restoring the datas).

- The other important stress factor was that all my backups have been made and recovered by affa V2 and I didn't know if the problem was a faulty disk, a faulty SME, a faulty archive or a faulty program!

- I was still online thanks to the affa --rise on my backup-SME! This function is simply extraordinary! Without it I would have been completely in black-out!

To do list:
- drink a fresh beer!  :pint:
- affa --undo-rise on the backup sme and hoping that all the problems described in the wiki will not occur
- restart the backup system via affa V2 in a first step
- update/comment the french version of the wiki about affa
- think about a new backup concept for my installation with more redundancy (no more only via affa (and sme??)) and with the possibility to be quickly online again by problems.


Salut
Arnaud

PS: Janet, I thank you for your support and all your indications!  :cool:
Title: Re: sme8 crashed and no not restart // directory /var damaged?
Post by: janet on January 22, 2014, 12:47:36 AM
Arnaud

Quote
I reinstalled sme from the old CD + yum update --enablerepo=smecontribs

Not really the best practice recommended procedure.
Run a standard yum update, but do not include contribs from smecontribs at this stage.
It is also wise to reinstall contribs one at a time to ensure they each work correctly before installing more contribs.
If you install say 5 or 6 contribs  at the same time (as well as regular updates), then it can be much more difficult to troubleshoot your server, in the event of errors or problems ie which contrib or update pkg to blame etc ???

Correct approach is
Install fresh OS
yum update
restore from backup

reinstall contribs one at a time using
yum install contribname --enablerepo=smecontribs
issue appropriate signal-event (as advised by contrib install instructions)
check contrib works & no server errors etc
repeat above 2 lines for each other contrib

The order of restore may vary if you restore from  USB on first reboot after installing the clean OS from CD, if so, then follow that with a general purpose
yum update

Quote
I restored the datas from the backup with affa --full-restore via nfs and the sme was ok again.

Quote
are you sure for the normal order of the operations? First restore from backup then install the contribs? Is it not a problem if parameters are set for not present things??

Refer to http://wiki.contribs.org/Backup_server_config#Backup_and_Restore_concepts.2C_issues_and_other_information
and if you still prefer not to believe me or that article, then search these forums on restore backup & posts by CharlieBrady, as he has mentioned this correct procedure order many many times over the years.

AFAIUI, the  contrib install process will use the existing restored configuration (conf) files instead of creating new default conf files. The restored db settings are invoked after appropriate signal events following contrib installation with yum or rpm command.

Quote
In any case it is very angrily to loose 1 complete day because of it (installation + contribs + restoring the datas).

Being angry does not help, computer equipment fails, that is a fact of life, you have to get in & fix it, that is a fact of life too.

Did you ever practice or rehearse your restore procedures on a test system, before assuming they would all work correctly when critically needed ?
Part of a good backup system is to prove that restores (ie restore procedures) will work correctly. Assuming they will work correctly is not good enough.

Quote
The other important stress factor was that all my backups have been made and recovered by affa V2 and I didn't know if the problem was a faulty disk, a faulty SME, a faulty archive or a faulty program.

Or a faulty operator, or a combination of some or all of those reasons.
Combine perhaps with not having fully tested restore procedures & proven to yourself they work correctly when critically needed.

Also have you ran full surface scans on those hard disks using drive manufacturers diagnotic test software (google UBCD), and ran the long smartctl tests ?
Also do memory tests on your servers memory, there are instructions here somewhere (FAQ maybe), & I think also on the install CD booted up in Rescue mode.
If not done, you should do so ASAP !

Having multiple backups created using different methods is good backup policy.
Doing a test restore using each backup method & proving the system & data is fully restored, is also part of a good backup policy.

Read some other posts elsewhere in these forums, some people claim I speak my mind too freely, & then they verbally abuse me for it, not nice considering I thought I was being quite polite, so thank you for thanking me & appreciating my efforts.
Title: Re: sme8 crashed and no not restart // directory /var damaged?
Post by: stephdl on January 22, 2014, 12:49:07 AM
- update/comment the french version of the wiki about affa

if you have time, you can take a look to the French documentation i have translated, probably it needs two eyes more to correct  errors made by my 11th fingers...http://wiki.contribs.org/SME_Server:Documentation/fr

thanks in advance
Title: Re: sme8 crashed and no not restart // directory /var damaged?
Post by: Arnaud on January 22, 2014, 06:52:58 PM
Good evening,

Install / reinstall procedure: Janet, your detailed explanations are clear.

Now I just can hope that my yum update –enablerepo=smecontribs will not make trouble in the future… :oops:

I installed the contribs one at a time (as you wrote), with the "signal-event post upgrade; signal-event reboot" and the complete procedure each time. This is one of the reasons why so much time was necessary.

Refer to http://wiki.contribs.org/Backup_server_config#Backup_and_Restore_concepts.2C_issues_and_other_information
and if you still prefer not to believe me
I just wanted to have understand you correctly, not doubting your indications.
The link you gave is fantastic: all what I wanted to know about backups of sme :wink:
I knew this page, but I never read this paragraph.............

Did you ever practice or rehearse your restore procedures on a test system, before assuming they would all work correctly when critically needed ?
Part of a good backup system is to prove that restores (ie restore procedures) will work correctly. Assuming they will work correctly is not good enough.
I agree with it (in fact, it is not possible to disagree!) and in the past I made already some own “experiences” with backups-CD’s that could not be read when needed!!

This is not the first time that I have to restore the system, but it is the first time that I have so much trouble to do it!

I made in the past some “emergency trials” but by basically  trials are always running well or……. become real situation because of losing control !

I try to learn something from any bad experience and to think about strategies to avoid them happening again. In this case now: if the only one backup-program I use makes trouble, I still don’t know how to proceed! Therefore i have to learn it or to solve it with redundancy.

The previous crashes made me learn:
- Lvm makes problems with RescueCD by changing automatically the md-nummers à I use “nolvm” from this time

- Backups present on the running disk of an sme are in danger if the sme is crashed (because of formatting by installation) à mount for backups a disk not present in the RAID

- Check if the sme can start from both disks before it is needed

- Always have a free usb disk as big as the disks of the sme

- Repair and restore only with 1 disk: datas present on the second one could still be needed. Rebuild the raid only after all is running again well

- Diversity is a good solution to eliminate problems without solving them: my backups are available via ssh, nfs or with a usb-disk

For all possible tests, I have a virtual sme which is a clone of the production sme but without datas (procedure as described in the wiki mentioned above)

 
For the disks I have Monitor-Disk-Health http://wiki.contribs.org/Monitor_Disk_Health (http://wiki.contribs.org/Monitor_Disk_Health) installed and I’m patiently waiting for an email ……… (for sure not the best attitude).

And I must confess never having done any RAM test………..

 
I thought I was being quite polite
In any case!  I have learned a lot during this pleasant discussion.  ;-)
Many teachers could/should take your explanations as an example………..

Thanks & salut
Arnaud
Title: Re: sme8 crashed and no not restart // directory /var damaged?
Post by: janet on January 22, 2014, 11:16:12 PM
Arnaud

It sounds like you are following good procedure or attempting to as best as is practically possible.

I would still say that doing updates using the command
yum update –enablerepo=smecontribs
is not good practice.
It will update your server to latest released packages from CentOS & smeserver modified packages, but also update contribs you have installed from smecontribs.
So if there were say 4 contribs installed on your sme server, for which newer package versions had been released, then these 4 contribs would also be updated.
That is not a bad outcome generally speaking, but if there are any problems after the upgrade, then you have to look at regular updates as well as the new versions of contribs, for possible sources of the problem. So the more packages you update in one go, then the more places to look.

It is better practice to update installed contribs one at a time, & as a seperate task to general system updates.
So instead, do:
yum update
signal-event post-upgrade
signal-event reboot

Then follow with
yum update contrib1packagename --enablerepo=smecontribs
signal-event post-upgrade
signal-event reboot

yum update contrib2packagename --enablerepo=smecontribs
signal-event post-upgrade
signal-event reboot

& so on.

It is slower to do it this way, but safer.
You do not need to update contribs too often, so it does not really add a big ongoing workload to update contribs the "slower but safer" way.
Title: Re: sme8 crashed and no not restart // directory /var damaged?
Post by: Arnaud on January 28, 2014, 09:47:46 PM
... i i'm thinking that you can ask right to edit the wiki and you will do it yourself...

Done!  :-P

Thank you for the warning on the top of the wiki  :wink:

@+
Arnaud