Koozali.org: home of the SME Server

Failure to reboot after power failure - NUT enabled

Offline wjhobbs

  • *****
  • 171
  • +0/-0
    • http://www.chryxus.ca
Failure to reboot after power failure - NUT enabled
« on: February 10, 2007, 06:52:40 PM »
I would like to know why a system failed to reboot when power was restored after a power failure. The facts are these:

-  The system is a ‘HP ProLiant ML110 G3’. The BIOS has been set to boot on power restore after power failure.

-  NUT is installed and enabled on the SME 7.1 machine. The UPS unit is an ‘APC Back-UPS XS 1200’.

-  The logs indicate that a power failure occurred at 11:20pm on Feb. 9 when the UPS unit reported that it went on battery. About an hour later at 00:21am on Feb. 10 a ‘battery low’ condition was reported and NUT initiated a shutdown.

-  From other sources we know that about two hours later (approx. 02:30am Feb. 10) power was restored.

-  However, the SME Server did not boot.

-  Several hours later the condition was noticed and someone went to the site and powered up the server manually, at 10:59am on Feb. 10. This was more than 8 hours after power had been restored. Less than a minute into the logs, communication with the UPS unit was re-established. The UPS batteries were still being charged.

My understanding was that after power was restored the UPS unit would take a few minutes to establish a basic battery charge level and then turn power on to the protected outlet(s), at which point the server should begin a boot process. Clearly these outlets were powered on or the manual boot would not have succeeded.

My questions is why the automatic boot would not have been initiated when power was restored to the machine.

Any suggestions for fixing this would be much appreciated.

Thanks.

John
...

Offline pfloor

  • *****
  • 889
  • +1/-0
Failure to reboot after power failure - NUT enabled
« Reply #1 on: February 11, 2007, 06:18:36 AM »
You have experienced a situation that will cause ths problem.

1-Power goes out.
2-Battery gets low.
3-Server auto shuts down.
4-UPS still has battery power and never goes completly dead and never shuts down.
5-Power comes back on.
6.-Server never sees a power off-on cycle (eg. no power failure because the battery actualy never went dead) so the server didn't re-start.
In life, you must either "Push, Pull or Get out of the way!"

Offline wjhobbs

  • *****
  • 171
  • +0/-0
    • http://www.chryxus.ca
Failure to reboot after power failure - NUT enabled
« Reply #2 on: February 11, 2007, 03:33:09 PM »
Thanks, Paul.

I guess this is the one situation where NUT protects the equipment at the expense of short term availability.

At least I have some assurance that my settings are probably OK.

John
...

Offline pfloor

  • *****
  • 889
  • +1/-0
Failure to reboot after power failure - NUT enabled
« Reply #3 on: February 11, 2007, 06:54:14 PM »
John, I have done some further investigating and found I may be incorrect with my previous answer.  NUT is supposed to do what you want.

A-Power goes out.
B-Battery gets low.
C-NUT issues a shutdown.
D-During shutdown, NUT issues a command to the UPS to turn itself off after the shutdown.
E-Mains come back on.
F-After the batteries have charged themselves back up, the UPS turns the power back on to the server.
G-The server reboots.

Common reasons why it may not have worked:

1-Your motherboard/bios may not configured properly.  Are you sure you have it set to "power up after power outage" and NOT "return to previous state"?

2-The UPS doesn't support it. eg. it uses a *dumb* cable or does not know how to kill itself.

3-Some UPS's get themselves into a "race" condition.  If this is the case then you need to do some extra configuration.

4-NUT is not configured/working properly and you may have found a bug.

Because of the large numbers of UPS's out there, it is impossible to configure NUT for them all.  The devs used the most common configuration and in may not be correct for your UPS.

Please go here http://www.networkupstools.org and read the FAQ and Documentation.  About 2/3 the way down the FAQ page it talks about your probems.  

If you think it is NUT or the way it is configured, please raise a bug in the bug tracer.
In life, you must either "Push, Pull or Get out of the way!"

Offline wjhobbs

  • *****
  • 171
  • +0/-0
    • http://www.chryxus.ca
Failure to reboot after power failure - NUT enabled
« Reply #4 on: February 11, 2007, 07:52:48 PM »
Quote from: "pfloor"

2-The UPS doesn't support it. eg. it uses a *dumb* cable or does not know how to kill itself.

If you look at this snip from the log:
Code: [Select]
Feb 10 00:21:12 server2 upsmon[3002]: UPS UPS@localhost battery is low
Feb 10 00:21:12 server2 upsd[2995]: Client upsmaster@127.0.0.1 set FSD on UPS [UPS]
Feb 10 00:21:12 server2 upsmon[3002]: Executing automatic power-fail shutdown
Feb 10 00:21:12 server2 wall[24546]: wall: user nut broadcasted 1 lines (34 chars)
Feb 10 00:21:12 server2 wall[24547]: wall: user nut broadcasted 2 lines (43 chars)
Feb 10 00:21:12 server2 upsmon[3002]: Auto logout and shutdown proceeding
Feb 10 00:21:12 server2 wall[24551]: wall: user nut broadcasted 1 lines (37 chars)
Feb 10 00:21:17 server2 upsd[2995]: Host 127.0.0.1 disconnected (read failure)
Feb 10 00:21:17 server2 esmith::event[24553]: Processing event: halt  
Feb 10 00:21:18 server2 esmith::event[24553]: Running event handler: /etc/e-smith/events/halt/S70halt
Feb 10 00:21:18 server2 shutdown: shutting down for system halt
Feb 10 00:21:18 server2 init: Switching to runlevel: 0
Feb 10 00:21:18 server2 esmith::event[24553]: S70halt=action|Event|halt|Action|S70halt|Start|1171084877 989809|End|1171084878 755205|Elapsed|0.765396
Feb 10 00:21:19 server2 haldaemon: haldaemon -TERM succeeded
Feb 10 00:21:19 server2 messagebus: messagebus -TERM succeeded
Feb 10 00:21:20 server2 atalk: papd shutdown succeeded
Feb 10 00:21:20 server2 atalk:   Unregistering server2:Workstation: succeeded
Feb 10 00:21:20 server2 atalk:   Unregistering server2:netatalk: succeeded
Feb 10 00:21:20 server2 atalkd[4189]: done
Feb 10 00:21:20 server2 atalk: atalkd shutdown succeeded
Feb 10 00:21:20 server2 afpd[4587]: shutting down on signal 15
Feb 10 00:21:20 server2 atalk: afpd shutdown succeeded
Feb 10 00:21:20 server2 atalk: cnid_metad shutdown succeeded
Feb 10 00:21:21 server2 acpid: acpid shutdown succeeded
Feb 10 00:21:21 server2 crond: crond shutdown succeeded
Feb 10 00:21:21 server2 ups: upsmon shutdown failed
Feb 10 00:21:21 server2 upsd[2995]: Signal 15: exiting
Feb 10 00:21:21 server2 ups: upsd shutdown succeeded
Feb 10 00:21:21 server2 newhidups[2991]: Signal 15: exiting
Feb 10 00:21:21 server2 irqbalance: irqbalance shutdown succeeded
Feb 10 00:21:21 server2 kernel: Kernel logging (proc) stopped.
Feb 10 00:21:21 server2 kernel: Kernel log daemon terminating.
Feb 10 00:21:22 server2 syslog: klogd shutdown succeeded
Feb 10 00:21:22 server2 exiting on signal 15

You see a message (9 up from the bottom) that "ups: upsmon shutdown failed". Would that be the indication that the UPS unit did not shut itself down? or is that something else?

John
...