Koozali.org: home of the SME Server

Yikes: loose contact with UPS => instant shutdown

Offline judgej

  • *
  • 375
  • +0/-0
Yikes: loose contact with UPS => instant shutdown
« on: November 11, 2006, 05:31:06 PM »
I just had to travel into the office to restart a server that had just shut down.

Basically, it lost the USB connection, for whatever reason, on one of its polls. The reaction was to perform an instant shutdown. This seems a bit harsh. The default action that makes more sense to me, would be to either:

a) e-mail the administrator, to inform them that the UPS may be playing up; or
b) put the server into standby mode, so that it can at least be restarted remotely (by faxing into it, or though an IP ping)

Any ideas how to go about configuring this? I'm using the hidups driver.

-- Jason
-- Jason

Offline judgej

  • *
  • 375
  • +0/-0
Re: Yikes: loose contact with UPS => instant shutdown
« Reply #1 on: November 11, 2006, 05:39:26 PM »
The event log was something like this (and if the UPS was really on battery, it happened only for a brief moment):

Nov 11 15:08:53 sme upsmon[2598]: UPS UPS@localhost on battery
Nov 11 15:09:00 sme kernel: usb 1-1: USB disconnect, address 2
Nov 11 15:09:00 sme kernel: usb 1-1: new low speed USB device using address 3
Nov 11 15:09:01 sme kernel: hiddev96: USB HID v1.11 Device [Liebert Liebert PSA1000 FW:09] on usb-0000:00:07.2-1
Nov 11 15:09:01 sme hidups[2589]: read: Input/output error
Nov 11 15:09:02 sme upsd[2593]: UPS [UPS] disconnected - check driver
Nov 11 15:09:02 sme upsd[2593]: Data for UPS [UPS] is stale - check driver
Nov 11 15:09:02 sme upsd[2593]: Can't connect to UPS [UPS] (hidups-hiddev0): No such file or directory
Nov 11 15:09:03 sme upsmon[2598]: Poll UPS [UPS@localhost] failed - Driver not connected
Nov 11 15:09:03 sme upsmon[2598]: Communications with UPS UPS@localhost lost
Nov 11 15:09:08 sme upsmon[2598]: Poll UPS [UPS@localhost] failed - Driver not connected
Nov 11 15:09:18 sme last message repeated 2 times
Nov 11 15:09:18 sme upsmon[2598]: Giving up on the master for UPS [UPS@localhost]
Nov 11 15:09:18 sme upsmon[2598]: Executing automatic power-fail shutdown
Nov 11 15:09:18 sme wall[12872]: wall: user nut broadcasted 2 lines (43 chars)
Nov 11 15:09:18 sme upsmon[2598]: Auto logout and shutdown proceeding
Nov 11 15:09:18 sme wall[12875]: wall: user nut broadcasted 1 lines (37 chars)
Nov 11 15:09:23 sme upsd[2593]: Host 127.0.0.1 disconnected (read failure)
Nov 11 15:09:23 sme esmith::event[12877]: Processing event: halt
Nov 11 15:09:23 sme esmith::event[12877]: Running event handler: /etc/e-smith/events/halt/S70halt
Nov 11 15:09:23 sme shutdown: shutting down for system halt
Nov 11 15:09:23 sme init: Switching to runlevel: 0
Nov 11 15:09:23 sme FaxGetty[4001]: CAUGHT SIGNAL 15
Nov 11 15:09:23 sme FaxGetty[4001]: CLOSE /dev/ttyS0
Nov 11 15:09:23 sme esmith::event[12877]: S70halt=action|Event|halt|Action|S70halt|Start|1163257763 540757|End|1163257763 865294|Elapsed|0.324537
Nov 11 15:09:24 sme haldaemon: haldaemon -TERM succeeded
Nov 11 15:09:25 sme messagebus: messagebus -TERM succeeded
Nov 11 15:09:25 sme FaxQueuer[3871]: QUIT
Nov 11 15:09:25 sme hylafax: Shutting down HylaFAX queue manager (faxq):  succeeded
Nov 11 15:09:25 sme hylafax: hfaxd shutdown succeeded
Nov 11 15:09:25 sme atalk: papd shutdown succeeded
Nov 11 15:09:26 sme atalk:   Unregistering sme:Workstation: succeeded
Nov 11 15:09:26 sme atalk:   Unregistering sme:netatalk: succeeded
Nov 11 15:09:26 sme atalkd[3858]: done
Nov 11 15:09:26 sme atalk: atalkd shutdown succeeded
Nov 11 15:09:26 sme afpd[4090]: shutting down on signal 15
Nov 11 15:09:26 sme atalk: afpd shutdown succeeded
Nov 11 15:09:26 sme atalk: cnid_metad shutdown succeeded
Nov 11 15:09:26 sme acpid: acpid shutdown succeeded
Nov 11 15:09:26 sme crond: crond shutdown succeeded
Nov 11 15:09:27 sme ups: upsmon shutdown failed
Nov 11 15:09:27 sme upsd[2593]: Signal 15: exiting
Nov 11 15:09:27 sme ups: upsd shutdown succeeded
Nov 11 15:09:27 sme ups: hidups shutdown failed
Nov 11 15:09:27 sme irqbalance: irqbalance shutdown failed
Nov 11 15:09:27 sme kernel: Kernel logging (proc) stopped.
Nov 11 15:09:27 sme kernel: Kernel log daemon terminating.
Nov 11 15:09:28 sme syslog: klogd shutdown succeeded
Nov 11 15:09:28 sme exiting on signal 15

A momentary flicker on the stability of the mains and - bang - thirty seconds later the server needs me to press its little 'on' button.

Could this be a problem with SME not reconnecting the USB correctly, after it goes down for a fraction of a second?
-- Jason

Offline judgej

  • *
  • 375
  • +0/-0
Re: Yikes: loose contact with UPS => instant shutdown
« Reply #2 on: November 11, 2006, 05:53:44 PM »
Reading the manual (http://www.die.net/doc/linux/man/man5/upsmon.conf.5.html) it seems that the two things happening in quite succession were significant:

- The UPS went onto battery
- The USB connection was lost

The server just assumes the UPS has remained on battery, and is starting to die, so an orderly (and quick) shutdown is required to be safe. What I think really happened was:

- The UPS went onto battery (a small power glitch)
- The USB connection was lost (perhaps related to the glitch)
- The UPS came off battery
- The server could not reestablish the USB connection to the UPS, so decides to shut down

One thing though, nut should have sent me a couple of notify messages in this process, but I did not get any of them.
-- Jason

Offline byte

  • *
  • 2,183
  • +2/-0
Re: Yikes: loose contact with UPS => instant shutdown
« Reply #3 on: November 11, 2006, 05:57:09 PM »
Quote

Nov 11 15:09:00 sme kernel: usb 1-1: USB disconnect, address 2
Nov 11 15:09:00 sme kernel: usb 1-1: new low speed USB device using address 3


I wonder why it was using address 2 then flicked to address 3 maybe thats why the UPS shutdown automatically?!


Quote from: "judgej"
Could this be a problem with SME not reconnecting the USB correctly, after it goes down for a fraction of a second?


Could be a problem with USB, I have a serial APC and we had a powercut last week at night for 30 minutes and the UPS kept our server on the battery and restored to power when power came back.

You should report to the Bug Tracker
--[byte]--

Have you filled in a Bug Report over @ http://bugs.contribs.org ? Please don't wait to be told this way you help us to help you/others - Thanks!

Offline judgej

  • *
  • 375
  • +0/-0
Re: Yikes: loose contact with UPS => instant shutdown
« Reply #4 on: November 11, 2006, 06:11:22 PM »
Quote from: "byte"
Quote

Nov 11 15:09:00 sme kernel: usb 1-1: USB disconnect, address 2
Nov 11 15:09:00 sme kernel: usb 1-1: new low speed USB device using address 3


I wonder why it was using address 2 then flicked to address 3 maybe thats why the UPS shutdown automatically?!


I didn't spot that. Thanks!

I either have a gremlin swapping cables around when I'm not looking, or a faulty/dodgy USB. Perhaps the BIOS took over and reallocated the IRQs or something (okay, I'm guessing now).

I'll follow this up with the nut project too, because it *ought* to be able to cope with that, as USB connections are designed for plugging into arbitrary sockets.

I'll try a different socket, maybe try a separate USB card.

BTW It was the server that shut down, not the UPS. The UPS seemed to have carried on happily. I have successfuly carried out power failure tests with this server/UPS combination. It keeps the server up for thirty minutes, and then shuts down gracefuly.

-- JJ
-- Jason

Offline judgej

  • *
  • 375
  • +0/-0
Re: Yikes: loose contact with UPS => instant shutdown
« Reply #5 on: November 22, 2006, 11:16:48 PM »
It happened again this evening:

UPS goes onto battery -> USB comms go down -> USB comms come back up -> USB reconnects under a different port -> NUT immediately assumes the server must be shut down

I think there are several faults here, all of which contribute to this problem:

- The UPS goes onto battery every now and then. Most UPSs do that, in order to handle sudden peaks or lows in the mains.

- The USB loses comms: I assume this to be a UPS fault. It is only down for a second or so.

- The USB comes back up on a different virtual port: it is the OS doing this. It probably does not realise the new device it has detected is the same one that just went bad.

- NUT does not see the new USB connection: possibly a fault with NUT? If the USB has just disappeared from view, then perhaps it needs to rescan the USB ports to find the UPS again, instead of just assuming it has gone for good. Maybe it is, but is doing it too soon? Perhaps a longer delay somewhere would fix this.


Anyway - since I had an external modem set up with Hylafax, I configured the serial port to wake the server. About 40 seconds into calling the fax, and the server is rebooted, saving a visit to the office. That's my handy hint for the day :-)

-- JJ
-- Jason

Offline byte

  • *
  • 2,183
  • +2/-0
Re: Yikes: loose contact with UPS => instant shutdown
« Reply #6 on: November 22, 2006, 11:55:33 PM »
Quote from: "judgej"

- The UPS goes onto battery every now and then. Most UPSs do that, in order to handle sudden peaks or lows in the mains.


Also the UPS goes onto battery for testing battery status (my APC does this)

Quote

- NUT does not see the new USB connection: possibly a fault with NUT? If the USB has just disappeared from view, then perhaps it needs to rescan the USB ports to find the UPS again, instead of just assuming it has gone for good. Maybe it is, but is doing it too soon? Perhaps a longer delay somewhere would fix this.


I would report this to our bug tracker to see if it's something more underlying.
--[byte]--

Have you filled in a Bug Report over @ http://bugs.contribs.org ? Please don't wait to be told this way you help us to help you/others - Thanks!

Offline mike_mattos

  • *
  • 313
  • +0/-0
Yikes: loose contact with UPS => instant shutdown
« Reply #7 on: November 23, 2006, 11:01:07 PM »
just a thought, on the server BIOS, how is the system set up, PLUG&PRAY or BIOS control for devices?  

I've seen Windows create a new copy of a USB printer every time it was re-connected, so this may not be a LINUX issue.
...

Offline judgej

  • *
  • 375
  • +0/-0
Yikes: loose contact with UPS => instant shutdown
« Reply #8 on: November 24, 2006, 10:53:11 AM »
Quote from: "mike_mattos"
just a thought, on the server BIOS, how is the system set up, PLUG&PRAY or BIOS control for devices?  

I've seen Windows create a new copy of a USB printer every time it was re-connected, so this may not be a LINUX issue.


I'm not sure. I can check at lunchtime, when I can reboot the server. I assume it should be set to "Plug and Play"? Does that setting only apply to IRQs though, rather than quite high-level USB protocols?

-- JJ
-- Jason

spanna

Yikes: loose contact with UPS => instant shutdown
« Reply #9 on: November 24, 2006, 11:43:01 AM »
Quote
I've seen Windows create a new copy of a USB printer every time it was re-connected, so this may not be a LINUX issue.

Aye, I've seen this happen before. Was with an external hard disk - every time it was reconnected Windows felt the need to do the whole driver scan thing.

Didn't check that disk under Linux, I just got the hard drive out and binned the rest :)

USB as always been a bit wobbly under every operating system, it's down to the USB specification. It's fairly vague to begin with, and not every hardware vendor keeps to specification.

It could also be a PnP or ACPI problem too though - it looks to me like your UPS is being re-enumerated for whatever reason.

Does your UPS have a serial port at all? This might be a more reliable way of doing it...

Offline judgej

  • *
  • 375
  • +0/-0
Yikes: loose contact with UPS => instant shutdown
« Reply #10 on: November 24, 2006, 07:08:05 PM »
Quote from: "spanna"


It could also be a PnP or ACPI problem too though - it looks to me like your UPS is being re-enumerated for whatever reason.

Does your UPS have a serial port at all? This might be a more reliable way of doing it...


"re-enumerated" - I'm with the lingo now ;-)

It does have a serial port, but it's a bit more difficult to set up. There are no working drivers (so far as I can find) for this model of UPS that reads its status. I expect I would need to use the generic serial driver, and create my own rules based on the contact closure specs. Just seemed like a lot more hassle than USB (heh).

-- JJ
-- Jason

Offline mike_mattos

  • *
  • 313
  • +0/-0
Yikes: loose contact with UPS => instant shutdown
« Reply #11 on: November 27, 2006, 07:57:05 PM »
If the BIOS has options for enumeration control, I'd try the other one!
...

Offline Stefano

  • *
  • 10,894
  • +3/-0
Yikes: loose contact with UPS => instant shutdown
« Reply #12 on: November 30, 2006, 03:34:39 PM »
Hi all.

I'm in the same situation with 2 customers and... Sme 6.0.1!

I was hoping that Sme 7 will solve this problem but it seems that is not a Sme problem

BTW, I'm testing this solution:


more /sbin/e-smith/nutUPS.notify
#! /bin/sh
# UPS notify script. This is a placeholder.

/bin/mail -s "$*" admin < /dev/null

MYTEST=`echo $1|grep unavailable|wc -l|bc -l`

if [ $MYTEST = 1 ]
then
/sbin/service nut stop
/sbin/rmmod hid
/sbin/modprobe hid
/sbin/service nut start
/usr/bin/upsc localhost > /tmp/upsc.txt
/bin/mail -s "Ripristino nut" admin < /tmp/upsc.txt
/bin/rm /tmp/upsc.txt
fi

HTH

ciao

Stefano

Offline judgej

  • *
  • 375
  • +0/-0
Yikes: loose contact with UPS => instant shutdown
« Reply #13 on: December 22, 2006, 04:55:06 PM »
Quote from: "nenonano"
Hi all.

I'm in the same situation with 2 customers and... Sme 6.0.1!

I was hoping that Sme 7 will solve this problem but it seems that is not a Sme problem

BTW, I'm testing this solution:


more /sbin/e-smith/nutUPS.notify
#! /bin/sh
# UPS notify script. This is a placeholder.

/bin/mail -s "$*" admin < /dev/null

MYTEST=`echo $1|grep unavailable|wc -l|bc -l`

if [ $MYTEST = 1 ]
then
/sbin/service nut stop
/sbin/rmmod hid
/sbin/modprobe hid
/sbin/service nut start
/usr/bin/upsc localhost > /tmp/upsc.txt
/bin/mail -s "Ripristino nut" admin < /tmp/upsc.txt
/bin/rm /tmp/upsc.txt
fi


Stefano,

Have you had any luck with this? I've entered the code as you listed it, and am waiting for the USB to go down again.

Just in case you are interested, I use this at the top of the file to provide a little more detail in the e-mails:

Code: [Select]
/bin/mail -s "UPS '$UPSNAME': $NOTIFYTYPE" admin <<END
$*
END


-- Jason
-- Jason