Koozali.org: home of the SME Server
Obsolete Releases => SME Server 7.x => Topic started by: AJB on July 11, 2008, 01:45:51 PM
-
Hi all,
I'm having a problem with my Compaq ProLiant 1600 box running SME Server 7.3, fully updated. The thing keeps rebooting on me every now and then for no apparent reason. The time interval between reboots varies from some 20 minutes or so till a couple of hours. The time at which the reboot occurs is not related to any of the cronjobs as far as I know. I have made no changes to the configuration, nor did I install, change or remove any contribs around the time this behaviour started. In short: I don't have the slightest idea where to search for a solution.
Of course I am more than happy to provide logs upon request, or other data that might help.
Thanks in advance for helping me figure this one out.
-
frequent reboots attribute to either ram problems or overheating problems. check your bios settings, check cpu, system temperatures and fan responses, if you have centos live cd boot with it and on prompt type memtest to conduct full ram tests. it seems more of a hardware than a software issue
-
Thanks for the response. I'll check the bios and hardware as you suggested. One question, though: the box has ECC RAM, would that make a difference? I mean, even if error correction does not work properly, would one not expect some log entry to be written if a memory error occurs? I'm asking because it's a production server, and while it is not business critical, running a reliable memtest takes quite some time. If possible, I would therefore like to pinpoint the problem as exactly as possible before taking the box down altogether.
Thanks.
-
OK, I've checked the hardware for errors, mechanical or otherwise. All the fans that should be running are running, temps are normal, and I ran 4 consecutive passes of the HP Server Diagnostics disk; no errors. I'm running memtest right now for more extensive memory testing, but I am starting to get fairly certain that this is no hardware issue.
So, once again, I'm fresh out of options. Any help or suggestion is greatly appreciated.
-
So, once again, I'm fresh out of options. Any help or suggestion is greatly appreciated.
If you have some spare RAM you could replace the RAM in the dodgy server and see if the server will be running more stable. You can then also test the original RAM in a other system which you do not need at the moment (all supposing you have spare parts and systems).
-
Don't have spare parts (as in: the same hardware). (If I did, I would have switched to other hardware by now ;). I do have a backup server running Affa, but that box is considerably less well spec'd, so I am a bit reluctant to make that my production server.)
But even so, it really doesn't seem to be hardware related. Still running memtest, still no errors. I am in GMT +1, so it's almost the end of the day (and week for that matter) which means I can keep running memtest for a little while longer. Still, I am leaning more and more to a software issue.
-
Don't have spare parts (as in: the same hardware). (If I did, I would have switched to other hardware by now ;). I do have a backup server running Affa, but that box is considerably less well spec'd, so I am a bit reluctant to make that my production server.)
But even so, it really doesn't seem to be hardware related. Still running memtest, still no errors.
I find the software issues hard to believe as there should be clues leading to that, if you can not find entries in your log files at the or around the time of reboot I tend to say it is not software related. I very rarely, more like never ever, had software issues cause frequent and unexpected reboots
I am in GMT +1, so it's almost the end of the day (and week for that matter) which means I can keep running memtest for a little while longer. Still, I am leaning more and more to a software issue.
I am in GMT +1 as well and am already enjoying my weekend ;-)
-
Hmm, up until this morning I was imagining myself enjoying a well-earned beer by now, but apparently fate decided otherwise today... ;).
As far as the logs are concerned: it's not that I can positively say there aren't any clues, it's just that I have no idea where to look for them. I checked the messages log which showed nothing that I consider to be out of the ordinary. I am, however, no expert in log file analysis.
That being said, I totally understand why you are reluctant to believe that the problem is not hardware-related. That's also why I spent the afternoon testing it, as it would be the most likely source of the problem. When testing the hardware doesn't result in any errors, however, troubleshooting becomes quite difficult.
-
Time to report a bug? I checked and there doesn't appear to be anything like this.
-
disable all usb ports on your motherboard via the bios. Then check for 2-3 hours. I have seen such behaviour with some usb's.
-
The server has no USB ports, and I disabled the USB support during bootup a long time ago: the thing would hang on shutdown with USB support enabled. (Don't ask, I spent several days to figure that one out last year, and as it is I know it works but still have no idea why :?.)
-
Stupid question time
Perhaps your AC or power supply is wonky. Is it plugged into the wall securely? In your bios, do you have ( I forget the exact term) "Reboot on restore of AC" activated. You may want to turn this feature off as an experiment.
-
Hmm, don't think the problem is with the power cables and/or outlets; a ProLiant has three PSU's and tolerates one of them to malfunction. I will turn the bios setting off, though, for that would reveal any problems in the power supply further upstream. Thanks for the suggestion, and I'll keep you all posted.
-
Try to switch LAN card for internal and external traffic.
I remember having such problems, when for some reason my external LAN card hanged and caused server stuck and reboot.
-
Hi,
Thanks for the suggestion. I'm starting to think however that the problem pretty much somehow resolved itself. I brought the server back up last Friday after running all sorts of hardware tests, and it has been up ever since.
Not too sure if I'm entirely happy with this, as I am still at a loss as to what caused these reboots, but as long as it keeps running it is neither possible nor necessary to troubleshoot the issue.
I think we can call this one resolved for now, I guess. Thanks for all the responses, guys!
-
Is the system on a UPS backup?
If not then set the bios to not restart on a power failure.
Then if the power fails the server won't restart.
Then you know it was a power glitch.
I get very short power glitches here every day.
18kva UPS goes on line at least once a day.
UPS log shows very short duration glitches < 100ms.
Power here in a word "sucks" thus the UPS to power everything.
Power company has been out here to many times and they are clueless, they say wiring inside, they say tree branches touching lines.
Considering I checked every connection in this place, and monitor power at various points 24/7.
"tree branches touching lines" not likely to be the cause, those types of glitches are well in excess of 1000ms, 1sec
Usually 10-15sec for the reset to occur and are very rare, maybe one a year.
Everyone would be complaining and their not.
Less then 100ms and nobody complains, it's just normal for them, not for me, systems dump.
I say their clueless.
One thing is for sure, 18kva UPS fixes the problem they can't fix.
Doesn't take much of a glitch to dump the Proliant...that I know, have one and 2 HP Netservers.
Power hungry monsters they are, anything longer then 10ms and they dump.
3.6amps running .35a in standby.
They tend to be expensive to feed these days.
I run them in the winter to heat the place, dual purpose, server heaters...!!
3.6amps server heaters...almost break even dollar wise.
-
What about checking in the /var/log/ directory. I guess there will be some info why it reboots (?!)
-
Hi All,
I am really sorry for posting my query here, as I am also facing the same frequent unexpected restart ( 4-5 times a week) issue with SME 7.2 (Server mode) on HP Porlaint ML 350 Hardware, I thought it would be better to post here than start a new thread. Please help me sorting out this issue at the earliest as it is a live server.
Please find the Log file:
Sep 8 01:12:04 anthem-mail syslogd 1.4.1: restart.
Sep 8 01:12:04 anthem-mail syslog: syslogd startup succeeded
Sep 8 01:12:04 anthem-mail esmith::event[15639]: Starting system logger: [ OK ]
Sep 8 01:12:04 anthem-mail kernel: klogd 1.4.1, log source = /proc/kmsg started.
Sep 8 01:12:04 anthem-mail kernel: Inspecting /boot/System.map-2.6.9-67.0.7.ELsmp
Sep 8 01:12:04 anthem-mail syslog: klogd startup succeeded
Sep 8 01:12:04 anthem-mail esmith::event[15639]: Starting kernel logger: [ OK ]
Sep 8 01:12:04 anthem-mail esmith::event[15639]: adjusting supervised httpd-admin (sigusr1)
Sep 8 01:12:04 anthem-mail esmith::event[15639]: adjusting supervised httpd-admin (up)
Sep 8 01:12:04 anthem-mail esmith::event[15639]: adjusting supervised httpd-e-smith (sigusr1)
Sep 8 01:12:04 anthem-mail esmith::event[15639]: adjusting supervised httpd-e-smith (up)
Sep 8 01:12:04 anthem-mail esmith::event[15639]: adjust-services=action|Event|logrotate|Action|adjust-services|Start|1220816522 824220|End|1220816524 409932|Elapsed|1.585712
Sep 8 01:12:04 anthem-mail kernel: Loaded 24774 symbols from /boot/System.map-2.6.9-67.0.7.ELsmp.
Sep 8 01:12:04 anthem-mail kernel: Symbols match kernel version 2.6.9.
Sep 8 01:12:04 anthem-mail kernel: No module symbols loaded - kernel modules not enabled.
Sep 8 01:12:04 anthem-mail syslog: syslogd shutdown succeeded
Sep 8 07:03:38 anthem-mail syslogd 1.4.1: restart.
Sep 8 07:03:38 anthem-mail syslog: syslogd startup succeeded
Sep 8 07:03:38 anthem-mail kernel: klogd 1.4.1, log source = /proc/kmsg started.
Sep 8 07:03:38 anthem-mail kernel: Inspecting /boot/System.map-2.6.9-67.0.7.ELsmp
Sep 8 07:03:38 anthem-mail syslog: klogd startup succeeded
Sep 8 07:03:38 anthem-mail kernel: Loaded 24774 symbols from /boot/System.map-2.6.9-67.0.7.ELsmp.
Sep 8 07:03:38 anthem-mail kernel: Symbols match kernel version 2.6.9.
Sep 8 07:03:38 anthem-mail kernel: No module symbols loaded - kernel modules not enabled.
Sep 8 07:03:38 anthem-mail kernel: Linux version 2.6.9-67.0.7.ELsmp (mockbuild@builder6.centos.org) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-9)) #1 SMP Sat Mar 15 06:54:55 EDT 2008
Sep 8 07:03:38 anthem-mail kernel: BIOS-provided physical RAM map:
Sep 8 07:03:38 anthem-mail kernel: BIOS-e820: 0000000000000000 - 000000000009f400 (usable)
Sep 8 07:03:38 anthem-mail kernel: BIOS-e820: 000000000009f400 - 00000000000a0000 (reserved)
Sep 8 07:03:38 anthem-mail kernel: BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
Sep 8 07:03:38 anthem-mail kernel: BIOS-e820: 0000000000100000 - 000000003ffc8000 (usable)
Sep 8 07:03:38 anthem-mail kernel: BIOS-e820: 000000003ffc8000 - 000000003ffd0000 (ACPI data)
Sep 8 07:03:38 anthem-mail kernel: BIOS-e820: 000000003ffd0000 - 0000000040000000 (reserved)
Sep 8 07:03:38 anthem-mail kernel: BIOS-e820: 00000000fec00000 - 00000000fed00000 (reserved)
Sep 8 07:03:38 anthem-mail kernel: BIOS-e820: 00000000fee00000 - 00000000fee10000 (reserved)
Sep 8 07:03:38 anthem-mail kernel: BIOS-e820: 00000000ffc00000 - 0000000100000000 (reserved)
Thanks in Advance
Avinash
-
Sep 8 01:12:04 anthem-mail kernel: Loaded 24774 symbols from /boot/System.map-2.6.9-67.0.7.ELsmp.
Sep 8 01:12:04 anthem-mail kernel: Symbols match kernel version 2.6.9.
Sep 8 01:12:04 anthem-mail kernel: No module symbols loaded - kernel modules not enabled.
Sep 8 01:12:04 anthem-mail syslog: syslogd shutdown succeeded
Are you using some not out-of-the-box supported hardware or loading custom drivers or kernel-modules perhaps?
-
Thanks for the reply, I just confirmed from my team that it is Out-of-the-box server that is been used by the SME 7.2, it has around 20 GB of data (I hope this is not a problem) I do not want it to crash (due to frequent restarts) coz it takes lot of time to restore the data (I am just worried about the downtime of the live server). Please help me solving this issue at the earliest.
Thanks & Regards,
avinash
-
Thanks for the reply, I just confirmed from my team that it is Out-of-the-box server that is been used by the SME 7.2, it has around 20 GB of data (I hope this is not a problem) I do not want it to crash (due to frequent restarts) coz it takes lot of time to restore the data (I am just worried about the downtime of the live server). Please help me solving this issue at the earliest.
Does it always hang at the same time? Do all the hangups look the same in the log files?
-
Does it always hang at the same time?
Usually at 10 PM to 7AM (4-5 times a week)
Do all the hangups look the same in the log files?
Yes, all the hangups look the same in the log files.
-
manegar,
How many ethernet card do you have in your box? Can you describe the chipset of your ethernet card?
My SME unhappy if I put two exactly same ethernet card with certain ethernet chip (Realtek gigabit ethernet).
It locks up/restarting unexpectedly every two or three days.
thomas
-
How many ethernet card do you have in your box?
Two cards (only one is used: eth0)
Can you describe the chipset of your ethernet card?
Both are "Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)" NIC's
-
Two cards (only one is used: eth0)
Because "(only one is used: eth0) ", try to remove the other one...
if it's onboard, try to disable from BIOS ..to see if this is the culprit...
-
My SME unhappy if I put two exactly same ethernet card with certain ethernet chip (Realtek gigabit ethernet).
Are you sure that they are supported? I thought no gigabit NIC's were supported in CentOS 4 out of the box and therefore also not in SME Server 7.
-
Two cards (only one is used: eth0)
Both are "Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)" NIC's
And you are sure they are supported with no additional drivers? I asked you before on the out-of-the-box support as I already expected something like this.
-
Are you sure that they are supported? I thought no gigabit NIC's were supported in CentOS 4 out of the box and therefore also not in SME Server 7.
My Server was running from past 1 year with the above said NIC without any issues.
-
Usually at 10 PM to 7AM (4-5 times a week)
Yes, all the hangups look the same in the log files.
Is there a certain task that might be running at such a time? Backup? Check for updates?
-
My Server was running from past 1 year with the above said NIC without any issues.
And you left it running without any actions like updates? Configuration changes? I am trying to get as many information as possible as at the moment you have not given us much to go from, you only proven your system seems to lock up at some time.
Be as precise as you can in describing your server history and in trying to pinpoint what you see and when.
-
And you left it running without any actions like updates? Configuration changes? I am trying to get as many information as possible as at the moment you have not given us much to go from, you only proven your system seems to lock up at some time.
Be as precise as you can in describing your server history and in trying to pinpoint what you see and when.
Well, the server is updated, as far as configuration is concerned - I have upgraded from 7.1 to 7.2 last month, other than that nothing has been changed.
Please let me know if you need more info.
Thanks & Regards,
Avinash
-
Well, the server is updated, as far as configuration is concerned - I have upgraded from 7.1 to 7.2 last month, other than that nothing has been changed.
Please let me know if you need more info.
Thanks & Regards,
Avinash
Are you sure you updated from 7.1 to 7.2 and not 7.1 to 7.3 or 7.2 to 7.3? The current version is 7.3.
Are you sure if you upgraded from 7.1/7.2 you read and did the following: http://wiki.contribs.org/Updating_to_SME_7.2
Which repositories are enabled on your system, are they inline with this instruction: http://wiki.contribs.org/SME_Server:Documentation:FAQ#Which_repositories_should_be_enabled
-
Are you sure you updated from 7.1 to 7.2 and not 7.1 to 7.3 or 7.2 to 7.3? The current version is 7.3.
Are you sure if you upgraded from 7.1/7.2 you read and did the following: http://wiki.contribs.org/Updating_to_SME_7.2
Which repositories are enabled on your system, are they inline with this instruction: http://wiki.contribs.org/SME_Server:Documentation:FAQ#Which_repositories_should_be_enabled
Dear C,
I have followed each and every steps before upgrading from SME 7.1 to 7.2, the repositories are enabled as per the instruction. Also it is updated.
FYI
Hardware: HP Prolaint ML 350 G5
RAM: 2 GB
HDD: 146 Gb Single Disk
Please let me know if there is any info you required other than this.
-
Today again the server is restarted, please see the log below.
Sep 10 11:12:30 anthem-mail slapd[4191]: conn=1101 op=0 BIND dn="" method=128
Sep 10 11:12:30 anthem-mail slapd[4191]: conn=1101 op=0 RESULT tag=97 err=0 text=
Sep 10 11:12:30 anthem-mail slapd[4191]: conn=1101 op=1 UNBIND
Sep 10 11:12:30 anthem-mail slapd[4191]: conn=1101 fd=7 closed
Sep 10 11:17:51 anthem-mail syslogd 1.4.1: restart.
Sep 10 11:17:52 anthem-mail syslog: syslogd startup succeeded
Sep 10 11:17:52 anthem-mail kernel: klogd 1.4.1, log source = /proc/kmsg started.
Sep 10 11:17:52 anthem-mail kernel: Inspecting /boot/System.map-2.6.9-67.0.7.ELsmp
Sep 10 11:17:52 anthem-mail syslog: klogd startup succeeded
Sep 10 11:17:52 anthem-mail kernel: Loaded 24774 symbols from /boot/System.map-2.6.9-67.0.7.ELsmp.
Sep 10 11:17:52 anthem-mail kernel: Symbols match kernel version 2.6.9.
Sep 10 11:17:52 anthem-mail kernel: No module symbols loaded - kernel modules not enabled.
Sep 10 11:17:52 anthem-mail kernel: Linux version 2.6.9-67.0.7.ELsmp (mockbuild@builder6.centos.org) (gcc version 3.4.6 20060404 (Red Hat 3.4.6-9)) #1 SMP Sat Mar 15 06:54:55 EDT 2008
Sep 10 11:17:52 anthem-mail kernel: BIOS-provided physical RAM map:
Sep 10 11:17:52 anthem-mail kernel: BIOS-e820: 0000000000000000 - 000000000009f400 (usable)
9863,1 93%
-
I think it is time to raise a bug and provide as much information in it about server history, installed packages and contribs that you know in the hope the devs can help you out.
-
You didn't mention if you started SME with acpi=off, if you haven't already,
HP Netserver & Compaq Proliant servers in most cases, won't even load Linux/SME with acpi on.
Know that HP & Compaq has a long history of proprietary hardware monitoring systems that aren't supported.
Hit enter at the SME startup splash screen and edit your selected kernel to include acpi=off.
1. Highlight the kernel you want to boot and press the ‘e’ key.
2. Highlight the line that says “root” and press the ‘e’ key.
3. At the end of the line, add acpi=off
4. Test the server.
Might want to also check latest bios firmware for ML350, not that I think it would help the acpi issue, doesn't hurt.
You didn't mention if you had a UPS?