Koozali.org: home of the SME Server

Server Crash?

Cyrus Bharda

Server Crash?
« on: October 01, 2003, 07:40:52 AM »
Howdy everyone,

Well I suffered my very first SME crash, first one in a year, not too shabby.

Anyway to try to find out what caused it, I had a look in my messages log and saw hundreds of these entries leading up to when I evenutally hit the reset button as it was totally inaccessable from the network.

Oct  1 12:14:31 Tyr kernel: XD: Loaded as a module.
Oct  1 12:14:31 Tyr kernel: Trying to free nonexistent resource <00000320-00000323>
Oct  1 12:14:31 Tyr insmod: /lib/modules/2.4.18-5smp/kernel/drivers/block/xd.o: init_module: Operation not permitted
Oct  1 12:14:31 Tyr insmod: Hint: insmod errors can be caused by incorrect module parameters, including invalid IO or IRQ parameters
Oct  1 12:14:31 Tyr insmod: /lib/modules/2.4.18-5smp/kernel/drivers/block/xd.o: insmod block-major-13 failed
Oct  1 12:14:31 Tyr kernel: XD: Loaded as a module.

Now there was hundreds of them, so what does it mean?

Thanks,

Cyrus Bharda

Cyrus Bharda

Re: Server Crash?
« Reply #1 on: October 01, 2003, 07:41:20 AM »
Forgot to click the little email replies button

Cyrus Bharda

Re: Server Crash?
« Reply #2 on: October 01, 2003, 09:30:01 AM »
OK well just had another crash, which is quite concerning! 2 in one day, after 2 weeks of being up and running fine? This time I had this in my messages log just before the crash:

Oct  1 13:52:14 Tyr kernel: Out of Memory: Killed process 1639 (httpd).
Oct  1 13:52:58 Tyr kernel: Out of Memory: Killed process 1640 (httpd).
Oct  1 13:53:07 Tyr kernel: Out of Memory: Killed process 1641 (httpd).
Oct  1 13:53:12 Tyr kernel: Out of Memory: Killed process 1642 (httpd).
Oct  1 13:53:19 Tyr kernel: Out of Memory: Killed process 1643 (httpd).
Oct  1 13:53:31 Tyr kernel: Out of Memory: Killed process 1644 (httpd).
Oct  1 13:53:50 Tyr kernel: Out of Memory: Killed process 1645 (httpd).
Oct  1 13:54:27 Tyr kernel: Out of Memory: Killed process 1646 (httpd).
Oct  1 13:54:38 Tyr kernel: Out of Memory: Killed process 3082 (httpd).
Oct  1 13:54:46 Tyr kernel: Out of Memory: Killed process 3170 (httpd).

If this continues then I will definatly look at going back to 5.5 as I never once had it crash on me, never!

Cyrus Bharda

Cyrus Bharda

Re: Server Crash?
« Reply #3 on: October 01, 2003, 09:34:17 AM »
Incedentally there are a heap of these these typs of entries too, anyone make any sence of them?

Oct  1 13:52:10 Tyr kernel: IN=eth0 OUT= MAC=ff:ff:ff:ff:ff:ff:00:80:c8:d7:f6:1f:08:00 SRC=0.0.0.0 DST=255.255.255.255 LEN=346 TOS=0x00 PREC=0x00 TTL=128 ID=0 PROTO=UDP SPT=68 DPT=67 LEN=326

Thanks,

Cyrus Bharda

Ray Mitchell

Re: Server Crash?
« Reply #4 on: October 01, 2003, 10:35:25 AM »
Cyrus

> Oct  1 13:52:14 Tyr kernel: Out of Memory: Killed process .....

Very suggestive of not enough memory don't you think ?

Perhaps you and your users are making more use of the server than before and you are now running into memory limitations when lots of processes are running.

> If this continues then I will definatly look at going back to
> 5.5 as I never once had it crash on me, never!

Had a similar thing happen when I updated to v5.6, added more memory (ie from 128 to 256 MB) and no more problems).

I think v5.6 likes more memory !!

Regards
Ray






>
> Cyrus Bharda

Cyrus Bharda

Re: Server Crash?
« Reply #5 on: October 01, 2003, 10:38:26 AM »
Ray,

Yes that's what I thought, but it has 512MB of ECC RAM and by looking at the sysmon stats the memory usage does not spike, and everything looks alright, I can put the pics somewhere if that would help you put together a better picture of what is going on, but still friggin wierd!

Cyrus Bharda

Ray Mitchell

Re: Server Crash?
« Reply #6 on: October 01, 2003, 10:44:42 AM »
Try running memtest (Charlies Bradys) or similar for a prolonged period (at least overnight) and look for any errors.

Ray

Cyrus Bharda

Re: Server Crash?
« Reply #7 on: October 01, 2003, 10:50:26 AM »
Ray,

Way ahead of you, ran it and posted some wierd results to the developers list by accident, I meant to only send it to charlie, but sent it to both. Here is a copy:

Charlie,

Just wondering if you could shed some light on some results?

When I run memtester, before all the tests get run, the log fills up with
hundreds of these types of lines:

@400000003f7a5a4e2a552e5c memtest v. 2.93.1
@400000003f7a5a4e2a55556c (C) 2000 Charles Cazabon

@400000003f7a5a4e2a5568f4 Original v.1 (C) 1999 Simon Kirby

@400000003f7a5a4e2a558064
@400000003f7a5a4e2a5a04a4 Current limits:
@400000003f7a5a4e2a5a182c   RLIMIT_RSS  0xffffffff
@400000003f7a5a4e2a5a23e4   RLIMIT_VMEM 0xffffffff
@400000003f7a5a4e2a5a3384 Raising limits...
@400000003f7a5a4e2d612254 Unable to malloc 458227712 bytes.
@400000003f7a5a4e2d61b2dc Unable to malloc 457179136 bytes.
@400000003f7a5a4e2d61c664 Unable to malloc 456130560 bytes.
@400000003f7a5a4e2d624f1c Unable to malloc 455081984 bytes.
@400000003f7a5a4e2d6262a4 Unable to malloc 454033408 bytes.
@400000003f7a5a4e2d62f32c Unable to malloc 452984832 bytes.
@400000003f7a5a4e2d6302cc Unable to malloc 451936256 bytes.
@400000003f7a5a4e2d639354 Unable to malloc 450887680 bytes.
@400000003f7a5a4e2d63a2f4 Unable to malloc 449839104 bytes.
@400000003f7a5a4e2d64337c Unable to malloc 448790528 bytes.
@400000003f7a5a4e2d64c7ec Allocated 447741952 bytes...trying mlock...failed:
insufficient resources.
@400000003f7a5a4e2d6a6954 Allocated 446689280 bytes...trying mlock...failed:
insufficient resources.
@400000003f7a5a4e2d6f89d4 Allocated 445636608 bytes...trying mlock...failed:
insufficient resources.
@400000003f7a5a4e2d7467ec Allocated 444583936 bytes...trying mlock...failed:
insufficient resources.
@400000003f7a5a4e2d793e34 Allocated 443531264 bytes...trying mlock...failed:
insufficient resources.
@400000003f7a5a4e2d7e33bc Allocated 442478592 bytes...trying mlock...failed:
insufficient resources.
@400000003f7a5a4e2d832944 Allocated 441425920 bytes...trying mlock...failed:
insufficient resources.
@400000003f7a5a4e2d87ff8c Allocated 440373248 bytes...trying mlock...failed:
insufficient resources.

But after that all the tests run fine?

@400000003f7a5ef42440828c 1 runs completed.  0 errors detected.

So is there a problem or not? I read the readme.tests and it said:

There is also a test (Stuck Address) which is run first.  It determines if
the
memory locations the program attempts to access are addressed properly or
not.
If this test reports errors, there is almost certainly a problem somewhere
in
the memory subsystem.  Results from the rest of the tests cannot be
considered
accurate if this test fails:
  Stuck Address

But this test does not fail, still all those errors does not look good!

Help please!

Cyrus Bharda

Bertrand CHERRIER

Re: Server Crash?
« Reply #8 on: October 01, 2003, 11:02:51 AM »
Greetings,

I had exactly the thing a few months ago ...
I replaced the memory and everything has been perfect (so far), well for the memory at least !

Good Luck

Cyrus Bharda wrote:
>
> OK well just had another crash, which is quite concerning! 2
> in one day, after 2 weeks of being up and running fine? This
> time I had this in my messages log just before the crash:
>
> Oct  1 13:52:14 Tyr kernel: Out of Memory: Killed process
> 1639 (httpd).
> Oct  1 13:52:58 Tyr kernel: Out of Memory: Killed process
> 1640 (httpd).
> Oct  1 13:53:07 Tyr kernel: Out of Memory: Killed process
> 1641 (httpd).
> Oct  1 13:53:12 Tyr kernel: Out of Memory: Killed process
> 1642 (httpd).
> Oct  1 13:53:19 Tyr kernel: Out of Memory: Killed process
> 1643 (httpd).
> Oct  1 13:53:31 Tyr kernel: Out of Memory: Killed process
> 1644 (httpd).
> Oct  1 13:53:50 Tyr kernel: Out of Memory: Killed process
> 1645 (httpd).
> Oct  1 13:54:27 Tyr kernel: Out of Memory: Killed process
> 1646 (httpd).
> Oct  1 13:54:38 Tyr kernel: Out of Memory: Killed process
> 3082 (httpd).
> Oct  1 13:54:46 Tyr kernel: Out of Memory: Killed process
> 3170 (httpd).
>
> If this continues then I will definatly look at going back to
> 5.5 as I never once had it crash on me, never!
>
> Cyrus Bharda

Cyrus Bharda

Re: Server Crash?
« Reply #9 on: October 01, 2003, 11:09:21 AM »
Bertrand,

Great, this is a kind of old board and I dont know what type of ram to get for it, i have tried a number of different types of 168 pin ram (PC66,PC100,PC133) in the 4 different slots and nothing worked, not even the spare ecc I had, so getting ram for this will be a bugger!

Oh well, the hunt begins!

Cyrus Bharda

elspike

Re: Server Crash?
« Reply #10 on: October 01, 2003, 11:21:22 AM »
I had a similar problem after transporting my server to a new building.

Removed the ram and reseated it (must have moved during the relocation) and all is good.

cheers
elSpike out.

Ray Mitchell

Re: Server Crash?
« Reply #11 on: October 01, 2003, 11:30:21 AM »
Cyrus

> @400000003f7a5a4e2a5a04a4 Current limits:
> @400000003f7a5a4e2a5a182c   RLIMIT_RSS  0xffffffff
> @400000003f7a5a4e2a5a23e4   RLIMIT_VMEM 0xffffffff
> @400000003f7a5a4e2a5a3384 Raising limits...

I have not studied memtest in great depth but I gather it is testing your memory to see what limits it can detect, so therefore you get the earlier errors which I gather are OK as that is part of the test gradually incrementing.

> @400000003f7a5ef42440828c 1 runs completed.  0 errors detected.

Did you run the test all night long ?
You need more than 1 completed run to give your memory a good workout.
You will see a time count ie 45500 seconds at the end. (approx 14 hours)

If all the tests show passed then thats OK.
ie Test 1, Test 2. Test 3 etc

As far as your next post about type of memory for your motherboard, I think therein lies the problem. You MUST use memory that is supported by your m/b, and it must be the right speed etc

I gather v5.6 is fussier about memory and the errors you are seeing generally in your logs are probably just showing up the issues with incorrect memory type (or even slighty flawed or faulty memory).

I would get onto the Internet and track down the motherboard manufacturer's site to determine the correct memory type and speed.

Thats a dangerous server you have !!!

Regs
Ray

Graham

Re: Server Crash?
« Reply #12 on: October 01, 2003, 11:41:10 AM »
Here in NW UK you can buy a mobo for 45GBP,
256MB RAM for 28GBP & AMD Duron 1.3 for £25

Save yourself all the hassle and upgrade !

(Especially if its for a customer)

The performance will be better too :)

(Computer fact of life :- It'll be half the price in 9 months
OR the price will be about the same and the spec doubled :)

Cyrus Bharda

Re: Server Crash?
« Reply #13 on: October 01, 2003, 11:51:38 AM »
Everyone,

Well I am going to leave memtest running overnight tonight, and hopefully it does not crash overnight, but here's what I am going to do tommorrow:

1. Remove the ram and reseat it, just to make sure it is in properly.
2. It is an intel server board so trackin info on it shouldnt be too hard, and track down what specs of ram it needs and source some.
3. If ram is to expensive, look at buying a cheap box with mobo cpu and ram, still will be hard to explain to boss, all they think about is that fact that we have to spend more money, not look at what it will provide us with!

Anyway thanks to all of your help, muchly appreicated and hopefully this ram can hold out until I can do something with it!

Cyrus Bharda

Jaco Bongers

Re: Server Crash?
« Reply #14 on: October 01, 2003, 04:28:38 PM »
Cyrus Bharda wrote:
>
> Everyone,
>
> Well I am going to leave memtest running overnight tonight,
> and hopefully it does not crash overnight, but here's what I
> am going to do tommorrow:
>
> 1. Remove the ram and reseat it, just to make sure it is in
> properly.
> 2. It is an intel server board so trackin info on it shouldnt
> be too hard, and track down what specs of ram it needs and
> source some.
> 3. If ram is to expensive, look at buying a cheap box with
> mobo cpu and ram, still will be hard to explain to boss, all
> they think about is that fact that we have to spend more
> money, not look at what it will provide us with!
>
I had a similar problem on a Redhat box. Turned out that the fan on the CPU no longer worked as expected. Summer coming in Southern Hemisphere :)  A new fan on the CPU seems to have fixed the memory errors for me.

Jaco

Daniel Oliver

Re: Server Crash?
« Reply #15 on: October 02, 2003, 02:20:43 AM »
The network packet logs are DHCP requests being broadcast from another machine.  Is there a rouge machine on your network pounding Apache, DHCP and possibly other services?

Cyrus Bharda

Re: Server Crash?
« Reply #16 on: October 02, 2003, 05:57:57 AM »
Daniel,

How would I find out if there was a computer doing this? Our SME does not provide DHCP to the network, only acts as an internet gateway/firewall and email server.

Ray,

Here's the results from running memtester overnight:

37 runs completed.  0 errors detected.  Total runtime:  71201 seconds.

So i guess that it's working fine............ might try reseating it and sourcing some new ram for it just in case though.

Thanks to all for the help, muchly appreciated!

Cyrus Bharda

Jon Blakely

Re: Server Crash?
« Reply #17 on: October 02, 2003, 01:24:52 PM »
Cyrus,

Module xd.o as mentioned in the messages log in your first post is the disk controller module.

I can create the exact same logs if I do a sfdisk -l.

Read http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=52013

Jon

cc_skavenger

Server Crash?
« Reply #18 on: August 26, 2004, 08:27:25 AM »
Cyrus Bharda,
did you ever find out what was causing the crash.  
Did you use the lat tools to set quotas on users or did you manually set quotas?  Having the same problem and it seems to be related to quotas.

I used the lat tools to gloally set quotas.  I used the command:
lat-quota -c "* |9M |10M"

After this, server crashes continually on useradd or copying in MC.

Any thoughts?

Michiel

Server Crash?
« Reply #19 on: September 08, 2004, 02:46:35 PM »
Hi Skavenger,

Quote
I used the lat tools to gloally set quotas. I used the command:
lat-quota -c "* |9M |10M"

After this, server crashes continually on useradd or copying in MC.


I only just came across the above posting. Did you work out what was causing your problem? I tried the lat-quota command with the same syntax on several machines and it always worked as expected.

Did your server really crash (i.e. reboot)?

BTW, you should NEVER use useradd (or userdel) on a SME machine. It will break all kinds of user settings.

regards,
Michiel