Koozali.org: home of the SME Server

Obsolete Releases => SME Server 7.x => Topic started by: judgej on August 19, 2007, 02:26:15 PM

Title: Kernel errors and crashing server
Post by: judgej on August 19, 2007, 02:26:15 PM
My server has suddenly started crashing and producing errors such as these:

Aug 19 12:59:09 sme kernel:  [<e0889c72>] ext3_write_inode+0x22/0x3f [ext3]
Aug 19 12:59:09 sme kernel:  [write_inode+48/55] write_inode+0x30/0x37
Aug 19 12:59:09 sme kernel:  [<c0177b56>] write_inode+0x30/0x37
Aug 19 12:59:09 sme kernel:  [__sync_single_inode+112/443] __sync_single_inode+0x70/0x1bb
Aug 19 12:59:09 sme kernel:  [<c0177bcd>] __sync_single_inode+0x70/0x1bb
Aug 19 12:59:09 sme kernel:  [sync_sb_inodes+423/628] sync_sb_inodes+0x1a7/0x274
Aug 19 12:59:09 sme kernel:  [<c0177f79>] sync_sb_inodes+0x1a7/0x274
Aug 19 12:59:09 sme kernel:  [writeback_inodes+145/222] writeback_inodes+0x91/0xde
Aug 19 12:59:09 sme kernel:  [<c01780d7>] writeback_inodes+0x91/0xde
Aug 19 12:59:09 sme kernel:  [balance_dirty_pages+124/284] balance_dirty_pages+0x7c/0x11c
Aug 19 12:59:09 sme kernel:  [<c01451b8>] balance_dirty_pages+0x7c/0x11c
Aug 19 12:59:09 sme kernel:  [<e0887f7d>] ext3_ordered_commit_write+0xb6/0xc5 [ext3]

My guess is that this is a hardware fault, but any ideas where it is likely to be? Hard disks or (motherboard) controller?

I have two 320G disks in an active array (no errors reported there) each a master drive on a separate IDE channels, 512M RAM, 1GHz Athlon (perhaps underpowered now) and an aging motherboard. RAM has tested out okay, but I have not yet done a low-level test of the hard drives. I just don't know what those errors mean, but I do know they have started appearing in the last few days, and the server has been crashing - processes suddenly stopping, with a hard reset being the only way to get out of it.

-- Jason
Title: Re: Kernel errors and crashing server
Post by: judgej on August 19, 2007, 06:44:09 PM
Are those messages actually just page faults? i.e. does the SME Server kernel have some kind of debug turned on, gving me scary-looking messages for a pretty ordinary event (i.e. the server needing a bit more memory than it has got, perhaps to service a big bunch of SPAM messages that have just come on? I am getting around 16,000 spam messages over any 7 day period at the moment, about 99% caught by the server).

-- JJ

Edit: server crashed again - four times in 48 hours. I've placed an order for a new Dell SC440. Just can't be doing with this.
Title: Re: Kernel errors and crashing server
Post by: william_syd on August 20, 2007, 03:14:02 PM
Are those messages actually just page faults? i.e. does the SME Server kernel have some kind of debug turned on, gving me scary-looking messages for a pretty ordinary event (i.e. the server needing a bit more memory than it has got, perhaps to service a big bunch of SPAM messages that have just come on? I am getting around 16,000 spam messages over any 7 day period at the moment, about 99% caught by the server).

-- JJ

Edit: server crashed again - four times in 48 hours. I've placed an order for a new Dell SC440. Just can't be doing with this.

Did you check the drive with SMART?

smartctl -t short /dev/hda

smartctl -l selftest /dev/hda

smartctl -a /dev/hda

Perhaps MasterSleepy can help you out. I saw a similar post from him on a French board.


Title: Re: Kernel errors and crashing server
Post by: judgej on August 20, 2007, 11:46:34 PM
Did you check the drive with SMART?

Nice tip - thanks :-)

Over 5500 hours of trouble-free usage, 39 power cycles, no raw errors, no retries (spin-up, calibration, seek or otherwise), on both disks hda and hdc, according to SMART (SMART is on).

I have a hunch the errors and the crashes are related, but are probably being caused by a mutual problem, and not each other. I have noticed the server no longer starts from a code boot - it hangs before checking memory during POST - but a warm reset gets it going again. I'll replace the BIOS cell tomorrow and see if that makes a difference.

-- JJ