Kernel errors and crashing server

judgej

375
+0/-0

Kernel errors and crashing server

« on: August 19, 2007, 02:26:15 PM »

My server has suddenly started crashing and producing errors such as these:

Aug 19 12:59:09 sme kernel: [<e0889c72>] ext3_write_inode+0x22/0x3f [ext3]
Aug 19 12:59:09 sme kernel: [write_inode+48/55] write_inode+0x30/0x37
Aug 19 12:59:09 sme kernel: [<c0177b56>] write_inode+0x30/0x37
Aug 19 12:59:09 sme kernel: [__sync_single_inode+112/443] __sync_single_inode+0x70/0x1bb
Aug 19 12:59:09 sme kernel: [<c0177bcd>] __sync_single_inode+0x70/0x1bb
Aug 19 12:59:09 sme kernel: [sync_sb_inodes+423/628] sync_sb_inodes+0x1a7/0x274
Aug 19 12:59:09 sme kernel: [<c0177f79>] sync_sb_inodes+0x1a7/0x274
Aug 19 12:59:09 sme kernel: [writeback_inodes+145/222] writeback_inodes+0x91/0xde
Aug 19 12:59:09 sme kernel: [<c01780d7>] writeback_inodes+0x91/0xde
Aug 19 12:59:09 sme kernel: [balance_dirty_pages+124/284] balance_dirty_pages+0x7c/0x11c
Aug 19 12:59:09 sme kernel: [<c01451b8>] balance_dirty_pages+0x7c/0x11c
Aug 19 12:59:09 sme kernel: [<e0887f7d>] ext3_ordered_commit_write+0xb6/0xc5 [ext3]

My guess is that this is a hardware fault, but any ideas where it is likely to be? Hard disks or (motherboard) controller?

I have two 320G disks in an active array (no errors reported there) each a master drive on a separate IDE channels, 512M RAM, 1GHz Athlon (perhaps underpowered now) and an aging motherboard. RAM has tested out okay, but I have not yet done a low-level test of the hard drives. I just don't know what those errors mean, but I do know they have started appearing in the last few days, and the server has been crashing - processes suddenly stopping, with a hard reset being the only way to get out of it.

-- Jason

Logged

-- Jason

judgej

375
+0/-0

Re: Kernel errors and crashing server

« Reply #1 on: August 19, 2007, 06:44:09 PM »

Are those messages actually just page faults? i.e. does the SME Server kernel have some kind of debug turned on, gving me scary-looking messages for a pretty ordinary event (i.e. the server needing a bit more memory than it has got, perhaps to service a big bunch of SPAM messages that have just come on? I am getting around 16,000 spam messages over any 7 day period at the moment, about 99% caught by the server).

-- JJ

Edit: server crashed again - four times in 48 hours. I've placed an order for a new Dell SC440. Just can't be doing with this.

« Last Edit: August 20, 2007, 09:43:35 AM by judgej »

Logged

-- Jason

william_syd

1,608
+0/-0
Nothing to see here.

Re: Kernel errors and crashing server

« Reply #2 on: August 20, 2007, 03:14:02 PM »

Quote from: judgej on August 19, 2007, 06:44:09 PM

Are those messages actually just page faults? i.e. does the SME Server kernel have some kind of debug turned on, gving me scary-looking messages for a pretty ordinary event (i.e. the server needing a bit more memory than it has got, perhaps to service a big bunch of SPAM messages that have just come on? I am getting around 16,000 spam messages over any 7 day period at the moment, about 99% caught by the server).

-- JJ

Edit: server crashed again - four times in 48 hours. I've placed an order for a new Dell SC440. Just can't be doing with this.

Did you check the drive with SMART?

smartctl -t short /dev/hda

smartctl -l selftest /dev/hda

smartctl -a /dev/hda

Perhaps MasterSleepy can help you out. I saw a similar post from him on a French board.

Logged

Regards,
William

IF I give advise.. It's only if it was me....

judgej

375
+0/-0

Re: Kernel errors and crashing server

« Reply #3 on: August 20, 2007, 11:46:34 PM »

Quote from: william_syd on August 20, 2007, 03:14:02 PM

Did you check the drive with SMART?

Nice tip - thanks

Over 5500 hours of trouble-free usage, 39 power cycles, no raw errors, no retries (spin-up, calibration, seek or otherwise), on both disks hda and hdc, according to SMART (SMART is on).

I have a hunch the errors and the crashes are related, but are probably being caused by a mutual problem, and not each other. I have noticed the server no longer starts from a code boot - it hangs before checking memory during POST - but a warm reset gets it going again. I'll replace the BIOS cell tomorrow and see if that makes a difference.

-- JJ

« Last Edit: August 20, 2007, 11:49:32 PM by judgej »

Logged

-- Jason