Koozali.org: home of the SME Server

[solved] ext2 inconsistency error (3 in a month) raid 1 UU hd smart ok

Offline lightman

  • ****
  • 75
  • +0/-0
Hello
I am a little lost here, this is the third time it happens, i have ext2 inconsistency and FSCK ask to do it manually.

I ran fsck to repair the damaged FS and i have a lot of errors!, but the RAID seems intact, it reports UU always.

I check both hard disks, SMART is ok, not one current pending sectors or reallocated sector count or UDMA error, i did a full check with HDAT2 in both disks, no errors, and no logged previous errors in smart log area.

This problem happens with about one week between failures and when it starts, they tell me that the server starts to behave VERY VERY slow with some ibays and normal with the other, then get slower, and slower until nothing can be accessed.

when they try to power it off, it doesn't respond, and have to hard power off instead, after that, i have the ext2 inconsistency error telling me to run fsck manually.

I check messages but cannot find anything related with hard disk, filesystem errors.

The server is a miniITX intel mobo, dual core atom, with 2GB ram (checked with memtest86+, no errors) and 2 WD10EADS drives in software RAID1, running SME 7.5 in server mode only, with no contribs, and have been working fine for a year until this start to happens, and it is connected to a APC UPS.

any ideas what can I look for?, I am a little lost here, i don't even know where to begin.
thank you
sorry for the LONG post, but i don't want to exclude any data that could be important.
Light.

Edit: added [solved] to subject line
« Last Edit: September 02, 2011, 08:57:18 AM by cactus »

Offline Jáder

  • *
  • 1,099
  • +0/-0
    • LinuxFacil
do you have notes about what exact time your problem start to happens ?

If not... start taking them... and search for something in crontab or cron.d

once in a week appears to be too much coincindence to me... and verify if you have enought free space on your disk :)

...

Offline lightman

  • ****
  • 75
  • +0/-0
Hi
Thank you for reply
It appears to be happening in sunday (2 out of 3) I have a couple of backup scripts that compresses the entire content of one ibay to a tar.gz file, they are in cron.weekly  (soft linked)

The hard drives are of 1TB and there is only 100 Gb of data so plenty of free space.

I will remove cron backups that runs on the weekend, but I don't think that this is the problem because they where working fine for almost a year unchanged.

thank you
Light

Offline CharlieBrady

  • *
  • 6,918
  • +3/-0
I ran fsck to repair the damaged FS and i have a lot of errors!, but the RAID seems intact, it reports UU always.

RAID device issues should not create FS damage, and FS damage should not create RAID device issues, so what you report here is not surprising.

It sounds to me that your initial problem is the system getting slower and slower. The FS corruption may be just due to you power cycling the system without proper shutdown (although that usually won't cause FS corruption). FS corruption otherwise is due to kernel bugs or hardware problems.

Offline lightman

  • ****
  • 75
  • +0/-0
Hi CharlieBrady
as usual, you are right on the money :)
I just saw the error in process for the first time!, and i was able to take a picture of it.

I have about 1 of these per second then the final error when the server finally dies:

Kernel Panic - not syncing: out of memory and no killable processes...

now the server has 2GB of ram, and it was doing nothing!, should I install it again from scratch :( I'm really lost, as usual, there is nothing in /var/log/messages related to it, it just stops logging.
« Last Edit: August 23, 2011, 02:31:06 AM by lightman »

Offline gregswallow

  • *
  • 651
  • +1/-0
Do those errors show up right away after you reboot, or after it's been running for a while?  You could try running htop and see if there is a process that is using an ever increasing amount of memory, or too much cpu.

You have all the standard updates installed?  IE, "yum update" has nothing to do?  If so, you could also try the kernel in smeupdates-testing:
yum --enablerepo=smeupdates-testing upgrade kernel*
(and then reconfigure/reboot and the latest kernel should be selected automatically as the default)

Offline CharlieBrady

  • *
  • 6,918
  • +3/-0
now the server has 2GB of ram...

That should be plenty. What do you have installed which is non-standard? What was the last thing you installed/upgraded/reconfigured before the problems started?

Offline lightman

  • ****
  • 75
  • +0/-0
The last thing I installed was the updates via de web panel but it was like 5 months ago, nothing after that was changed.
the only non-standard thing I have is LCDPROC daemon running with a 4x20 display in the parallel port but that was from the very first day I installed the server about a year ago, anyway i disabled the daemon just in case after the first failure.

It takes anything from 30 minutes to 1 day to happens now.

one thing that i decide to try was run the server with the raid 1 complete, and with every disk separated.

So:
raid1 complete: failed after 1 hour.
sda only: failed after 30 min.
sdb only: so far hasn't failed (3 hours online and counting)
(sda/sdb just a way to explain which drive physically :) )

it got to be hardware, i mean if i haven't change a thing software cannot decide to collect garabage in the memory from one day to another with no reason right?

oh I forgot to comment, in both cases there was no CPU activity, htop shows no CPU usage above 5%, and MEM about 1% in any process.

EDIT: it seems to be one of the disk drives, because since i remove the first one (originally as SDA) the problem is GONE.
now it shouldn't the RAID 1 JUST solve that kind of issues?, it seems to me that software raid 1 is more trouble than help.
« Last Edit: August 24, 2011, 03:41:09 AM by lightman »

Offline lightman

  • ****
  • 75
  • +0/-0
I retract some of what I said earlier, failed again with the other disk after almost 2 days working.
I have no idea what is going on and my client is getting very anxious (and I am as well), I will reinstall everything from scratch and see what happens, because i have no idea of why this freezeing followed by severe ext3 corruption is happening, i have never seen anything like this before.

Offline Jáder

  • *
  • 1,099
  • +0/-0
    • LinuxFacil
Lightman

If you  reinstall all info will be missed and we'll no be able to help you...
Remember your "severe ext3 corruption" is the sintom... not the cause.
MAYBE your HDD or something related to it/them is the cause.

Do you have sysmon and/or sme7admin installed (search for them on contribs).
They can tell you (in graphs... so it's easy to see!) if something is getting higher/lower than after you reboot.

I like to install one of them to see a trend line for all indications on SME.
It's also very usefull to show why you need a memory upgrade (show a big swap usage line in a server with low memory versus a no swap at all in a server with enought memory) to management.

Doid you use your server as file server onlor print server also.
How do you use your server ? What do you think use more resources from your server (I see it's a miniITX, so is not for a big group of users).
Tell me more so I can guess things to look at.

Regards

Jáder
« Last Edit: August 25, 2011, 03:02:15 AM by jader »
...

Offline lightman

  • ****
  • 75
  • +0/-0
Hello
Thank you for reply, I know I really will like to know what the hell is causing this because it could happen again to me or someone else, what is worse is that i don't know if it is a software or a hardware issue at this point!.

every part has been tested separately without issues, both disks tested outside the raid very heavily, and they work perfectly.

If it is hardware, the closest thing I have seen far remote to this was a HDD controller failure.

I don't know what do you think (specially the developers that understand very well the inner workings of SME, I know i don't) if the HDD controller freezes in a WAIT or BUSY state, could the kernel continue to caching to ram until there is no more ram available and then panic?

I will try to give my client a temporary server so i can work with this to find out what is going on, but my time is running out and my client want the solution ASAP.
I never knew about those tools you comment here!, I will try to get as much info as I can with them, and post here.
thank you.

EDIT: I forgot to answer some of your questions, it's for a SOHO, only 6 workstations, the network card is what cause more CPU usage, but this is here in my lab, at gigabit speed, my client has 100mb switch so it is not his case, and also I have 2GB of ram (DDR2-800) that is a lot of ram for a fileserver only, the motherboard is an intel dual core dual threaded atom at 1.6Ghz, I don't use as a print server or anything else, only file serving, about 90 Gb of data on disks of 1TB in RAID1 software.
« Last Edit: August 25, 2011, 03:24:29 AM by lightman »

Offline slords

  • *****
  • 235
  • +3/-0
I have about 1 of these per second then the final error when the server finally dies:

Kernel Panic - not syncing: out of memory and no killable processes...

now the server has 2GB of ram, and it was doing nothing!, should I install it again from scratch :( I'm really lost, as usual, there is nothing in /var/log/messages related to it, it just stops logging.

It doesn't matter how much memory you have.  When the system gets into an OOM condition it will start killing processes.  Most of the time this will result in a hung system.  If you let it get to the point of hanging they most of the time you will find you also have corruption (more likely missing) information on the hard drive.

What you need to do is determine what is eating all your memory.  This is almost never a hardware issue.  You may not have changed anything on the system but that doesn't mean that the data on the system hasn't changed.  I've seen things as simple as a corrupt zip file causing the virus scanner to eat all the memory and crash the system.  It could also be that the tar backup you are taking is all of a sudden taking more memory because of additional data in a ibay.

Whatever it is you need to identify what is taking the memory to solve this issue.  Depending on how aggressive the program is that is eating all the memory you might be able to identify it growing over a few hours, or it might go from 0 to 110% of memory in a matter of seconds.

Once you get into OOM you will mostly likely not get any additional message in /var/log/messages.  This is especially true if the flush daemon or syslog daemon get killed.
"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs,
and the Universe trying to produce bigger and better idiots. So far, the Universe is winning." -- Rich Cook

Offline janet

  • *****
  • 4,812
  • +0/-0
lightman

Use
top -i
or
htop

to see what is happening, before the lock up occurs
Please search before asking, an answer may already exist.
The Search & other links to useful information are at top of Forum.

Offline gregswallow

  • *
  • 651
  • +1/-0
Whatever it is you need to identify what is taking the memory to solve this issue.  Depending on how aggressive the program is that is eating all the memory you might be able to identify it growing over a few hours, or it might go from 0 to 110% of memory in a matter of seconds.

Can you suggest a tool to determine the problem better than top or htop?

I was just reading about atop, which seems to do logging built in...good description of what it can do here:
http://www.atoptool.nl/download/man_atop.pdf
http://www.atoptool.nl/download/case_leakage.pdf

There are rpm's on RPMForge or EPEL for SME7 or SME8

Offline lightman

  • ****
  • 75
  • +0/-0
Hello all
Thank you for taking the time to answer.
I am starting to get desperate.
it is not SME issue, it is hardware.
I removed one of the disks of the raid, format, install a copy of windows server 2003 trial and the server hang up about 10 hours after install, with the exception of no data corruption at all, power cycle and everything come back to normal for other 10 to 20 hours, not sure when happened but happened again.

I replaced motherboard and memory.  (buy a new one)
I replaced power supply
I did the windows 2003 test with the other disk I wasn't using (let it call sda), previously the hangup with SME was happening with (sdb) (with both actually, but the last test i did it with every disk separately).

I'm lost, at this point I know it is not an SME issue because happened also with windows server 2003, and more often than SME, the only good thing with server 2003 is that i get no data corruption when it happens, but it is more often.

Now I'm having the old motherboard at home with a clean SME install and so far everything seems OK, I leave it overnight to see what happens.
I don't know what else to test, everything failed, but everything works when I tested it separately outside.
thank you
light