Koozali.org: home of the SME Server

[solved] ext2 inconsistency error (3 in a month) raid 1 UU hd smart ok

Offline lightman

  • ****
  • 75
  • +0/-0
Hello
I am a little lost here, this is the third time it happens, i have ext2 inconsistency and FSCK ask to do it manually.

I ran fsck to repair the damaged FS and i have a lot of errors!, but the RAID seems intact, it reports UU always.

I check both hard disks, SMART is ok, not one current pending sectors or reallocated sector count or UDMA error, i did a full check with HDAT2 in both disks, no errors, and no logged previous errors in smart log area.

This problem happens with about one week between failures and when it starts, they tell me that the server starts to behave VERY VERY slow with some ibays and normal with the other, then get slower, and slower until nothing can be accessed.

when they try to power it off, it doesn't respond, and have to hard power off instead, after that, i have the ext2 inconsistency error telling me to run fsck manually.

I check messages but cannot find anything related with hard disk, filesystem errors.

The server is a miniITX intel mobo, dual core atom, with 2GB ram (checked with memtest86+, no errors) and 2 WD10EADS drives in software RAID1, running SME 7.5 in server mode only, with no contribs, and have been working fine for a year until this start to happens, and it is connected to a APC UPS.

any ideas what can I look for?, I am a little lost here, i don't even know where to begin.
thank you
sorry for the LONG post, but i don't want to exclude any data that could be important.
Light.

Edit: added [solved] to subject line
« Last Edit: September 02, 2011, 08:57:18 AM by cactus »

Offline Jáder

  • *
  • 1,099
  • +0/-0
    • LinuxFacil
do you have notes about what exact time your problem start to happens ?

If not... start taking them... and search for something in crontab or cron.d

once in a week appears to be too much coincindence to me... and verify if you have enought free space on your disk :)

...

Offline lightman

  • ****
  • 75
  • +0/-0
Hi
Thank you for reply
It appears to be happening in sunday (2 out of 3) I have a couple of backup scripts that compresses the entire content of one ibay to a tar.gz file, they are in cron.weekly  (soft linked)

The hard drives are of 1TB and there is only 100 Gb of data so plenty of free space.

I will remove cron backups that runs on the weekend, but I don't think that this is the problem because they where working fine for almost a year unchanged.

thank you
Light

Offline CharlieBrady

  • *
  • 6,918
  • +3/-0
I ran fsck to repair the damaged FS and i have a lot of errors!, but the RAID seems intact, it reports UU always.

RAID device issues should not create FS damage, and FS damage should not create RAID device issues, so what you report here is not surprising.

It sounds to me that your initial problem is the system getting slower and slower. The FS corruption may be just due to you power cycling the system without proper shutdown (although that usually won't cause FS corruption). FS corruption otherwise is due to kernel bugs or hardware problems.

Offline lightman

  • ****
  • 75
  • +0/-0
Hi CharlieBrady
as usual, you are right on the money :)
I just saw the error in process for the first time!, and i was able to take a picture of it.

I have about 1 of these per second then the final error when the server finally dies:

Kernel Panic - not syncing: out of memory and no killable processes...

now the server has 2GB of ram, and it was doing nothing!, should I install it again from scratch :( I'm really lost, as usual, there is nothing in /var/log/messages related to it, it just stops logging.
« Last Edit: August 23, 2011, 02:31:06 AM by lightman »

Offline gregswallow

  • *
  • 651
  • +1/-0
Do those errors show up right away after you reboot, or after it's been running for a while?  You could try running htop and see if there is a process that is using an ever increasing amount of memory, or too much cpu.

You have all the standard updates installed?  IE, "yum update" has nothing to do?  If so, you could also try the kernel in smeupdates-testing:
yum --enablerepo=smeupdates-testing upgrade kernel*
(and then reconfigure/reboot and the latest kernel should be selected automatically as the default)

Offline CharlieBrady

  • *
  • 6,918
  • +3/-0
now the server has 2GB of ram...

That should be plenty. What do you have installed which is non-standard? What was the last thing you installed/upgraded/reconfigured before the problems started?

Offline lightman

  • ****
  • 75
  • +0/-0
The last thing I installed was the updates via de web panel but it was like 5 months ago, nothing after that was changed.
the only non-standard thing I have is LCDPROC daemon running with a 4x20 display in the parallel port but that was from the very first day I installed the server about a year ago, anyway i disabled the daemon just in case after the first failure.

It takes anything from 30 minutes to 1 day to happens now.

one thing that i decide to try was run the server with the raid 1 complete, and with every disk separated.

So:
raid1 complete: failed after 1 hour.
sda only: failed after 30 min.
sdb only: so far hasn't failed (3 hours online and counting)
(sda/sdb just a way to explain which drive physically :) )

it got to be hardware, i mean if i haven't change a thing software cannot decide to collect garabage in the memory from one day to another with no reason right?

oh I forgot to comment, in both cases there was no CPU activity, htop shows no CPU usage above 5%, and MEM about 1% in any process.

EDIT: it seems to be one of the disk drives, because since i remove the first one (originally as SDA) the problem is GONE.
now it shouldn't the RAID 1 JUST solve that kind of issues?, it seems to me that software raid 1 is more trouble than help.
« Last Edit: August 24, 2011, 03:41:09 AM by lightman »

Offline lightman

  • ****
  • 75
  • +0/-0
I retract some of what I said earlier, failed again with the other disk after almost 2 days working.
I have no idea what is going on and my client is getting very anxious (and I am as well), I will reinstall everything from scratch and see what happens, because i have no idea of why this freezeing followed by severe ext3 corruption is happening, i have never seen anything like this before.

Offline Jáder

  • *
  • 1,099
  • +0/-0
    • LinuxFacil
Lightman

If you  reinstall all info will be missed and we'll no be able to help you...
Remember your "severe ext3 corruption" is the sintom... not the cause.
MAYBE your HDD or something related to it/them is the cause.

Do you have sysmon and/or sme7admin installed (search for them on contribs).
They can tell you (in graphs... so it's easy to see!) if something is getting higher/lower than after you reboot.

I like to install one of them to see a trend line for all indications on SME.
It's also very usefull to show why you need a memory upgrade (show a big swap usage line in a server with low memory versus a no swap at all in a server with enought memory) to management.

Doid you use your server as file server onlor print server also.
How do you use your server ? What do you think use more resources from your server (I see it's a miniITX, so is not for a big group of users).
Tell me more so I can guess things to look at.

Regards

Jáder
« Last Edit: August 25, 2011, 03:02:15 AM by jader »
...

Offline lightman

  • ****
  • 75
  • +0/-0
Hello
Thank you for reply, I know I really will like to know what the hell is causing this because it could happen again to me or someone else, what is worse is that i don't know if it is a software or a hardware issue at this point!.

every part has been tested separately without issues, both disks tested outside the raid very heavily, and they work perfectly.

If it is hardware, the closest thing I have seen far remote to this was a HDD controller failure.

I don't know what do you think (specially the developers that understand very well the inner workings of SME, I know i don't) if the HDD controller freezes in a WAIT or BUSY state, could the kernel continue to caching to ram until there is no more ram available and then panic?

I will try to give my client a temporary server so i can work with this to find out what is going on, but my time is running out and my client want the solution ASAP.
I never knew about those tools you comment here!, I will try to get as much info as I can with them, and post here.
thank you.

EDIT: I forgot to answer some of your questions, it's for a SOHO, only 6 workstations, the network card is what cause more CPU usage, but this is here in my lab, at gigabit speed, my client has 100mb switch so it is not his case, and also I have 2GB of ram (DDR2-800) that is a lot of ram for a fileserver only, the motherboard is an intel dual core dual threaded atom at 1.6Ghz, I don't use as a print server or anything else, only file serving, about 90 Gb of data on disks of 1TB in RAID1 software.
« Last Edit: August 25, 2011, 03:24:29 AM by lightman »

Offline slords

  • *****
  • 235
  • +3/-0
I have about 1 of these per second then the final error when the server finally dies:

Kernel Panic - not syncing: out of memory and no killable processes...

now the server has 2GB of ram, and it was doing nothing!, should I install it again from scratch :( I'm really lost, as usual, there is nothing in /var/log/messages related to it, it just stops logging.

It doesn't matter how much memory you have.  When the system gets into an OOM condition it will start killing processes.  Most of the time this will result in a hung system.  If you let it get to the point of hanging they most of the time you will find you also have corruption (more likely missing) information on the hard drive.

What you need to do is determine what is eating all your memory.  This is almost never a hardware issue.  You may not have changed anything on the system but that doesn't mean that the data on the system hasn't changed.  I've seen things as simple as a corrupt zip file causing the virus scanner to eat all the memory and crash the system.  It could also be that the tar backup you are taking is all of a sudden taking more memory because of additional data in a ibay.

Whatever it is you need to identify what is taking the memory to solve this issue.  Depending on how aggressive the program is that is eating all the memory you might be able to identify it growing over a few hours, or it might go from 0 to 110% of memory in a matter of seconds.

Once you get into OOM you will mostly likely not get any additional message in /var/log/messages.  This is especially true if the flush daemon or syslog daemon get killed.
"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs,
and the Universe trying to produce bigger and better idiots. So far, the Universe is winning." -- Rich Cook

Offline janet

  • *****
  • 4,812
  • +0/-0
lightman

Use
top -i
or
htop

to see what is happening, before the lock up occurs
Please search before asking, an answer may already exist.
The Search & other links to useful information are at top of Forum.

Offline gregswallow

  • *
  • 651
  • +1/-0
Whatever it is you need to identify what is taking the memory to solve this issue.  Depending on how aggressive the program is that is eating all the memory you might be able to identify it growing over a few hours, or it might go from 0 to 110% of memory in a matter of seconds.

Can you suggest a tool to determine the problem better than top or htop?

I was just reading about atop, which seems to do logging built in...good description of what it can do here:
http://www.atoptool.nl/download/man_atop.pdf
http://www.atoptool.nl/download/case_leakage.pdf

There are rpm's on RPMForge or EPEL for SME7 or SME8

Offline lightman

  • ****
  • 75
  • +0/-0
Hello all
Thank you for taking the time to answer.
I am starting to get desperate.
it is not SME issue, it is hardware.
I removed one of the disks of the raid, format, install a copy of windows server 2003 trial and the server hang up about 10 hours after install, with the exception of no data corruption at all, power cycle and everything come back to normal for other 10 to 20 hours, not sure when happened but happened again.

I replaced motherboard and memory.  (buy a new one)
I replaced power supply
I did the windows 2003 test with the other disk I wasn't using (let it call sda), previously the hangup with SME was happening with (sdb) (with both actually, but the last test i did it with every disk separately).

I'm lost, at this point I know it is not an SME issue because happened also with windows server 2003, and more often than SME, the only good thing with server 2003 is that i get no data corruption when it happens, but it is more often.

Now I'm having the old motherboard at home with a clean SME install and so far everything seems OK, I leave it overnight to see what happens.
I don't know what else to test, everything failed, but everything works when I tested it separately outside.
thank you
light

Offline Jáder

  • *
  • 1,099
  • +0/-0
    • LinuxFacil
I just see this once, it was a smaller-than-necessary PSU.
My tip: just try your HW with a REALLY OVERESTIMATED power supply. Borrow one if you need.
Or unplug anything not necessary (let just motherboard and 1xHDD) to try.

BTW: Do you installed sysmon or SME7admin ?
Both of them generate nice graphs, so you can see how your server is behaving in time.
Do not scare about to memory be all in use ... it's the linux way.
...

Offline purvis

  • *****
  • 567
  • +0/-0
I had some video memory problems lately that was hard to detect. Most memory also has gold connections. Do not use two different type of metal surfaces where your memory connects to the mobo.

Offline lightman

  • ****
  • 75
  • +0/-0
Hello
First of: Thank you all for your help.
I finally decided to go radical, the motherboard/processor/memory where OK since it didn't failed at all at home after 2 days, so I buy a brand new power supply, case, and one 1TB hard drive, and it's working perfectly so far with SME 7.5

So, It has something to do with the power supply or the disks, I will dedicate this weekind to find out, since my client is now happy with his new working server (well... i didn't tell him that the motherboard/memory was the same one :D, he just saw a new enclosure and figure it out the rest by himself hehe), so now, there is no rush to figure it out what is going on.

I did change the PSU a couple of days ago for a ANTEC BASQ 450w, that i think, is a very good PSU, remember the motherboard is an ATOM dual core based, and 2 WD Green drives, power consumption is FAR below 100watts, and it failed after 8 hours in the same way (with only one drive connected), that's why I rule it out. (it's a good info BTW, never knew that smaller-than-necessary PSU can cause hangups like this, thank you).

I keep the new motherboard (intel D525MW with 2GB DDR3) so i will use it to run tests, will post here results later :) (since my client wants RAID 1, but I will not take any chances, i will test both drives very well with SME here before go into the production server again)
thank you all
light
« Last Edit: September 01, 2011, 03:12:02 AM by lightman »

Offline Jáder

  • *
  • 1,099
  • +0/-0
    • LinuxFacil
Hello
First of: Thank you all for your help.
I finally decided to go radical, the motherboard/processor/memory where OK since it didn't failed at all at home after 2 days, so I buy a brand new power supply, case, and one 1TB hard drive, and it's working perfectly so far with SME 7.5
Glad to know you fixed the issue.
You're welcome.
But now please change subject to include [Solved]
And if you can, donate some money to this project... everyone will thanks you.
 
Quote
So, It has something to do with the power supply or the disks, I will dedicate this weekind to find out, since my client is now happy with his new working server (well... i didn't tell him that the motherboard/memory was the same one :D, he just saw a new enclosure and figure it out the rest by himself hehe), so now, there is no rush to figure it out what is going on.

I did change the PSU a couple of days ago for a ANTEC BASQ 450w, that i think, is a very good PSU, remember the motherboard is an ATOM dual core based, and 2 WD Green drives, power consumption is FAR below 100watts, and it failed after 8 hours in the same way (with only one drive connected), that's why I rule it out. (it's a good info BTW, never knew that smaller-than-necessary PSU can cause hangups like this, thank you).
you're welcome.
I've this kind of problem several years ago... was a pain...

Jáder
...

Offline lightman

  • ****
  • 75
  • +0/-0
Hello
Happened AGAIN!
How is it possible? it is a new machine, new disk, new enclosure, power supply.
now it worse, i cannot fix the / partition

I booted with SME RESCUE, try to e2fsck /dev/sda2 with no sucess (sda1 is fine)
it says: e2fsck: Bad magic number in super-block while trying to open /dev/sda2

I tried -b 8193 and some other ones i found it here in some posts with no sucess, now I have a backup but from 2 days, and that would be a problem if I lost 2 days of work because of this.

I read here: http://wiki.contribs.org/Recovering_SME_Server_with_lvm_drives

that I need to enable raid and LVM in order to fix the filesystem is this correct??

I did mdadm -AR /dev/md8 dev/sda2 and now I have md8 created
but in RESCUE mode I don't have any LVM utilities, when I try to vgs , vgchange it doesn't exists

I'm lost again, what could I do?, I need to recover those files but I have no idea what to do, I don't have another SME server to mount the disk as secondary.

any advice will be very appreciated
thank you

Offline Jáder

  • *
  • 1,099
  • +0/-0
    • LinuxFacil
If you had installed sysmon or sme7admin and keep an eye on your server graphs, you could had see it coming.

Now it's a scare scenario: I woudl start over with all new pieces ... ALL NEW this time... and be sure to put server on a UPS (a good one, please!).

Good luck

Jáder
...

Offline lightman

  • ****
  • 75
  • +0/-0
Hello
The UPS is a backup UPS from APC brand new
now I put the new motherboard again so it's all new equipment, nothing used.

I was able to revive the partition using another live CD (SME RESCUE where USELESS) so I could start LVM and do the file repair from there, it worked, but I had TONS of errors.
now the server starts but I cannot access from CIFS (I can still access from FTP but not CIFs) when I try to modify, add or delete any user, ibay or group it gives me TONS of errors, like: failed to initialize group mapping, adding entry for group user failed no rid or sid specified.
I have no idea what is this.
but it seems that the only way to extract the data would be via FTP :( it would take HOURS since they are all very small files but hundred of thousands of them.
but well, at least I have the data back :)

Offline Jáder

  • *
  • 1,099
  • +0/-0
    • LinuxFacil
Good to know you got your data back.
Search for AFFA in contribs and put it to work...backups everyday offsite using web!
Please do install sysmon and/or sme7admin on this server... so you can watch graphs of server status.

Keep us informed.
Jáder
...