Koozali.org: home of the SME Server

Raid array question

Offline bigpoppa

  • 3
  • +0/-0
Raid array question
« on: February 23, 2008, 05:14:00 AM »
Howdy folks, i've got a strange thing happening to my raid array and thought i would see what you guys thought.  If this has already been covered somewhere else, please forgive, but as far as i know, i haven't read anything quite like this on the forums.

I'm running sme 7.3 on a raid 5 disk array (adaptec Ultra 2 wide raid controller) with 3 18 gig drives sda1, sda2 and sda3.  Since i have built this machine, every so often, and not in any pattern that i can tell, sda2 gets a failure message and is removed from the raid array.

Of course, the raid array goes into degraded mode, but continues to function.

here's the strange part.


EVERY time it does this, i test the drive that is says has failed, and the drive functions perfectly!!

All i have to do is run the mdadm -a command again for it to rebuild the raid using the same disk it says is bad, and it goes back to normal, until out of the blue, it decides that the drive has failed and removes it again.

It has done this a total of 4 times in the 1.5 yrs i've had this machine running.

Anyone have a clue as to what might be behind this?

I'll post whatever logs you request, just make sure you include how to get to them because i don't know where to find everything.

Thanks guys!!


SME ROCKS by the way!!!

BigPoppa
« Last Edit: February 23, 2008, 05:19:28 AM by bigpoppa »

Offline raem

  • *
  • 3,972
  • +4/-0
Re: Raid array question
« Reply #1 on: February 23, 2008, 05:30:31 AM »
bigpoppa

Did you ever think that just possibly your drive is actually faulty, but it is an intermittent fault.

You should test that drive thoroughly with eg the IBM Drive Fitness Test program or similar for your brand of drive.

I'd run the full test many times to try to get it to become faulty for you.

It may best to replace the drive.
...

Offline bigpoppa

  • 3
  • +0/-0
Re: Raid array question
« Reply #2 on: February 23, 2008, 06:32:18 AM »
Well, that's the thing.  I've run multiple tests on the drive.  The one you can run from the raid card, the manufacturer's, and some others.  All come back clean.  I'm not saying that there isn't something wrong with the drive, i just thought i would see if anyone else may have experienced a similar problem and it turned out to be something software related, or perhaps a new firmware for their raid card or hdd.  It's not really a big issue for me, when the drive does ACTUALLY bite the dust, i've got one just like it waiting to be put in.  I was just curious.

 :-)

BigPoppa

Offline christian

  • *
  • 369
  • +0/-0
    • http://www.szpilfogel.com
Re: Raid array question
« Reply #3 on: February 23, 2008, 05:00:28 PM »
Another thought is if the drives are running too hot.

I had a similar issue and was suspicious that the drives couldn't dissipate the heat fast enough so I spread them out a bit more and added an extra fan. Part of my issue is the case is in a spot that doesn't have much air movement thus was sucking in warm air.

Christian
SME since 2003

Offline pfloor

  • *****
  • 889
  • +1/-0
Re: Raid array question
« Reply #4 on: February 23, 2008, 06:06:33 PM »
I had the same problem years ago with some IBM "Deathstar" drives in a S/W RAID1 array.  One of the drives would go out of sync, I would test it and not find anything wrong.  I would add it back to the array and it would be fine for a few months and then fail again.  It was always the same drive so I RMA'd the drive and all was OK.  I built a second server and it did the same thing with brand new IBM drives.

About 3 years ago I replaced all those IBM drives with Seagate drives on the exact same hardware and have never had the same problem.  Needless to say, I threw the IBM drives in the trash and have never bought another IBM (and now Hitachi) drive since then.  IBM drives have left a bad taste in my mouth :-)

Disclaimer...This is my "unproven" and "non-scientific" conclusion:

Some drives must have a tendency to occasionally mis-write and then recover the mistake and therefore not failing any individual drive tests..  The drive recovers and there is really nothing wrong with it and will perform fine in a non-raid situation but this tiny glitch causes the raid array to go out of sync even though there is probably nothing wrong with the drive itself.  However, these types of drives are not suitable for a raid array.  JMHO!
In life, you must either "Push, Pull or Get out of the way!"

Offline bigpoppa

  • 3
  • +0/-0
Re: Raid array question
« Reply #5 on: February 23, 2008, 06:37:35 PM »
Hmm, i guess it's quite possible that heat is the culprit in this instance.  The drives are full height seagate cheetah 10K rpm.  they get so hot, even with fans blowing on them, that you can hardly touch them.  Doesn't matter how much cooling i put on them, they still get really hot.  I guess it's just the nature of this particular drive.  I also totally agree with the "non-scientific" conclusion of pfloor, I ran across a similar situation on a windows server box a while ago.  Thanks for the input guys, i really appreciate it. 


Thanks again,


BigPoppa  8-)