Koozali.org: home of the SME Server

Help troubleshooting a failing array

Offline jptechnical

  • ***
  • 68
  • +0/-0
Help troubleshooting a failing array
« on: August 10, 2009, 06:39:29 PM »
I have an SME7.4 machine with a 500gig SATA mirror.  The mdadm rebuild event has happened 3 or 4 times in as many months, and more recently twice in a 3 week period. I have run:

smartctl -t short, long, conveyance(?) on sdc, sdd and they all passed.

both drives are around 2yrs old, same model but significantly different serial#s

What should I be looking for?

To add insult to injury, I am leaving for 10 days at the end of the week, so I am nervous that (murphy's law) it will die the day I leave.

Also, I am looking for someone in the Tacoma WA area that is an experienced linux admin (with windows small biz server and 2000/2003 domains), preferably experience with SME who can take some of my work over as a sub-contractor, though that is 6-8 months out before I need to take on more help.

Thanks.

Offline CharlieBrady

  • *
  • 6,918
  • +3/-0
Re: Help troubleshooting a failing array
« Reply #1 on: August 11, 2009, 01:03:08 AM »
I have an SME7.4 machine with a 500gig SATA mirror.  The mdadm rebuild event has happened 3 or 4 times in as many months, and more recently twice in a 3 week period.

I don't know exactly what you mean when you say "mdadm rebuild event has happened". I would guess you will get better suggestions if you provide more detail about what you have done and what you have seen.


Offline jptechnical

  • ***
  • 68
  • +0/-0
Re: Help troubleshooting a failing array
« Reply #2 on: August 11, 2009, 01:18:40 AM »
A Rebuild20 event has been detected on md device /dev/md2.
A Rebuild40 event has been detected on md device /dev/md2.
A Rebuild60 event has been detected on md device /dev/md2.
A Rebuild80 event has been detected on md device /dev/md2.
A RebuildFinished event has been detected on md device /dev/md2.

MDADM rebuild event, i.e. the raid device md2 was degraded and started auto-rebuild.

Offline CharlieBrady

  • *
  • 6,918
  • +3/-0
Re: Help troubleshooting a failing array
« Reply #3 on: August 11, 2009, 05:52:53 AM »
I would guess you will get better suggestions if you provide more detail about what you have done and what you have seen.

This is still true. In your last post, you haven't described anything that you have done or anything you have seen (although I guess you are quoting some subset of things you have seen in logs or emails).

I am not familiar with md RAID1 spontaneously starting auto-rebuild. Please tell us more.

Offline jptechnical

  • ***
  • 68
  • +0/-0
Re: Help troubleshooting a failing array
« Reply #4 on: August 11, 2009, 07:38:21 AM »
Sorry, I don't know what to say, I have not heard of spontaneous array rebuilds either, unless it is a promise fake-raid card :-D. I have just been seeing a raid rebuild and it occurred in a fairly short period of time with no obvious cause. The server has been happily chugging along for over a year with not even a hiccup. There was a power outage around the time of the last rebuild, but I have had hard resets that didn't result in array rebuilds before, so I don't want to assume it is nothing more.

I searched the forums for what to look for as far as triage during/after a disk failure, but the SMART scans are coming back clean. I know what to do in the case of a failed drive (for the most part, I don't really like proving this particular skill), whether I catch it as it is dying or after the fact. I am hesitant to break the array so I can fsck the drives if there isn't a good reason to do it.

Is there any kind of scheduled verify task that could be failing and resulting in a rebuild? What kind of IO errors should I look for in the message log? I guess I am looking for advice for proactively monitoring drive health and if SMART is a reliable gauge. Perhaps I am not using the correct search terms, what should I be searching for? What else is there out there that I can use to check the health of a live system other than smartmontools?

Offline Stefano

  • *
  • 10,894
  • +3/-0
Re: Help troubleshooting a failing array
« Reply #5 on: August 11, 2009, 08:54:41 AM »
hi
I would start grepping /var/log/messages for sd[a|b|c], to find why one of your hd it's kicked out of raid array..

ciao
Stefano