Koozali.org: home of the SME Server

RAID1 starts rebuilding for no reason?

Offline DanB35

  • ****
  • 764
  • +0/-0
    • http://www.familybrown.org
RAID1 starts rebuilding for no reason?
« on: August 11, 2014, 03:58:12 PM »
I've been running SME 9.0 for a few weeks without significant issues.  However, in the last week or so, I've had it twice start rebuilding the array for no readily-apparent reason.  I didn't keep the admin emails from the previous time this happened, but the most recent instance started at 1:00 am yesterday, and the fact that it was exactly on the hour makes me a little suspicious.

Some history, in case it's relevant:  I had been running SME 8.1 on this machine, on mirrored disks.  When I did the upgrade, I removed one of the disks, installed SME 9 and restored from my backup onto the other disk, and made sure it was working.  Once  it appeared to be working OK, I reinstalled the previously-removed disk and used "manage redundancy" from the console menu to set up the mirror.  Curiously, I did not receive any emails about the rebuild status as the disk was synced, but I was able to monitor it with /proc/mdstat and it finished without errors.  Currently, mdstat indicates both disks are online:

Code: [Select]
[dan@e-smith ~]$ cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[2] sda1[0]
      255936 blocks super 1.0 [2/2] [UU]
     
md1 : active raid1 sdb2[2] sda2[0]
      1953126208 blocks super 1.1 [2/2] [UU]
      bitmap: 5/15 pages [20KB], 65536KB chunk

unused devices: <none>

Where should I start looking to see what triggered the rebuild?
......

Offline stephdl

  • *
  • 1,519
  • +0/-0
    • Linux et Geekeries
Re: RAID1 starts rebuilding for no reason?
« Reply #1 on: August 11, 2014, 05:21:20 PM »
If the message comes from every sunday at 1h00 AM then it is not a raid warning.

Please take a look to that bug report, the package is waiting a release.

http://bugs.contribs.org/show_bug.cgi?id=7748
See http://wiki.contribs.org/Koozali_Foundation
irc : Freenode #sme_server #sme-fr

!!! Please write your knowledge to the Wiki !!!

Offline DanB35

  • ****
  • 764
  • +0/-0
    • http://www.familybrown.org
Re: RAID1 starts rebuilding for no reason?
« Reply #2 on: August 11, 2014, 06:10:51 PM »
That bug is described as sending emails on routine checks, but it looks like the raid-check actually forces a rebuild of the array:

Code: [Select]
[root@e-smith log]# /usr/sbin/raid-check
^Z
[1]+  Stopped                 /usr/sbin/raid-check
[root@e-smith log]# bg
[1]+ /usr/sbin/raid-check &
[root@e-smith log]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[2] sda1[0]
      255936 blocks super 1.0 [2/2] [UU]
     
md1 : active raid1 sdb2[2] sda2[0]
      1953126208 blocks super 1.1 [2/2] [UU]
      [>....................]  check =  0.0% (1019072/1953126208) finish=638.5min speed=50953K/sec
      bitmap: 10/15 pages [40KB], 65536KB chunk

unused devices: <none>

Is this the same thing, or something different?
......

Offline stephdl

  • *
  • 1,519
  • +0/-0
    • Linux et Geekeries
Re: RAID1 starts rebuilding for no reason?
« Reply #3 on: August 11, 2014, 08:40:00 PM »
In first if you think that it is a bug, you are welcome to go to bugzilla.

Now I would to explain what is occurring (of course I can be wrong)

Code: [Select]
[root@sme9 ~]# rpm -qf /usr/sbin/raid-check
mdadm-3.2.6-7.el6_5.2.i686

this is where the package comes from

Code: [Select]
[root@sme9 ~]# yum info mdadm
Loaded plugins: fastestmirror, smeserver
Loading mirror speeds from cached hostfile
 * base: ftp.rezopole.net
 * smeaddons: mirror.hakkers.com
 * smeextras: mirror.hakkers.com
 * smeos: mirror.hakkers.com
 * smeupdates: mirror.hakkers.com
 * updates: ftp.rezopole.net
Installed Packages
Name        : mdadm
Arch        : i686
Version     : 3.2.6
Release     : 7.el6_5.2
Size        : 884 k
Repo        : installed
From repo   : updates
Summary     : The mdadm program controls Linux md devices (software RAID arrays)
URL         : http://www.kernel.org/pub/linux/utils/raid/mdadm/
License     : GPLv2+
Description : The mdadm program is used to create, manage, and monitor Linux MD (software
            : RAID) devices.  As such, it provides similar functionality to the raidtools
            : package.  However, mdadm is a single program, and it can perform
            : almost all functions without a configuration file, though a configuration
            : file can be used to help with some common tasks.

the script which is launched every sunday at 1:00AM is not ours (it comes from mdadm) , it is a pure centos binary, we cannot modify it if we want to be centos compatible

then we have an event called '/sbin/e-smith/mdevent' which is in charge to watch about the events launched by mdadm, but we need to patch that event to avoid to send email if the script /usr/sbin/raid-check is working.

It is important to regularly verify the state of your raid and it is what you can see in your /proc/mdstat.

See http://wiki.contribs.org/Koozali_Foundation
irc : Freenode #sme_server #sme-fr

!!! Please write your knowledge to the Wiki !!!


Offline DanB35

  • ****
  • 764
  • +0/-0
    • http://www.familybrown.org
Re: RAID1 starts rebuilding for no reason?
« Reply #5 on: August 13, 2014, 05:17:51 PM »
Certainly it's important to periodically verify the state of the array, and looking more closely at the mdstat output I posted, it does look like it's checking, rather than rebuilding, the array.  Should the patch instead check the mdstat output to determine if it's a "check" event vs. a "rebuild" event?
......

Offline stephdl

  • *
  • 1,519
  • +0/-0
    • Linux et Geekeries
Re: RAID1 starts rebuilding for no reason?
« Reply #6 on: August 13, 2014, 05:34:48 PM »
Code: [Select]
print "Event: $event, Device: $device, Member: $member\n";

+if ($event =~ m#^Rebuild# && system( "ps -C raid-check" ) == 0 ) {
+    exit 0;
+}
+
 if ($event =~ m#^Rebuild|^Fail|^Degraded|^SpareActive#) {
     my $domain = $conf->get_value("DomainName") || 'localhost';
     my $user = "admin_raidreport\@$domain";


http://bugs.contribs.org/attachment.cgi?id=4664&action=diff

If '$event' contains Rebuild AND the processus raid-check is running, then the mdevent is stopped

If '$event' contains Rebuild OR Fail OR Degraded OR SpareActive, then the mdevent drops an email to the sysadmin
« Last Edit: August 13, 2014, 05:36:19 PM by stephdl »
See http://wiki.contribs.org/Koozali_Foundation
irc : Freenode #sme_server #sme-fr

!!! Please write your knowledge to the Wiki !!!