Koozali.org: home of the SME Server

RAID resynchronization does not end

Offline curdegn

  • **
  • 26
  • +0/-0
RAID resynchronization does not end
« on: October 26, 2009, 08:28:18 AM »
Hi,

To test the RAID1 functionality I removed one (of the two) HD during the SME-erver was running. All works fine. For reinserting the HD I switched the SME-server off. After switch on the resynchronization process started automatically, but it seems to never end. I actually do not know how close it comes to 100%, but it always start from the beginning again. One resynchronization takes about 2h, so within the last 3 days the SME-server did already many many resynchronizations...

Does anybody know what is going on and what to do?
Many thanks

SME Version 7.3, all updates included



Offline Stefano

  • *
  • 10,894
  • +3/-0
Re: RAID resynchronization does not end
« Reply #1 on: October 26, 2009, 08:35:27 AM »
hi

you should:
- check your /var/log/messages for errors... you'll find the reason
- upgrade your SME to 7.4

Offline curdegn

  • **
  • 26
  • +0/-0
Re: RAID resynchronization does not end
« Reply #2 on: October 26, 2009, 08:50:04 AM »
Thanks for the advise.

Following an extract from /var/log/messages that could be interesting:
Code: [Select]
Oct 26 06:58:04 server2 kernel: ata1: EH complete
Oct 26 06:58:04 server2 kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 26 06:58:04 server2 kernel: ata1.00: (BMDMA stat 0x24)
Oct 26 06:58:04 server2 kernel: ata1.00: cmd 25/00:20:2d:82:22/00:00:34:00:00/e0 tag 0 cdb 0x0 data 16384 in
Oct 26 06:58:04 server2 kernel:          res 51/40:00:31:82:22/40:00:34:00:00/e0 Emask 0x9 (media error)
Oct 26 06:58:04 server2 kernel: ata1.00: configured for UDMA/133
Oct 26 06:58:04 server2 kernel: SCSI error : <0 0 0 0> return code = 0x8000002
Oct 26 06:58:04 server2 kernel: Info fld=0x4000000 (nonstd), Invalid sda: sense = 72 11
Oct 26 06:58:04 server2 kernel: end_request: I/O error, dev sda, sector 874676781
Oct 26 06:58:04 server2 kernel: ata1: EH complete
Oct 26 06:58:04 server2 kernel: raid1: sda: unrecoverable I/O read error for block 874467840
Oct 26 06:58:04 server2 kernel: SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB)
Oct 26 06:58:04 server2 kernel: SCSI device sda: drive cache: write back
Oct 26 06:58:04 server2 kernel: SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB)
Oct 26 06:58:04 server2 kernel: SCSI device sda: drive cache: write back
Oct 26 06:58:04 server2 kernel: md: syncing RAID array md2

"raid1: sda: unrecoverable I/O read error for block 874467840", could this be the reason? Can one fix?

Offline Stefano

  • *
  • 10,894
  • +3/-0
Re: RAID resynchronization does not end
« Reply #3 on: October 26, 2009, 09:06:45 AM »
"raid1: sda: unrecoverable I/O read error for block 874467840", could this be the reason? Can one fix?

yes, this is the reason and the only fix is.. change that hd asap

Offline chris burnat

  • *****
  • 1,135
  • +2/-0
    • http://www.burnat.com
Re: RAID resynchronization does not end
« Reply #4 on: October 26, 2009, 09:51:21 AM »
I removed one (of the two) HD during the SME-erver was running. All works fine.

I do not think that "all works fine".  Disconnecting a drive from a running server is asking for troubles unless you have a caddy with a mechanism to switch the power supply to the drive before disconnection takes place. And even then, things may go wrong... Do not repeat this operation, just switch off the server, then remove the drive, it is as good a test as any for Raid functionality testing.

As regards repairing the drive, you could try to fix it, it is tedious but worth a try I guess...  To quote Mary in a recent post "get the manufacturers test software - eg for Seagate drives use Seatools  You can also get a lot of test software on the Ultimate Boot CD (UBCD), google for it. http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=201271  "

- chris
If it does not work out of the box, please fill in a Bug Report @ Bugzilla (http://bugs.contribs.org)  - check: http://wiki.contribs.org/Bugzilla_Help .  Thanks.

Offline curdegn

  • **
  • 26
  • +0/-0
Re: RAID resynchronization does not end
« Reply #5 on: October 26, 2009, 10:14:56 AM »
Thanks for all replies.

I already ordered a new one, but will also try to fix the old one to learn sometging.
Is such a nice Server case, equipped with those nice hot-swap HD-bays...... couldn't withstand to try it out....


Offline janet

  • *****
  • 4,812
  • +0/-0
Re: RAID resynchronization does not end
« Reply #6 on: October 26, 2009, 01:22:52 PM »
curdegn

sme 7.4 kernel does not support hot swapping of drives.

The drive may not be faulty, but the data on it probably is.
Test the drive with software referred to earlier. Also before you reconnect a used drive to an array you should delete the existing partitions. Do this with a dd command, see the Raid Howto, or download delpart.exe and boot to a floppy or CD and delete partitions that way.
Please search before asking, an answer may already exist.
The Search & other links to useful information are at top of Forum.

Offline curdegn

  • **
  • 26
  • +0/-0
Re: RAID resynchronization does not end
« Reply #7 on: October 27, 2009, 10:31:37 PM »
I have bad news to report:
First I deleted all partitions of the faulty HD (sdb) with the dd command, but the problem still remains. As soon the resynchronization process comes close to 100% it starts from the beginning again...

Then I replaced the faulty HD (sdb) with a new one. Sill the same problem:
Code: [Select]
ata1: EH complete
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata1.00: (BMDMA stat 0x24)
ata1.00: cmd 25/00:f0:dd:7f:22/00:02:34:00:00/e0 tag 0 cdb 0x0 data 385024 in
         res 51/40:00:31:82:22/40:00:34:00:00/e0 Emask 0x9 (media error)
ata1.00: configured for UDMA/133
SCSI error : <0 0 0 0> return code = 0x8000002
Info fld=0x4000000 (nonstd), Invalid sda: sense = 72 11
end_request: I/O error, dev sda, sector 874676189
ata1: EH complete
raid1: sda: unrecoverable I/O read error for block 874467328
SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB)
SCSI device sda: drive cache: write back
SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB)
SCSI device sda: drive cache: write back
md: syncing RAID array md2
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwith (but not more than 200000 KB/sec) for reconstruction.
md: using 128k window, over a total of 488279488 blocks.

Do I understand this message right? sda is the problem? sda I never touched.

PS: I am using SME-Server 7.4, not 7.3 as mentioned above
« Last Edit: October 27, 2009, 10:41:07 PM by curdegn »

Offline chris burnat

  • *****
  • 1,135
  • +2/-0
    • http://www.burnat.com
Re: RAID resynchronization does not end
« Reply #8 on: October 27, 2009, 11:50:13 PM »
I have bad news to report:
First I deleted all partitions of the faulty HD (sdb) with the dd command, but the problem still remains. As soon the resynchronization process comes close to 100% it starts from the beginning again...

Then I replaced the faulty HD (sdb) with a new one. Sill the same problem:

From your latest report, I understand the following:
- /dev/sda is the drive with data from which you boot
- /dev/sdb was the drive you disconnected and subsequently replaced, suspected faulty.

This is a different scenario.  If correct, your logs suggest that /dev/sda has [also] been corrupted.  If the case,  there is no possibility of re-establishing a mirror. The first thing would be to check /dev/sda on its own and see if it can be fixed.  Then worry about re-establishing the mirror.

Please proceed with caution - have you got a backup of your system/data?
If not, you should try to backup your data before doing anything more with this box.
- chris
If it does not work out of the box, please fill in a Bug Report @ Bugzilla (http://bugs.contribs.org)  - check: http://wiki.contribs.org/Bugzilla_Help .  Thanks.

Offline curdegn

  • **
  • 26
  • +0/-0
Re: RAID resynchronization does not end
« Reply #9 on: October 29, 2009, 01:06:21 PM »
The data is fine. It is updated 3 times a day by affa and still is working fine.

Since I only touched "sdb", I was kind of blind and did not see that actually the untouched "sda" is the problem....

I plan now to do the following:
1. Try to fix "sda" as mentioned above by Chris. Can I use fsck? Any other tools?
2. If 1 does not work, set up a new SME server using the affa backups according the affa howto.
Any oder advises are more then welcome.

Many thanks

Offline chris burnat

  • *****
  • 1,135
  • +2/-0
    • http://www.burnat.com
Re: RAID resynchronization does not end
« Reply #10 on: October 29, 2009, 06:32:58 PM »
The data is fine. It is updated 3 times a day by affa and still is working fine.

Since I only touched "sdb", I was kind of blind and did not see that actually the untouched "sda" is the problem....

I plan now to do the following:
1. Try to fix "sda" as mentioned above by Chris. Can I use fsck? Any other tools?
2. If 1 does not work, set up a new SME server using the affa backups according the affa howto.
Any oder advises are more then welcome.

Many thanks

How you proceed from this point very much depends on how important the data is, and your level of skills.

I would play it safe, and go the other way around.  Store /dev/sda away for a little while and verify whether the backup is valid.  I imagine you do not have a lot of experience testing/fixing drives - and even if you had, there is no room for mistakes.

Given that you have been down for a few days now, you can spend another few hours trying to rebuild your server by reinstalling 7.4 from CD and proceed with a restore from Affa.  You have a new drive, use it for this purpose. If you have the resources, go and buy yourself a second identical drive and start with a mirror, and since drives are cheap, consider using 3 drives, one being a spare - SME supports this out of the box - nice.  If you do not have the resources, just start with one drive, and add the mirror later - if you do this, keep in mind that there is no way of adding a spare later on.

An Affa restore is pretty straight forward, and well documented. Just take it one step at a time, clean your new drive just in case, reinstall, ensure that the server is up and running (check basic functionality, LAN and WAN access + logs).  Do not make any modifications to your system, i.e. install contribs.  Restore if all is OK.  If it works, all you have to do is reinstall your contribs, not a big deal.  Then you are safe, and can even experiment with the old sda drive for future reference. 

One more thing, after restoring from Affa onto the new system, check your data carefully for integrity.  I assume that you will restore from the latest backup,  this data could be corrupt in part - for one thing, we do not know when the file system become faulty on /dev/sda.  Affa may have performed backups after this occurrence.  If the case, you can try to recover files from previous backups - tedious, but feasible.

If your backup fails for some reason, you have your old drive to fall back onto.  I would first of all reinstall this drive in the server and try to suck out the data - you can do this by performing a backup to USB, or transferring whatever you can access onto a workstation.  Or both. Then worry about fixing the old drive if it can be done.

Hope it helps, good luck.

- chris
If it does not work out of the box, please fill in a Bug Report @ Bugzilla (http://bugs.contribs.org)  - check: http://wiki.contribs.org/Bugzilla_Help .  Thanks.

Offline curdegn

  • **
  • 26
  • +0/-0
Re: RAID resynchronization does not end [SOLVED]
« Reply #11 on: October 31, 2009, 07:02:44 PM »
Thank you very much for taking the time and explain so detailed what you suggest to do. The system is now up and running again. Here is what i did:

1. Wait till nobody is in the office any more (on Friday evening :-))
2. Make sure the SME server is up to date, all updates installed
3. Let Affa do a last backup to the USB-drive
4. Shut down the Server and take out sda. Sda is the HD that makes problems with the RAID resynchronization, but it is also that disc that contains the running system including all data. I stored it in a save place, so in case something goes wrong I can reinstall it on Sunday evening, so working on Monday is assured.
4. Install two fresh HDs. I used the "dd" command as explained in the Raid Howto to clear the HDs.
5. Install SME Server 7.4 from CD with the same parameters as the original server
6. Perform all software updates
7. Install Affa backup system http://wiki.contribs.org/Affa
8. Set up the identical affa-job as the original server. The affa-job setup scripts can be found in the archive directories on the backup USB-drive
9. Let affa do a full restore: #affa --full-restore <JOB>
10. As soon as the restore is complete the server automatically reboots. Afterwarts I got my server back with all configurations as nothing would have happened, great! Only the contribs are missing, but that is no deal to reinstall.

Remark:
Above mentioned affa how-to recommends to use the --rise option for such a case, see section "Restore from USB drive on new server", quite at the bottom of the page. This way did not work for me. After giving the --rise command affa stops immediately with the error message that one cannot use the "--rise" option to rise from an archive that was made from the very same server...

Thanks again for all the support, was a great help.
« Last Edit: November 02, 2009, 08:32:24 AM by curdegn »

Offline chris burnat

  • *****
  • 1,135
  • +2/-0
    • http://www.burnat.com
Re: RAID resynchronization does not end
« Reply #12 on: October 31, 2009, 10:39:12 PM »
Good to hear, I just found myself in the same situation last week with some 140GB of essential data, had a backup, so it was fresh in my mind.

"Remark:
Above mentioned affa how-to recommends to use the --rise option for such a case, see section "Restore from USB drive on new server", quite at the bottom of the page. This way did not work for me. After giving the --rise command affa stops immediately with the error message that one cannot use the "--rise" option to rise from an archive that was made from the very same server..."

My reading suggests that this is a special case. This routine does not apply to your situation and should not be interpreted as a general recommendation when restoring a full system onto a new server from a USB drive.  Actually, the title of this entry in the Wikis [Restore from USB drive on new server] is not descriptive enough.  If you have time, raise a bug against the Affa contrib and point this out.

Finally, you may wish to add [SOLVED] in the subject line of this post.

- chris
If it does not work out of the box, please fill in a Bug Report @ Bugzilla (http://bugs.contribs.org)  - check: http://wiki.contribs.org/Bugzilla_Help .  Thanks.