Koozali.org: home of the SME Server

RAID1 failure; data missing

Offline Brenno

  • ****
  • 208
  • +0/-0
RAID1 failure; data missing
« on: November 18, 2008, 05:01:06 AM »
Ok, so let's set aside for a minute that I'm still using a 6.x release in one of my installations... it's been running just peachy as an email server (until now) and we were actually just about to replace the machine entirely with new hardware and the latest version of SME.  Anyway...

I have 4 SCSI drives in this box, 0 and 1 were in RAID1, 2 and 3 were connected but not used.  1 failed, and then 4 shortly thereafter.  So, I followed a howto on here to partition 3 to match 0 and then rebuild the raid.  Seemed to work, but upon reboot, 3 is not a member of the raid!

More perplexingly, when I was able to boot into degraded mode, I mysteriously lost almost 2 months of data... nothing between September 24th and today exists in user folders.  I thought RAID1 was supposed to mirror the drives, so upon failure of one, the other had a complete copy... this appears to not be the case!  I was able to do a 911 backup, but suspect that this, too, will be missing the data.

I really just need to do two things: 1) have this box run for another 10 days until the new hardware is in place, and 2) recover the missing data.

Hoping the forum experts can provide some assistance (and go easy on the chastising for the old version still in service issue...)

Appreciated in advance.

Offline Brenno

  • ****
  • 208
  • +0/-0
Re: RAID1 failure; data missing
« Reply #1 on: November 18, 2008, 02:40:37 PM »
Now I am even more confused as it appears that this server has gone back in time to September 24th:

- user accounts added since then have disappeared
- user accounts removed since then have reappeared with data intact
- domains added since then have disappeared
- passwords changed since then have reverted back to what they were
- deleted emails are restored

Is it possible that the server was storing all new/changed data since 09/24 on the failed drive?

My game plan now is to continue to use the server as-is in degraded mode until it can be replaced next week.  If anyone has any better ideas, I'm excited to hear them.

Thanks.

Offline David Harper

  • *
  • 653
  • +0/-0
  • Watch this space
    • Workgroup Technology Solutions
Re: RAID1 failure; data missing
« Reply #2 on: November 19, 2008, 02:53:31 AM »
Sounds like rip-n-replace is the best option you have right now. Were you using a tape backup that might have the missing data on it?

Offline Brenno

  • ****
  • 208
  • +0/-0
Re: RAID1 failure; data missing
« Reply #3 on: November 19, 2008, 03:09:28 AM »
I do have a valid and complete backup thanks to dmay's backup2ws contrib.

As for the time warp, turns out the mirroring of the drives must have broken on 09/24 without us realizing it.  When the mirroring broke, the server picked one of the two drives to continue on with, and sadly it was the drive it chose that ultimately failed.  The other drive simply wasn't being updated, so it's "stuck in time" back on 09/24.

I've curious as I know I have rebooted that server several times between 09/24 and now, but never once did I see any indication that there was an issue with the RAID (and I religiously watch the messages stream by on startup).

Offline David Harper

  • *
  • 653
  • +0/-0
  • Watch this space
    • Workgroup Technology Solutions
Re: RAID1 failure; data missing
« Reply #4 on: November 19, 2008, 07:00:39 AM »
I'm not sure you'd see much at startup, especially if the array is hardware-based. In SME 7.3, there's a console panel that lets you manage RAID arrays.

Offline Brenno

  • ****
  • 208
  • +0/-0
Re: RAID1 failure; data missing
« Reply #5 on: November 20, 2008, 10:37:08 PM »
Here's my thoughts on data recovery for this... going to set up a fresh install of 7.3 (with all updates) on a new box.  I have three data sets, all containing various stages that I'd ideally like to combine.  Note that these backups are user files only (ie /home/e-smith/files/users and below):

Set1 = all files up to 9/24 and after 11/17 (current 6.x box, crippled as it may be)
Set2 = all files up to 11/16 (partial backup; not all users included)
Set3 = all files up to 11/9 (last complete backup)

So, I'm thinking that I could use rsync to first "prime" the 7.3 box with Set3. Then, I'll run rsync against Set2 to get what exists there but not in Set3, and finally, I run rsync on Set1 to get what exists there but is not in Set3 or Set2.  When all this is done, I will need to correct the permissions and whatnot.

I believe that if I run rsync with the correct options, I should be able to end up with nearly all the data (save for a few users whose 11/9 to 11/16 data was were not included in Set2).

Does anyone here see any flaws in this logic?  Am I missing something?

Offline pfloor

  • ****
  • 889
  • +1/-0
Re: RAID1 failure; data missing
« Reply #6 on: November 21, 2008, 02:40:25 AM »
There ia a lot more than just home directories to consider when trying to restore a system. Going from 6.x to 7.x is complicated and the upgrade process is quite extensive under the covers.  It's not just "load and copy".

-How many users are on the old system and how do you intend to create them on the new one?
-Is there a lot of email?
In life, you must either "Push, Pull or Get out of the way!"

Offline Brenno

  • ****
  • 208
  • +0/-0
Re: RAID1 failure; data missing
« Reply #7 on: November 21, 2008, 02:58:46 PM »
We have <60 users on this system, all running in IMAP accounts, with about 12GB of data in total.  We aren't storing anything in the home folders, just the Maildir at this point.  My plan is to recreate the users manually as we will do a password change as the new accounts are created.

I realize that, due to the way the mail files are altered and suffixed by doing things like marking for deletion and forwarding, I am likely to get some duplication of emails in different states, but I think the users can purge these duplicates and would be happier to have them then no email at all.

My plan B is just to use rsync to prime the data with Set3 and have users move/copy any post 11/17 emails from Set1 manually to their new account (running both servers concurrently for a short period).

Offline David Harper

  • *
  • 653
  • +0/-0
  • Watch this space
    • Workgroup Technology Solutions
Re: RAID1 failure; data missing
« Reply #8 on: November 21, 2008, 10:24:01 PM »
You can use lat-users (Lazy Admin Tools) to batch add your 60 users from a CSV file, which should save you some heartache.

Offline janet

  • ****
  • 4,812
  • +0/-0
Re: RAID1 failure; data missing
« Reply #9 on: November 22, 2008, 09:09:21 AM »
Brenno

It will save you a good bit of manual work if you upgrade the sme6 server to sme7 and then manually move Maildir's etc to the new sme7 server.
Quite a few routines get run during the 6>7 upgrade process to rename folders etc to suit sme 7.
Please search before asking, an answer may already exist.
The Search & other links to useful information are at top of Forum.

Offline Brenno

  • ****
  • 208
  • +0/-0
Re: RAID1 failure; data missing
« Reply #10 on: November 24, 2008, 03:05:15 PM »
My level of confidence in the 6.x box surviving the upgrade to 7.x is not high.  Perhaps I should go the other way around and start the new hardware on 6.x, migrate the users and data, then perform an upgrade to 7.x?

I suppose I could restore the good backup (Set3) using the backup2ws contrib, then take care of the rest of the files manually.

Offline janet

  • ****
  • 4,812
  • +0/-0
Re: RAID1 failure; data missing
« Reply #11 on: November 25, 2008, 01:24:50 AM »
Brenno

Quote
My level of confidence in the 6.x box surviving the upgrade to 7.x is not high.

If you follow the instructions here (exactly too)
http://wiki.contribs.org/UpgradeDisk

then you retain your fully working sme6 disk, and can easily go back to sme6, in the event that you run into unresolvable problems during the upgrade to sme7.

sme6 to sme 7 upgrades are supported, so it should go OK as long as you remove incompatible contribs and custom templates and user custom templates (as per instructions).

The only area I'm aware of that seemed to be a problem in some cases was workstation machine accounts on sme6 not being recognised under sme7, necessitating the removal of a Windows workstation from a (sme) domain (controller) and the rejoining of it to the domain.
« Last Edit: November 25, 2008, 01:46:30 AM by mary »
Please search before asking, an answer may already exist.
The Search & other links to useful information are at top of Forum.

Offline Brenno

  • ****
  • 208
  • +0/-0
Re: RAID1 failure; data missing
« Reply #12 on: November 25, 2008, 04:23:29 AM »
My confidence (or lack thereof) is more indicative of the current state of the hardware, not the software...  two failed drives and a bad power supply fan means that I've been nursing this box for the past week and don't want to introduce any more "stress" than I have to :)

I think I'll run with my latest plan (start with 6.x, restore, upgrade) as this will be easier given that I won't have any extra contribs or custom templates present when I start with the clean install of 6.x

Offline Brenno

  • ****
  • 208
  • +0/-0
Re: RAID1 failure; data missing
« Reply #13 on: November 25, 2008, 05:15:18 PM »
You can use lat-users (Lazy Admin Tools) to batch add your 60 users from a CSV file, which should save you some heartache.

Of course, everything so far has gone wrong, so I shouldn't be surprised that my both my 6.0b3 and 6.0 Final install disks failed to load on the new hardware.  Had to install 7.3 directly, so my plan has been scrapped and is now being written on the fly :(

I did use the lat-dump to get info from the 6.x box I was running and manually modified the files to add the additional users which had been lost in the failed mirror.  Now, can I use these scripts to run the lat-restore on the 7.3 box, or are the differences in the two systems too great for this?

Offline janet

  • ****
  • 4,812
  • +0/-0
Re: RAID1 failure; data missing
« Reply #14 on: November 26, 2008, 03:17:18 AM »
Brenno

Quote
Had to install 7.3 directly, so my plan has been scrapped and is now being written on the fly

So just restore directly from the mounted sme6 disk to a fresh install of sme7 (as per previous link provided) and make life easy for yourself.
Please search before asking, an answer may already exist.
The Search & other links to useful information are at top of Forum.

Offline Brenno

  • ****
  • 208
  • +0/-0
Re: RAID1 failure; data missing
« Reply #15 on: December 01, 2008, 04:41:01 PM »
In the end, what I did was use WinRAR on a local workstation to extract and merge together the four partial backups I had.  This resulted in capturing about 99% of the email.  Then, due to the change in IMAP folder delimiters from ";" to ".", I used a freeware program called RenameMaster to replace all instances of ; with . in my user's folders.  Once this was complete, I copied all data to an ibay and finally moved to the correct location on the 7.3 server (since updated to 7.4).  After that, all I needed to do was remove the subscriptions file from the Maildir and use chown -R to correctly set the permissions for each user's files.

Since my last backup set was 11/23 and the new hardware wasn't prepared until 11/25, I used Thunderbird to search for any emails dated after 11/23 and manually copied them from the old accounts to the new accounts as they were set up on the desktop.  This process took about 5 - 10 minutes per account.

So far everything seems to be working well, though I am still looking to fill some of the gaps I have due to 6.x contribs not being available for 7.x (most notably the mailblocking contrib).

Yes, this was a pain in the.... but it worked and took less than a day for 60 users and 12+ GB of email.

Offline David Harper

  • *
  • 653
  • +0/-0
  • Watch this space
    • Workgroup Technology Solutions
Re: RAID1 failure; data missing
« Reply #16 on: December 03, 2008, 06:29:10 AM »
Wow, sounds like you had quite a bit of excitement there 8-)

From what I recall, the mail blocking contrib was written by Dungog Networks. You might want to take a look at their commercial email contrib.