Problem replacing new disk in RAID

judgej

375
+0/-0

Problem replacing new disk in RAID

« on: April 05, 2011, 04:39:41 PM »

I am in the process of upgrading my disks from 500G to 1Tbyte. I have removed the second disk from the RAID, and replaced it with a brand new 1TB disk. When I go into the disk redundancy screen I get the following:

Code: [Select]

               ┌──────Disk redundancy status as of Tuesday April  5, 2011 15:31:24────────┐
               │ Current RAID status:                                                     │
               │                                                                          │
               │ Personalities : [raid1]                                                  │
               │ md2 : active raid1 sda2[0]                                               │
               │       488279488 blocks [2/1] [U_]                                        │
               │ md1 : active raid1 sda1[0]                                               │
               │       104320 blocks [2/1] [U_]                                           │
               │ unused devices: <none>                                                   │
               │                                                                          │
               │                                                                          │
               │ The free disk count must equal one.                                      │
               │                                                                          │
               │ Manual intervention may be required.                                     │
               │                                                                          │
               │ Current disk status:                                                     │
               │                                                                          │
               │ Installed disks: sdc sda sdb                                             │
               │ Used disks: sda                                                          │
               └──────────────────────────────────────────────────────────────────────────┘

I am not sure what the "manual intervention" is, and there is no reference to it in the WIKI. I am also not sure why it lists three disks as being installed, because there are only two at the moment (SATA0 - 500G original disk, SATA2 - 1T new disk, SATA1 and SATA3 - no disks and disabled in the BIOS).

Any idea what I need to do to get the new disk into the RAID?

Logged

-- Jason

judgej

375
+0/-0

Re: Problem replacing new disk in RAID

« Reply #1 on: April 05, 2011, 04:49:11 PM »

I don't know if this is any use:

Code: [Select]

[root@sme ~]# ls /dev/sd*
/dev/sda  /dev/sda1  /dev/sda2  /dev/sdb  /dev/sdc  /dev/sdc1

The original RAID used disks sda and sdb - I'm not sure how they map onto the SATA channels. What sdc is here, I really don't know.

-- Jason

Logged

-- Jason

CharlieBrady

6,918
+3/-0

Re: Problem replacing new disk in RAID

« Reply #2 on: April 05, 2011, 06:55:56 PM »

Quote from: judgej on April 05, 2011, 04:49:11 PM

What sdc is here, I really don't know.

Well, you'll have to find out. That's what is preventing the console tool from doing its job.

Try:

/sbin/hdparm -I /dev/sdc

Logged

jumba

291
+0/-0
Donations: July 2007 - $ 20.00

Re: Problem replacing new disk in RAID

« Reply #3 on: April 05, 2011, 07:56:07 PM »

A connected USB-disk maybe?

Logged

judgej

375
+0/-0

Re: Problem replacing new disk in RAID

« Reply #4 on: April 05, 2011, 09:15:18 PM »

Quote from: jumba on April 05, 2011, 07:56:07 PM

A connected USB-disk maybe?

Ah yes, there is a USB disk connected for backups. Not sure whether this is it or not though:

Code: [Select]

[root@sme ~]# /sbin/hdparm -I /dev/sdc
/dev/sdc:
 HDIO_DRIVE_CMD(identify) failed: Invalid argument

Running this for drives a and b give me a couple of pages of reasonable-looking data on the drives.

« Last Edit: April 05, 2011, 09:17:47 PM by judgej »

Logged

-- Jason

judgej

375
+0/-0

Re: Problem replacing new disk in RAID

« Reply #5 on: April 05, 2011, 11:09:07 PM »

Digging a bit deeper, yes, I believe /dev/sdc *is* the USB drive.

Now, is there a way to move it out of the way programmatically (i.e. remotely) allowing me to set the RAID to build overnight, or is the only option to physically unplug it (a job for tomorrow, when I'm back in the office)?

Alternatively, perhaps there is a way to add a disk to the RAID array without going through the "manage disk redundancy" screen, which seems to have a problem with there being more than two disks on the system? I am guessing that could be a bit fiddly and - dare I say it - risky. It does make me wonder what will happen when I plug in the hot standby..?

« Last Edit: April 05, 2011, 11:17:51 PM by judgej »

Logged

-- Jason

CharlieBrady

6,918
+3/-0

Re: Problem replacing new disk in RAID

« Reply #6 on: April 06, 2011, 04:47:09 AM »

Quote from: judgej on April 05, 2011, 11:09:07 PM

Alternatively, perhaps there is a way to add a disk to the RAID array without going through the "manage disk redundancy" screen, ...

Yes, it's possible to manually partition the new drive, and manually add the partitions to the RAID mirrors.

Logged

judgej

375
+0/-0

Re: Problem replacing new disk in RAID

« Reply #7 on: April 06, 2011, 11:50:04 AM »

Okay, I managed to unplug the backup disk, and the "manage disk redundancy" screen immediately started working as expected (a "there is a spare disk, do you want to use it for RAID" kind of message).

I'm not sure whether this should be considered a bug? The disks that form the RAID array have not changed between the page not allowing the RAID to be set up, and then allowing it. I know it is not a situation that happens a lot, but being able to install all the new disks, and they manage the switchover to the new RAID remotely would be a real boost. As it is now, at least three site visits are needed to swap disks at the end of each stage, at least it is when using the standard SME admin screens.

Thanks for your help.

-- Jason

« Last Edit: April 06, 2011, 12:05:21 PM by judgej »

Logged

-- Jason

judgej

375
+0/-0

Re: Problem replacing new disk in RAID

« Reply #8 on: April 07, 2011, 10:53:05 AM »

Well, as the new disk is synchronising itself into the array, the sync process is finding the odd unrecoverable read errors on the original disk. I guess this is turning out to be a good time to replace those disks anyway (Hitachi HDs from 2008).

It does make me wonder though - taking a disk out to replace it with a new one is probably not the best method. Synchronizing a new disk into the array is IMO best done while *both* original disks are in place, then reading the data they contain will be a bit more resilient. The unrecoverable sector errors I am getting now represent lost data - probably stuff I will never find again, and may not miss, but who knows? If both original RAID disks were still in the machine, than a read error on one would still give the OS a chance to read the other disk to try to recover the sector. With one disk there is no chance.

I am guessing these read errors are in sectors of the disk that have not been read in a long time (only read now that I am synchronising), otherwise the data would already have been moved to good sectors.

« Last Edit: April 07, 2011, 11:17:44 AM by judgej »

Logged

-- Jason

judgej

375
+0/-0

Re: Problem replacing new disk in RAID

« Reply #9 on: April 07, 2011, 01:09:19 PM »

My hunch was correct: this method of upgrading a disk is *not* such a good idea.

The synchronisation gets about 98.4% of the way through, then finds some read errors on the original disk. it hangs everything for about three minutes, then simply jumps back to the beginning, synchronising at 0% again. I guess this is now in an endless loop until I can do something about the bad sectors at the end of the disk.

What do I do next? Should I put the old second RAID disk back in for now, and try to recover all data (that won't risk jumping back two days by putting an out-of-date disk back in, will it)? Should I take the server down and scan the single disk with the data on to find and fix hardware errors? Is there something else I should do now?

Any help appreciated.

Edit: in case it means anything, this error gets repeated four times as the server appears to hang:

Code: [Select]

Apr  7 11:55:56 sme kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Apr  7 11:55:56 sme kernel: ata1.00: (BMDMA stat 0x65)
Apr  7 11:55:56 sme kernel: ata1.00: cmd 25/00:00:cd:2b:5b/00:04:39:00:00/e0 tag 0 cdb 0x0 data 524288 in
Apr  7 11:55:56 sme kernel:          res 51/40:ce:ff:2d:5b/40:01:39:00:00/e9 Emask 0x9 (media error)
Apr  7 11:55:56 sme kernel: ata1.00: configured for UDMA/133
Apr  7 11:55:56 sme kernel: ata1.01: configured for UDMA/133
Apr  7 11:55:56 sme kernel: ata1: EH complete

Then a bunch of these for a while:

Code: [Select]

Apr  7 12:15:25 sme kernel: SCSI error : <0 0 0 0> return code = 0x8000002
Apr  7 12:15:25 sme kernel: Info fld=0x4000000 (nonstd), Invalid sda: sense = 72 11
Apr  7 12:15:26 sme kernel: end_request: I/O error, dev sda, sector 962538605
Apr  7 12:15:26 sme kernel: ata1: EH complete
Apr  7 12:15:26 sme kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Apr  7 12:15:27 sme kernel: ata1.00: (BMDMA stat 0x65)
Apr  7 12:15:31 sme kernel: ata1.00: cmd 25/00:d8:75:2c:5f/00:02:39:00:00/e0 tag 0 cdb 0x0 data 372736 in
Apr  7 12:15:31 sme kernel:          res 51/40:66:e7:2d:5f/40:01:39:00:00/e9 Emask 0x9 (media error)
Apr  7 12:15:32 sme kernel: ata1.00: configured for UDMA/133
Apr  7 12:15:33 sme kernel: ata1.01: configured for UDMA/133

I am guessing this is not appropriate to raise as an SME bug, so I am hoping someone here has had some similar experiences and can offer some possible solutions.

Am I right in thinking I am going to have to take the server out of service, and copy the one remaining hard drive to a new hard drive using some other bootable distro? If so, any hints? I've tried copying SME disks before and never managed to produce working disks.

« Last Edit: April 07, 2011, 01:48:27 PM by judgej »

Logged

-- Jason

BarryO

17
+0/-0

Re: Problem replacing new disk in RAID

« Reply #10 on: April 07, 2011, 02:38:28 PM »

Whoa, you are bringing up all bad memories.

I tried the recommended procedure, with pretty much the same results you had. One thing I didn't have was the time you seem to not mind losing. I ended up doing an AFFA restore. I was up and running in under an hour with my upsized HDD's.

I just use a little spare desktop unit as dedicated AFFA backup server. It chugs along unnoticed in the background and I sleep a whole lot better. An alternate approach would be to AFFA to USB drive. That would work too....

Didn't mean to go off topic. I know my response was not on point to your post. Sorry man.

Logged

judgej

375
+0/-0

Re: Problem replacing new disk in RAID

« Reply #11 on: April 07, 2011, 02:43:59 PM »

Thanks - that is very useful, and an approach I just may take. I have a spare server - identical to the live server - that I could install a fresh SME Server on, with the new disks, and then synchronize with the live server. Once they are all backed up, I can just swap over the disks ("just" - lol - never works out that way).

I also don't have spare time to do all this - just stuck with it, and also stuck with maintaining up-time, since nothing can get done here in the office if the server is down.

Obviously I am now running on one disk that has obviously got problems, so the time-bomb is ticking...

Ta,

-- Jason

Edit: popped home, picked up the spare server, and now back to give AFFA a go.

« Last Edit: April 07, 2011, 03:20:05 PM by judgej »

Logged

-- Jason

BarryO

17
+0/-0

Re: Problem replacing new disk in RAID

« Reply #12 on: April 07, 2011, 02:59:22 PM »

With AFFA, you do not need to go offline for more than a few minutes. Set up the AFFA machine, do a backup of the original server, do an AFFA rise. You are only offline for the rise operation. The backup operation takes some time, but the server is not down during that process. Since AFFA uses rsync, incremental backups are really quick.

AFFA is very well documented for this procedure. You'll figure it out. I did (which says alot)

Logged

judgej

375
+0/-0

Re: Problem replacing new disk in RAID

« Reply #13 on: April 07, 2011, 03:26:36 PM »

Just a quick question: does it make sense to "rise" the backup instance to production and use it indefinitely, or use the backup to create a full restore of the production server onto fresh disks. If I can do it in one step:

Production Server 1 -> Affa Backup Server 2 -> swap disks over to Server 1 then "raise"

then that would be easiest. But I'm not sure whether that will be storing problems for the future? Perhaps better would be:

Production Server 1 -> Affa Backup Server 2 -> Production Server 1 with new disks and fresh SME install

« Last Edit: April 07, 2011, 03:30:24 PM by judgej »

Logged

-- Jason

BarryO

17
+0/-0

Re: Problem replacing new disk in RAID

« Reply #14 on: April 07, 2011, 03:54:01 PM »

I don't have the luxury of having a fully capable backup server. Mine is just a retired desktop box with minimal specs. Still, in my lightly used environment, it fills the roll nicely for a day or two.

Set up a backup server and do an AFFA test. If you have a highly modded server (lots of contribs, ect), you may have some issues. The first time I did a rise, I was really surprised. It does exactly what the documentation says, it comes up as a clone. Everything was there and, as the saying goes, it just worked. The unrise command is equally amazing.

I am not sure what roles your server fulfills for you, but I can easily lock mine down to preserve data integrity while I do the "final" backup and rise operation. My nightly AFFA backups generally take 10 minutes and the rise operation another 10 minutes or so. We can live for that long offline, so that turns out to be a solution for me.

I have completely highjacked your thread. My apologies.

Logged

judgej

375
+0/-0

Re: Problem replacing new disk in RAID

« Reply #15 on: April 07, 2011, 05:25:16 PM »

Quote from: BarryO on April 07, 2011, 03:54:01 PM

...I have completely highjacked your thread. My apologies.

No, no - this is all good stuff. I just need to get a production server up and running with larger hard drives as quickly and reliably as I can, and this looks like a way to do it. I'm already half-way through the install process of a new server now.

The role this server has is as a mail server, firewall, file shares, backup server for our web hosted sites (I run some custom scripts and rsync via cron for that). There is nothing particularly custom about it, although there are a few contribs installed to monitor the system. It is a shame that one of them was not a SMART warning system, something I am surprised that SME Server does not do out of the box.

Anyway - we are well off the original topic, but still well in line to solving my problem

Logged

-- Jason

judgej

375
+0/-0

Re: Problem replacing new disk in RAID

« Reply #16 on: April 08, 2011, 11:06:28 AM »

Okay, I am finding out the hard way that affa is simply not going to work.

On our server, we have backups from our production web servers. These backups are taken each day using rsync, and then a snapshot is taken using rsync and hard links every couple of days. This works in a similar way to affa in that files that do not change between each backup job are not duplicated, but stored as a single inode.

Now, the problem here is that affa is not getting these hard links transferred across. One "index.php" file on our office server, that is hard-linked into twelve monthly backup snapshots, is being copied across to the backup server as twelve individual files. So 40G of backups with monthly snapshots kept over a year on our main server becomes a Terabyte of files on the affa backup server, which will obviously not do.

I guess I am going to have to work out how to tell affa to exclude certain ibays, or at least certain 'snapshot' folders within certain ibays, when backing up.

A word of warning to anyone else who may end up going through this pain: NEVER EVER take out a RAID disk to replace with another, unless you are ABSOLUTELY CERTAIN that the remaining disk has NO read errors ANYWHERE on it. It basically means you cannot consider replacing disks or rebuilding a failed RAID unless you are prepared to take the server completely offline while you do this. The risks and pain involved in trying to get the disk array rebuilt is just too high. I've lost several days of work now, and am not happy.

Logged

-- Jason

judgej

375
+0/-0

Re: Problem replacing new disk in RAID

« Reply #17 on: April 08, 2011, 11:45:23 AM »

Adding "exclude" folders in the affa job is easy enough. The next time the backup job is run, it will remove the excluded folders from the backup entirely (don't expect them to simply not back up any more - they will go).

If you run the backup job from the command line, it will do the deletions before it sets the backup going as a background job, so it can take a little time to return from "affa --run prodserver".

Logged

-- Jason

axessit

214
+0/-0

Re: Problem replacing new disk in RAID

« Reply #18 on: April 14, 2011, 11:41:26 PM »

I went through this process (adding a larger drive) and thought I was OK. But after the raid synch'd up, I tested it by pulling one of the drives out and trying to boot, only to find all sorts of problems (the server wouldn't boot, got grub errors, then server would crash trying to mount the LVM etc etc).

Word of warning, if you're thinking of cloning a disk with g4u, ghost or whatever, these won't work as the Linux RAID reads drive serial numbers as part of the process and the cloning tools can't change that. That may have had something to do with my problems too. It also had my server out of action while they were cloning for a couple of hours, but I was doing that as I was paranoid about stuffing up my RAID as I thought only one drive had data.

In the end, I restorted to working through the http://wiki.contribs.org/AddExtraHardDisk and http://wiki.contribs.org/Raid:Manual_Rebuild to add the larger disk, then create the partitions on the new disk exactly as the old disks - ie creating a 500GB drive on the 1TB one, reinstalling the grub, then adding into the RAID and letting it rebuild, then removing the original 500GB drive and adding a new 1TB drive as above, thus creating a proper 500GB RAID on two 1TB drives, then growing the LVM partition. Apart from the cloning, I did all this on the fly. There were a few reboots, but that was only a few minutes until the system came back up.

Found a good tutorial somewhere on the net regarding creating a RAID on the fly using mdadm, sorry, can't put my finger on it now, that also enhanced the above how-to's quite well.

Logged