Problem replacing new disk in RAID

judgej

375
+0/-0

Problem replacing new disk in RAID

« on: April 05, 2011, 04:39:41 PM »

I am in the process of upgrading my disks from 500G to 1Tbyte. I have removed the second disk from the RAID, and replaced it with a brand new 1TB disk. When I go into the disk redundancy screen I get the following:

Code: [Select]

               ┌──────Disk redundancy status as of Tuesday April  5, 2011 15:31:24────────┐
               │ Current RAID status:                                                     │
               │                                                                          │
               │ Personalities : [raid1]                                                  │
               │ md2 : active raid1 sda2[0]                                               │
               │       488279488 blocks [2/1] [U_]                                        │
               │ md1 : active raid1 sda1[0]                                               │
               │       104320 blocks [2/1] [U_]                                           │
               │ unused devices: <none>                                                   │
               │                                                                          │
               │                                                                          │
               │ The free disk count must equal one.                                      │
               │                                                                          │
               │ Manual intervention may be required.                                     │
               │                                                                          │
               │ Current disk status:                                                     │
               │                                                                          │
               │ Installed disks: sdc sda sdb                                             │
               │ Used disks: sda                                                          │
               └──────────────────────────────────────────────────────────────────────────┘

I am not sure what the "manual intervention" is, and there is no reference to it in the WIKI. I am also not sure why it lists three disks as being installed, because there are only two at the moment (SATA0 - 500G original disk, SATA2 - 1T new disk, SATA1 and SATA3 - no disks and disabled in the BIOS).

Any idea what I need to do to get the new disk into the RAID?

Logged

-- Jason

judgej

375
+0/-0

Re: Problem replacing new disk in RAID

« Reply #1 on: April 05, 2011, 04:49:11 PM »

I don't know if this is any use:

Code: [Select]

[root@sme ~]# ls /dev/sd*
/dev/sda  /dev/sda1  /dev/sda2  /dev/sdb  /dev/sdc  /dev/sdc1

The original RAID used disks sda and sdb - I'm not sure how they map onto the SATA channels. What sdc is here, I really don't know.

-- Jason

Logged

-- Jason

CharlieBrady

6,918
+3/-0

Re: Problem replacing new disk in RAID

« Reply #2 on: April 05, 2011, 06:55:56 PM »

Quote from: judgej on April 05, 2011, 04:49:11 PM

What sdc is here, I really don't know.

Well, you'll have to find out. That's what is preventing the console tool from doing its job.

Try:

/sbin/hdparm -I /dev/sdc

Logged

jumba

291
+0/-0
Donations: July 2007 - $ 20.00

Re: Problem replacing new disk in RAID

« Reply #3 on: April 05, 2011, 07:56:07 PM »

A connected USB-disk maybe?

Logged

judgej

375
+0/-0

Re: Problem replacing new disk in RAID

« Reply #4 on: April 05, 2011, 09:15:18 PM »

Quote from: jumba on April 05, 2011, 07:56:07 PM

A connected USB-disk maybe?

Ah yes, there is a USB disk connected for backups. Not sure whether this is it or not though:

Code: [Select]

[root@sme ~]# /sbin/hdparm -I /dev/sdc
/dev/sdc:
 HDIO_DRIVE_CMD(identify) failed: Invalid argument

Running this for drives a and b give me a couple of pages of reasonable-looking data on the drives.

« Last Edit: April 05, 2011, 09:17:47 PM by judgej »

Logged

-- Jason

judgej

375
+0/-0

Re: Problem replacing new disk in RAID

« Reply #5 on: April 05, 2011, 11:09:07 PM »

Digging a bit deeper, yes, I believe /dev/sdc *is* the USB drive.

Now, is there a way to move it out of the way programmatically (i.e. remotely) allowing me to set the RAID to build overnight, or is the only option to physically unplug it (a job for tomorrow, when I'm back in the office)?

Alternatively, perhaps there is a way to add a disk to the RAID array without going through the "manage disk redundancy" screen, which seems to have a problem with there being more than two disks on the system? I am guessing that could be a bit fiddly and - dare I say it - risky. It does make me wonder what will happen when I plug in the hot standby..?

« Last Edit: April 05, 2011, 11:17:51 PM by judgej »

Logged

-- Jason

CharlieBrady

6,918
+3/-0

Re: Problem replacing new disk in RAID

« Reply #6 on: April 06, 2011, 04:47:09 AM »

Quote from: judgej on April 05, 2011, 11:09:07 PM

Alternatively, perhaps there is a way to add a disk to the RAID array without going through the "manage disk redundancy" screen, ...

Yes, it's possible to manually partition the new drive, and manually add the partitions to the RAID mirrors.

Logged

judgej

375
+0/-0

Re: Problem replacing new disk in RAID

« Reply #7 on: April 06, 2011, 11:50:04 AM »

Okay, I managed to unplug the backup disk, and the "manage disk redundancy" screen immediately started working as expected (a "there is a spare disk, do you want to use it for RAID" kind of message).

I'm not sure whether this should be considered a bug? The disks that form the RAID array have not changed between the page not allowing the RAID to be set up, and then allowing it. I know it is not a situation that happens a lot, but being able to install all the new disks, and they manage the switchover to the new RAID remotely would be a real boost. As it is now, at least three site visits are needed to swap disks at the end of each stage, at least it is when using the standard SME admin screens.

Thanks for your help.

-- Jason

« Last Edit: April 06, 2011, 12:05:21 PM by judgej »

Logged

-- Jason

judgej

375
+0/-0

Re: Problem replacing new disk in RAID

« Reply #8 on: April 07, 2011, 10:53:05 AM »

Well, as the new disk is synchronising itself into the array, the sync process is finding the odd unrecoverable read errors on the original disk. I guess this is turning out to be a good time to replace those disks anyway (Hitachi HDs from 2008).

It does make me wonder though - taking a disk out to replace it with a new one is probably not the best method. Synchronizing a new disk into the array is IMO best done while *both* original disks are in place, then reading the data they contain will be a bit more resilient. The unrecoverable sector errors I am getting now represent lost data - probably stuff I will never find again, and may not miss, but who knows? If both original RAID disks were still in the machine, than a read error on one would still give the OS a chance to read the other disk to try to recover the sector. With one disk there is no chance.

I am guessing these read errors are in sectors of the disk that have not been read in a long time (only read now that I am synchronising), otherwise the data would already have been moved to good sectors.

« Last Edit: April 07, 2011, 11:17:44 AM by judgej »

Logged

-- Jason

judgej

375
+0/-0

Re: Problem replacing new disk in RAID

« Reply #9 on: April 07, 2011, 01:09:19 PM »

My hunch was correct: this method of upgrading a disk is *not* such a good idea.

The synchronisation gets about 98.4% of the way through, then finds some read errors on the original disk. it hangs everything for about three minutes, then simply jumps back to the beginning, synchronising at 0% again. I guess this is now in an endless loop until I can do something about the bad sectors at the end of the disk.

What do I do next? Should I put the old second RAID disk back in for now, and try to recover all data (that won't risk jumping back two days by putting an out-of-date disk back in, will it)? Should I take the server down and scan the single disk with the data on to find and fix hardware errors? Is there something else I should do now?

Any help appreciated.

Edit: in case it means anything, this error gets repeated four times as the server appears to hang:

Code: [Select]

Apr  7 11:55:56 sme kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Apr  7 11:55:56 sme kernel: ata1.00: (BMDMA stat 0x65)
Apr  7 11:55:56 sme kernel: ata1.00: cmd 25/00:00:cd:2b:5b/00:04:39:00:00/e0 tag 0 cdb 0x0 data 524288 in
Apr  7 11:55:56 sme kernel:          res 51/40:ce:ff:2d:5b/40:01:39:00:00/e9 Emask 0x9 (media error)
Apr  7 11:55:56 sme kernel: ata1.00: configured for UDMA/133
Apr  7 11:55:56 sme kernel: ata1.01: configured for UDMA/133
Apr  7 11:55:56 sme kernel: ata1: EH complete

Then a bunch of these for a while:

Code: [Select]

Apr  7 12:15:25 sme kernel: SCSI error : <0 0 0 0> return code = 0x8000002
Apr  7 12:15:25 sme kernel: Info fld=0x4000000 (nonstd), Invalid sda: sense = 72 11
Apr  7 12:15:26 sme kernel: end_request: I/O error, dev sda, sector 962538605
Apr  7 12:15:26 sme kernel: ata1: EH complete
Apr  7 12:15:26 sme kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Apr  7 12:15:27 sme kernel: ata1.00: (BMDMA stat 0x65)
Apr  7 12:15:31 sme kernel: ata1.00: cmd 25/00:d8:75:2c:5f/00:02:39:00:00/e0 tag 0 cdb 0x0 data 372736 in
Apr  7 12:15:31 sme kernel:          res 51/40:66:e7:2d:5f/40:01:39:00:00/e9 Emask 0x9 (media error)
Apr  7 12:15:32 sme kernel: ata1.00: configured for UDMA/133
Apr  7 12:15:33 sme kernel: ata1.01: configured for UDMA/133

I am guessing this is not appropriate to raise as an SME bug, so I am hoping someone here has had some similar experiences and can offer some possible solutions.

Am I right in thinking I am going to have to take the server out of service, and copy the one remaining hard drive to a new hard drive using some other bootable distro? If so, any hints? I've tried copying SME disks before and never managed to produce working disks.

« Last Edit: April 07, 2011, 01:48:27 PM by judgej »

Logged

-- Jason

BarryO

17
+0/-0

Re: Problem replacing new disk in RAID

« Reply #10 on: April 07, 2011, 02:38:28 PM »

Whoa, you are bringing up all bad memories.

I tried the recommended procedure, with pretty much the same results you had. One thing I didn't have was the time you seem to not mind losing. I ended up doing an AFFA restore. I was up and running in under an hour with my upsized HDD's.

I just use a little spare desktop unit as dedicated AFFA backup server. It chugs along unnoticed in the background and I sleep a whole lot better. An alternate approach would be to AFFA to USB drive. That would work too....

Didn't mean to go off topic. I know my response was not on point to your post. Sorry man.

Logged

judgej

375
+0/-0

Re: Problem replacing new disk in RAID

« Reply #11 on: April 07, 2011, 02:43:59 PM »

Thanks - that is very useful, and an approach I just may take. I have a spare server - identical to the live server - that I could install a fresh SME Server on, with the new disks, and then synchronize with the live server. Once they are all backed up, I can just swap over the disks ("just" - lol - never works out that way).

I also don't have spare time to do all this - just stuck with it, and also stuck with maintaining up-time, since nothing can get done here in the office if the server is down.

Obviously I am now running on one disk that has obviously got problems, so the time-bomb is ticking...

Ta,

-- Jason

Edit: popped home, picked up the spare server, and now back to give AFFA a go.

« Last Edit: April 07, 2011, 03:20:05 PM by judgej »

Logged

-- Jason

BarryO

17
+0/-0

Re: Problem replacing new disk in RAID

« Reply #12 on: April 07, 2011, 02:59:22 PM »

With AFFA, you do not need to go offline for more than a few minutes. Set up the AFFA machine, do a backup of the original server, do an AFFA rise. You are only offline for the rise operation. The backup operation takes some time, but the server is not down during that process. Since AFFA uses rsync, incremental backups are really quick.

AFFA is very well documented for this procedure. You'll figure it out. I did (which says alot)

Logged

judgej

375
+0/-0

Re: Problem replacing new disk in RAID

« Reply #13 on: April 07, 2011, 03:26:36 PM »

Just a quick question: does it make sense to "rise" the backup instance to production and use it indefinitely, or use the backup to create a full restore of the production server onto fresh disks. If I can do it in one step:

Production Server 1 -> Affa Backup Server 2 -> swap disks over to Server 1 then "raise"

then that would be easiest. But I'm not sure whether that will be storing problems for the future? Perhaps better would be:

Production Server 1 -> Affa Backup Server 2 -> Production Server 1 with new disks and fresh SME install

« Last Edit: April 07, 2011, 03:30:24 PM by judgej »

Logged

-- Jason

BarryO

17
+0/-0

Re: Problem replacing new disk in RAID

« Reply #14 on: April 07, 2011, 03:54:01 PM »

I don't have the luxury of having a fully capable backup server. Mine is just a retired desktop box with minimal specs. Still, in my lightly used environment, it fills the roll nicely for a day or two.

Set up a backup server and do an AFFA test. If you have a highly modded server (lots of contribs, ect), you may have some issues. The first time I did a rise, I was really surprised. It does exactly what the documentation says, it comes up as a clone. Everything was there and, as the saying goes, it just worked. The unrise command is equally amazing.

I am not sure what roles your server fulfills for you, but I can easily lock mine down to preserve data integrity while I do the "final" backup and rise operation. My nightly AFFA backups generally take 10 minutes and the rise operation another 10 minutes or so. We can live for that long offline, so that turns out to be a solution for me.

I have completely highjacked your thread. My apologies.

Logged