RAID1 Problem (again)

SchulzStefan

620
+0/-0

RAID1 Problem (again)

« on: February 10, 2014, 07:26:44 PM »

Running a SoftRAID1 on SME8 latest updates installed.

A few weeks ago, one of the disks in the server gave up. Somehow I managed to install an identical disk in the machine, added the disk to the RAID, and server was up and running again. No errors have been reported. On Saturday night, the machine ends up in a loop, no more responding to nothing. I could read on the console, that there was no write access in /var/log/httpd/access_log. I tried to shutdown the server from the console, but no luck.

I did a hard reset, the machine did boot but tells me, to do a filesystem check, because of an inconsistent file system and dropped me to the maintenance console. I did not perform a check at all. I shut down the server, and changed the disks on the board. I did not connect the old sda to the board. The server went up again everything seemed to be fine. I shut down the machine again. Connected the second drive to the board, booted again, and added the (new) disk to the RAID.

Syncing started, but after a few hours I got an email:

This email was generated by the smartd daemon running on:

host name: saturn
DNS domain: ivb.local
NIS domain: (none)

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another email message will be sent in 1 days if the problem persists

and:

# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md2 : active raid1 sdb2[2](S) sda2[0]
976655488 blocks [2/1] [U_]

unused devices: <none>

This means to me, that the RAID is still broken. The reason why must be unreadable sectors on sda. Interesting, because this is the disk I was able to boot. Users are reporting no errors while working. I checked the sdb drive with the manufacturer disktool. No errors are reported.

Am I assuming right, that with an unreadable error on sda I will never be able to rebuild a RAID? And if so, what would be the easiest way (shortest downtime) to clone/copy/mirror sda to another disk? Remember, the server is up an running without errors.

Any help/hints would be great. Thank's in advance,
stefan

Logged

And then one day you find ten years have got behind you.

Time, 1973
(Mason, Waters, Wright, Gilmour)

SchulzStefan

620
+0/-0

Re: RAID1 Problem (again)

« Reply #1 on: February 10, 2014, 09:59:45 PM »

From /var/log/messages:

Feb 10 18:45:12 saturn kernel: sd 0:0:0:0: Unhandled sense code
Feb 10 18:45:12 saturn kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
Feb 10 18:45:12 saturn kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Feb 10 18:45:12 saturn kernel: sda: Current [descriptor]: sense key: Medium Error
Feb 10 18:45:12 saturn kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Feb 10 18:45:12 saturn kernel:
Feb 10 18:45:12 saturn kernel: Descriptor sense data with sense descriptors (in hex):
Feb 10 18:45:12 saturn kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Feb 10 18:45:12 saturn kernel: 74 70 53 30
Feb 10 18:45:12 saturn kernel: ata1: EH complete
Feb 10 18:45:12 saturn kernel: raid1: sda: unrecoverable I/O read error for block 1953309440
Feb 10 18:45:12 saturn kernel: SCSI device sda: 1953525168 512-byte hdwr sectors (1000205 MB)
Feb 10 18:45:12 saturn kernel: sda: Write Protect is off
Feb 10 18:45:12 saturn kernel: sda: Mode Sense: 00 3a 00 00
Feb 10 18:45:12 saturn kernel: SCSI device sda: drive cache: write back
Feb 10 18:45:12 saturn kernel: SCSI device sda: 1953525168 512-byte hdwr sectors (1000205 MB)
Feb 10 18:45:12 saturn kernel: sda: Write Protect is off
Feb 10 18:45:12 saturn kernel: sda: Mode Sense: 00 3a 00 00
Feb 10 18:45:12 saturn kernel: SCSI device sda: drive cache: write back
Feb 10 18:45:12 saturn kernel: RAID1 conf printout:
Feb 10 18:45:12 saturn kernel: --- wd:1 rd:2
Feb 10 18:45:12 saturn kernel: disk 0, wo:0, o:1, dev:sda2
Feb 10 18:45:12 saturn kernel: disk 1, wo:1, o:1, dev:sdb2
Feb 10 18:45:13 saturn kernel: RAID1 conf printout:
Feb 10 18:45:13 saturn kernel: --- wd:1 rd:2
Feb 10 18:45:13 saturn kernel: disk 0, wo:0, o:1, dev:sda2
Feb 10 19:01:09 saturn smartd[2582]: Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors

I assume, that sda is corrupted. Did some googling: https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID

Quote

Hardware error

If the reading fails for a harddisk, you need to copy that harddisk to a new harddisk. Do that using GNU ddrescue. ddrescue can read forwards (fast) and backwards (slow). This is useful since you can sometimes only read a sector if you read it from "the other side". By giving ddrescue a log-file it will skip the parts that have already been copied successfully. Thereby it is OK to reboot your system, if the copying makes the system hang: The copying will continue where it left off.

ddrescue -r 3 /dev/old /dev/new my_log
ddrescue -R -r 3 /dev/old /dev/new my_log

where /dev/old is the harddisk with errors and /dev/new is the new empty harddisk.

Re-test that you can now read all sectors from /dev/new using 'dd', and remove /dev/old from the system. Then recompute $DEVICES to include the /dev/new:

UUID=$(mdadm -E /dev/sdj1|perl -ne '/Array UUID : (\S+)/ and print $1')
DEVICES=$(cat /proc/partitions | parallel --tagstring {5} --colsep ' +' mdadm -E /dev/{5} |grep $UUID | parallel --colsep '\t' echo /dev/{1})

and: http://www.gnu.org/software/ddrescue/manual/ddrescue_manual.html

Quote

3 Using ddrescue safely

Ddrescue is like any other power tool. You need to understand what it does, and you need to understand some things about the machines it does those things to, in order to use it safely.

Always use a logfile unless you know you won't need it. Without a logfile, ddrescue can't resume a rescue, only reinitiate it.

Never try to rescue a r/w mounted partition. The resulting copy may be useless.

Never try to repair a file system on a drive with I/O errors; you will probably lose even more data.

If you use a device or a partition as destination, any data stored there will be overwritten.

Some systems may change device names on reboot (eg. udev enabled systems). If you reboot, check the device names before restarting ddrescue.

If you interrupt the rescue and then reboot, any partially copied partitions should be hidden before allowing them to be touched by any operating system that tries to mount and "fix" the partitions it sees.

As I understand the procedure should be:

1. remove sdb form the RAID. (set it to faulty and then remove it)
2. boot the machine with both disks plugged, from a CD or USB where ddrescue can be run from.
3. mount sda (the disk with the errors) read only. Do I have to mount the disk? Or is ddrescue taking control itself?
4. mount sdb read/write. Same question: Do I have to mount the disk? Or is ddrescue taking control itself?
5. use ddrescue to copy the data from sda to sdb.

After copying, pull the corrupted disk out of the machine. Plug the good disk (the copy of sda) on the board to boot.

Never done this before, I'm a bit more than a user, but surely no expert in this things. So my question is, anybody with experience in this things? Will it work in this way? Did I miss something?

Logged

And then one day you find ten years have got behind you.

Time, 1973
(Mason, Waters, Wright, Gilmour)

TerryF

grumpy old man
1,847
+6/-0

Re: RAID1 Problem (again)

« Reply #2 on: February 10, 2014, 11:00:07 PM »

Are you saying you were able to reboot with a single disk, in this case sdb?

If that is the case why not, remove the sda HD, reboot the machine on the single drive, power it down and simpley install a new disk of the same size as sdb, reboot and let the sytem resync the raid, as it should do when a new HD is introduced to a single drive setup..

http://wiki.contribs.org/Raid#Replacing_and_Upgrading_Hard_Drive_after_HD_fail see here and leave out the grow section..

Logged

--
qui scribit bis legit

SchulzStefan

620
+0/-0

Re: RAID1 Problem (again)

« Reply #3 on: February 11, 2014, 08:43:09 AM »

TerryF thank you for your reply.

While the original disk on the sda port ended in the maintenance console because of an inconsistent file system, I booted the machine from the sdb disk. Therefore I changed the sdb to sda on the board. The server booted. Then I tried to add the other disk back to the RAID. Syncing started but ended up in:

Quote

This email was generated by the smartd daemon running on:

host name: saturn
DNS domain: ivb.local
NIS domain: (none)

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another email message will be sent in 1 days if the problem persists

and:

# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md2 : active raid1 sdb2[2](S) sda2[0]
976655488 blocks [2/1] [U_]

unused devices: <none>

In my understanding the disk is not mirrored - therefore not usable. But I didn't try that already. Is it possible to boot from an unfinished mirrored disk? I don't know. Does it make sense to try this?

But all this means to me I'm not able to add a physically good disk in a RAID1, while the other disk has a I/O error. The rediculous thing is that the server is running with the disk, which has I/O errors on bad sectors.

Is there a way to force the mirroring or bypass the bad sectors while building the RAID? They are obviously (or hopefully) empty? Well as I said, this is beyond my knowledge...

Logged

And then one day you find ten years have got behind you.

Time, 1973
(Mason, Waters, Wright, Gilmour)

TerryF

grumpy old man
1,847
+6/-0

Re: RAID1 Problem (again)

« Reply #4 on: February 11, 2014, 09:18:55 AM »

Remove the faulty drive, Boot off the good sdb, you end up with a single disk raid 1..

Procure a new HD the same size as sdb, use an old one if thats all you have, make sure to "zero" it, buy a new one is a better idea.

Power down the machine, put it in the new HD, reboot the system, go to the console and check that the system has now identified a new HD has been added, let it sync up, will take some time..

You are back with a two drive Raid 1 system

Logged

--
qui scribit bis legit

SchulzStefan

620
+0/-0

Re: RAID1 Problem (again)

« Reply #5 on: February 11, 2014, 09:22:40 AM »

How can I remove the faulty drive, while this is the disk the server is running with?

Logged

And then one day you find ten years have got behind you.

Time, 1973
(Mason, Waters, Wright, Gilmour)

TerryF

grumpy old man
1,847
+6/-0

Re: RAID1 Problem (again)

« Reply #6 on: February 11, 2014, 09:30:34 AM »

Quote from: SchulzStefan on February 11, 2014, 08:43:09 AM

I booted the machine from the sdb disk. Therefore I changed the sdb to sda on the board. The server booted.

Sorry I don't get your last comment..according to what you said above the machine was boted up on the old sdb, is that not correct?

Throw the faulty drive in the rubbish, buy a new drive of the same size, put it in the machine..power it up.

Check that the new drive is found and has started to sync from the console..

Whats not to do..??

sorry spelling "booted"

« Last Edit: February 11, 2014, 09:32:11 AM by TerryF »

Logged

--
qui scribit bis legit

SchulzStefan

620
+0/-0

Re: RAID1 Problem (again)

« Reply #7 on: February 11, 2014, 09:49:56 AM »

Quote

the machine was boted up on the old sdb, is that not correct?

Yes, the machine booted up with the sdb disk.

Quote

Throw the faulty drive in the rubbish

The faulty drive is running the server.

Quote

Quote
Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors

Quote

buy a new drive of the same size, put it in the machine..power it up.

Check that the new drive is found and has started to sync from the console.

That's what I did.

IMHO syncing will not finish because of an I/O error on the sda disk, the server is running with.

Logged

And then one day you find ten years have got behind you.

Time, 1973
(Mason, Waters, Wright, Gilmour)

SchulzStefan

620
+0/-0

Re: RAID1 Problem (again)

« Reply #8 on: February 11, 2014, 10:06:12 AM »

Meanwhile I installed via epel repo ddrescue.

1. Disable all services.

SVC='qpsmtpd sqpsmtpd crond imap pop3 imaps pop3s ftp httpd-e-smith atalk smb qmail'
for s in $SVC; do service $s stop; done

There's also hylafax, zarafa, firebird running as well as the bridge contrib, of course they have also to be stopped.

2. Stop the RAID

mdadm --stop /dev/md1
mdadm --stop /dev/md2

3. Copy the data fom sda to sdb

ddrescue -r 3 /dev/sda /dev/sdb my_log
ddrescue -R -r 3 /dev/sda /dev/sdb my_log

Will this way work? Anybody done this before?

« Last Edit: February 11, 2014, 10:09:02 AM by SchulzStefan »

Logged

And then one day you find ten years have got behind you.

Time, 1973
(Mason, Waters, Wright, Gilmour)

TerryF

grumpy old man
1,847
+6/-0

Re: RAID1 Problem (again)

« Reply #9 on: February 11, 2014, 11:54:17 AM »

Sorry I am totally confused..

If the sdb drive booted, why is the sda HD still in the server..why keep trying to use a faulty drive?

If the sytem starts with the sdb as the only disk present then you can put in another and sync it up..

Logged

--
qui scribit bis legit

SchulzStefan

620
+0/-0

Re: RAID1 Problem (again)

« Reply #10 on: February 11, 2014, 01:42:13 PM »

TerryF, thank you for your patience. Maybe I didn't explain correct (English is not my motherlanguage.)

Step-by-step:

1. server was dead. sda ended on a maintenance console because of inconsistent file system.
2. pulled sda out of the server.
3. changed sdb to sda
4. the server booted and is still running. No user reported any error so far.
5. examined with the wd-tool the disk which I pulled out. No errors reported either on the short test, neither on the long test.
6. put the examined disk back in the server.
7. added the disk with the server admin menu back to the RAID
8. mirroring started.
9. finally got an email, that the following warning/error was logged by the smartd daemon: Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors
10. # cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md2 : active raid1 sdb2[2](S) sda2[0]
976655488 blocks [2/1] [U_]

unused devices: <none>

11. sdb is not clean in the RAID
12. I assume that because of an unreadable sector of *sda* (see above) the mirroring will not finish.
13. While the server is running without errors I would like to know
a) is there a way to force mdadm to mirror, and if not
b) does it make sense to copy all data from sda to sdb with ddrescue in the described way.

EDITED: Server is running with an error:

SMART error (CurrentPendingSector) detected on host: saturn
This email was generated by the smartd daemon running on:

host name: saturn
DNS domain: ivb.local
NIS domain: (none)

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another email message will be sent in 1 days if the problem persists

As there are no reported errors from the users I assume, that this unreadable sector does not harm any program.

« Last Edit: February 11, 2014, 01:49:51 PM by SchulzStefan »

Logged

And then one day you find ten years have got behind you.

Time, 1973
(Mason, Waters, Wright, Gilmour)

SchulzStefan

620
+0/-0

Re: RAID1 Problem (again)

« Reply #11 on: February 11, 2014, 08:56:02 PM »

I couldn't stop md2.

Quote

2. Stop the RAID

mdadm --stop /dev/md1
mdadm --stop /dev/md2

Couldn't find any hint in the forum how to unmount /dev/md2.

Decided to boot the machine from an USB-Stick. OS on the stick is a puppy linux, slacko 5.6. I installed ddrescue before.

Puppy recognized both disks. I did not mount the disks and performed

ddrescue --force -r 3 /dev/sda /dev/sdb log.txt

Copying of data is still in progress. Average rate is 133 MB/s. Right now I got 0 B errorsize and 0 errors. Rescued data is 480000 MB. On the way to the half of the TB. I'll report when it's finished. It's getting late now...

Logged

And then one day you find ten years have got behind you.

Time, 1973
(Mason, Waters, Wright, Gilmour)

TerryF

grumpy old man
1,847
+6/-0

Re: RAID1 Problem (again)

« Reply #12 on: February 11, 2014, 09:41:54 PM »

Quote from: SchulzStefan on February 11, 2014, 08:56:02 PM

I couldn't stop md2.

Couldn't find any hint in the forum how to unmount /dev/md2.

Decided to boot the machine from an USB-Stick. OS on the stick is a puppy linux, slacko 5.6. I installed ddrescue before.

Puppy recognized both disks. I did not mount the disks and performed

ddrescue --force -r 3 /dev/sda /dev/sdb log.txt

Copying of data is still in progress. Average rate is 133 MB/s. Right now I got 0 B errorsize and 0 errors. Rescued data is 480000 MB. On the way to the half of the TB. I'll report when it's finished. It's getting late now...

Am I just not seeing something, are you using the inbuilt Raid support that is part of SME8..

or are you using a third party Raid system?

Good Luck

« Last Edit: February 11, 2014, 09:47:07 PM by TerryF »

Logged

--
qui scribit bis legit

SchulzStefan

620
+0/-0

Re: RAID1 Problem (again)

« Reply #13 on: February 12, 2014, 08:19:54 AM »

Quote

Decided to boot the machine from an USB-Stick. OS on the stick is a puppy linux, slacko 5.6. I installed ddrescue before.

Puppy recognized both disks. I did not mount the disks and performed

ddrescue --force -r 3 /dev/sda /dev/sdb log.txt

Copying of data is still in progress. Average rate is 133 MB/s. Right now I got 0 B errorsize and 0 errors. Rescued data is 480000 MB. On the way to the half of the TB. I'll report when it's finished. It's getting late now...

That worked. I pulled out the disk with the I/O error. Server is running without errors so far.

« Last Edit: February 12, 2014, 09:00:36 PM by SchulzStefan »

Logged

And then one day you find ten years have got behind you.

Time, 1973
(Mason, Waters, Wright, Gilmour)

SchulzStefan

620
+0/-0

Re: RAID1 Problem (again)

« Reply #14 on: February 12, 2014, 09:14:53 PM »

Changed the disk with the I/O error today at my dealer into a new one. Was still under warranty.

Put the disk as sdb back in the server. Added the disk in the array.

Got an email which tells me:

This email was generated by the smartd daemon running on:

host name: saturn
DNS domain: ivb.local
NIS domain: (none)

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another email message will be sent in 1 days if the problem persists

I'm totally confused now. Don't know what to do with this.

Remeber: I tested the running sda disk in the server with the manufactorer disk test tool (yes, I did the long test) and no errors have been reported. With ddrescue I copied from the old sda, that was this one with the I/O error, all data to the new disk, which is now sda. The server was, until adding sdb to the RAID, running without errors.

How can it be, that there is an I/O error again?? Did ddrescue copy the I/O error on this disk? Does this make sense???

Right now there are two brand new disks in the server, and I'm not able to build a RAID???

Logged

And then one day you find ten years have got behind you.

Time, 1973
(Mason, Waters, Wright, Gilmour)