From /var/log/messages:
Feb 10 18:45:12 saturn kernel: sd 0:0:0:0: Unhandled sense code
Feb 10 18:45:12 saturn kernel: sd 0:0:0:0: SCSI error: return code = 0x08000002
Feb 10 18:45:12 saturn kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Feb 10 18:45:12 saturn kernel: sda: Current [descriptor]: sense key: Medium Error
Feb 10 18:45:12 saturn kernel: Add. Sense: Unrecovered read error - auto reallocate failed
Feb 10 18:45:12 saturn kernel:
Feb 10 18:45:12 saturn kernel: Descriptor sense data with sense descriptors (in hex):
Feb 10 18:45:12 saturn kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Feb 10 18:45:12 saturn kernel: 74 70 53 30
Feb 10 18:45:12 saturn kernel: ata1: EH complete
Feb 10 18:45:12 saturn kernel: raid1: sda: unrecoverable I/O read error for block 1953309440
Feb 10 18:45:12 saturn kernel: SCSI device sda: 1953525168 512-byte hdwr sectors (1000205 MB)
Feb 10 18:45:12 saturn kernel: sda: Write Protect is off
Feb 10 18:45:12 saturn kernel: sda: Mode Sense: 00 3a 00 00
Feb 10 18:45:12 saturn kernel: SCSI device sda: drive cache: write back
Feb 10 18:45:12 saturn kernel: SCSI device sda: 1953525168 512-byte hdwr sectors (1000205 MB)
Feb 10 18:45:12 saturn kernel: sda: Write Protect is off
Feb 10 18:45:12 saturn kernel: sda: Mode Sense: 00 3a 00 00
Feb 10 18:45:12 saturn kernel: SCSI device sda: drive cache: write back
Feb 10 18:45:12 saturn kernel: RAID1 conf printout:
Feb 10 18:45:12 saturn kernel: --- wd:1 rd:2
Feb 10 18:45:12 saturn kernel: disk 0, wo:0, o:1, dev:sda2
Feb 10 18:45:12 saturn kernel: disk 1, wo:1, o:1, dev:sdb2
Feb 10 18:45:13 saturn kernel: RAID1 conf printout:
Feb 10 18:45:13 saturn kernel: --- wd:1 rd:2
Feb 10 18:45:13 saturn kernel: disk 0, wo:0, o:1, dev:sda2
Feb 10 19:01:09 saturn smartd[2582]: Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors
I assume, that sda is corrupted. Did some googling:
https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAIDHardware error
If the reading fails for a harddisk, you need to copy that harddisk to a new harddisk. Do that using GNU ddrescue. ddrescue can read forwards (fast) and backwards (slow). This is useful since you can sometimes only read a sector if you read it from "the other side". By giving ddrescue a log-file it will skip the parts that have already been copied successfully. Thereby it is OK to reboot your system, if the copying makes the system hang: The copying will continue where it left off.
ddrescue -r 3 /dev/old /dev/new my_log
ddrescue -R -r 3 /dev/old /dev/new my_log
where /dev/old is the harddisk with errors and /dev/new is the new empty harddisk.
Re-test that you can now read all sectors from /dev/new using 'dd', and remove /dev/old from the system. Then recompute $DEVICES to include the /dev/new:
UUID=$(mdadm -E /dev/sdj1|perl -ne '/Array UUID : (\S+)/ and print $1')
DEVICES=$(cat /proc/partitions | parallel --tagstring {5} --colsep ' +' mdadm -E /dev/{5} |grep $UUID | parallel --colsep '\t' echo /dev/{1})
and:
http://www.gnu.org/software/ddrescue/manual/ddrescue_manual.html3 Using ddrescue safely
Ddrescue is like any other power tool. You need to understand what it does, and you need to understand some things about the machines it does those things to, in order to use it safely.
Always use a logfile unless you know you won't need it. Without a logfile, ddrescue can't resume a rescue, only reinitiate it.
Never try to rescue a r/w mounted partition. The resulting copy may be useless.
Never try to repair a file system on a drive with I/O errors; you will probably lose even more data.
If you use a device or a partition as destination, any data stored there will be overwritten.
Some systems may change device names on reboot (eg. udev enabled systems). If you reboot, check the device names before restarting ddrescue.
If you interrupt the rescue and then reboot, any partially copied partitions should be hidden before allowing them to be touched by any operating system that tries to mount and "fix" the partitions it sees.
As I understand the procedure should be:
1. remove sdb form the RAID. (set it to faulty and then remove it)
2. boot the machine with both disks plugged, from a CD or USB where ddrescue can be run from.
3. mount sda (the disk with the errors) read only. Do I have to mount the disk? Or is ddrescue taking control itself?
4. mount sdb read/write. Same question: Do I have to mount the disk? Or is ddrescue taking control itself?
5. use ddrescue to copy the data from sda to sdb.
After copying, pull the corrupted disk out of the machine. Plug the good disk (the copy of sda) on the board to boot.
Never done this before, I'm a bit more than a user, but surely no expert in this things. So my question is, anybody with experience in this things? Will it work in this way? Did I miss something?