Hi,
Yesterday I received two MDADM monitoring emails. First one on 9:32 on /dev/md2, at 23:32 on /dev/md1.
The current output of mdadm --detail /dev/md1 and md2 shows:
/dev/md1:
Version : 00.90.01
Creation Time : Sun Feb 11 13:38:58 2007
Raid Level : raid1
Array Size : 104320 (101.89 MiB 106.82 MB)
Device Size : 104320 (101.89 MiB 106.82 MB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 1
Persistence : Superblock is persistent
Update Time : Tue Dec 11 23:31:11 2007
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0
UUID : 4c021324:7d0b66b0:59685a06:10306ca3
Events : 0.3613
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 0 0 - removed
2 8 17 - faulty /dev/sdb1
/dev/md2:
Version : 00.90.01
Creation Time : Sun Feb 11 13:38:58 2007
Raid Level : raid1
Array Size : 195253888 (186.21 GiB 199.94 GB)
Device Size : 195253888 (186.21 GiB 199.94 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 2
Persistence : Superblock is persistent
Update Time : Wed Dec 12 15:11:44 2007
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0
UUID : 06c12353:d61ac8e5:2de7f6e9:7fdcfb10
Events : 0.6595950
When I looked in to the messages log file I found some errors as well:
Dec 11 09:28:30 smesrv-01 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
Dec 11 09:28:30 smesrv-01 kernel: ata2.00: cmd 35/00:08:02:dd:49/00:00:17:00:00/e0 tag 0 cdb 0x0 data 4096 out
Dec 11 09:28:30 smesrv-01 kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 11 09:28:37 smesrv-01 kernel: ata2: port is slow to respond, please be patient (Status 0xd0)
Dec 11 09:29:00 smesrv-01 kernel: ata2: port failed to respond (30 secs, Status 0xd0)
Dec 11 09:29:00 smesrv-01 kernel: ata2: soft resetting port
Dec 11 09:29:00 smesrv-01 kernel: ATA: abnormal status 0xD0 on port 0xE407
Dec 11 09:29:00 smesrv-01 last message repeated 6 times
Dec 11 09:31:41 smesrv-01 kernel: ata2.00: qc timeout (cmd 0xec)
Dec 11 09:31:41 smesrv-01 kernel: ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Dec 11 09:31:41 smesrv-01 kernel: ata2.00: revalidation failed (errno=-5)
Dec 11 09:31:41 smesrv-01 kernel: ata2: failed to recover some devices, retrying in 5 secs
Dec 11 09:31:41 smesrv-01 kernel: ata2: port is slow to respond, please be patient (Status 0xd0)
Dec 11 09:31:41 smesrv-01 kernel: ata2: port failed to respond (30 secs, Status 0xd0)
Dec 11 09:31:41 smesrv-01 kernel: ata2: soft resetting port
Dec 11 09:31:41 smesrv-01 kernel: ATA: abnormal status 0xD0 on port 0xE407
Dec 11 09:31:41 smesrv-01 su(pam_unix)[32741]: session opened for user qmailr by (uid=0)
Dec 11 09:31:41 smesrv-01 last message repeated 6 times
Dec 11 09:31:41 smesrv-01 kernel: ata2.00: qc timeout (cmd 0xec)
Dec 11 09:31:41 smesrv-01 kernel: ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Dec 11 09:31:41 smesrv-01 kernel: ata2.00: revalidation failed (errno=-5)
Dec 11 09:31:41 smesrv-01 kernel: ata2: failed to recover some devices, retrying in 5 secs
Dec 11 09:31:41 smesrv-01 kernel: ata2: port is slow to respond, please be patient (Status 0xd0)
Dec 11 09:31:41 smesrv-01 kernel: ata2: port failed to respond (30 secs, Status 0xd0)
Dec 11 09:31:41 smesrv-01 kernel: ata2: soft resetting port
Dec 11 09:31:41 smesrv-01 kernel: ATA: abnormal status 0xD0 on port 0xE407
Dec 11 09:31:41 smesrv-01 last message repeated 6 times
Dec 11 09:31:41 smesrv-01 kernel: ata2.00: qc timeout (cmd 0xec)
Dec 11 09:31:41 smesrv-01 kernel: ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Dec 11 09:31:41 smesrv-01 kernel: ata2.00: revalidation failed (errno=-5)
Dec 11 09:31:41 smesrv-01 kernel: ata2.00: disabled
Dec 11 09:31:41 smesrv-01 kernel: ata2: EH complete
Dec 11 09:31:41 smesrv-01 kernel: SCSI error : <1 0 0 0> return code = 0x40000
Dec 11 09:31:41 smesrv-01 kernel: end_request: I/O error, dev sdb, sector 390716674
Dec 11 09:31:41 smesrv-01 kernel: md: write_disk_sb failed for device sdb2
Dec 11 09:31:41 smesrv-01 kernel: md: errors occurred during superblock update, repeating
Dec 11 09:31:41 smesrv-01 kernel: SCSI error : <1 0 0 0> return code = 0x40000
<snip> these errors repeated a few hundred times in three seconds <snip>
Dec 11 09:31:44 smesrv-01 kernel: md: excessive errors occurred during superblock update, exiting
Dec 11 09:31:44 smesrv-01 kernel: SCSI error : <1 0 0 0> return code = 0x40000
Dec 11 09:31:44 smesrv-01 kernel: end_request: I/O error, dev sdb, sector 271522
Dec 11 09:31:44 smesrv-01 kernel: raid1: Disk failure on sdb2, disabling device.
Dec 11 09:31:44 smesrv-01 kernel: Operation continuing on 1 devices
Dec 11 09:31:44 smesrv-01 kernel: SCSI error : <1 0 0 0> return code = 0x40000
Dec 11 09:31:44 smesrv-01 kernel: end_request: I/O error, dev sdb, sector 271530
Dec 11 09:31:44 smesrv-01 kernel: SCSI error : <1 0 0 0> return code = 0x40000
Dec 11 09:31:44 smesrv-01 kernel: end_request: I/O error, dev sdb, sector 271538
Dec 11 09:31:44 smesrv-01 kernel: SCSI error : <1 0 0 0> return code = 0x40000
Dec 11 09:31:44 smesrv-01 kernel: end_request: I/O error, dev sdb, sector 108462770
Dec 11 09:31:44 smesrv-01 kernel: SCSI error : <1 0 0 0> return code = 0x40000
Dec 11 09:31:44 smesrv-01 kernel: end_request: I/O error, dev sdb, sector 147370394
Dec 11 09:31:44 smesrv-01 kernel: RAID1 conf printout:
Dec 11 09:31:44 smesrv-01 kernel: --- wd:1 rd:2
Dec 11 09:31:44 smesrv-01 kernel: disk 0, wo:0, o:1, dev:sda2
Dec 11 09:31:44 smesrv-01 kernel: disk 1, wo:1, o:0, dev:sdb2
Dec 11 09:31:44 smesrv-01 kernel: RAID1 conf printout:
Dec 11 09:31:44 smesrv-01 kernel: --- wd:1 rd:2
Dec 11 09:31:44 smesrv-01 kernel: disk 0, wo:0, o:1, dev:sda2
Note the ATA errors before the actual write errors. Don't know what that means.
This happend again at 23:30 (during backup execution)
Dec 11 23:30:40 smesrv-01 kernel: SCSI error : <1 0 0 0> return code = 0x40000
Dec 11 23:30:40 smesrv-01 kernel: end_request: I/O error, dev sdb, sector 208641
Dec 11 23:30:40 smesrv-01 kernel: md: write_disk_sb failed for device sdb1
<snip> again a few hundred messages in three seconds <snip>
Dec 11 23:30:40 smesrv-01 kernel: md: errors occurred during superblock update, repeating
Dec 11 23:30:40 smesrv-01 kernel: SCSI error : <1 0 0 0> return code = 0x40000
Dec 11 23:30:40 smesrv-01 kernel: end_request: I/O error, dev sdb, sector 208641
Dec 11 23:30:40 smesrv-01 kernel: md: write_disk_sb failed for device sdb1
Dec 11 23:30:40 smesrv-01 kernel: md: errors occurred during superblock update, repeating
Dec 11 23:30:40 smesrv-01 kernel: md: excessive errors occurred during superblock update, exiting
Dec 11 23:30:40 smesrv-01 kernel: SCSI error : <1 0 0 0> return code = 0x40000
Dec 11 23:30:40 smesrv-01 kernel: end_request: I/O error, dev sdb, sector 1171
Dec 11 23:30:40 smesrv-01 kernel: raid1: Disk failure on sdb1, disabling device.
Dec 11 23:30:40 smesrv-01 kernel: Operation continuing on 1 devices
Dec 11 23:30:40 smesrv-01 kernel: SCSI error : <1 0 0 0> return code = 0x40000
Dec 11 23:30:40 smesrv-01 kernel: end_request: I/O error, dev sdb, sector 1173
Dec 11 23:30:40 smesrv-01 kernel: RAID1 conf printout:
Dec 11 23:30:40 smesrv-01 kernel: --- wd:1 rd:2
Dec 11 23:30:40 smesrv-01 kernel: disk 0, wo:0, o:1, dev:sda1
Dec 11 23:30:40 smesrv-01 kernel: disk 1, wo:1, o:0, dev:sdb1
Dec 11 23:30:40 smesrv-01 kernel: RAID1 conf printout:
Dec 11 23:30:40 smesrv-01 kernel: --- wd:1 rd:2
Dec 11 23:30:40 smesrv-01 kernel: disk 0, wo:0, o:1, dev:sda1
Note there were no more ATA errors.
The machine runs SME7.2 up-to-date at the moment, it runs perfectly for about 18 months now.
I am unsure which actions to take. Both raid arrays received errors independently, never seen before. I think chances are good for a malfunctioning disk device. My questions:
Can the ATA errors in the first part of the messages log cause the arrays to degrade?
Can I reactivate the array in this situation (it differs from the previous posts in the forum on this topic because the mdadm output says the failed device; this is not the case in any of the other examples that I found). If I can, what are the correct command line statements for this situation? (I hate messing around with production arrays, partitions and disks: I am not that good at linux)
Is there a way to read SMART status of the disk device sdb from the command line? I would like some confirmation at device level before I start asking for money to buy a new disk...
Regards,
Marcel