sme 7.3, 4 disks, Manual intervention may be required

bhamail

46
+0/-0

sme 7.3, 4 disks, Manual intervention may be required

« on: October 25, 2008, 05:16:59 PM »

I recently got a couple emails about degraded raid array, but I'm not really sure how to repair it. Here's the info I saw listed in other posts. Thanks for any guidance!
(This server has 4 sata disks.)

Code: [Select]

# fdisk -l

Disk /dev/sda: 400.0 GB, 400088457216 bytes
255 heads, 63 sectors/track, 48641 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          13      104391   fd  Linux raid autodetect
/dev/sda2              14       48641   390604410   fd  Linux raid autodetect

Disk /dev/sdb: 400.0 GB, 400088457216 bytes
255 heads, 63 sectors/track, 48641 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1          13      104391   fd  Linux raid autodetect
/dev/sdb2              14       48641   390604410   fd  Linux raid autodetect

Disk /dev/sdc: 400.0 GB, 400088457216 bytes
255 heads, 63 sectors/track, 48641 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdc doesn't contain a valid partition table

Disk /dev/sdd: 400.0 GB, 400088457216 bytes
255 heads, 63 sectors/track, 48641 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdd doesn't contain a valid partition table

Disk /dev/md1: 106 MB, 106823680 bytes
2 heads, 4 sectors/track, 26080 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md1 doesn't contain a valid partition table

Disk /dev/md2: 399.9 GB, 399978790912 bytes
2 heads, 4 sectors/track, 97651072 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md2 doesn't contain a valid partition table

Disk /dev/dm-0: 397.7 GB, 397787791360 bytes
2 heads, 4 sectors/track, 97116160 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/dm-0 doesn't contain a valid partition table

Disk /dev/dm-1: 2080 MB, 2080374784 bytes
2 heads, 4 sectors/track, 507904 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/dm-1 doesn't contain a valid partition table

Code: [Select]

# cat /proc/mdstat
Personalities : [raid1] 
md2 : active raid1 sdb2[1]
      390604288 blocks [2/1] [_U]
      
md1 : active raid1 sdb1[1]
      104320 blocks [2/1] [_U]
      
unused devices: <none>

Code: [Select]

# mdadm --detail /dev/md1
/dev/md1:
        Version : 00.90.01
  Creation Time : Wed May 23 23:45:49 2007
     Raid Level : raid1
     Array Size : 104320 (101.89 MiB 106.82 MB)
    Device Size : 104320 (101.89 MiB 106.82 MB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Sat Oct 25 10:45:35 2008
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           UUID : b842337c:566cdf32:932646ab:ef088413
         Events : 0.13108

    Number   Major   Minor   RaidDevice State
       0       0        0        -      removed
       1       8       17        1      active sync   /dev/sdb1

Code: [Select]

# mdadm --detail /dev/md2
/dev/md2:
        Version : 00.90.01
  Creation Time : Wed May 23 23:45:49 2007
     Raid Level : raid1
     Array Size : 390604288 (372.51 GiB 399.98 GB)
    Device Size : 390604288 (372.51 GiB 399.98 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 2
    Persistence : Superblock is persistent

    Update Time : Sat Oct 25 11:13:33 2008
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           UUID : d4a287fb:892c031a:b69a18db:d2fe5e9c
         Events : 0.18878483

    Number   Major   Minor   RaidDevice State
       0       0        0        -      removed
       1       8       18        1      active sync   /dev/sdb2

Thinking of RTFM'ing, Is there a "noobie" raid article you could recommend?

Thanks!

Logged

janet

4,812
+0/-0

Re: sme 7.3, 4 disks, Manual intervention may be required

« Reply #1 on: October 25, 2008, 06:24:23 PM »

bhamail

There is a link at the top of the Forums to Howtos, look for the one on Raid, it may be helpful.

Logged

Please search before asking, an answer may already exist.
The Search & other links to useful information are at top of Forum.

christian

369
+0/-0

Re: sme 7.3, 4 disks, Manual intervention may be required

« Reply #2 on: October 25, 2008, 06:38:26 PM »

I would also encourage you to verify the health of your disks and try to understand why the array broke before you add the failed disk back in. I've now modified the RAID entry to indicate this.

See also:
http://wiki.contribs.org/Monitor_Disk_Health

« Last Edit: October 25, 2008, 06:45:27 PM by christian »

Logged

SME since 2003

bhamail

46
+0/-0

Re: sme 7.3, 4 disks, Manual intervention may be required

« Reply #3 on: October 25, 2008, 08:15:59 PM »

Great info in those HowTo's (I can't believe I never noticed these). Thanks!

While running the recommended checks, I only saw one report that seemed problematic:

Code: [Select]

# smartctl -a /dev/sda
smartctl version 5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD4000YR-01PLB0
Serial Number:    WD-WMAMY1624489
Firmware Version: 01.06A01
User Capacity:    400,088,457,216 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   7
ATA Standard is:  Not recognized. Minor revision code: 0x1d
Local Time is:    Sat Oct 25 14:09:16 2008 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 (10530) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 152) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   200   001   051    Pre-fail  Always   In_the_past 0
  3 Spin_Up_Time            0x0007   225   224   021    Pre-fail  Always       -       5791
  4 Start_Stop_Count        0x0032   100   100   040    Old_age   Always       -       133
  5 Reallocated_Sector_Ct   0x0033   165   165   140    Pre-fail  Always       -       560
  7 Seek_Error_Rate         0x000b   200   001   051    Pre-fail  Always   In_the_past 0
  9 Power_On_Hours          0x0032   080   080   000    Old_age   Always       -       15061
 10 Spin_Retry_Count        0x0013   065   063   051    Pre-fail  Always       -       128
 11 Calibration_Retry_Count 0x0013   100   100   051    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       114
194 Temperature_Celsius     0x0022   117   101   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       279
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0012   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     15061         -
# 2  Short offline       Completed without error       00%      9954         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Specifically, these bits jumped out at me:

Code: [Select]

...
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   200   001   051    Pre-fail  Always   In_the_past 0
...
  7 Seek_Error_Rate         0x000b   200   001   051    Pre-fail  Always   In_the_past 0
...

Are those "WHEN_FAILED" -> "In_the_past" entries something to worry about? Only /dev/sda had such entries, none of the other drives showed anything failing.

Thanks again!

Logged

bhamail

46
+0/-0

Re: sme 7.3, 4 disks, Manual intervention may be required

« Reply #4 on: October 25, 2008, 08:45:47 PM »

Hmm...I'm having some trouble understanding the HowTo.

I have four drives, so I'd expect RAID5 +1 from the howto that says:

4-6 Drives - Software RAID 5 + 1 Hot-spare

But mdstat shows only sdb in use. OK, so a couple of drives got kicked out of the array, but I'm not sure which drives to add back where (and I was expecting mdstat to show an md device as RAID5).

Code: [Select]

# cat /proc/mdstat 
Personalities : [raid1] 
md2 : active raid1 sdb2[1]
      390604288 blocks [2/1] [_U]
      
md1 : active raid1 sdb1[1]
      104320 blocks [2/1] [_U]
      
unused devices: <none>

I started down the path of adding back sda (not sure this is correct) via:

# mdadm --add /dev/md1 /dev/sda1

then

# mdadm --add /dev/md2 /dev/sda2

but wouldn't that leave /dev/sdc and /dev/sdd unused???
I don't really know how the drives where setup initially (before the recent problems occurred), so I'm a little worried about rebuilding the arrays incorrectly. Thanks for you patience and help!

Logged

christian

369
+0/-0

Re: sme 7.3, 4 disks, Manual intervention may be required

« Reply #5 on: October 25, 2008, 09:40:50 PM »

Quote from: bhamail on October 25, 2008, 08:15:59 PM

Code: [Select]
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 200 001 051 Pre-fail Always In_the_past 0 ... 7 Seek_Error_Rate 0x000b 200 001 051 Pre-fail Always In_the_past 0 ...
Are those "WHEN_FAILED" -> "In_the_past" entries something to worry about? Only /dev/sda had such entries, none of the other drives showed anything failing.

Doesn't look good to me especially since it is also the disk that got kicked out.

See also: http://en.wikipedia.org/wiki/S.M.A.R.T. for a good description of these errors.

Also note parameter 196 indicates a number of sectors have been remapped. The disk is still alive but I think it is on its way out. Assuming sdb is clean, you can still add sda back in as it has recovered from its failures but I would look at replacing sda as it is aging.

If sdb is showing issues then adding sda back in may fail.

« Last Edit: October 25, 2008, 09:49:48 PM by christian »

Logged

SME since 2003

christian

369
+0/-0

Re: sme 7.3, 4 disks, Manual intervention may be required

« Reply #6 on: October 25, 2008, 09:46:35 PM »

Quote from: bhamail on October 25, 2008, 08:45:47 PM

I have four drives, so I'd expect RAID5 +1 from the howto that says:

4-6 Drives - Software RAID 5 + 1 Hot-spare

Your mdstat says you have RAID1. The fact it is still running on one disk says that too.

If you started with one or two disks and then added the others then those disks will be unused unless you did something to enable them.

« Last Edit: October 25, 2008, 09:48:31 PM by christian »

Logged

SME since 2003

bhamail

46
+0/-0

Re: sme 7.3, 4 disks, Manual intervention may be required

« Reply #7 on: October 25, 2008, 10:15:23 PM »

When I built this machine it had all 4 disks and I installed SME 7.2 (and later upgraded to 7.3) so maybe that's why it is only RAID1? If so, it still seems odd that the md devices don't start from 0 (md0, md1 instead of md1, md2).

Anyway, I'm now having trouble adding sda back:

Code: [Select]

# mdadm --add /dev/md2 /dev/sda2
mdadm: hot added /dev/sda2
# cat /proc/mdstat
Personalities : [raid1] 
md2 : active raid1 sda2[2](F) sdb2[1]
      390604288 blocks [2/1] [_U]
      
md1 : active raid1 sdb1[1]
      104320 blocks [2/1] [_U]
      
unused devices: <none>

If fails right away.
I wonder if I did something bad??
I first tried adding back sda1 via:
mdadm --add /dev/md1 /dev/sda1
I checked mdstat and saw the "rebuilding..." message.
Then, (without waiting...

) I added sda2 via:
# mdadm --add /dev/md2 /dev/sda2
I checked mdstat and saw no "rebuilding...", but both newly added partitions had the (F) next to them. Yikes!

Dare I try adding one of the other drives (that I now suspect has never actually been used all this time)?

I really hope I haven't tanked my server.

« Last Edit: October 25, 2008, 10:16:55 PM by bhamail »

Logged

christian

369
+0/-0

Re: sme 7.3, 4 disks, Manual intervention may be required

« Reply #8 on: October 25, 2008, 10:57:35 PM »

no. you are ok for now. don't panic. You are still running right?

It would seem sda is sufficiently bad that it won't go back in. I suspect if you check your smart stats you will see increased errors. you could run a self test on the disk and see what it says.

If either of the other two disks is identical to the first two in the RAID then you can easily add one in. If they are different then we need to know more about them (e.g. brand, size, and preferably model). In general if they are different but bigger then they can be added in but the total size will be the smaller of the two disks in the array.

Don't forget to run smartctl on the other disks to ensure they are ok.

If one of them is the same and checks ok then you can use mdadm to add the new disk into the array. resync to completion then shutdown and remove the dead drive.

If they disks are different, tell me what they are and I'll point you in the right direction.

BTW, only md1 and m2 are used now. There used to also be md0 before SME7. I think this changed in SME7 with the use of LVM.

Logged

SME since 2003

christian

369
+0/-0

Re: sme 7.3, 4 disks, Manual intervention may be required

« Reply #9 on: October 25, 2008, 11:00:08 PM »

sorry. just checked your first post. they are all identical so go for it. And it is clear sdc and sdd aren't being used.

Logged

SME since 2003

electroman00

491
+0/-0

Re: sme 7.3, 4 disks, Manual intervention may be required

« Reply #10 on: October 26, 2008, 03:06:58 AM »

bhamail

Quote

Are those "WHEN_FAILED" -> "In_the_past" entries something to worry about?

Could be a bad drive or a corrupt drive from a power outage if no UPS.

Indications are a corrupt drive, bad drives will usually have a lot of errors.

Suggest....disconnect all drives except for the good drive, mark the good drive.

Make sure you can boot to the good drive.

Then remove that drive and install only the suspected bad drive and perform non-destructive tests first.

If the drive fails tests, then reformat the bad drive using the MFG diag. utilities.

While it's formatting gently wiggle the drive and power cables.

Format should continue error free, failure indicates bad cabling.

That's a fairly new SATA drive with less then 2years run time so you may have a bad cable/connection, controller, motherboard.

If the motherboard has been running for some time then it's a good idea to check it for bulging capacitors, caps
where by the tops are not flat on the top and bulging up.

http://en.wikipedia.org/wiki/Electrolytic_capacitor

After the format, run smart again on that drive.

Then remove the reformatted, bad drive....

Setup the good drive to sda and reboot with only that drive installed.

Check it's raid status with SME.

Then shutdown and install the reformatted drive as sdb.

Boot and check SME raid, should prompt you to sync raid.

Let it sync and keep an eye on it (smart check) for a few weeks.

If you can't format the bad drive or it fails smart then drive is likely bad.

Lets us know the results...

hth

Added these important notes before formatting any drive.

1. If you use SATA or SCSI drives, the drive devices may move around during boot, so you should not use /dev/sd? to find your drives. The kernel and/or bios assigns these names as it sees fit, so there's no guarantee that /dev/sda will always refer to the same physical device.
Therefore it's always a good idea to disconnect any other drive on the system when running diag. or formatting.

2. Before formatting any drives, due in part to #1 it's a good idea to make sure you can or cannot boot to the suspected good (raid) drive by installing only that drive and attempt the boot up.

3. Be sure you have identified the physical drives via bios settings and jumpers.

4. Run only non-destructive tests first, to verify the faulty drive.

5. Double verify everything you have done before you format any drive.

« Last Edit: October 26, 2008, 11:00:51 AM by electroman00 »

Logged

christian

369
+0/-0

Re: sme 7.3, 4 disks, Manual intervention may be required

« Reply #11 on: October 26, 2008, 04:21:57 AM »

Quote from: electroman00 on October 26, 2008, 03:06:58 AM

Indications are a corrupt drive, bad drives will usually have a lot of errors.

I would generally agree based on parameters 1 and 7 alone; but parameter 196 shows quite a lot of remapped sector attempts hence my opinion. But what the heck, maybe you can rescue it.

I personally wouldn't trust anything with bad sectors. Just make sure the stats are clear if you are going to re-use.

Logged

SME since 2003

william_syd

1,608
+0/-0
Nothing to see here.

Re: sme 7.3, 4 disks, Manual intervention may be required

« Reply #12 on: October 26, 2008, 06:32:18 AM »

I recently had a drive drop out of an array.

fdisk -l showed that that the first partition ended on the same cylinder(13) as the start of the second partition(13).

I went through the motions of failing it then removing it and finally reformatting with the same boundaries as the good drive.

Added it back and it's still going.

fdisk -l as it is now...

[root@tiger ~]# fdisk -l Disk /dev/hda: 320.0 GB, 320072933376 bytes 255 heads, 63 sectors/track, 38913 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/hda1 * 1 13 104391 fd Linux raid autodetect /dev/hda2 14 38913 312464250 fd Linux raid autodetect Disk /dev/hdc: 320.0 GB, 320072933376 bytes 255 heads, 63 sectors/track, 38913 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/hdc1 * 1 13 104391 fd Linux raid autodetect /dev/hdc2 14 38913 312464250 fd Linux raid autodetect ----------------------snip------------------------------

and...

[root@tiger ~]# cat /proc/mdstat Personalities : [raid1] md2 : active raid1 hda2[0] hdc2[1] 312464128 blocks [2/2] [UU] md1 : active raid1 hda1[0] hdc1[1] 104320 blocks [2/2] [UU] unused devices: <none> [root@tiger ~]#

Commands to fail, remove and add.

mdadm /dev/md1 -f /dev/hdc1 -r /dev/hdc1 mdadm /dev/md2 -f /dev/hdc2 -r /dev/hdc2 mdadm /dev/md2 -a /dev/hdc2 mdadm /dev/md1 -a /dev/hdc1

Logged

Regards,
William

IF I give advise.. It's only if it was me....

bhamail

46
+0/-0

Re: sme 7.3, 4 disks, Manual intervention may be required

« Reply #13 on: October 28, 2008, 02:29:57 AM »

Thanks very much for these helpful replies!

I added in one of the unused drives (sdc), but I managed to skip what I think was a critical step:

1. Used "sfdisk ... > out; sfdisk .../sdc < out" from another post to create matching partions on sdc.
2. Got cautious since the disk had never really been used in the raid before, and did the "dd..." to wipe the MBR. (DID NOT remember to reboot after "dd...", as was suggested in yet a third post).
3. added sdc1 and sdc2 to md1 and md2 respectively, and let the synch complete.

Since I had another unused disk (sdd), I went ahead and did the same steps above (still no reboot though) to add sdd1,2 as a spares for md1,2.

When I finally rebooted (wondering if it would ever come back up), the raid array kicked out the sdc disk (and sdd spares)!
Also, (as warned in other posts) the drives have been re-named after reboot - fdisk now shows sda, sdb, and sdc; but I know the former sda is toast, so sda must be refering to one of the previously unused drives now.
I think the disk originally known as sda seems to have gone so bad that it doesn't even show up any more in: fdisk -l

I'm guessing the raid kicked out the new disks because I never rebooted after doing the "dd...", so now I am doing:

1. Used "sfdisk ... > out; sfdisk .../sdc < out" to create matching partions on sdc, sda.
2. did NOT do the "dd..." to wipe the MBR, DID reboot after sfdisk just to be sure (and made sure fdisk -l showed valid partions on all drives - currently sda,b,c).
3. added sdc1 and sdc2 to md1 and md2 respectively, and letting the synch run.
Also added sda1,2 as spares for md1,2.
The synch is still grinding away. I will post what happens after the synch completes and I do my first reboot to find out if the raid kicks out a drive again.

The best part of all this is SME server is still humming along nicely while I'm doing all this stuff.

One more question while waiting on the synch: When I get access to this machine to swap drives, how can I reliably tell which physical drive is sda,b,c, or d?

Thanks again,
Dan

« Last Edit: October 28, 2008, 02:40:40 AM by bhamail »

Logged

electroman00

491
+0/-0

Re: sme 7.3, 4 disks, Manual intervention may be required

« Reply #14 on: October 28, 2008, 02:55:39 AM »

Simply....

If your original raid was sda & sdb and sda failed, then you would move sdb into the sda position
and insert a blank drive at sdb.

Select #5 from sever admin and your done....well you have to wait for it to sync.

If sdb failed then remove and install a blank drive at sdb and SME #5 sync.

SME will only sync raid to a blank drive automatically.
If the replacement drive contains a boot part then SME will not sync that drive for good reasons.

Rule of thumb...

Never do destructive testing, formatting, fdisking with more then one drive installed.

hth

Logged