Koozali.org: home of the SME Server

Obsolete Releases => SME Server 7.x => Topic started by: bhamail on October 25, 2008, 05:16:59 PM

Title: sme 7.3, 4 disks, Manual intervention may be required
Post by: bhamail on October 25, 2008, 05:16:59 PM: I recently got a couple emails about degraded raid array, but I'm not really sure how to repair it. Here's the info I saw listed in other posts. Thanks for any guidance!
(This server has 4 sata disks.)

Code: [Select]
# fdisk -l Disk /dev/sda: 400.0 GB, 400088457216 bytes 255 heads, 63 sectors/track, 48641 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sda1 * 1 13 104391 fd Linux raid autodetect /dev/sda2 14 48641 390604410 fd Linux raid autodetect Disk /dev/sdb: 400.0 GB, 400088457216 bytes 255 heads, 63 sectors/track, 48641 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sdb1 * 1 13 104391 fd Linux raid autodetect /dev/sdb2 14 48641 390604410 fd Linux raid autodetect Disk /dev/sdc: 400.0 GB, 400088457216 bytes 255 heads, 63 sectors/track, 48641 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk /dev/sdc doesn't contain a valid partition table Disk /dev/sdd: 400.0 GB, 400088457216 bytes 255 heads, 63 sectors/track, 48641 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk /dev/sdd doesn't contain a valid partition table Disk /dev/md1: 106 MB, 106823680 bytes 2 heads, 4 sectors/track, 26080 cylinders Units = cylinders of 8 * 512 = 4096 bytes Disk /dev/md1 doesn't contain a valid partition table Disk /dev/md2: 399.9 GB, 399978790912 bytes 2 heads, 4 sectors/track, 97651072 cylinders Units = cylinders of 8 * 512 = 4096 bytes Disk /dev/md2 doesn't contain a valid partition table Disk /dev/dm-0: 397.7 GB, 397787791360 bytes 2 heads, 4 sectors/track, 97116160 cylinders Units = cylinders of 8 * 512 = 4096 bytes Disk /dev/dm-0 doesn't contain a valid partition table Disk /dev/dm-1: 2080 MB, 2080374784 bytes 2 heads, 4 sectors/track, 507904 cylinders Units = cylinders of 8 * 512 = 4096 bytes Disk /dev/dm-1 doesn't contain a valid partition table
Code: [Select]
# cat /proc/mdstat Personalities : [raid1] md2 : active raid1 sdb2[1] 390604288 blocks [2/1] [_U] md1 : active raid1 sdb1[1] 104320 blocks [2/1] [_U] unused devices: <none>
Code: [Select]
# mdadm --detail /dev/md1 /dev/md1: Version : 00.90.01 Creation Time : Wed May 23 23:45:49 2007 Raid Level : raid1 Array Size : 104320 (101.89 MiB 106.82 MB) Device Size : 104320 (101.89 MiB 106.82 MB) Raid Devices : 2 Total Devices : 1 Preferred Minor : 1 Persistence : Superblock is persistent Update Time : Sat Oct 25 10:45:35 2008 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 UUID : b842337c:566cdf32:932646ab:ef088413 Events : 0.13108 Number Major Minor RaidDevice State 0 0 0 - removed 1 8 17 1 active sync /dev/sdb1
Code: [Select]
# mdadm --detail /dev/md2 /dev/md2: Version : 00.90.01 Creation Time : Wed May 23 23:45:49 2007 Raid Level : raid1 Array Size : 390604288 (372.51 GiB 399.98 GB) Device Size : 390604288 (372.51 GiB 399.98 GB) Raid Devices : 2 Total Devices : 1 Preferred Minor : 2 Persistence : Superblock is persistent Update Time : Sat Oct 25 11:13:33 2008 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 UUID : d4a287fb:892c031a:b69a18db:d2fe5e9c Events : 0.18878483 Number Major Minor RaidDevice State 0 0 0 - removed 1 8 18 1 active sync /dev/sdb2
Thinking of RTFM'ing, Is there a "noobie" raid article you could recommend?

Thanks!
Title: Re: sme 7.3, 4 disks, Manual intervention may be required
Post by: janet on October 25, 2008, 06:24:23 PM: bhamail

There is a link at the top of the Forums to Howtos, look for the one on Raid, it may be helpful.
Title: Re: sme 7.3, 4 disks, Manual intervention may be required
Post by: christian on October 25, 2008, 06:38:26 PM: I would also encourage you to verify the health of your disks and try to understand why the array broke before you add the failed disk back in. I've now modified the RAID entry to indicate this.

See also:
http://wiki.contribs.org/Monitor_Disk_Health
Title: Re: sme 7.3, 4 disks, Manual intervention may be required
Post by: bhamail on October 25, 2008, 08:15:59 PM: Great info in those HowTo's (I can't believe I never noticed these). Thanks!

While running the recommended checks, I only saw one report that seemed problematic:

Code: [Select]
# smartctl -a /dev/sda smartctl version 5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: WDC WD4000YR-01PLB0 Serial Number: WD-WMAMY1624489 Firmware Version: 01.06A01 User Capacity: 400,088,457,216 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 7 ATA Standard is: Not recognized. Minor revision code: 0x1d Local Time is: Sat Oct 25 14:09:16 2008 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (10530) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 152) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 200 001 051 Pre-fail Always In_the_past 0 3 Spin_Up_Time 0x0007 225 224 021 Pre-fail Always - 5791 4 Start_Stop_Count 0x0032 100 100 040 Old_age Always - 133 5 Reallocated_Sector_Ct 0x0033 165 165 140 Pre-fail Always - 560 7 Seek_Error_Rate 0x000b 200 001 051 Pre-fail Always In_the_past 0 9 Power_On_Hours 0x0032 080 080 000 Old_age Always - 15061 10 Spin_Retry_Count 0x0013 065 063 051 Pre-fail Always - 128 11 Calibration_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 114 194 Temperature_Celsius 0x0022 117 101 000 Old_age Always - 35 196 Reallocated_Event_Count 0x0032 001 001 000 Old_age Always - 279 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 15061 - # 2 Short offline Completed without error 00% 9954 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
Specifically, these bits jumped out at me:
Code: [Select]
... SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 200 001 051 Pre-fail Always In_the_past 0 ... 7 Seek_Error_Rate 0x000b 200 001 051 Pre-fail Always In_the_past 0 ...
Are those "WHEN_FAILED" -> "In_the_past" entries something to worry about? Only /dev/sda had such entries, none of the other drives showed anything failing.

Thanks again!
Title: Re: sme 7.3, 4 disks, Manual intervention may be required
Post by: bhamail on October 25, 2008, 08:45:47 PM: Hmm...I'm having some trouble understanding the HowTo.

I have four drives, so I'd expect RAID5 +1 from the howto that says:

4-6 Drives - Software RAID 5 + 1 Hot-spare

But mdstat shows only sdb in use. OK, so a couple of drives got kicked out of the array, but I'm not sure which drives to add back where (and I was expecting mdstat to show an md device as RAID5).
Code: [Select]
# cat /proc/mdstat Personalities : [raid1] md2 : active raid1 sdb2[1] 390604288 blocks [2/1] [_U] md1 : active raid1 sdb1[1] 104320 blocks [2/1] [_U] unused devices: <none>
I started down the path of adding back sda (not sure this is correct) via:

# mdadm --add /dev/md1 /dev/sda1

then

# mdadm --add /dev/md2 /dev/sda2

but wouldn't that leave /dev/sdc and /dev/sdd unused???
I don't really know how the drives where setup initially (before the recent problems occurred), so I'm a little worried about rebuilding the arrays incorrectly. Thanks for you patience and help!
Title: Re: sme 7.3, 4 disks, Manual intervention may be required
Post by: christian on October 25, 2008, 09:40:50 PM: Quote from: bhamail on October 25, 2008, 08:15:59 PM
Code: [Select]
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 200 001 051 Pre-fail Always In_the_past 0 ... 7 Seek_Error_Rate 0x000b 200 001 051 Pre-fail Always In_the_past 0 ...
Are those "WHEN_FAILED" -> "In_the_past" entries something to worry about? Only /dev/sda had such entries, none of the other drives showed anything failing.

Doesn't look good to me especially since it is also the disk that got kicked out.

See also: http://en.wikipedia.org/wiki/S.M.A.R.T. for a good description of these errors.

Also note parameter 196 indicates a number of sectors have been remapped. The disk is still alive but I think it is on its way out. Assuming sdb is clean, you can still add sda back in as it has recovered from its failures but I would look at replacing sda as it is aging.

If sdb is showing issues then adding sda back in may fail.
Title: Re: sme 7.3, 4 disks, Manual intervention may be required
Post by: christian on October 25, 2008, 09:46:35 PM: Quote from: bhamail on October 25, 2008, 08:45:47 PM
I have four drives, so I'd expect RAID5 +1 from the howto that says:

4-6 Drives - Software RAID 5 + 1 Hot-spare

Your mdstat says you have RAID1. The fact it is still running on one disk says that too.

If you started with one or two disks and then added the others then those disks will be unused unless you did something to enable them.
Title: Re: sme 7.3, 4 disks, Manual intervention may be required
Post by: bhamail on October 25, 2008, 10:15:23 PM: When I built this machine it had all 4 disks and I installed SME 7.2 (and later upgraded to 7.3) so maybe that's why it is only RAID1? If so, it still seems odd that the md devices don't start from 0 (md0, md1 instead of md1, md2).

Anyway, I'm now having trouble adding sda back:
Code: [Select]
# mdadm --add /dev/md2 /dev/sda2 mdadm: hot added /dev/sda2 # cat /proc/mdstat Personalities : [raid1] md2 : active raid1 sda2[2](F) sdb2[1] 390604288 blocks [2/1] [_U] md1 : active raid1 sdb1[1] 104320 blocks [2/1] [_U] unused devices: <none>If fails right away.
I wonder if I did something bad??
I first tried adding back sda1 via:
mdadm --add /dev/md1 /dev/sda1
I checked mdstat and saw the "rebuilding..." message.
Then, (without waiting... :( ) I added sda2 via:
# mdadm --add /dev/md2 /dev/sda2
I checked mdstat and saw no "rebuilding...", but both newly added partitions had the (F) next to them. Yikes!

Dare I try adding one of the other drives (that I now suspect has never actually been used all this time)?

I really hope I haven't tanked my server.
Title: Re: sme 7.3, 4 disks, Manual intervention may be required
Post by: christian on October 25, 2008, 10:57:35 PM: no. you are ok for now. don't panic. You are still running right? :smile:

It would seem sda is sufficiently bad that it won't go back in. I suspect if you check your smart stats you will see increased errors. you could run a self test on the disk and see what it says.

If either of the other two disks is identical to the first two in the RAID then you can easily add one in. If they are different then we need to know more about them (e.g. brand, size, and preferably model). In general if they are different but bigger then they can be added in but the total size will be the smaller of the two disks in the array.

Don't forget to run smartctl on the other disks to ensure they are ok.

If one of them is the same and checks ok then you can use mdadm to add the new disk into the array. resync to completion then shutdown and remove the dead drive.

If they disks are different, tell me what they are and I'll point you in the right direction.

BTW, only md1 and m2 are used now. There used to also be md0 before SME7. I think this changed in SME7 with the use of LVM.
Title: Re: sme 7.3, 4 disks, Manual intervention may be required
Post by: christian on October 25, 2008, 11:00:08 PM: sorry. just checked your first post. they are all identical so go for it. And it is clear sdc and sdd aren't being used.
Title: Re: sme 7.3, 4 disks, Manual intervention may be required
Post by: electroman00 on October 26, 2008, 03:06:58 AM: bhamail

Quote
Are those "WHEN_FAILED" -> "In_the_past" entries something to worry about?

Could be a bad drive or a corrupt drive from a power outage if no UPS.

Indications are a corrupt drive, bad drives will usually have a lot of errors.

Suggest....disconnect all drives except for the good drive, mark the good drive.

Make sure you can boot to the good drive.

Then remove that drive and install only the suspected bad drive and perform non-destructive tests first.

If the drive fails tests, then reformat the bad drive using the MFG diag. utilities.

While it's formatting gently wiggle the drive and power cables.

Format should continue error free, failure indicates bad cabling.

That's a fairly new SATA drive with less then 2years run time so you may have a bad cable/connection, controller, motherboard.

If the motherboard has been running for some time then it's a good idea to check it for bulging capacitors, caps
where by the tops are not flat on the top and bulging up.

http://en.wikipedia.org/wiki/Electrolytic_capacitor (http://en.wikipedia.org/wiki/Electrolytic_capacitor)

After the format, run smart again on that drive.

Then remove the reformatted, bad drive....

Setup the good drive to sda and reboot with only that drive installed.

Check it's raid status with SME.

Then shutdown and install the reformatted drive as sdb.

Boot and check SME raid, should prompt you to sync raid.

Let it sync and keep an eye on it (smart check) for a few weeks.

If you can't format the bad drive or it fails smart then drive is likely bad.

Lets us know the results...

hth

Added these important notes before formatting any drive.

1. If you use SATA or SCSI drives, the drive devices may move around during boot, so you should not use /dev/sd? to find your drives. The kernel and/or bios assigns these names as it sees fit, so there's no guarantee that /dev/sda will always refer to the same physical device.
Therefore it's always a good idea to disconnect any other drive on the system when running diag. or formatting.

2. Before formatting any drives, due in part to #1 it's a good idea to make sure you can or cannot boot to the suspected good (raid) drive by installing only that drive and attempt the boot up.

3. Be sure you have identified the physical drives via bios settings and jumpers.

4. Run only non-destructive tests first, to verify the faulty drive.

5. Double verify everything you have done before you format any drive.
Title: Re: sme 7.3, 4 disks, Manual intervention may be required
Post by: christian on October 26, 2008, 04:21:57 AM: Quote from: electroman00 on October 26, 2008, 03:06:58 AM
Indications are a corrupt drive, bad drives will usually have a lot of errors.

I would generally agree based on parameters 1 and 7 alone; but parameter 196 shows quite a lot of remapped sector attempts hence my opinion. But what the heck, maybe you can rescue it.

I personally wouldn't trust anything with bad sectors. Just make sure the stats are clear if you are going to re-use.
Title: Re: sme 7.3, 4 disks, Manual intervention may be required
Post by: william_syd on October 26, 2008, 06:32:18 AM: I recently had a drive drop out of an array.

fdisk -l showed that that the first partition ended on the same cylinder(13) as the start of the second partition(13).

I went through the motions of failing it then removing it and finally reformatting with the same boundaries as the good drive.

Added it back and it's still going.

fdisk -l as it is now...

[root@tiger ~]# fdisk -l Disk /dev/hda: 320.0 GB, 320072933376 bytes 255 heads, 63 sectors/track, 38913 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/hda1 * 1 13 104391 fd Linux raid autodetect /dev/hda2 14 38913 312464250 fd Linux raid autodetect Disk /dev/hdc: 320.0 GB, 320072933376 bytes 255 heads, 63 sectors/track, 38913 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/hdc1 * 1 13 104391 fd Linux raid autodetect /dev/hdc2 14 38913 312464250 fd Linux raid autodetect ----------------------snip------------------------------

and...

[root@tiger ~]# cat /proc/mdstat Personalities : [raid1] md2 : active raid1 hda2[0] hdc2[1] 312464128 blocks [2/2] [UU] md1 : active raid1 hda1[0] hdc1[1] 104320 blocks [2/2] [UU] unused devices: <none> [root@tiger ~]#

Commands to fail, remove and add.

mdadm /dev/md1 -f /dev/hdc1 -r /dev/hdc1 mdadm /dev/md2 -f /dev/hdc2 -r /dev/hdc2 mdadm /dev/md2 -a /dev/hdc2 mdadm /dev/md1 -a /dev/hdc1
Title: Re: sme 7.3, 4 disks, Manual intervention may be required
Post by: bhamail on October 28, 2008, 02:29:57 AM: Thanks very much for these helpful replies!

I added in one of the unused drives (sdc), but I managed to skip what I think was a critical step:

1. Used "sfdisk ... > out; sfdisk .../sdc < out" from another post to create matching partions on sdc.
2. Got cautious since the disk had never really been used in the raid before, and did the "dd..." to wipe the MBR. (DID NOT remember to reboot after "dd...", as was suggested in yet a third post).
3. added sdc1 and sdc2 to md1 and md2 respectively, and let the synch complete.

Since I had another unused disk (sdd), I went ahead and did the same steps above (still no reboot though) to add sdd1,2 as a spares for md1,2.

When I finally rebooted (wondering if it would ever come back up), the raid array kicked out the sdc disk (and sdd spares)!
Also, (as warned in other posts) the drives have been re-named after reboot - fdisk now shows sda, sdb, and sdc; but I know the former sda is toast, so sda must be refering to one of the previously unused drives now.
I think the disk originally known as sda seems to have gone so bad that it doesn't even show up any more in: fdisk -l

I'm guessing the raid kicked out the new disks because I never rebooted after doing the "dd...", so now I am doing:

1. Used "sfdisk ... > out; sfdisk .../sdc < out" to create matching partions on sdc, sda.
2. did NOT do the "dd..." to wipe the MBR, DID reboot after sfdisk just to be sure (and made sure fdisk -l showed valid partions on all drives - currently sda,b,c).
3. added sdc1 and sdc2 to md1 and md2 respectively, and letting the synch run.
Also added sda1,2 as spares for md1,2.
The synch is still grinding away. I will post what happens after the synch completes and I do my first reboot to find out if the raid kicks out a drive again.

The best part of all this is SME server is still humming along nicely while I'm doing all this stuff.

One more question while waiting on the synch: When I get access to this machine to swap drives, how can I reliably tell which physical drive is sda,b,c, or d?

Thanks again,
Dan
Title: Re: sme 7.3, 4 disks, Manual intervention may be required
Post by: electroman00 on October 28, 2008, 02:55:39 AM: Simply....

If your original raid was sda & sdb and sda failed, then you would move sdb into the sda position
and insert a blank drive at sdb.

Select #5 from sever admin and your done....well you have to wait for it to sync.

If sdb failed then remove and install a blank drive at sdb and SME #5 sync.

SME will only sync raid to a blank drive automatically.
If the replacement drive contains a boot part then SME will not sync that drive for good reasons.

Rule of thumb...

Never do destructive testing, formatting, fdisking with more then one drive installed.

hth
Title: Re: sme 7.3, 4 disks, Manual intervention may be required
Post by: bhamail on October 28, 2008, 03:25:46 AM: Quote
If your original raid was sda & sdb and sda failed, then you would move sdb into the sda position
and insert a blank drive at sdb.

"position" is the problem. I don't know from looking at the drives which drive is sda, b, c, or d.
Although I only had 2 drive RAID1 running, I actually always had 4 drives in the machine (see the prior postings where I discovered I had two drives that had never been doing anything all along...)
For example, in bad old IDE land, I could trace the cable to the IDE slot. Once I knew which IDE controller I was connected to, and if the drive was master or slave, I reliably knew which drive was /dev/hdX, etc....

I'm not in front of the machine, but I assume there are at least 2 sata controllers (if not 4) on the main board, and I'm not sure how to map from drive, to cable, to controller, and ultimately to /dev/sdX to identify which of the 4 physical drives went bad.
I'm not even sure if I can rely on the ordering of controllers/cables now that "sda" swapped with "sdc" or "sdd". I'm sure I'm missing something obvious...
Title: Re: sme 7.3, 4 disks, Manual intervention may be required
Post by: electroman00 on October 28, 2008, 03:33:55 AM: Drive jumpers and bios settings and with scsi it the adapter settings as to what the boot drive is.

Also drive serial numbers confirms.
Title: Re: sme 7.3, 4 disks, Manual intervention may be required
Post by: bhamail on October 28, 2008, 03:41:13 AM: Not sure what you mean about:
Quote
Drive jumpers and bios settings
, as I never had to change any drive or mb jumpers, nor force a boot disk in the bios.

But serial number - AHA! That's the obvious one. I can get the drive serial number via:
Code: [Select]
# smartctl -a /dev/sdX ... === START OF INFORMATION SECTION === ... Serial Number: ..., and then just read the serial numbers from the physical drive label. Thanks Again!
Title: Re: sme 7.3, 4 disks, Manual intervention may be required
Post by: electroman00 on October 28, 2008, 08:32:53 AM: No MB jumpers anymore thank god.

ata and scsi have drive jumpers for drive selection master/slave or 0-7 scsi..

sata drives also, but you position them via plugging into the MB controller 1-4 or 0-3 connectors typically as you know.
Their not all on the same cable as ata and scsi.

So sata provides an improved redundancy unlike scsi, all drives on one cable, where a bad cable can
wipe out an entire raid array.

I never did think much of the one cable concept with respects to raid redundancy.
The very thing raid is suppose to overcome (drive failure) and the cable becomes the weak link in the concept.

All bios have drive boot settings, so you could boot to physical drive 2 instead of drive 0.

Something to pay attention to, else you could wipe a good drive.

Thus the advent of the Rule of thumb and there's two way's to learn that rule...the easy way or the hard way...!!

The S/N is a definitive ID with respect to system reporting.

JFYI

Quote
1. Used "sfdisk ... > out; sfdisk .../sdc < out" to create matching partions on sdc, sda.
2. did NOT do the "dd..." to wipe the MBR, DID reboot after sfdisk just to be sure (and made sure fdisk -l showed valid partions on all drives - currently sda,b,c).
3. added sdc1 and sdc2 to md1 and md2 respectively, and letting the synch run.

Even though you did all that, raid still thought there was no valid boot sector and proceeded to sync the drives to
be identical to each other.

The raid system doesn't do a delete and re-init the partitions and copies files, it does a sector read/write/verify synced between the drives thru the controller.
It's faster and less CPU intensive.

fdisk will not remove/zero the boot sector when creating/deleting partitions, where as.... dd will.

The procedure calls for a blank drive because it's better to start with a zero'd out drive to reduce possible r/w/v errors
during the sync.

If Raid detects even one bit out of sync between drives it will force/flag a re-sync.

So it's not unusual for a sync to fail or a re-sync after a power outage even with a UPS.

The drives have to be bit for bit identical or it flags a sync.

In your case, the fact that raid began the sync says it didn't find a valid boot sector, else it would not have started the sync.

Raid pretends to be smarter then an IT tech installing a drive with good data in a raid stack and won't harm it in any such event.

Tip

Keep in mind it only takes appox <10 millisec. for the system to render a good data drive unusable.

That's less time then it take's for one to say....Oh Sh_t...!!

I use the drink & think procedure, go get a drink and think about what your about to undertake.

Once you identify via S/N, then use a marker to ID the drive sda, sdb etc. good or bad.

Failure to ID can lead to further problems down the road.

You might have noticed most all drives have a barcode & S/N on them.

Good to make note of the number before install.

Else.... can you say Oh Sh_t...!! :sad:

Just so you know..... I've never have had to say.... Oh Sh_t...! :-P

HaHa....
Title: Re: sme 7.3, 4 disks, Manual intervention may be required
Post by: electroman00 on October 28, 2008, 11:05:29 AM: Sorry should have added to clairify.....

Quote
nor force a boot disk in the bios.

Bios setup DEL or F2 keys on boot up.

Takes you to the bios setup.
Title: Re: sme 7.3, 4 disks, Manual intervention may be required
Post by: christian on November 01, 2008, 02:27:34 PM: Bhamail,
glad this worked out for you and sorry to disappear. I travel a lot.

electroman00,
I've added the drive identification procedure usign smartctl to the How-to for others. That's useful.

Also, I note you don't have wiki editing access so if you wish to write up some RAID best practices for the wiki, I would be happy to add it for you. You've given some very good advice above. Perhaps adding a bug for SME Server Documentation would be the best way.
Title: Re: sme 7.3, 4 disks, Manual intervention may be required
Post by: bhamail on November 02, 2008, 08:58:07 AM: Just wanted to add more info about how this all turned out for me:

I think sda was dieing (but not quite dead yet), so at some point it managed to get back into the raid1 array (likely due to a user error on my part).
Later, sda got bad enough that the machine would no longer boot from it. That's when I DID have to go into the bios and DISABLE the bad drive in order for the machine to boot. (I still have not been on site with the machine to swap out the bad drive...and it's way cool that I've been able to do all this remotely, with just one phone call to a helper to walk through the bios to disable sda).

Given my level of skills, does anyone think it madness to consider converting this machine with 4 SATA drives from RAID1 to RAID5? Is it really hard to do this, and/or worth the trouble?

Many Thanks again for all the great help!!!

Dan
Title: Re: sme 7.3, 4 disks, Manual intervention may be required
Post by: electroman00 on November 05, 2008, 12:31:05 AM: bhamail

I would suggest you check the date on the suspected drive, if it's less then 3 years to the day
then you would be able to obtain a new drive under warranty from WD.

You then need to run destructive testing on that drive via WD diagnostics software
and obtain the failure code to submit to WD for warranty.

hth