sme 7.3, 4 disks, Manual intervention may be required

bhamail

46
+0/-0

Re: sme 7.3, 4 disks, Manual intervention may be required

« Reply #15 on: October 28, 2008, 03:25:46 AM »

Quote

If your original raid was sda & sdb and sda failed, then you would move sdb into the sda position
and insert a blank drive at sdb.

"position" is the problem. I don't know from looking at the drives which drive is sda, b, c, or d.
Although I only had 2 drive RAID1 running, I actually always had 4 drives in the machine (see the prior postings where I discovered I had two drives that had never been doing anything all along...)
For example, in bad old IDE land, I could trace the cable to the IDE slot. Once I knew which IDE controller I was connected to, and if the drive was master or slave, I reliably knew which drive was /dev/hdX, etc....

I'm not in front of the machine, but I assume there are at least 2 sata controllers (if not 4) on the main board, and I'm not sure how to map from drive, to cable, to controller, and ultimately to /dev/sdX to identify which of the 4 physical drives went bad.
I'm not even sure if I can rely on the ordering of controllers/cables now that "sda" swapped with "sdc" or "sdd". I'm sure I'm missing something obvious...

Logged

electroman00

491
+0/-0

Re: sme 7.3, 4 disks, Manual intervention may be required

« Reply #16 on: October 28, 2008, 03:33:55 AM »

Drive jumpers and bios settings and with scsi it the adapter settings as to what the boot drive is.

Also drive serial numbers confirms.

Logged

bhamail

46
+0/-0

Re: sme 7.3, 4 disks, Manual intervention may be required

« Reply #17 on: October 28, 2008, 03:41:13 AM »

Not sure what you mean about:

Quote

Drive jumpers and bios settings

, as I never had to change any drive or mb jumpers, nor force a boot disk in the bios.

But serial number - AHA! That's the obvious one. I can get the drive serial number via:

Code: [Select]

# smartctl -a /dev/sdX
...
=== START OF INFORMATION SECTION ===
...
Serial Number:    ...

, and then just read the serial numbers from the physical drive label. Thanks Again!

Logged

electroman00

491
+0/-0

Re: sme 7.3, 4 disks, Manual intervention may be required

« Reply #18 on: October 28, 2008, 08:32:53 AM »

No MB jumpers anymore thank god.

ata and scsi have drive jumpers for drive selection master/slave or 0-7 scsi..

sata drives also, but you position them via plugging into the MB controller 1-4 or 0-3 connectors typically as you know.
Their not all on the same cable as ata and scsi.

So sata provides an improved redundancy unlike scsi, all drives on one cable, where a bad cable can
wipe out an entire raid array.

I never did think much of the one cable concept with respects to raid redundancy.
The very thing raid is suppose to overcome (drive failure) and the cable becomes the weak link in the concept.

All bios have drive boot settings, so you could boot to physical drive 2 instead of drive 0.

Something to pay attention to, else you could wipe a good drive.

Thus the advent of the Rule of thumb and there's two way's to learn that rule...the easy way or the hard way...!!

The S/N is a definitive ID with respect to system reporting.

JFYI

Quote

1. Used "sfdisk ... > out; sfdisk .../sdc < out" to create matching partions on sdc, sda.
2. did NOT do the "dd..." to wipe the MBR, DID reboot after sfdisk just to be sure (and made sure fdisk -l showed valid partions on all drives - currently sda,b,c).
3. added sdc1 and sdc2 to md1 and md2 respectively, and letting the synch run.

Even though you did all that, raid still thought there was no valid boot sector and proceeded to sync the drives to
be identical to each other.

The raid system doesn't do a delete and re-init the partitions and copies files, it does a sector read/write/verify synced between the drives thru the controller.
It's faster and less CPU intensive.

fdisk will not remove/zero the boot sector when creating/deleting partitions, where as.... dd will.

The procedure calls for a blank drive because it's better to start with a zero'd out drive to reduce possible r/w/v errors
during the sync.

If Raid detects even one bit out of sync between drives it will force/flag a re-sync.

So it's not unusual for a sync to fail or a re-sync after a power outage even with a UPS.

The drives have to be bit for bit identical or it flags a sync.

In your case, the fact that raid began the sync says it didn't find a valid boot sector, else it would not have started the sync.

Raid pretends to be smarter then an IT tech installing a drive with good data in a raid stack and won't harm it in any such event.

Tip

Keep in mind it only takes appox <10 millisec. for the system to render a good data drive unusable.

That's less time then it take's for one to say....Oh Sh_t...!!

I use the drink & think procedure, go get a drink and think about what your about to undertake.

Once you identify via S/N, then use a marker to ID the drive sda, sdb etc. good or bad.

Failure to ID can lead to further problems down the road.

You might have noticed most all drives have a barcode & S/N on them.

Good to make note of the number before install.

Else.... can you say Oh Sh_t...!!

Just so you know..... I've never have had to say.... Oh Sh_t...!

HaHa....

Logged

electroman00

491
+0/-0

Re: sme 7.3, 4 disks, Manual intervention may be required

« Reply #19 on: October 28, 2008, 11:05:29 AM »

Sorry should have added to clairify.....

Quote

nor force a boot disk in the bios.

Bios setup DEL or F2 keys on boot up.

Takes you to the bios setup.

Logged

christian

369
+0/-0

Re: sme 7.3, 4 disks, Manual intervention may be required

« Reply #20 on: November 01, 2008, 02:27:34 PM »

Bhamail,
glad this worked out for you and sorry to disappear. I travel a lot.

electroman00,
I've added the drive identification procedure usign smartctl to the How-to for others. That's useful.

Also, I note you don't have wiki editing access so if you wish to write up some RAID best practices for the wiki, I would be happy to add it for you. You've given some very good advice above. Perhaps adding a bug for SME Server Documentation would be the best way.

Logged

SME since 2003

bhamail

46
+0/-0

Re: sme 7.3, 4 disks, Manual intervention may be required

« Reply #21 on: November 02, 2008, 08:58:07 AM »

Just wanted to add more info about how this all turned out for me:

I think sda was dieing (but not quite dead yet), so at some point it managed to get back into the raid1 array (likely due to a user error on my part).
Later, sda got bad enough that the machine would no longer boot from it. That's when I DID have to go into the bios and DISABLE the bad drive in order for the machine to boot. (I still have not been on site with the machine to swap out the bad drive...and it's way cool that I've been able to do all this remotely, with just one phone call to a helper to walk through the bios to disable sda).

Given my level of skills, does anyone think it madness to consider converting this machine with 4 SATA drives from RAID1 to RAID5? Is it really hard to do this, and/or worth the trouble?

Many Thanks again for all the great help!!!

Dan

Logged

electroman00

491
+0/-0

Re: sme 7.3, 4 disks, Manual intervention may be required

« Reply #22 on: November 05, 2008, 12:31:05 AM »

bhamail

I would suggest you check the date on the suspected drive, if it's less then 3 years to the day
then you would be able to obtain a new drive under warranty from WD.

You then need to run destructive testing on that drive via WD diagnostics software
and obtain the failure code to submit to WD for warranty.

hth

Logged