Koozali.org: home of the SME Server
Obsolete Releases => SME Server 7.x => Topic started by: Tillebeck on August 17, 2008, 10:59:11 PM
-
Hi
I ran into the degradedArray mail but the server is still running with no noticable problems.
I did read this page:
http://wiki.contribs.org/Raid
there is an example with two disks and it seems it should be more or less obvious with partition there is the faulty one ;-) Unfortunately that does not count for me so I hope one of you guys will point me in the right direction.
I get this output (single disk only)
[root@hp600 ~]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 hdc2[0]
245007232 blocks [2/1] [U_]
md1 : active raid1 hdc1[0] hdd1[1]
104320 blocks [2/2] [UU]
unused devices: <none>
[root@hp600 ~]#
From the link above I guess the hdc2 is the one that should be readded. But just to do a compare with a similar healthy server I logged into and ran the same command from a test server. And there both hd1 and hd2 where listed as [U_]. So now I am wondering if that is more a difference seen between single and multible disk servers?...
BR. Anders
-
Tillebeck
In your first example hdd is the broken device
The healthy server your are looking at also appears to have problems
On a IDE system you should be seeing something like this:
Personalities : [raid1]
md2 : active raid1 hda2[0] hdc2[1]
80308864 blocks [2/2] [UU]
md1 : active raid1 hda1[0] hdc1[1]
104320 blocks [2/2] [UU]
unused devices: <none>
On a SATA system like this:
Personalities : [raid1]
md2 : active raid1 sda2[0] sdb2[1]
312464128 blocks [2/2] [UU]
md1 : active raid1 sda1[0] sdb1[1]
104320 blocks [2/2] [UU]
unused devices: <none>
-
Tillebeck,
What Mary is trying to say is your drives are chained incorrectly on the IDE cables in your machine. The bios is probably set
incorrectly as well - forcing it to boot to hdc (hdd-2 in the bios). Once you have that straightened out -
Searching the forums can produce many different commands which will attempt to repair or reinitialize the raid.
A recent discovery may help by powering down the machine from the console. Disconnecting the power from the drive which is having
difficulty and powering the machine back up and going back into the console. After entering the "manage raid" screen simply exit and
shut the system down once again reconnecting the faulty drive. After starting the machine up go back into the console and "manage raid"
this time it will ask you if you'd like to readd the drive to the system - select ok and then wait a couple of minutes and check it's progress
by re-entering the "manage raid".
Hope this helps.........
-
Unfortunately that does not count for me so I hope one of you guys will point me in the right direction.
I get this output (single disk only)
[root@hp600 ~]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 hdc2[0]
245007232 blocks [2/1] [U_]
md1 : active raid1 hdc1[0] hdd1[1]
104320 blocks [2/2] [UU]
unused devices: <none>
[root@hp600 ~]#
From the link above I guess the hdc2 is the one that should be readded.
That is incorrect. hdd2 is the partition which should be part of /dev/md2 but isn't. hdc2 is correctly functioning as part of the array. You should look through your logs and discover why hdd2 was kicked out of the array. You can re-add the partition, but perhaps what you should do is to replace the drive, which might be failing.
You also have your two hard drives on /dev/hdc and /dev/hdd, when they would perform better on /dev/hda and /dev/hdc - i.e as master on the two IDE channels.
-
Thanks for the help.
I looked at the wiki and at this thread:
http://forums.contribs.org/index.php?topic=39738.0
I ended up doing:
[root@hp600 ~]# mdadm /dev/md2 -a /dev/hdd2
Until now it seems fine. So far so good.
But Charlie Brady pointed out that my disks are set up incorrect. Or at least connected to the wrong ide-cables. Can I just power down the maschine. Change reorder the cables and boot the server again?
BR. Anders
P.S.
I got another (not important) server that gives me the output below. It is fine with me just to reinstall it but could it be fixed in any way with two broken partitions?
[root@ronja ~]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 hda2[0]
245007232 blocks [2/1] [U_]
md1 : active raid1 hda1[0]
104320 blocks [2/1] [U_]
unused devices: <none>
[root@ronja ~]#
-
Tillebeck
That looks like a whole drive has failed and needs to be replaced.
-
Thanks.
Cannot understand the output. Probably have to google a "raid for dummies" article to get a better understanding
BR. Anders
-
Tillebeck
RAID1 has two drives, in your case hda & something else, perhaps hdc (it could be hdb, hdc or hdd).
If both drives are connected as Masters on seperate channels then they will be hda & hdc, which is usually the optimal fail safe configuration
Mirror image partitions hda1 & hdc1 form the boot partition md1 of the RAID array
Mirror image partitions hda2 & hdc2 form the data partition md2 of the RAID array
For all intents and purposes the RAID array looks like one drive to the system, which has a boot partition md1 and a data partition md2
md2 : active raid1 hda2[0]
245007232 blocks [2/1] [U_]
md1 : active raid1 hda1[0]
104320 blocks [2/1] [U_]
In your case, only 1 out of 2 drives in each partition is active, the [2/1] part
The active drive is hda in both cases
The two partitions of the second drive have disappeared, the [U_] part
which is highly indicative of a total drive failure.
A good drive would look like
Personalities : [raid1]
md2 : active raid1 hda2[0] hdc2[1]
80308864 blocks [2/2] [UU]
md1 : active raid1 hda1[0] hdc1[1]
104320 blocks [2/2] [UU]
Also see
http://wiki.contribs.org/Raid
-
Do a fdisk -l to see what drives and partitions you have.
[root@tiger ~]# fdisk -l
Disk /dev/hda: 320.0 GB, 320072933376 bytes
255 heads, 63 sectors/track, 38913 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/hda1 * 1 13 104391 fd Linux raid autodetect
/dev/hda2 14 38913 312464250 fd Linux raid autodetect
Disk /dev/hdc: 320.0 GB, 320072933376 bytes
255 heads, 63 sectors/track, 38913 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/hdc1 * 1 13 104391 fd Linux raid autodetect
/dev/hdc2 14 38913 312464250 fd Linux raid autodetect
Disk /dev/md1: 106 MB, 106823680 bytes
2 heads, 4 sectors/track, 26080 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Disk /dev/md1 doesn't contain a valid partition table
Disk /dev/md2: 319.9 GB, 319963267072 bytes
2 heads, 4 sectors/track, 78116032 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Disk /dev/md2 doesn't contain a valid partition table
Disk /dev/dm-0: 317.7 GB, 317760471040 bytes
2 heads, 4 sectors/track, 77578240 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Disk /dev/dm-0 doesn't contain a valid partition table
Disk /dev/dm-1: 2080 MB, 2080374784 bytes
2 heads, 4 sectors/track, 507904 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Disk /dev/dm-1 doesn't contain a valid partition table
[root@tiger ~]#
[root@c3 ~]# fdisk -l
Disk /dev/hda: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/hda1 * 1 13 104391 fd Linux raid autodetect
/dev/hda2 14 9729 78043770 fd Linux raid autodetect
Disk /dev/md1: 106 MB, 106823680 bytes
2 heads, 4 sectors/track, 26080 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Disk /dev/md1 doesn't contain a valid partition table
Disk /dev/md2: 79.9 GB, 79916695552 bytes
2 heads, 4 sectors/track, 19510912 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Disk /dev/md2 doesn't contain a valid partition table
Disk /dev/dm-0: 79.6 GB, 79658221568 bytes
2 heads, 4 sectors/track, 19447808 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Disk /dev/dm-0 doesn't contain a valid partition table
Disk /dev/dm-1: 201 MB, 201326592 bytes
2 heads, 4 sectors/track, 49152 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Disk /dev/dm-1 doesn't contain a valid partition table
[root@c3 ~]#
[root@tiger ~]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 hda2[0] hdc2[1]
312464128 blocks [2/2] [UU]
md1 : active raid1 hda1[0] hdc1[1]
104320 blocks [2/2] [UU]
unused devices: <none>
[root@tiger ~]#
[root@c3 ~]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 hda2[0]
78043648 blocks [2/1] [U_]
md1 : active raid1 hda1[0]
104320 blocks [2/1] [U_]
unused devices: <none>
[root@c3 ~]#
Two different machines, one with one disk and one with two.
-
--
-
--
-
Electroman, you are right. It doesn't make sense. And right now I do not know why I wrote single disk?
- There are two disks in the server
- They are ATA disks and they are on the same cable (I think)
- Both disks where in when the server was installed.
- There are both floppy and CD-Rom in both mashines
I am quite sure that it counts for both servers that they have two disks. But I am getting in doubt with the second one... But it is not important. I will pick it up some day, drive it to the office, replace the foulty disk (if there is two disks in it) and reinstall it with SME8.
Status for the first server:
After running:
[root@hp600 ~]# mdadm /dev/md2 -a /dev/hdd2
I get this:
[root@hp600 ~]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 hdd2[1] hdc2[0]
245007232 blocks [2/2] [UU]
md1 : active raid1 hdc1[0] hdd1[1]
104320 blocks [2/2] [UU]
unused devices: <none>
So it seems to be fine again. Now I just have to try to move the cables so the disk turn into hda
thanks a lot for your help :-)
BR. Anders
-
--
-
Agree :-)
sme8 will just be for testing on a separate server - and a bit of local development. I using some software that need php5.2.1 and since the SME7.3 does not (out-of-the-box) work with php5 then I using a few development hotels at different webservers. It is just so much more convenient to do development on a local server so SME8 will just be for fun, for testing and maybe even bugreports if I should come across anything.
I have one old server (old old old) that I tried the sme8 with. But it looses its internet connection when I reinstall with sme8 or upgrade to sme8 (upgrade a fresh 7.3).
So I have to find another server to try out sme8. Or maybe just a newer pci-card for ethernet will do.
-
- They are ATA disks and they are on the same cable (I think)
that's EVIL
if ide channel fails, your data is gone..
when you work with SW raid1 and ide/ata disks, hd MUST be on different channels and preferably as master..
Ciao
Stefano
-
--
-
if ide channel fails, your data is gone.. Maybe
having 2 hd in raid1 on a single ide channel and a failure on that channel is data corruption at 99%
and, of course, 2 hds on the same ide channel are not the best choice for speed.
hd MUST can be on different channels and preferably as master on both channels.. or on the same channel.
if we are talking about test server or home server, ok.
hd can be different by size, type, everything..
but I think we are talking about production server with valuable data..
Some day you'll realize nothing is absolute except for death and taxes...!!
I've already realized it.. don't worry
Well maybe we can add one more, the world absolutely didn't need George W Bush.
this is a bit OT, but I agree 1000% ;-)
Ciao
Stefano
-
--
-
Hi
I have 2) disks on ATA IDE separate cables
I've been reading about the Soft RAID messages and what they mean and would like to ask for confirmation about this:
I recently got a degraded array message.
My RAID says:
Current RAID status:
│
│ Personalities : [raid1]
│ md1 : active raid1 hda1[0] hdc1[1]
│ 104320 blocks [2/2] [UU]
│ md2 : active raid1 hdc2[1]
│ 59947136 blocks [2/1] [_U]
│ unused devices: <none>
│
│
│ Only some of the RAID devices are unclean.
│
│ Manual intervention may be required.
Am I correct to say that (hdc2) is my problem drive and should be replaced ?
From what I read the [UU] means both are alive, however if both are alive I'm not sure how to interpret the second drive saying [_U] ? Does that mean it's down ?
Please advise ? Thanks
-
Am I correct to say that (hdc2) is my problem drive and should be replaced ?
no.. hdc(1,2) is the good one.. hda has a problem..
check your /var/log/messages and search for hda2
HTH
Ciao
Stefano
-
--
-
electroman00, you wrote:
md2 partition on hda failed, notice hda1[0] is missing on md2
It should be hda2[0] that is missing from md2, right?
-
--
-
I'm not sure I understand this
md1 : active raid1 hda1[0] hdc1[1]
104320 blocks [2/2] [UU]
Shows both drives sync'd on the md1 partion.
I only have 2) drives so how can both drives sync'd on md1 partition ?
But I think I do understand this now so hda2[0] has failed which is the md2 [_U] underscore showing failed. But is this a failed partition or drive ? Or does it mean the same thing ?
Am I understanding this correctly md1 showing 2/2 partitions, and md2 showing 2/1 meaning 1 of 2 partitions ?
Thanks for all the help. Sorry I'm having trouble understanding this.
-
--
-
electroman00- Yes I have CD rom on the primary slave
I'm getting confused about this.
How can hda1[0] show as sync'd but hda2[0] is missing, isn't hda1[0] mirroring to hda2[0] which is missing or failed ?
I thought that the first drive hda with 2 partitions was mirroring or duplicated to hdc with the same 2 partitions ?
Please excuse my ignorance
Also I'll run those Disk Health tests you linked to and post back on that also.
Thanks
-
--
-
--
-
--
-
I'm not sure I understand this
I only have 2) drives so how can both drives sync'd on md1 partition ?
But I think I do understand this now so hda2[0] has failed which is the md2 [_U] underscore showing failed. But is this a failed partition or drive ? Or does it mean the same thing ?
Am I understanding this correctly md1 showing 2/2 partitions, and md2 showing 2/1 meaning 1 of 2 partitions ?
Thanks for all the help. Sorry I'm having trouble understanding this.
You are confusing the differences between drives, partitions and raid devices. For simplicity, we will use your setup as the example here.
Think of it this way:
-hda and hdc are drives
-hdaX and hdcX are partitions
-md1 and md2 are raid devices
Two or more drives can be mirrored together to construct a raid device (SME does NOT use this concept).
Two or more partition can be mirrored together to construct a raid device (this is how SME is configured).
In your SME configuration:
-Partition hda1 and partition hdc1 are mirrored together to construct a raid device called md1
-Partition hda2 and partition hdc2 are mirrored together to construct a raid device called md2
A raid device can still operate with only 1 partition active. Your machine is doing this:
hda1 + hdc1 = md1
<non working partition> + hdc2 = md2
Your machine has 1 partition (hda2) that is out of sync, not an entire drive. You still have your original 2 raid devices (md1 and md2) working but md2 does not have redundancy any more so if hdc (or hdc2) goes bad, you will lose your data.
Is hda2 bad??? most likely not. There are a few things I have learned over the years.
1-Cheap cables cause raid devices to go out of sync, always use high quality 80 wire cables.
2-Cheap power supplies cause raid devices to go out of sync, always use good power supplies.
3-Bad memory can causes sync issues, check your memory.
4-Power spikes and sags cause raid devices to go out of sync, ALWAYS use battery backups to keep the power level constant.
5-IBM (now Hitachi) IDE drives don't stay in sync well. Don't use them.
-
Well certainly no one could disagree if I said the raid reporting is convoluted to some vast degree.
A simple Table format might have served to overcome the confusion the current report presents to first time viewing.
The ability to convolute something is a favorite past time of the Linux world.
It lends itself to the main reason why there aren't more Linux user's then M$.
No need to excuse ignorance, on the other hand there is always a need to excuse Stupidity.
So your not excused, however the possibility exists that you may be excused in the future.
Only time will tell. LOL
hth
electroman00, stick to the topic at hand and keep comments like this to yourself.
Your next off topic comment will be removed and if you continue, you will be banned.
-
--
-
--
-
electroman00- Funny :lol:
pfloor- I understand now that SME creates my raid from 2) partitions on the same drive
Let me consider this out loud again just to be sure I understand this.
For example in my case,of course I"m not positive if this is the case, but lets say hda1[1] is the boot partition which is synced to hdc1[1] in this case the SME system is ok, because the mda1 checks out so the first partition on the first drive is sync'd with the first partition on the second drive.
Second the mda2 lets say is data partition etc. which is the second partition on the first drive aka=hda2[0] sync'd to the second partition of the second drive aka=hdc2[0]
So in this case the first mda1 which is the raid, and the [UU] shows that both the first partition of the first drive and the first partition of the second drive or OK
And the mda2 shows the [_U] which indicates the second partition of the first drive has failed and the second partition of the second drive is good.
I think I understand now.
Thanks to all this is a great help, and please correct me if I"m wrong.
I'll move to testing the drive now using the wiki linked here. Thanks
-
Here is the smart report in email to admin:
The following warning/error was logged by the smartd daemon:
Device: /dev/hda, 103 Offline uncorrectable sectors
and this:
The following warning/error was logged by the smartd daemon:
Device: /dev/hda, 13 Currently unreadable (pending) sectors
And this from the smart report:
=== START OF INFORMATION SECTION ===
Device Model: Maxtor 6Y060L0
Serial Number: Y2P6KRYE
Firmware Version: YAR41BW0
User Capacity: 61,492,838,400 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0
Local Time is: Sat Nov 8 09:04:27 2008 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 181) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 31) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
3 Spin_Up_Time 0x0027 224 223 063 Pre-fail Always - 4981
4 Start_Stop_Count 0x0032 253 253 000 Old_age Always - 41
5 Reallocated_Sector_Ct 0x0033 253 253 063 Pre-fail Always - 1
6 Read_Channel_Margin 0x0001 253 253 100 Pre-fail Offline - 0
7 Seek_Error_Rate 0x000a 253 029 000 Old_age Always - 0
8 Seek_Time_Performance 0x0027 252 245 187 Pre-fail Always - 52170
9 Power_On_Minutes 0x0032 161 161 000 Old_age Always - 247h+58m
10 Spin_Retry_Count 0x002b 253 252 157 Pre-fail Always - 0
11 Calibration_Retry_Count 0x002b 253 252 223 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 253 253 000 Old_age Always - 44
192 Power-Off_Retract_Count 0x0032 253 253 000 Old_age Always - 0
193 Load_Cycle_Count 0x0032 253 253 000 Old_age Always - 0
194 Temperature_Celsius 0x0032 253 253 000 Old_age Always - 37
195 Hardware_ECC_Recovered 0x000a 253 252 000 Old_age Always - 543
196 Reallocated_Event_Count 0x0008 253 253 000 Old_age Offline - 0
197 Current_Pending_Sector 0x0008 253 253 000 Old_age Offline - 0
198 Offline_Uncorrectable 0x0008 253 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0008 199 199 000 Old_age Offline - 0
200 Multi_Zone_Error_Rate 0x000a 253 252 000 Old_age Always - 0
201 Soft_Read_Error_Rate 0x000a 253 119 000 Old_age Always - 26
202 TA_Increase_Count 0x000a 253 252 000 Old_age Always - 0
203 Run_Out_Cancel 0x000b 253 252 180 Pre-fail Always - 0
204 Shock_Count_Write_Opern 0x000a 253 252 000 Old_age Always - 0
205 Shock_Rate_Write_Opern 0x000a 253 252 000 Old_age Always - 0
207 Spin_High_Current 0x002a 253 252 000 Old_age Always - 0
208 Spin_Buzz 0x002a 253 252 000 Old_age Always - 0
209 Offline_Seek_Performnce 0x0024 179 179 000 Old_age Offline - 0
99 Unknown_Attribute 0x0004 253 253 000 Old_age Offline - 0
100 Unknown_Attribute 0x0004 253 253 000 Old_age Offline - 0
101 Unknown_Attribute 0x0004 253 253 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
So I did have some power outages, but not sure if this could cause this type of failure or if this means replace drive.
Anyhow I don't understand the message other then degraded raid and looks to me like hard drive failure.
What do you think ?
-
Agent86
Here's something really funny.
Check the date on the drive, if it's more then 38 months old (3yrs-2months) get a hammer
and take all your frustrations out on that drive so nobody can recover data from it and then send it to the scrap yard.
Could have save a whole thread if I knew it was a MAXTOR drive.
Shame on me, I should have asked.
That 60 gig is probably clocking in at > 4years old.
Seagate or WD drives, use them for 25 years and they are the the best.
Yes Seagate bought Maxtor few years ago but I'm not going to be a guinea pig for Maxtor any more
even if god bought them.
What's that about lipstick on a pig....still a pig...LOL
I've grown quite fond of Maxtor drives if you haven't noticed. LOL
BTW they don't make good hockey pucks either, I tried that.
Just in case you don't have one http://www.computerhammer.com/ (http://www.computerhammer.com/)
Have a good one...
-
electroman00-
I understand you have had bad experience with Maxtor, however I've had bad experience with Seagate over the past 20 years, in fact I've never had a Seagate drive last more then 1 year for some reason. So I stopped using them and went with Western Digital mostly, however lately those have failed early as well also.
My opinion is that all of the drives are about the same to me. I can't say that I've had much luck with Seagate, however I know that many people use them and like them.
Anyhow this maxtor was used when I got it, about 3 years old already and I used it for about 3+ years myself so I can't complain too much about it. Also I have additional square trade warranty on my stuff So I still have it covered under warranty LOL that is one good thing about ebay I do like.
Anyhow thanks for all the help I'll be looking to upgrade the server and perhaps ditch this old one.
-
Agent86
Tried to AIM ya. Got some server info for ya.
-
Got it thanks,
-
This thread has gone (way) off topic, locking thread.