Koozali.org: home of the SME Server
Obsolete Releases => SME Server 8.x => Topic started by: SchulzStefan on March 19, 2013, 02:52:19 PM
-
SME 8.0, up-to-date. I replaced two 500 GB S-ATA-Disks following this HOW-TO http://wiki.contribs.org/Raid#Replacing_and_Upgrading_Hard_Drive_after_HD_fail. The new disks have a capapcity of 1 TB each. Disks are identical. Server is up and running. No errors. Raid is clean.
I issued the following commands to resize the RAID to the bigger disks:
mdadm --grow /dev/md2 --size=max
pvresize /dev/md2
lvresize -l +100%FREE main/root
resize2fs /dev/mapper/main-root &
A cat /proc/mdstat shows
Personalities: [raid1]
md1: active raid1 sdb1[0] sda1[1]
104320 blocks [2/2] [UU]
md1: active raid1 sdb2[0] sda2[1]
976655552 blocks [2/2] [UU]
unused devices: <none>
fdisk -l shows:
Platte /dev/sda: 1000.2 GByte, 1000204886016 Byte
255 heads, 63 sectors/track, 121601 cylinders
Einheiten = Zylinder von 16065 � 512 = 8225280 Bytes
Ger�t boot. Anfang Ende Bl�cke Id System
/dev/sda1 * 1 13 104384+ fd Linux raid autodetect
Partition 1 endet nicht an einer Zylindergrenze.
/dev/sda2 13 121601 976655647 fd Linux raid autodetect
Platte /dev/sdb: 1000.2 GByte, 1000204886016 Byte
255 heads, 63 sectors/track, 121601 cylinders
Einheiten = Zylinder von 16065 � 512 = 8225280 Bytes
Ger�t boot. Anfang Ende Bl�cke Id System
/dev/sdb1 * 1 13 104384+ fd Linux raid autodetect
Partition 1 endet nicht an einer Zylindergrenze.
/dev/sdb2 13 121601 976655647 fd Linux raid autodetect
Platte /dev/md2: 1000.0 GByte, 1000095285248 Byte
2 heads, 4 sectors/track, 244163888 cylinders
Einheiten = Zylinder von 8 � 512 = 4096 Bytes
Festplatte /dev/md2 enth�lt keine g�ltige Partitionstabelle
Platte /dev/md1: 106 MByte, 106823680 Byte
2 heads, 4 sectors/track, 26080 cylinders
Einheiten = Zylinder von 8 � 512 = 4096 Bytes
Festplatte /dev/md1 enth�lt keine g�ltige Partitionstabelle
df -h shows:
Dateisystem Gr��e Benut Verf Ben% Eingeh�ngt auf
/dev/mapper/main-root
457G 129G 306G 30% /
/dev/md1 99M 46M 48M 50% /boot
none 2,0G 0 2,0G 0% /dev/shm
Seems to me, there are missing 500 GB. Did I miss something?
Thanks for any hint.
stefan
-
I'm confused now.
mdadm --detail /dev/md2 | more
/dev/md2:
Version : 0.90
Creation Time : Fri Aug 8 17:01:14 2008
Raid Level : raid1
Array Size : 976655552 (931.41 GiB 1000.10 GB)
Used Dev Size : 976655552 (931.41 GiB 1000.10 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 2
Persistence : Superblock is persistent
Update Time : Tue Mar 19 15:25:40 2013
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
UUID : 7be080c3:58e3a9c4:55bdf7e0:ca9607bf
Events : 0.46776802
Number Major Minor RaidDevice State
0 8 18 0 active sync /dev/sdb2
1 8 2 1 active sync /dev/sda2
Does that mean, the df command does not work on RAIDs?
-
Googling around brings me to the following taken from ubuntuforums.org:
http://ubuntuforums.org/showthread.php?t=980541 (http://ubuntuforums.org/showthread.php?t=980541)
mkfs -t ext3 /dev/md2
Does that mean, before adding a new drive to the RAID, one have to build a filesystem on the new disk? It is not enough to dd if=/dev/sda1 of=/dev/sdb1 the bootsector? Again - did I miss something in the HOW-TO?
Both disks are in sync and both are able to boot the server.
I'm concerned whether the disks are setup correct. Could anybody give advice please?
-
This may help http://wiki.contribs.org/Hard_Disk_Partitioning (http://wiki.contribs.org/Hard_Disk_Partitioning)
-
Does that mean, before adding a new drive to the RAID, one have to build a filesystem on the new disk? It is not enough to dd if=/dev/sda1 of=/dev/sdb1 the bootsector? Again - did I miss something in the HOW-TO?
Were the new HDs new and unused prior to adding to this sytem ie had they ever been used in another system prior to this.
Where in that How To does it say to dd if=/dev/sda1 of=/dev/sdb1 the bootsector, that is only used on a single drive raid 1 system in the section http://wiki.contribs.org/Raid#Adding_another_Hard_Drive_Later_.28Raid1_array_only.29 it is NOT part of http://wiki.contribs.org/Raid#Upgrading_the_Hard_Drive_Size
-
I'm still confused. It's very clear to me that in a system with only one disk, the disk has to be partitioned and formatted before it can be used. Now, in a RAID1 system, if one disk has to be replaced, all I read is to place a *new* disk in, add it to the RAID and you're done. *New* disk means to me *untouched*.
Do I have to build the RAID again? Meaning to remove first sdb, create a partition table following http://wiki.contribs.org/Hard_Disk_Partitioning, add it again to the RAID, and after full syncing change the disks to the same with the other one? Would it be safe to do it on a life system? (Of course, I do have a backup.)
-
Now, in a RAID1 system, if one disk has to be replaced, all I read is to place a *new* disk in, add it to the RAID and you're done. *New* disk means to me *untouched*.
Yes, just as described here: http://wiki.contribs.org/Raid#Upgrading_the_Hard_Drive_Size
Thats it, just follow the steps.
-
This is how I grow my disk space from a raid 1 system.
Usually, I will have two brand new equal hard drives I want to put in my system.
Remove the internet and/or intranet cable connections to the computer system.
From a raid 1 working system.
shutdown the system
Pull the smaller of the two working rad1 drives out or if drives are same size in size, I pull 1 drive out.
The pulled drive is then stored for backup purposes and never used again unless i need to restore something from it.
Place a newer bigger drive in the system.
Boot up the system
Do not allow any users on the system or internet activity to the computer.
Rebuild the new drive to raid 1 operating status
Shutdown the system
Pull the older drive out
Boot up the system
Do not allow users to
Expand the volume of the drive to the fullest size
Shutdown the system after the expansion on the newer larger drive.
Put the second new larger hard drive
Hook backup internet and intranet to the computer system
Boot the computer.
Rebuild the raid 1 on to the new second drive.
You are done now while the rebuilding process is working.
Do not boot the computer until the raid 1 process is complete.
-
Lets go over this in order. The order you showed things isn't quite right.
fdisk -l shows:
Platte /dev/sda: 1000.2 GByte, 1000204886016 Byte
/dev/sda2 13 121601 976655647 fd Linux raid autodetect
Platte /dev/sdb: 1000.2 GByte, 1000204886016 Byte
/dev/sdb2 13 121601 976655647 fd Linux raid autodetect
This looks right. The two partitions that will be used in raid are just under 1T in size.
A cat /proc/mdstat shows
md1: active raid1 sdb2[0] sda2[1]
976655552 blocks [2/2] [UU]
This also looks right. Just under 1T.
We are missing a few steps here. There is nothing showing the pv, vg, or lv sizes.
Can you show the output of the following commands:
pvs
vgs
lvs
df -h shows:
Dateisystem Gr��e Benut Verf Ben% Eingeh�ngt auf
/dev/mapper/main-root
457G 129G 306G 30% /
This shows that at least the root lv hasn't been resized. Please post the output of the above commands to see where the expansion stopped (or broke). You are almost there, we just need to take care of the lvm stuff and then you can grow the filesystem.
-
@slords
Here's the output of:
pvs:
PV VG Fmt Attr PSize PFree
/dev/md2 main lvm2 a-- 931,41G 0
vgs:
VG #PV #LV #SN Attr VSize VFree
main 1 2 0 wz--n- 931,41G 0
lvs:
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
root main -wi-ao 929,47G
swap main -wi-ao 1,94G
@TerryF
I followed the How-To http://wiki.contribs.org/Raid#Upgrading_the_Hard_Drive_Size. This is the result after syncing:
fdisk -l:
Platte /dev/sda: 1000.2 GByte, 1000204886016 Byte
255 heads, 63 sectors/track, 121601 cylinders
Einheiten = Zylinder von 16065 × 512 = 8225280 Bytes
Gerät boot. Anfang Ende Blöcke Id System
/dev/sda1 * 1 13 104384+ fd Linux raid autodetect
***Partition 1 endet nicht an einer Zylindergrenze.***
***This error is reported, since I synced the new (unpartitioned and unformatted 1TB disk) with the old 500GB disk.***
/dev/sda2 13 121601 976655647 fd Linux raid autodetect
Platte /dev/sdb: 1000.2 GByte, 1000204886016 Byte
255 heads, 63 sectors/track, 121601 cylinders
Einheiten = Zylinder von 16065 × 512 = 8225280 Bytes
Gerät boot. Anfang Ende Blöcke Id System
/dev/sdb1 * 1 13 104384+ fd Linux raid autodetect
***Partition 1 endet nicht an einer Zylindergrenze.***
***This error was not reported before adding the disk to the RAID.***
/dev/sdb2 13 121601 976655647 fd Linux raid autodetect
Platte /dev/md2: 1000.0 GByte, 1000095285248 Byte
2 heads, 4 sectors/track, 244163888 cylinders
Einheiten = Zylinder von 8 × 512 = 4096 Bytes
Festplatte /dev/md2 enthält keine gültige Partitionstabelle
Platte /dev/md1: 106 MByte, 106823680 Byte
2 heads, 4 sectors/track, 26080 cylinders
Einheiten = Zylinder von 8 × 512 = 4096 Bytes
Festplatte /dev/md1 enthält keine gültige Partitionstabelle
df -h still shows
Dateisystem Größe Benut Verf Ben% Eingehängt auf
/dev/mapper/main-root
457G 138G 297G 32% /
/dev/md1 99M 46M 48M 50% /boot
none 2,0G 0 2,0G 0% /dev/shm
@purvis
Could you show the output of df -h?
-
...
Expand the volume of the drive to the fullest size
...
I wonder whether that step could be automated (it's the trickiest step). If SME server starts to boot up, and has only one drive with partitions which don't fill the drive - put up a "please wait" banner and resize the volume...
-
What does that mean to me?
-
What does that mean to me?
Nothing. Just a suggestion for possible (but probably difficult) improvement to the software.
-
Nothing. Just a suggestion for possible (but probably difficult) improvement to the software.
I'd like to know did I miss something? If not, is the RAID setup correct? In this case, why is df reporting the wrong size. And what does it mean for any software running on the server. And if not, how do I setup the RAID correct.
-
You said earlier "It is not enough to dd if=/dev/sda1 of=/dev/sdb1 the bootsector? "
Did you do that?
-
Yes.
-
i do not like the idea of anything automatically increasing the size of my drives.
I have installed sme on smaller drives and then installed larger drives for reasons of my own.
If somebody wants to run a bash scipt to increase a volume that can be downloaded. I have no issues with that.
-
SchulzStefan
It is not enough to dd if=/dev/sda1 of=/dev/sdb1 the bootsector?
Where did you get that command from ?
man dd implies it would copy sda1 to sdb1, I am not sure what that achieves.
According to
http://wiki.contribs.org/Raid#Reusing_Hard_Drives
the command to "zero" a drive & prepare it for reuse is
dd if=/dev/zero of=/dev/sdx bs=512 count=1
after doing the above you MUST reboot so that the empty partition table gets read correctly.
(replace sdx with sda, sdb, sdc etc)
If a drive is brand new & NEVER used then you should be able to add it to a RAID array in sme server & it will be added & sync'd by the system after manually selecting to do so from the console menu. You should NOT need to issue commands of any sort to either prepare the drive, or format it, or add a filesystems or add partitions etc, that will all be done by the system, after manually selecting to add the drive from the console menu (log in as admin to access this, or type console from the root command line prompt).
If the drive has been used (either a brand new or second hand drive) in Windows, Linux or wherever, then you MUST clear the drive & prepare it for use BEFORE adding it to an array.
The command to do this is
dd if=/dev/zero of=/dev/sdx bs=512 count=1
after doing the above you MUST reboot so that the empty partition table gets read correctly.
where /dev/sdx is the location of the drive to be cleared eg /dev/sdc
Then you can proceed to rebuild your array using the larger drives:
add both larger disks
check array is in sync
run cat /proc/mdstat to check this before proceeding
then follow the procedure to enlarge the array.
-
pvs:
PV VG Fmt Attr PSize PFree
/dev/md2 main lvm2 a-- 931,41G 0
vgs:
VG #PV #LV #SN Attr VSize VFree
main 1 2 0 wz--n- 931,41G 0
lvs:
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
root main -wi-ao 929,47G
swap main -wi-ao 1,94G
All of this shows that things have expanded as they should be. The only step left is to actually expand the root filesystem.
What happens when you run the following command:
resize2fs -p /dev/mapper/main-root
I don't know why the & is on the final command on the wiki. The actual expansion of the filesystem shouldn't take that long and there is no reason to put it in the background. If this is an sme8 system then the only reason that command should fail would be if your root filesystem is ext4 instead of ext3. If this is the case then you should replace the resize2fs command above with resize4fs.
Sorry it has taken me so long to get back to you.
-
@mary
If a drive is brand new & NEVER used then you should be able to add it to a RAID array in sme server & it will be added & sync'd by the system after manually selecting to do so from the console menu. You should NOT need to issue commands of any sort to either prepare the drive, or format it, or add a filesystems or add partitions etc, that will all be done by the system, after manually selecting to add the drive from the console menu (log in as admin to access this, or type console from the root command line prompt).
That's what I did first. Both disks were in sync. The last one I added was not able to boot after removing sda. To test this, I plugged sda from the motherboard off, and changed sdb to sda. The synced drive didn't boot.
Therefore I followed this: http://wiki.contribs.org/Raid#Adding_another_Hard_Drive_Later_.28Raid1_array_only.29 (http://wiki.contribs.org/Raid#Adding_another_Hard_Drive_Later_.28Raid1_array_only.29). I copied the boot partition with dd if=/dev/sda1 of=/dev/sdb1 to the new drive. Before doing this I followed http://wiki.contribs.org/Hard_Disk_Partitioning (http://wiki.contribs.org/Hard_Disk_Partitioning). After syncing and rebooting both drives are now able to boot. In my understanding that's the way it should be. If one disk fails the other shoud be bootable. This is the case right now.
@slords
Thank's for coming back. System is up, everybody is able to work, so time doesn't really matter. Here's the result of the command:
resize2fs -p /dev/mapper/main-root
resize2fs 1.39 (29-May-2006)
Filesystem at /dev/mapper/main-root is mounted on /; on-line resizing required
Performing an on-line resize of /dev/mapper/main-root to 243654656 (4k) blocks.
The filesystem on /dev/mapper/main-root is now 243654656 blocks long.
I'll boot the system and will check again the df -h command.
-
@slords
Here's the output after the reboot:
df -h
Dateisystem Gr��e Benut Verf Ben% Eingeh�ngt auf
/dev/mapper/main-root
915G 141G 729G 17% /
/dev/md1 99M 46M 48M 50% /boot
none 2,0G 0 2,0G 0% /dev/shm
mdadm --detail /dev/md1
/dev/md1:
Version : 0.90
Creation Time : Fri Aug 8 17:01:14 2008
Raid Level : raid1
Array Size : 104320 (101.89 MiB 106.82 MB)
Used Dev Size : 104320 (101.89 MiB 106.82 MB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 1
Persistence : Superblock is persistent
Update Time : Sun Mar 24 09:30:10 2013
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
UUID : b5f1b131:fe27265a:85dfe98f:3fb577a2
Events : 0.19506
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 1 1 active sync /dev/sda1
mdadm --detail /dev/md2
/dev/md2:
Version : 0.90
Creation Time : Fri Aug 8 17:01:14 2008
Raid Level : raid1
Array Size : 976655552 (931.41 GiB 1000.10 GB)
Used Dev Size : 976655552 (931.41 GiB 1000.10 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 2
Persistence : Superblock is persistent
Update Time : Sun Mar 24 09:50:33 2013
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
UUID : 7be080c3:58e3a9c4:55bdf7e0:ca9607bf
Events : 0.46794156
Number Major Minor RaidDevice State
0 8 18 0 active sync /dev/sdb2
1 8 2 1 active sync /dev/sda2
fdisk -l
Platte /dev/sda: 1000.2 GByte, 1000204886016 Byte
255 heads, 63 sectors/track, 121601 cylinders
Einheiten = Zylinder von 16065 � 512 = 8225280 Bytes
Ger�t boot. Anfang Ende Bl�cke Id System
/dev/sda1 * 1 13 104384+ fd Linux raid autodetect
Partition 1 endet nicht an einer Zylindergrenze.
/dev/sda2 13 121601 976655647 fd Linux raid autodetect
Platte /dev/sdb: 1000.2 GByte, 1000204886016 Byte
255 heads, 63 sectors/track, 121601 cylinders
Einheiten = Zylinder von 16065 � 512 = 8225280 Bytes
Ger�t boot. Anfang Ende Bl�cke Id System
/dev/sdb1 * 1 13 104384+ fd Linux raid autodetect
Partition 1 endet nicht an einer Zylindergrenze.
/dev/sdb2 13 121601 976655647 fd Linux raid autodetect
Platte /dev/md2: 1000.0 GByte, 1000095285248 Byte
2 heads, 4 sectors/track, 244163888 cylinders
Einheiten = Zylinder von 8 � 512 = 4096 Bytes
Festplatte /dev/md2 enth�lt keine g�ltige Partitionstabelle
Platte /dev/md1: 106 MByte, 106823680 Byte
2 heads, 4 sectors/track, 26080 cylinders
Einheiten = Zylinder von 8 � 512 = 4096 Bytes
Festplatte /dev/md1 enth�lt keine g�ltige Partitionstabelle
Seems the size is now correct. Thank you so far. There are still errors reported with the fdisk -l command. Is further investigation necessary?
-
df -h
Dateisystem Gr��e Benut Verf Ben% Eingeh�ngt auf
/dev/mapper/main-root
915G 141G 729G 17% /
/dev/md1 99M 46M 48M 50% /boot
none 2,0G 0 2,0G 0% /dev/shm
This looks good now. You should have access to your extra space now.
Seems the size is now correct. Thank you so far. There are still errors reported with the fdisk -l command. Is further investigation necessary?
fdisk -l tries to find partitions on all devices. We don't care about all devices. Just about the two hard drives, and they look good.
-
Thanks to all for staying with me.
stefan
-
Well - got a few problems since resizing the system.
Two days ago the the server stoppped working. The error on the console was "i/o error ext3 journal". The server was bootable and resynced automatically. In the morning hours today, during an affa backup, the server died again. On the restart the machine dropped on the console (ctrl-d to boot up, or pwd for maintenance) with an inconsistent filesystem, to perform a fsck. Fsck found a lot of errors which have been corrected.
After booting again, the sdb2 was removed from md2. I added the disk with mdadm --add /dev/md2 /dev/sdb2. Server is up and syncing now.
I think there must be something wrong. In my experiance with smeserver since 5.x always running RAID1, I don't remember having problems in this way. Is it possible to find the reason why the server is running so instable , or should I use my backup to build the server new?
Some more information about the disks:
smartctl -a /dev/sda
smartctl 5.42 2011-10-20 r3458 [i686-linux-2.6.18-348.3.1.el5PAE] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Device Model: WDC WD10EFRX-68JCSN0
Serial Number: WD-WCC1U0647211
LU WWN Device Id: 5 0014ee 207ef4bfd
Firmware Version: 01.01A01
User Capacity: 1.000.204.886.016 bytes [1,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Thu Mar 28 12:19:46 2013 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (13680) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 156) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x30bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 140 137 021 Pre-fail Always - 3983
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 30
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 310
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 30
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 15
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 14
194 Temperature_Celsius 0x0022 111 107 000 Old_age Always - 32
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
SMART Error Log Version: 1
ATA Error Count: 1
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 1 occurred at disk power-on lifetime: 278 hours (11 days + 14 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
10 51 08 8a 96 03 e0 Error: IDNF at LBA = 0x0003968a = 235146
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 00 08 8a 96 03 e0 08 8d+14:10:43.909 WRITE DMA
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
And the other one:
smartctl -a /dev/sdb
smartctl 5.42 2011-10-20 r3458 [i686-linux-2.6.18-348.3.1.el5PAE] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Device Model: WDC WD10EFRX-68JCSN0
Serial Number: WD-WCC1U0641728
LU WWN Device Id: 5 0014ee 2b29a02e9
Firmware Version: 01.01A01
User Capacity: 1.000.204.886.016 bytes [1,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Thu Mar 28 12:20:48 2013 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (13320) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 152) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x30bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 137 135 021 Pre-fail Always - 4141
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 15
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 237
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 15
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 9
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 5
194 Temperature_Celsius 0x0022 112 110 000 Old_age Always - 31
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
SMART Error Log Version: 1
ATA Error Count: 1
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 1 occurred at disk power-on lifetime: 219 hours (9 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
10 51 50 02 57 04 e0 Error: IDNF at LBA = 0x00045702 = 284418
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 00 50 02 57 04 e0 08 07:00:40.145 WRITE DMA
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Both brand new, original packed and sealed, bought by my local dealer.
-
After replacing the disks and having the trouble that the server two times died, I checked also the BIOS settings of the machine. I found, that the hard raid was enabled. I'm quite sure, that before changing the disks, I had it disabled. I disabled the RAID again. Server is running and up. No problems so far. I'll report in a few days whether the server is running stable.
-
Server is still up and running. Some more information:
LOG before I replaced the disks:
Mar 13 13:35:53 smartd[2842]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 66 to 64
Mar 13 13:35:53 smartd[2842]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 60 to 63
Mar 13 14:05:53 smartd[2842]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 64 to 63
Mar 13 14:35:54 smartd[2842]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 63 to 62
Mar 13 16:05:54 smartd[2842]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 63 to 62
Mar 13 18:35:53 smartd[2842]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 62 to 61
Mar 13 18:35:53 smartd[2842]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 62 to 61
Mar 13 19:05:53 smartd[2842]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 61 to 62
Mar 13 19:05:53 smartd[2842]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 61 to 63
Mar 13 19:35:53 smartd[2842]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 62 to 63
Mar 13 19:35:54 smartd[2842]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 63 to 62
Mar 13 20:05:53 smartd[2842]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 63 to 62
Mar 13 21:05:54 smartd[2842]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 62 to 61
Mar 13 21:05:54 smartd[2842]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 62 to 61
Mar 13 21:35:53 smartd[2842]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 61 to 62
Mar 13 21:35:53 smartd[2842]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 61 to 62
Mar 14 00:35:54 smartd[2842]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 62 to 58
Mar 14 00:35:54 smartd[2842]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 62 to 60
Mar 14 01:05:53 smartd[2842]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 58 to 57
Mar 14 01:05:53 smartd[2842]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 60 to 59
Mar 14 01:35:53 smartd[2842]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 59 to 58
Mar 14 02:05:53 smartd[2842]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 57 to 58
Mar 14 02:05:54 smartd[2842]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 58 to 59
Mar 14 03:05:53 smartd[2842]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 59 to 60
Mar 14 04:05:53 smartd[2842]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 58 to 59
Mar 14 04:35:56 smartd[2842]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 10 Spin_Retry_Count changed from 195 to 196
Mar 14 05:05:53 smartd[2842]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 60 to 61
Mar 14 05:35:53 smartd[2842]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 59 to 60
Mar 14 05:35:53 smartd[2842]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 61 to 62
Mar 14 05:35:55 smartd[2842]: Device: /dev/sdc [SAT], SMART Usage Attribute: 9 Power_On_Hours changed from 89 to 88
Mar 14 07:35:53 smartd[2842]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 62 to 61
Mar 14 12:35:53 smartd[2842]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 61 to 60
Mar 14 13:35:53 smartd[2842]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 60 to 59
Mar 14 14:35:53 smartd[2842]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 59 to 60
Mar 14 14:35:53 smartd[2842]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 60 to 61
Mar 14 21:05:53 smartd[2842]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 61 to 62
Mar 14 22:05:53 smartd[2842]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 60 to 66
Mar 14 22:05:53 smartd[2842]: Device: /dev/sdb [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 62 to 66
Mar 14 23:05:53 smartd[2842]: Device: /dev/sda [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 66 to 65
LOG after changing the disks:
Mar 28 18:18:26 smartd[2672]: Device: /dev/sda [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 200 to 100
Mar 28 19:48:25 smartd[2672]: Device: /dev/sdb [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 200 to 100
Mar 28 20:18:25 smartd[2672]: Device: /dev/sdb [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 100 to 200
Mar 28 23:48:25 smartd[2672]: Device: /dev/sda [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 100 to 200
---
Mar 31 00:48:26 smartd[2672]: Device: /dev/sda [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 200 to 100
Mar 31 01:18:25 smartd[2672]: Device: /dev/sda [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 100 to 200
Mar 31 01:48:25 smartd[2672]: Device: /dev/sda [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 200 to 100
Mar 31 01:48:25 smartd[2672]: Device: /dev/sdb [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 200 to 100
Mar 31 03:48:25 smartd[2672]: Device: /dev/sdb [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 100 to 200
Mar 31 04:18:26 smartd[2672]: Device: /dev/sdb [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 200 to 100
Mar 31 05:18:25 smartd[2672]: Device: /dev/sdb [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 100 to 200
Mar 31 06:48:25 smartd[2672]: Device: /dev/sda [SAT], SMART Usage Attribute: 7 Seek_Error_Rate changed from 100 to 200
I'm not firm with smartd. Does it mean anything to the stability of the server? Or is further investigation needed?
-
I have no explanation - since April 01 there are no more messages from smartd. Server is up and running stable.
-
After replacing the disks and having the trouble that the server two times died, I checked also the BIOS settings of the machine. I found, that the hard raid was enabled. I'm quite sure, that before changing the disks, I had it disabled. I disabled the RAID again. Server is running and up. No problems so far. I'll report in a few days whether the server is running stable.
A long time I was talking about to use only SW RAID vs HW RAID when available (in another thread I really do not remember now) and Charlie Brady ask me "why do not enable both of them".
I said I've heard about problems with this... but just could not found any memories to show him.
Now I have them. Thank you!
I hope you read this Charlie! :D