Koozali.org: home of the SME Server
Obsolete Releases => SME Server 7.x => Topic started by: jekal on August 27, 2010, 05:16:08 PM
-
Hi,
I've got ERROR Emails:
Frist:
The following warning/error was logged by the smartd daemon:
Device: /dev/sda, Self-Test Log error count increased from 0 to 1
So I checked with smartctl and did an OFFLINE test and get this Message:
The following warning/error was logged by the smartd daemon:
Device: /dev/sda, 1 Offline uncorrectable sectors
[root@server ~]# smartctl -a /dev/sda
smartctl version 5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG HD753LJ
Serial Number: S13UJ1KQ318199
Firmware Version: 1AA01109
User Capacity: 750,156,374,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: Not recognized. Minor revision code: 0x52
Local Time is: Fri Aug 27 17:02:34 2010 CEST
==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details.
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x06) Offline data collection activity
was aborted by the device with a fatal error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 25) The self-test routine was aborted by
the host.
Total time to complete Offline
data collection: (11081) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 186) minutes.
Conveyance self-test routine
recommended polling time: ( 20) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 1
3 Spin_Up_Time 0x0007 084 084 011 Pre-fail Always - 5800
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 28
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0
8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 10045
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 14179
10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 28
13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age Always - 0
183 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
184 Unknown_Attribute 0x0033 100 100 099 Pre-fail Always - 0
187 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 2
188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
190 Unknown_Attribute 0x0022 071 068 000 Old_age Always - 521863197
194 Temperature_Celsius 0x0022 072 067 000 Old_age Always - 28 (Lifetime Min/Max 0/8475)
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 65118827
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 1
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 0
201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision number = 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Offline Completed: read failure 00% 13722 1465143145
# 2 Offline Aborted by host 00% 13706 -
# 3 Short offline Aborted by host 90% 13703 -
# 4 Short offline Aborted by host 10% 13703 -
# 5 Short offline Aborted by host 00% 5634 -
# 6 Extended offline Aborted by host 00% 608 -
So I guess drive sda of my RAID1 set is faulty. (I even had reboot problems in the past, system didn't come up at the first try. Half a year ago I had a degraded array which I could fix with mdadm).
What makes me a bit wondering, mdadm reports no problem:
[root@server ~]# mdadm --detail --verbose /dev/md1
/dev/md1:
Version : 00.90.01
Creation Time : Fri May 30 22:20:45 2008
Raid Level : raid1
Array Size : 104320 (101.89 MiB 106.82 MB)
Device Size : 104320 (101.89 MiB 106.82 MB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 1
Persistence : Superblock is persistent
Update Time : Wed Aug 25 21:05:46 2010
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
UUID : 2054586d:4d07c956:5a22832e:59b02ab6
Events : 0.5928
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 17 1 active sync /dev/sdb1
So, question is: defect hard drive or not?
Any ideas?
Jens
-
smartd says your drive has a problem. Either the md RAID1 hasn't noticed, or it did notice in the past, but you overrode it by using mdadm to re-add the drive to the array.
If your data and/or time is important to you, then replace the drive. If it's under warranty, send it back.
The md raid layer won't fail a disk just because smartd found a problem during a self-test.
-
Thx Charlie,
that's what I am intend to do.
As far as I have understand several postings here I just have to exchange the disk and use the "mirror repair" in the admin menu after reboot.
Jens