Koozali.org: home of the SME Server

defect disk drive in RAID

Offline jekal

  • **
  • 21
  • +0/-0
defect disk drive in RAID
« on: August 27, 2010, 05:16:08 PM »
Hi,

I've got ERROR Emails:

Frist:
Quote
The following warning/error was logged by the smartd daemon:

Device: /dev/sda, Self-Test Log error count increased from 0 to 1

So I checked with smartctl and did an OFFLINE test and get this Message:

Quote
The following warning/error was logged by the smartd daemon:

Device: /dev/sda, 1 Offline uncorrectable sectors

Quote
[root@server ~]# smartctl -a /dev/sda
smartctl version 5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG HD753LJ
Serial Number:    S13UJ1KQ318199
Firmware Version: 1AA01109
User Capacity:    750,156,374,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Not recognized. Minor revision code: 0x52
Local Time is:    Fri Aug 27 17:02:34 2010 CEST

==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details.

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x06)   Offline data collection activity
               was aborted by the device with a fatal error.
               Auto Offline Data Collection: Disabled.
Self-test execution status:      (  25)   The self-test routine was aborted by
               the host.
Total time to complete Offline
data collection:        (11081) seconds.
Offline data collection
capabilities:           (0x7b) SMART execute Offline immediate.
               Auto Offline data collection on/off support.
               Suspend Offline collection upon new
               command.
               Offline surface scan supported.
               Self-test supported.
               Conveyance Self-test supported.
               Selective Self-test supported.
SMART capabilities:            (0x0003)   Saves SMART data before entering
               power-saving mode.
               Supports SMART auto save timer.
Error logging capability:        (0x01)   Error logging supported.
               General Purpose Logging supported.
Short self-test routine
recommended polling time:     (   2) minutes.
Extended self-test routine
recommended polling time:     ( 186) minutes.
Conveyance self-test routine
recommended polling time:     (  20) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -       1
  3 Spin_Up_Time            0x0007   084   084   011    Pre-fail  Always       -       5800
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       28
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   100   051    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      -       10045
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       14179
 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       28
 13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age   Always       -       0
183 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
184 Unknown_Attribute       0x0033   100   100   099    Pre-fail  Always       -       0
187 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       2
188 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
190 Unknown_Attribute       0x0022   071   068   000    Old_age   Always       -       521863197
194 Temperature_Celsius     0x0022   072   067   000    Old_age   Always       -       28 (Lifetime Min/Max 0/8475)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       65118827
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision number = 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Offline             Completed: read failure       00%     13722         1465143145
# 2  Offline             Aborted by host               00%     13706         -
# 3  Short offline       Aborted by host               90%     13703         -
# 4  Short offline       Aborted by host               10%     13703         -
# 5  Short offline       Aborted by host               00%      5634         -
# 6  Extended offline    Aborted by host               00%       608         -

So I guess drive sda of my RAID1 set is faulty. (I even had reboot problems in the past, system didn't come up at the first try. Half a year ago I had a degraded array which I could fix with mdadm).

What makes me a bit wondering, mdadm reports no problem:
Quote
[root@server ~]# mdadm --detail --verbose /dev/md1
/dev/md1:
        Version : 00.90.01
  Creation Time : Fri May 30 22:20:45 2008
     Raid Level : raid1
     Array Size : 104320 (101.89 MiB 106.82 MB)
    Device Size : 104320 (101.89 MiB 106.82 MB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Wed Aug 25 21:05:46 2010
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : 2054586d:4d07c956:5a22832e:59b02ab6
         Events : 0.5928

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1

So, question is: defect hard drive or not?

Any ideas?

Jens

Offline CharlieBrady

  • *
  • 6,918
  • +3/-0
Re: defect disk drive in RAID
« Reply #1 on: August 27, 2010, 07:07:12 PM »
smartd says your drive has a problem. Either the md RAID1 hasn't noticed, or it did notice in the past, but you overrode it by using mdadm to re-add the drive to the array.

If your data and/or time is important to you, then replace the drive. If it's under warranty, send it back.

The md raid layer won't fail a disk just because smartd found a problem during a self-test.

Offline jekal

  • **
  • 21
  • +0/-0
Re: defect disk drive in RAID
« Reply #2 on: August 27, 2010, 07:16:32 PM »
Thx Charlie,

that's what I am intend to do.
As far as I have understand several postings here I just have to exchange the disk and use the "mirror repair" in the admin menu after reboot.

Jens