Koozali.org: home of the SME Server

HDD Failure? Confusion

Offline jimgoode

  • **
  • 40
  • +0/-0
HDD Failure? Confusion
« on: November 08, 2010, 05:20:40 PM »
When I came in this morning the server (server and gateway) was hung. The only message on the console was one of not being able to write a journal entry.

None of the consoles would allow me to shutdown, or reboot, so I had to power the system off. On reboot the the primary drive (I'm running primary and secondary software RAID) was not recognized. I swapped the drives and tried a reboot, both drives were recognized but would not progress beyond the uncompressing kernel message. I put the primary back in HDD0, left the secondary out and rebooted doing a file system check in the process of coming back up. The system is now up and running.

Can anyone explain the symptoms I experienced?

Additionally, my system had rebooted at 4:11 am this morning for no apparent reason. I can not find any crons or other events that would trigger a reboot. Also, if complete power was cut, this particular machine will not come back on.

TIA,
Jim

Offline janet

  • *****
  • 4,812
  • +0/-0
Re: HDD Failure? Confusion
« Reply #1 on: November 09, 2010, 12:09:22 AM »
jimgoode

Possibilities are endless.
Hard disk failure, hardware failure, power glitches, UPS faulty, hacker penetrated your system, too many processes overloading system ie not enough RAM,  spam & virus load very high in incoming messages, just to name a few. Multiple events could have happened.

Firstly run smartctl long tests on your drives.
Look at log files for signs of what was happening prior to the problem occurring.

Please consider this. Your question is like asking a car mechanic by phone to tell you what is wrong with your car, and you did not tell him what make, model, specification, recent repair history, how many mods the car has etc etc !
A lot more information is needed in order to be able to provide meaningful or specific answers.
« Last Edit: November 09, 2010, 12:13:17 AM by mary »
Please search before asking, an answer may already exist.
The Search & other links to useful information are at top of Forum.

Offline jimgoode

  • **
  • 40
  • +0/-0
Re: HDD Failure? Confusion
« Reply #2 on: November 10, 2010, 05:48:55 PM »
Mary, Thank you for your suggestion.

Hardware: Newly acquired Dell PowerEdge 650 rack mount server.
Software: SME 7.5.1, patches up to date with the exception of 3 received this week.
HDDs: 2 x Seagate Barracuda 7200.8 300GB IDE/PATA, running RAID 1, separate IDE channels.

The HDDS were previously running in a generic tower that began shutting down, randomly, every 1-3 days. I replaced the power supply which did not fix the problem. I then acquired the new hardware, installed the original Seagate 300GB HDDs, reconfigured the server, and everything ran fine for over a week. Then the UPS failed. I replaced the UPS with a new one and the server ran for 2 days and then hung.
That's when I started this strand.

I was only able to get the server to boot up by removing the 2nd HDD and running with a single HDD. The server is running that way, now, and has been up for 2 days.

Acting on your suggestion, I enabled SMART (it had not previously been enabled). I have run both short and long tests on the HDD and everything indicates PASSED or Completed without error.

I have taken the 2nd drive and mounted it in a case as an external USB drive. I have not been able to enable SMART on the 2nd drive. It may have something to do with being an ATA drive pretending to be scsi in the USB external case. The smartctl command complains about a bad response to IEC mode page. I've reread the man pages and tried other options without success.

taking a real close look at the 2nd drive (the one that was supposed to be a morror of the 1st) I see that the most recent file on that disk is 2009-05-16 03:00:49. That's about 1.5 years ago. As I dig through the /var/log/messages I see that as far back as April 24, 2009, drive 1 was being recognized as 300GB and drive 2 was being recognized as 33.8GB. I feel really stupid, now. I should have seen this when I first installed these 2 drives back in 2007. I remember having problems getting both drives initially partitioned and formatted as 300GB drives and I sincerely thought I had solved the problem. Obviously, I did not. I'm now wondering how the system was able to run for so long.

Output from several commands follows:
[root@s01 log]# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/main-root
                     286219752  67416204 204264388  25% /
/dev/md1                101018     25193     70609  27% /boot
none                   1167960         0   1167960   0% /dev/shm
/dev/sda1            480688980    106564 456164824   1% /media/usbdisk
s02:/home/usr2/BUILD/Math_Work/QA_Review
                       2482560   2232368    124088  95% /mnt/s02/QA_Review
/dev/sdb2             32138888  24463608   6042680  81% /mnt/usbdisk2

[root@s01 log]# fdisk /dev/hda (drive1, boot drive in server case)
Command (m for help): p

Disk /dev/hda: 300.0 GB, 300069052416 bytes
255 heads, 63 sectors/track, 36481 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/hda1   *           1          13      104391   fd  Linux raid autodetect
/dev/hda2              14       36481   292929210   fd  Linux raid autodetect

[root@s01 log]# fdisk /dev/sdb (drive2 in the external usb case)
Command (m for help): p

Disk /dev/sdb: 33.8 GB, 33820284928 bytes
255 heads, 63 sectors/track, 4111 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1          13      104391   fd  Linux raid autodetect
/dev/sdb2              14        4078    32652112+  fd  Linux raid autodetect
/dev/sdb3            4079        4111      265072+  fd  Linux raid autodetect

After rereading the RAID documentation, I see I have probably upset that process by just physically removing the 2nd HDD. By that, I mean that the cat /proc/mdstat yeilds:

[root@s01 log]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 hda2[0]
      292929088 blocks [2/1] [U_]

md1 : active raid1 hda1[0]
      104320 blocks [2/1] [U_]

unused devices: <none>

I'm thinking that the "_" implies the server may still be looking for the mirror.
[root@s01 log]# cat /etc/mdadm.conf

# mdadm.conf written out by anaconda
DEVICE partitions
ARRAY /dev/md1 level=raid1 num-devices=2 UUID=21e38650:58d0a772:9bcc2561:b0f7787b
ARRAY /dev/md2 level=raid1 num-devices=2 UUID=582bf6d6:72ca7e22:269b5ec9:11ab0382

[root@s01 log]# cat /proc/partitions
major minor  #blocks  name

   3     0  293036184 hda
   3     1     104391 hda1
   3     2  292929210 hda2
   9     1     104320 md1
   9     2  292929088 md2
 253     0  290783232 dm-0
 253     1    2031616 dm-1
   8     0  488358912 sda
   8     1  488351871 sda1
   8    16   33027622 sdb
   8    17     104391 sdb1
   8    18   32652112 sdb2
   8    19     265072 sdb3

So, it looks like I may have some server clean up to do. I will also begin working on trying to repartition / reformat drive2 so that it recognizes the full 300GB. I guess this could be a bad drive that stands no hope of reaching it's stated potential.

Thanks, again, for your advise.
Jim

Offline cactus

  • *
  • 4,880
  • +3/-0
    • http://www.snetram.nl
Re: HDD Failure? Confusion
« Reply #3 on: November 10, 2010, 06:39:36 PM »
After rereading the RAID documentation, I see I have probably upset that process by just physically removing the 2nd HDD. By that, I mean that the cat /proc/mdstat yeilds:

[root@s01 log]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 hda2[0]
      292929088 blocks [2/1] [U_]

md1 : active raid1 hda1[0]
      104320 blocks [2/1] [U_]

unused devices: <none>

I'm thinking that the "_" implies the server may still be looking for the mirror.
[root@s01 log]# cat /etc/mdadm.conf
Are you sure it was not there before you started? I think the journal issue might be the symptom of issues with your raid. Did you receive emails in the admin mailbox? If a raid issue occurs you should have received a e-mail.
Be careful whose advice you buy, but be patient with those who supply it. Advice is a form of nostalgia, dispensing it is a way of fishing the past from the disposal, wiping it off, painting over the ugly parts and recycling it for more than its worth ~ Baz Luhrmann - Everybody's Free (To Wear Sunscreen)

Offline janet

  • *****
  • 4,812
  • +0/-0
Re: HDD Failure? Confusion
« Reply #4 on: November 10, 2010, 06:52:11 PM »
jimgoode

Download a drive manufacturers CD from the Internet (eg UBCD - google for it), insert that 2nd (faulty ?) drive into a server on it's own, boot up to UBCD (or whichever) and run full diagnostic tests on the drive.

I would not even consider using that drive until fully tested and verified to be in good working order.


Quote
taking a real close look at the 2nd drive (the one that was supposed to be a morror of the 1st) I see that the most recent file on that disk is 2009-05-16 03:00:49. That's about 1.5 years ago. As I dig through the /var/log/messages I see that as far back as April 24, 2009, drive 1 was being recognized as 300GB and drive 2 was being recognized as 33.8GB. I feel really stupid, now. I should have seen this when I first installed these 2 drives back in 2007. I remember having problems getting both drives initially partitioned and formatted as 300GB drives and I sincerely thought I had solved the problem. Obviously, I did not. I'm now wondering how the system was able to run for so long.

It was probably just running on one drive for ages.
Please search before asking, an answer may already exist.
The Search & other links to useful information are at top of Forum.

Offline jimgoode

  • **
  • 40
  • +0/-0
Re: HDD Failure? Confusion
« Reply #5 on: November 10, 2010, 06:58:48 PM »
cactus,
With only one drive installed I receive emails containing 'A DegradedArray event has been detected on md device /dev/md1.' As best I can recall, I have always received these emails every time the system boots up. These messages occurred even when I had both drives installed.

Mary,
I am preparing to test the drive, now. I agree that the system has probably been running in single drive mode since 4/15/2009.

Thanks,
Jim

Offline cactus

  • *
  • 4,880
  • +3/-0
    • http://www.snetram.nl
Re: HDD Failure? Confusion
« Reply #6 on: November 10, 2010, 07:00:56 PM »
With only one drive installed I receive emails containing 'A DegradedArray event has been detected on md device /dev/md1.' As best I can recall, I have always received these emails every time the system boots up. These messages occurred even when I had both drives installed.
As it turns out, you might have received those messages as something has been wrong all the time. Hope you get things sorted. One tip: do not ignore those mails next time, but I guess you already know that by now ;-).
Be careful whose advice you buy, but be patient with those who supply it. Advice is a form of nostalgia, dispensing it is a way of fishing the past from the disposal, wiping it off, painting over the ugly parts and recycling it for more than its worth ~ Baz Luhrmann - Everybody's Free (To Wear Sunscreen)

Offline purvis

  • *****
  • 567
  • +0/-0
Re: HDD Failure? Confusion
« Reply #7 on: November 12, 2010, 10:24:38 AM »
Search for serverstatus in forum

Offline MSmith

  • *
  • 675
  • +0/-0
Re: HDD Failure? Confusion
« Reply #8 on: November 12, 2010, 08:30:32 PM »
Assuming this is an important machine, you might consider replacing *both* those drives as they're getting along in years.  A 320-gig PATA drive can be had in the U.S. for about $60.  Given that no or very few computers are made these days with PATA interfaces, it'd probably be wise to get your PATA drives while you can.
...

Offline purvis

  • *****
  • 567
  • +0/-0
Re: HDD Failure? Confusion
« Reply #9 on: November 14, 2010, 05:44:59 PM »
Msmith
And more than two for when one goes out while running raid.
As for myself, i like dependable equiment but that seems hard to find these days.
I feel if you are going to place a need on a server as anything that is needed to be running.
Try to have an exact spare backup machine.
If the machine was a fully loaded machine then pull the memoy and the hard drives out and make use of them in the exact primary machine. If the primay machine shows signs of problems. Then simply place the hard drives and memory in the backup
machine and boot. You are now back in business.