Mary, Thank you for your suggestion.
Hardware: Newly acquired Dell PowerEdge 650 rack mount server.
Software: SME 7.5.1, patches up to date with the exception of 3 received this week.
HDDs: 2 x Seagate Barracuda 7200.8 300GB IDE/PATA, running RAID 1, separate IDE channels.
The HDDS were previously running in a generic tower that began shutting down, randomly, every 1-3 days. I replaced the power supply which did not fix the problem. I then acquired the new hardware, installed the original Seagate 300GB HDDs, reconfigured the server, and everything ran fine for over a week. Then the UPS failed. I replaced the UPS with a new one and the server ran for 2 days and then hung.
That's when I started this strand.
I was only able to get the server to boot up by removing the 2nd HDD and running with a single HDD. The server is running that way, now, and has been up for 2 days.
Acting on your suggestion, I enabled SMART (it had not previously been enabled). I have run both short and long tests on the HDD and everything indicates PASSED or Completed without error.
I have taken the 2nd drive and mounted it in a case as an external USB drive. I have not been able to enable SMART on the 2nd drive. It may have something to do with being an ATA drive pretending to be scsi in the USB external case. The smartctl command complains about a bad response to IEC mode page. I've reread the man pages and tried other options without success.
taking a real close look at the 2nd drive (the one that was supposed to be a morror of the 1st) I see that the most recent file on that disk is 2009-05-16 03:00:49. That's about 1.5 years ago. As I dig through the /var/log/messages I see that as far back as April 24, 2009, drive 1 was being recognized as 300GB and drive 2 was being recognized as 33.8GB. I feel really stupid, now. I should have seen this when I first installed these 2 drives back in 2007. I remember having problems getting both drives initially partitioned and formatted as 300GB drives and I sincerely thought I had solved the problem. Obviously, I did not. I'm now wondering how the system was able to run for so long.
Output from several commands follows:
[root@s01 log]# df -k
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/main-root
286219752 67416204 204264388 25% /
/dev/md1 101018 25193 70609 27% /boot
none 1167960 0 1167960 0% /dev/shm
/dev/sda1 480688980 106564 456164824 1% /media/usbdisk
s02:/home/usr2/BUILD/Math_Work/QA_Review
2482560 2232368 124088 95% /mnt/s02/QA_Review
/dev/sdb2 32138888 24463608 6042680 81% /mnt/usbdisk2
[root@s01 log]# fdisk /dev/hda (drive1, boot drive in server case)
Command (m for help): p
Disk /dev/hda: 300.0 GB, 300069052416 bytes
255 heads, 63 sectors/track, 36481 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/hda1 * 1 13 104391 fd Linux raid autodetect
/dev/hda2 14 36481 292929210 fd Linux raid autodetect
[root@s01 log]# fdisk /dev/sdb (drive2 in the external usb case)
Command (m for help): p
Disk /dev/sdb: 33.8 GB, 33820284928 bytes
255 heads, 63 sectors/track, 4111 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sdb1 * 1 13 104391 fd Linux raid autodetect
/dev/sdb2 14 4078 32652112+ fd Linux raid autodetect
/dev/sdb3 4079 4111 265072+ fd Linux raid autodetect
After rereading the RAID documentation, I see I have probably upset that process by just physically removing the 2nd HDD. By that, I mean that the cat /proc/mdstat yeilds:
[root@s01 log]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 hda2[0]
292929088 blocks [2/1] [U_]
md1 : active raid1 hda1[0]
104320 blocks [2/1] [U_]
unused devices: <none>
I'm thinking that the "_" implies the server may still be looking for the mirror.
[root@s01 log]# cat /etc/mdadm.conf
# mdadm.conf written out by anaconda
DEVICE partitions
ARRAY /dev/md1 level=raid1 num-devices=2 UUID=21e38650:58d0a772:9bcc2561:b0f7787b
ARRAY /dev/md2 level=raid1 num-devices=2 UUID=582bf6d6:72ca7e22:269b5ec9:11ab0382
[root@s01 log]# cat /proc/partitions
major minor #blocks name
3 0 293036184 hda
3 1 104391 hda1
3 2 292929210 hda2
9 1 104320 md1
9 2 292929088 md2
253 0 290783232 dm-0
253 1 2031616 dm-1
8 0 488358912 sda
8 1 488351871 sda1
8 16 33027622 sdb
8 17 104391 sdb1
8 18 32652112 sdb2
8 19 265072 sdb3
So, it looks like I may have some server clean up to do. I will also begin working on trying to repartition / reformat drive2 so that it recognizes the full 300GB. I guess this could be a bad drive that stands no hope of reaching it's stated potential.
Thanks, again, for your advise.
Jim