Koozali.org: home of the SME Server

SME RESCUE - Boot question

Offline SchulzStefan

  • *
  • 620
  • +0/-0
SME RESCUE - Boot question
« on: November 17, 2010, 05:38:49 PM »
I'm running a 7.5.1 installation, up-to-date.

The server crashed in the last days a few times - so I took a closer look at the RAID 1. There are two identical hard drives, running without any problems for the last 2 years. Well, a

[root@sme]# cat /proc/mdstat shows something like this:

md2 : active raid1 sda2[2] sdb2[1]
      1048704 blocks [2/1] [UU]
      [=>...................]  recovery =  6.4% (67712/1048704) finish=1.2min speed=13542K/sec
md1 : active raid1 sda1[0] sdb1[1]
      255936 blocks [2/2] [UU]

I did:

[root@ ~]# fdisk -l /dev/sdb; fdisk -lu /dev/sdb

Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1          13      104384+  fd  Linux raid autodetect
Partition 1 does not end on cylinder boundary.
/dev/sdb2              13      121601   976655647   fd  Linux raid autodetect

Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectors
Units = sectors of 1 * 512 = 512 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1      208769      104384+  fd  Linux raid autodetect
Partition 1 does not end on cylinder boundary.
/dev/sdb2          208770  1953520063   976655647   fd  Linux raid autodetect

and:

smartctl -i /dev/sda
smartctl -i /dev/sdb

and:

smartctl -t short /dev/sda
smartctl -t short /dev/sdb

both showed no errors at all.

and:

[root@ ~]# mdadm --manage /dev/md2 --fail /dev/sdb2
mdadm: set /dev/sdb2 faulty in /dev/md2
[root@ ~]# mdadm --manage /dev/md2 --remove /dev/sdb2
mdadm: hot removed /dev/sdb2
[root@ ~]# mdadm --manage /dev/md1 --fail /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md1
[root@ ~]# mdadm --manage /dev/md1 --remove /dev/sdb1
mdadm: hot removed /dev/sdb1

Booting in the server causes in a crash while a file check is forced. The machine crashes during the check. All fans are o.k., the BIOS shows nothing special for any temperature.

Now I wanted to boot from the sme server CD (is still a 7.4 ver) in rescue mode. The machine is booting from the CD, but then something goes wrong. It seems, it looses the CD-Drive. After choosing the CD-ROM as installation medium, I have to choose a driver? Tried a few ones, but no luck. The same thing happens using the TRK-CD. The machine is booting and looses the drive. Same with any other live CD. I checked the BIOS Settings and it seems to me, that erverything is normal, means conservative for a server. Now I'm stuck.

Thank's for any help in advance.
stefan
« Last Edit: November 17, 2010, 07:47:50 PM by SchulzStefan »
And then one day you find ten years have got behind you.

Time, 1973
(Mason, Waters, Wright, Gilmour)

Offline SchulzStefan

  • *
  • 620
  • +0/-0
Re: SME RESCUE - Boot question
« Reply #1 on: November 17, 2010, 06:35:42 PM »
Meanwhile I did some more tests with the hardware. I changed the hard-disk(s), memory, LAN-adapter, graphic-adapter, cd-rom and now I think, it's the mainboard. The machine crashes, no matter what I change. Or it is the processor. But how to eliminate this? As I mentioned, the fans are o.k.

Question: if I buy a new (probably another type) mainbord, is there a chance to get the two hard-disks back on track?

Thank's in advance for any reply.
stefan
And then one day you find ten years have got behind you.

Time, 1973
(Mason, Waters, Wright, Gilmour)

Offline idp_qbn

  • *****
  • 347
  • +0/-0
Re: SME RESCUE - Boot question
« Reply #2 on: November 17, 2010, 09:38:52 PM »
Stefan,
I have twice in the past removed HDDs from an older server and successfully mounted them in a newer server. Both times, the system started OK but I needed to logon at the new server, and run"console", then choose option 2, "Configure this server". This allowed the server to make adjustments for the new N/W adapters, specifically the on-board ones. Then all was well.

I think this depended on the newer servers not being too cutting-edge with lots of new hardware to contend with.

I believe that if necessary, I could have run the install disk in Rescue mode to pick up the hardware changes - in fact this is probably the best way of handling the situation.

I take it there is user data at stake here as well as configuration settings. If you have a backup (done from SMEserver "server-manager" panel, you could do a clean installation on a new server and restore from your backup. This should pick up user data and system settings BUT NOT any contribs you have installed on the old system.

From you first post, it looked like the system was in the process of recovering from the RAID failure when you ran mdadm. That may complicate things. Perhaps you could try mounting ONE disk in a new server and see if it boots. If successful, you can add the second disk - look in the wiki for advice on how to add a second disk: you may have to zero it out first (wipe it, in other words).

Good luck
Ian
___________________
Sydney, NSW, Australia

Offline CharlieBrady

  • *
  • 6,918
  • +3/-0
Re: SME RESCUE - Boot question
« Reply #3 on: November 17, 2010, 10:04:42 PM »
I believe that if necessary, I could have run the install disk in Rescue mode to pick up the hardware changes - in fact this is probably the best way of handling the situation.

No, please do not spread misinformation. Running the install disk in Rescue mode does not change any system configuration items. The only way to adjust to new network hardware changes is to run through the Configuration option in the console.

Offline CharlieBrady

  • *
  • 6,918
  • +3/-0
Re: SME RESCUE - Boot question
« Reply #4 on: November 17, 2010, 10:06:41 PM »
Question: if I buy a new (probably another type) mainbord, is there a chance to get the two hard-disks back on track?

Yes, there is a chance. I would guess there is a good chance.

Offline idp_qbn

  • *****
  • 347
  • +0/-0
Re: SME RESCUE - Boot question
« Reply #5 on: November 17, 2010, 10:34:35 PM »
Yes, Charlie, Sorry.
I have never used the Rescue Disk mode - as I said, just transplanting the disk to a new server and running "console" to reconfigure n/w adapters worked just fine for me.

Ian
___________________
Sydney, NSW, Australia

Offline janet

  • *****
  • 4,812
  • +0/-0
Re: SME RESCUE - Boot question
« Reply #6 on: November 18, 2010, 02:17:12 AM »
SchulzStefan

Quote
....I changed the hard-disk(s), memory, LAN-adapter, graphic-adapter, cd-rom and now I think, it's the mainboard. The machine crashes, no matter what I change. Or it is the processor. But how to eliminate this? As I mentioned, the fans are o.k.

You did not mention the power supply. I have seen systems fail temporarily & behave quiet erratically & in a seemingly unrepeatable or identifiable fashion when the CD/DVD drive spins up, where the power supply is faulty or weak. The extra drain on the power supply pulls the voltage rails down and the whole system becomes erratic under the low voltage situation.

Your myriad of problems could all be answered by a faulty (weak output) power supply.
Please search before asking, an answer may already exist.
The Search & other links to useful information are at top of Forum.

Offline SchulzStefan

  • *
  • 620
  • +0/-0
Re: SME RESCUE - Boot question
« Reply #7 on: November 19, 2010, 08:56:11 AM »
Ian, Charlie and Mary,

thank you for your replies. At this place I want to mention, that the network was only for 15 minutes down. A BIG THANK TO MICHAEL WEINBERGER and his AFFA.

@ ian: I bought a new board - the machine is back on track.
@ charlie: thank you for your hint. In this case it seems, that I do not have to re-configure the machine.
@ mary: I checked the power supply also, it's o.k.

BUT, there are still a few problems with the disks. Well, I'm able to boot in the server, all data seem to be o.k. But a fdisk -lu /dev/sda; fdisk -lu /dev/sdb shows up, that on BOTH disks is an incorrecty partitioned first partition. "Partition 1 does not end on cylinder boundary". I think it may cause from the blackouts during several resyncs. Well, seems strange to me anyway, both disks were brand new (same vendor, type, size and model), when installed.

O.k., the idea is to get it fixed on one disk, and resync the other one after clearing. Is there a way to do it? Any idea?

Thank's for any reply.
stefan
And then one day you find ten years have got behind you.

Time, 1973
(Mason, Waters, Wright, Gilmour)

Offline SchulzStefan

  • *
  • 620
  • +0/-0
Re: SME RESCUE - Boot question
« Reply #8 on: November 19, 2010, 10:21:26 AM »
I read bug 2542 - that's allright so far. Could have done a search earlier, sorry.

Anyway, smartd reports on sda "1 Offline uncorrectable sectors". Interesting, because the BIOS reports no errors at all. What counts: BIOS or smartctl? No errors on sdb, either BIOS nor smartctl. This one I wiped meanwhile.

Does it make sense build the array again under those circumstances?
stefan
And then one day you find ten years have got behind you.

Time, 1973
(Mason, Waters, Wright, Gilmour)

Offline SchulzStefan

  • *
  • 620
  • +0/-0
Re: SME RESCUE - Boot question
« Reply #9 on: November 19, 2010, 11:03:35 AM »
Stuck - syncing aborts and restarts. No luck. Does it make sense to try a backup via USB?

stefan
And then one day you find ten years have got behind you.

Time, 1973
(Mason, Waters, Wright, Gilmour)

Offline janet

  • *****
  • 4,812
  • +0/-0
Re: SME RESCUE - Boot question
« Reply #10 on: November 19, 2010, 12:34:59 PM »
SchulzStefan

Quote
No errors on sdb, either BIOS nor smartctl. This one I wiped meanwhile.

Why did you do that, you said your machine was booting up OK. That partition/cylinder boundary error is not a problem AFAIK.

How did you wipe the drive, ie what commands or method, if you did not do it correctly that could be the reason for your syncing abort ???

Quote
Does it make sense to try a backup via USB?

If you wish, it's your machine & how you resolve this issue is your choice, but jumping quickly from one "fix" to another does not allow time to diagnose. If all else has seemingly failed then a restore is your only option.

Quote
Does it make sense build the array again under those circumstances?

Probably depends on answers given to above questions.
Please search before asking, an answer may already exist.
The Search & other links to useful information are at top of Forum.

Offline SchulzStefan

  • *
  • 620
  • +0/-0
Re: SME RESCUE - Boot question
« Reply #11 on: November 19, 2010, 01:45:39 PM »
@mary:
Quote
Why did you do that, you said your machine was booting up OK.
The machine is still booting o.k. on one disk. The machine did not boot on the other. Therefore I cleaned the other one.

Unluckily on the boot disk is the "1 Offline uncorrectable sectors". And trying to sync ends up in an endless loop.

Quote
That partition/cylinder boundary error is not a problem AFAIK.
Yep, I noticed that already. Nevertheless is the problem that the disks have been out of sync. And the disk which boots, has an error.

Quote
but jumping quickly from one "fix" to another does not allow time to diagnose
Sorry, you're right. I'm too fast in trying and eliminating the things. After all, it's not possible to do it online...

O.k., I decided to make a backup with dar. I think, I'll see in the logs, when it breaks. Maybe I can eliminate the files and then exclude. I'll give this a try. Right now it's more to figure out what one can do, if there's no backup...

The target is to install on the wiped disk an new smesever, and then move the backup on to it. We'll see, how it works.
stefan
And then one day you find ten years have got behind you.

Time, 1973
(Mason, Waters, Wright, Gilmour)

Offline janet

  • *****
  • 4,812
  • +0/-0
Re: SME RESCUE - Boot question
« Reply #12 on: November 19, 2010, 02:00:12 PM »
SchulzStefan

Quote
The machine is still booting o.k. on one disk....
And the disk which boots, has an error.

Do you mean it boots OK and does not give any errors, but when you run a smartctl test it gives the error
"1 Offline uncorrectable sectors".

You should test both disks thoroughly ie full test, with a disk manufacturers utility. You can download the UBCD and run many manufacturers disk diagnostic tests by booting to the CD. I'd also run the smartctl "long" test on both disks.

You did not answer how you prepared (wiped) the sdb disk before trying to resync.
You must delete the MBR
You can use the dd command or use delpart.exe or similar downloadable partition deleting tools.


Quote
The target is to install on the wiped disk an new smesever, and then move the backup on to it.

I would consider that disk still suspect until you have done thorough testing on it and the other disk too.
Please search before asking, an answer may already exist.
The Search & other links to useful information are at top of Forum.

Offline SchulzStefan

  • *
  • 620
  • +0/-0
Re: SME RESCUE - Boot question
« Reply #13 on: November 19, 2010, 02:14:01 PM »
@mary:

snip -

Nov 19 13:15:45 real-saturn smartd[4269]: smartd version 5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Nov 19 13:15:45 real-saturn smartd[4269]: Home page is http://smartmontools.sourceforge.net/ 
Nov 19 13:15:45 real-saturn smartd[4269]: Opened configuration file /etc/smartd.conf
Nov 19 13:15:45 real-saturn smartd[4269]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
Nov 19 13:15:45 real-saturn smartd[4269]: Problem creating device name scan list
Nov 19 13:15:45 real-saturn smartd[4269]: Device: /dev/sda, opened
Nov 19 13:15:45 real-saturn smartd[4269]: Device: /dev/sda, not found in smartd database.
Nov 19 13:15:45 real-saturn smartd[4269]: Device: /dev/sda, is SMART capable. Adding to "monitor" list.
Nov 19 13:15:45 real-saturn smartd[4269]: Device: /dev/sdb, opened
Nov 19 13:15:45 real-saturn smartd[4269]: Device: /dev/sdb, not found in smartd database.
Nov 19 13:15:45 real-saturn smartd[4269]: Device: /dev/sdb, is SMART capable. Adding to "monitor" list.
Nov 19 13:15:45 real-saturn smartd[4269]: Monitoring 2 ATA and 0 SCSI devices
Nov 19 13:15:45 real-saturn smartd[4269]: Device: /dev/sda, 1 Currently unreadable (pending) sectors
Nov 19 13:15:45 real-saturn smartd[4269]: Sending warning via mail to admin ...
Nov 19 13:15:45 real-saturn smartd[4269]: Warning via mail to admin: successful
Nov 19 13:15:45 real-saturn smartd[4269]: Device: /dev/sda, 1 Offline uncorrectable sectors
Nov 19 13:15:45 real-saturn smartd[4269]: Sending warning via mail to admin ...
Nov 19 13:15:45 real-saturn smartd[4269]: Warning via mail to admin: successful

- snip
Never got any email to the admin account from smartd before.

- snip

Nov 19 13:45:46 real-saturn smartd[4320]: Device: /dev/sda, 1 Currently unreadable (pending) sectors
Nov 19 13:45:46 real-saturn smartd[4320]: Device: /dev/sda, 1 Offline uncorrectable sectors
Nov 19 13:45:46 real-saturn smartd[4320]: Device: /dev/sda, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 62 to 63
Nov 19 13:45:46 real-saturn smartd[4320]: Device: /dev/sdb, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 109 to 108
Nov 19 13:45:46 real-saturn smartd[4320]: Device: /dev/sdb, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 62 to 61

- snip

Quote
You should test both disks thoroughly ie full test, with a disk manufacturers utility. You can download the UBCD and run many manufacturers disk diagnostic tests by booting to the CD. I'd also run the smartctl "long" test on both disks.

I'll do that.

Quote
You did not answer how you prepared (wiped) the sdb disk before trying to resync.
You must delete the MBR
You can use the dd command or use delpart.exe or similar downloadable partition deleting tools.

Pardon me, I did not delete the MBR. I formatted the disk with mkfs.ext3. IMHO I didn't believe, that the MBR could be damaged. I just added the clean disk via install menu. Before I re-install, I'll test the disks and will also delete the MBR.

stefan
And then one day you find ten years have got behind you.

Time, 1973
(Mason, Waters, Wright, Gilmour)

Offline CharlieBrady

  • *
  • 6,918
  • +3/-0
Re: SME RESCUE - Boot question
« Reply #14 on: November 19, 2010, 03:09:27 PM »
Anyway, smartd reports on sda "1 Offline uncorrectable sectors". Interesting, because the BIOS reports no errors at all. What counts: BIOS or smartctl?

I've never heard of BIOS being a reliable source of information about disk errors.