Koozali.org: home of the SME Server

I think it's dead:-|

Robert Harlow

I think it's dead:-|
« on: October 25, 2002, 04:46:28 PM »
Sirs

For the 3rd time in 2 weeks my 24/7-running SME server 5.5 seems to have *died*.

It was working a short time ago when I ran a client backup of a networked W2k box from the main W2k box (using Retrospect 6) which used the SME as the destination file.

Now the SME is dead. Can't browse to it over the network. The SME's monitor was showing a dead looking Midnight Commander screen so I attempted to get to the prompt with the usual F2 stuff but just got  a DOS screen fillled with missed placed lines of hard to read error reports - mainly complaining about not being to open (anything)loggable. Eventually I tried to CntlAltDel (yes, old M$ tricks die hard I know) but all I got was the inability of the SME box to execute its sbin/restart and similar.  This is its current situation while I attempt to get some help from the SME forum.

Is it possible for a failing hdd, I-bay mounted, to pull the SME over this fatally? Recently I reckon I've heard some strange [read this as ominous] noises and had some unavailable reports from a particularly old hdd - all of which turned out to be spurious, but I really do trust my ears:~/

It's dead isn't it, I'm referring to this dynamic iteration of the SME box. Once again. Looks like I have to hit the hardware reset switch once again. Just like old M$ times this is...

This (Linux) SME box has fallen over way more times than any of my W2k boxes. I put all the user files deliberately onto the SME box because of my naive perception of its inherent stability... I am reasonably expert with my W2k boxes but just starting out on Linux (SuSE8) and this SME box (Redhat 7.x).

I see the SCSI LED flashing periodically every few seconds. The SME is on the booting SCSI drive. I hear all the fans and a bunch of 8 hdd/sdd's spinning (well I assume all of them are still spinning...) but that's about as much as I can get from the SME box.

Anything I can try - without a darned prompt and this inability to execute anything from the sbin...?

yours, somewhat disappointedly, Robert

Geoff Bennion

Re: I think it's dead:-|
« Reply #1 on: October 25, 2002, 05:17:52 PM »
Robert Harlow wrote:
>
> Sirs
>
> For the 3rd time in 2 weeks my 24/7-running SME server 5.5
> seems to have *died*.
>
> It was working a short time ago when I ran a client backup of
> a networked W2k box from the main W2k box (using Retrospect
> 6) which used the SME as the destination file.
>
> Now the SME is dead. Can't browse to it over the network. The
> SME's monitor was showing a dead looking Midnight Commander
> screen so I attempted to get to the prompt with the usual F2
> stuff but just got  a DOS screen fillled with missed placed
> lines of hard to read error reports - mainly complaining
> about not being to open (anything)loggable. Eventually I
> tried to CntlAltDel (yes, old M$ tricks die hard I know) but
> all I got was the inability of the SME box to execute its
> sbin/restart and similar.  This is its current situation
> while I attempt to get some help from the SME forum.
>


DOS !!!! - Wow that's impressive on a SME Server....


> Is it possible for a failing hdd, I-bay mounted, to pull the
> SME over this fatally? Recently I reckon I've heard some
> strange [read this as ominous] noises and had some
> unavailable reports from a particularly old hdd - all of
> which turned out to be spurious, but I really do trust my
> ears:~/
>


More than likely the problem is with one of the disks.


> It's dead isn't it, I'm referring to this dynamic iteration
> of the SME box. Once again. Looks like I have to hit the
> hardware reset switch once again. Just like old M$ times this
> is...


Any operating system installed on faulty hardware will be instable, this is not a problem with the software, but the hardware.
M$, on the otherhand, can be unstable all by it's self.


> This (Linux) SME box has fallen over way more times than any
> of my W2k boxes. I put all the user files deliberately onto
> the SME box because of my naive perception of its inherent
> stability... I am reasonably expert with my W2k boxes but
> just starting out on Linux (SuSE8) and this SME box (Redhat
> 7.x).


How about you naive perception that it is SME at fault ?
Have you even looked at the log files ?


>
> I see the SCSI LED flashing periodically every few seconds.
> The SME is on the booting SCSI drive. I hear all the fans and
> a bunch of 8 hdd/sdd's spinning (well I assume all of them
> are still spinning...) but that's about as much as I can get
> from the SME box.
>
> Anything I can try - without a darned prompt and this
> inability to execute anything from the sbin...?
>


eh ?


> yours, somewhat disappointedly, Robert


Possible problems :

Hard disk(s) are full ?
Faulty Hard Disk(s) ?
Memory Problem ?

To do :

take a look at the logfile in /var/log
post up any errors, and we will have a look.

P.S. Try not to blame SME, until you know it is SME.

Robert Harlow

Re: I think it's dead:-|
« Reply #2 on: October 25, 2002, 06:50:28 PM »
Geoff

[for brevity I'm assuming the previous is still visible in the forum]

I'm -always- prepared to blame any of my hdd/sdd's.
Having quietly heard, what I had, I'm even more so inclined.

     [BTW I pressed the hardware reset some while ago]

Just finished running IBM's and Maxtor's HD utilities from boot.
NOTHING absolutely NOTHING wrong found... makes me wild:~/
Still prefer to believe my ears, they've outforecasted both utilities before.

Hardware [the kit] is emphatically stable, I'm not interested in go-faster-errors.
Kit is proven, stable, isolated, way cooled, UPS'd and physically untouched.
LOVE to look at the log files Geoff, it's just SME won't let me [yet]:-))
The kit just runs SME. It's not overly naive to allude to a SME problem:~/
I'm running it as plain ordinary file server - nothing fancy at all - no mail.
Unmanaged 100Mb network switch - hardware firewall - ISDN LAN router.

Now running MEMTEST86 from boot...
There's only 3 x 128Mb SDRAM sticks in there but it still takes a while.


...only relaying that which was running around on the monitor.
It was particularly present after I tried to CtrlAltDel SME.

Like I said, I could obtain no prompt and was presented with a load of relatively meaningless statements from SME about not being to execute various commands from sbin. It seemed to me that SME was attempting to get me to kill it manually via the hardware reset but was so depleted that it had no resources left with which it couldn't tell me:-) With no further assistance from the forum there did not seem much option left to me and so I killed it manually; have tested the drives; now testing the memory.

Any particular SME logs that you'd recommend?

best wishes [feeling better after caffeine intake] Robert

dave

Re: I think it's dead:-|
« Reply #3 on: October 25, 2002, 08:35:54 PM »
I'm certainly no Linux expert but I am familiar with SME and certain failures that can happen to a system.  

Some things to check, like Geoff mentioned, is disk configuration and capacity.  A volume that's been filled with log files (or a large backup file) can create lots of problems.  I think you'll have to boot the system in single user mode, which should allow you access to the prompt to do some maintenence, including viewing/deleting logs.

Something else that may assist in a diagnosis is letting us know what 'the kit' consists of.  What type of computer if it's a brand name, some specifics if it's not.  What type of processor(s), hard drive configurations and capacities, whether or not the
drive(s) were set up by SME during install or if there's been some custom configurations performed afterwards.  

BTW: I've used the seagate and maxtor disk utilities to check failed SCSI (and IDE) drives.  The non destructive test only reads data on the drive and does some basic mechanical tests - this type of utility is available from most drive manufacturers.  It can't tell you if your file allocation table is corrupt, only that it can read data. I'm not even sure if it can tell if a partiiton table is corrupt.

Robert Harlow

Re: I think it's dead:-|
« Reply #4 on: October 25, 2002, 09:09:49 PM »
Dave

Sorry... have been chatting to Geoff via email:~/
To recap [more publically] and quoting only own text...

-------quote on----->
I'd imagine that SME *will* boot in due course. It did so the
other times. Albeit with some automatic testing during the boot.
Last time it displayed an inode error and empty dtime something.
However I DID have to force the hardware reset with at least 7
hard drives still mounted, so I'd imagine there be an issue or two.

The memory test routine takes ages [several hours]. For SME it's
now been 3 strikes and now it's out... I have to resolve this or
else I have to plan something else more resilient/reliable.

All the hdd/sdd's test out OK but I've never fully trusted those
manufacturer utilities:-| All they really do [for me as an
experienced user] is confirm to the idiots at the other end
of the RMA line that you [me] know what I'm talking about
when I tell them a drive or whatever is toast.

Memory is still being tested I do not intend to abort the test,
despite not being able to get on with any *work*. All my working
user data files are resident on that darned SME box:-| I have
backups etc etc but...

SME server v5.5 (not yet updated)
No RAID hard or soft
File Server mode only (no mail)

SCSI Adaptec 2940u2w host adapter
SCSI Adaptec 2904 (CD-type card)
Asus P3B-F and latest BIOS
384Mb SDRAM PC133
Pentium III 600 slot 1 (never overclocked)

Boots SCSI and SME is wholly on a SCSI drive
Two 9Gb LVD 10k SCSI.
One 4.5 LVD 7k2 SCSI
Four IDE's - 2x80Gb - 1x20Gb (old) - 1x120Gb (new)
Drives mostly, if not all, single partition.
Variety of EXT2 and VFAT.
All [now] apparently tested OK by manufacturer's utilities.

No [additional] PCI controller.
I build for reliability/stability...
eg Matrox Millennium video card.
There's an used Magneto Optical drive in there that has yet
to get installed/working etc.
 
Nothing else I can think of... kit's been untouched for
several months. Occasional browse over the network.
Physically untouched. I planned for stability...
<-------quote off-----

And shortly afterwards...

-------quote on----->
It's only crashed as I've indicated. Otherwise been 100%. I found out
the hard way that EXT2 is worse even than DOS for max file size when
I tried to dump 12Gb MPEG files and make 30Gb backup files on my
otherwise empty 80Gb and 129Gb drives. Boy did I curse Mitel:~/
I thought I'd ditched small file sizes back in 2000 with W2kPro!

BTW I accidently aborted the memory test routine by *looking*.
Will continue that overnight. Still rather think I have some sort of
SME vulnerability invoked by, perhaps, the backup package
(Retrospect v6 working from W2kPro-SP2 via a network client)

AFAIK everything back up running sweetly. Makes me wild, I KNOW
this will re-occur sometime [soon].

During the initial boot I got the inode error/empty dtime FIXED report,
on the booting SCSI drive (this is where SME is located), but you'd
expect that sort of issue (?) what with a manual hardware reset(?).

No lines with *promisc* found in /var/log/messages...

All my *data* sits on distributed non-SME physical drives
ie on separately mounted I-bays

VFAT is only used on a mounted I-bay drive in historic use for W98
and other various tweeks (so I can learn SuSE8 and be able to see
the files it produces when I boot up later in W2kPro-SP2 with NTFS...)

I keep an eye on the PSU with Asus Probe utility. The drives are
fairly pedestrian. Many many fans. SCSI CDRW and MO drives too
but have never had a problem. This piece of kit is the lightest load
of all the PC's I use here - according to the displays of the UPS,
which itself is only rated at 1500VA ie 1kw... so no lights dimming:-)
<-------quote off-----


I think you may be on to something Dave.
These catastrophic failures of SME *started* around about the same time as I purchased the updater to v6 of Retrospect in order to use its new features to fix older issues. One of these was the use of network clients and the other the ability to break the sheer size of the backup file down to sizes more appropriate to the utterly pathetic neolithic antique sizes that SME's use of EXT2 had inflicted on me.

Retrospect will now both break its files down to 600Mb chunks AND an overall disc capacity ceiling can be specified. Earlier I specified 100% and I think that was unwise, particularly on the remainder of the SCSI drive on which SME was operating [ahem]. My excuse was that I believed that the ceiling figure was a projection or expectation for use. It occurred to me [later] that it *may* pre-book or format that space or otherwise put the mockers on anything attempting to either use that space or perhaps fragment it. The drive still reports space available etc etc and so that was why I believed that its use was dynamic rather than being pre-booked. All of this is just speculation. I think it needs verification. Dantz (Europe) support are now closed.

Nothing has been done to the kit after SME was actually installed.

For the booting SCSI drive I opted for the Advanced test not just the Basic. Deemed OK. It's the most intensive before it gets intrusive and overwrites/reformats.

One of the logs recommends some preventative e2fscking or something... so, must get the man pages running now:-)

best wishes, Robert

Robert Harlow

Re: I think it's dead:-|
« Reply #5 on: October 25, 2002, 09:53:36 PM »
Geoff if you're reading this later - I didn't have time to put that top [monitoring] command in... the darned thing has died once again, shortly after I pulled those log files off for you to look at.  Looks like yet another hardware reset with 7 drives still mounted:-|  Super stable huh, guess it just needs me to get involved to pull everything down to some nice unadulterated cold entropy:-(

best wishes, Robert [reaching for the caffeine once again]

Robert Harlow

Re: I think it's dead:-|
« Reply #6 on: October 25, 2002, 10:06:56 PM »
It certainly seems pretty dormant but not quite dead [yet].

Keyboard's CAPS and NUM lights are under control.
Prompt exists.
Bash displays the hdparm help with hdparm being typed in.
But typing in top gives an input/output error.
Typing in mc [Midnight Commander] gives an input/output error but it has a go at deleting something and fails.
Browsing across my network to server-manager fails.
CtrlAltDel does not work at the Bash prompt, it displays
INIT cannot execute /sbin/shutdown

So, this failure time I can still get a prompt displayed...
is there anything useful I can do to this thing before I hit the hardware reset again?

best wishes, Robert

Robert Harlow

Re: I think it's dead:-|
« Reply #7 on: October 26, 2002, 01:17:00 AM »
The foolishness continued... well, by that I mean it got worse.

It got madder still... I could see all the mounted data drives from across the network but I couldn't properly control the SME terminal locally or browse to the server-manager, nor could I reboot/restart locally by command or otherwise softly. Several enforced hard resets later things escallated. The booting SCSI drive finally gave up the ghost and died thus taking this iteration of SME with it into the ether.

A BIOS reorganisation and a cupboard search later got my old W98 image and SuSE8 iteration up and booting from what were mounted drives on the SME. So no power or memory issues I guess.

The IBM test diagnostic cannot cope with a dead IBM sitting in the booting position. How ironic. Not forgetting that this was the very utility proclaiming the dying drive to be of *factory quality*... when I was adamant that at least one of my drives in the nest was not fully OK. Still, by using the Adaptec BIOS to sideline the dead drive the IBM utility has finally acknowledged that this particular drive is, guess what, faulty. Well I never, Well Done IBM, Finally Made It huh?

Guessing here but I think that Retrospect's large files must have precipitated the growing avalanche of damage to the disc by its earlier use of what *had* been unused but dicey areas. Best I can come up with until I can obtain a spare drive or reorganise existing capacity and get a re-installation completed.

It's been a long day. I need to review my options. Thank you for your support through the day.

best wishes, Robert

LucL

Re: I think it's dead:-|
« Reply #8 on: October 26, 2002, 01:15:49 PM »
I've been getting similar issues with my server.

Sudden kernel errors forcing hard boots.
Sometimes after or during scheduled tape backups.
Sometimes during cron jobs.

At boot: inode and non-contiguous file errors.

The hardware is new.

I've replaced every piece of hardware so far except for the hardrive.

If I run fsck it fixes several errors.

My AWSTATS no longer works but I think its a result of blocks being moved and the configuration of AWSTATS.

Curious as to updates on your situation.
Was the harddrive causing errors?

Robert Harlow

Re: I think it's dead:-|
« Reply #9 on: October 27, 2002, 01:54:40 AM »
LucL

Booting SCSI drive finally expired itself.
The IBM utility had the grace to proclaim it being *faulty*:~/

The myriad of other mounted drives contain my old W98 image
and a separate SuSE8 iteration. Both of these are now working
in the same kit - minus its previously booting SCSI drive of course.

Looks like I have to get my hands dirty...

[update]
Day somewhat occupied with an aunt's 80th birthday celebration.
Tonight and tomorrow we have been told to expect a nasty storm.
Now battening down the hatches here:-)
Dead drive is not really dead but [certainly] has difficulty booting.
Using the above SuSE8 iteration I was able to read various bits of the *dead* drive but was not entirely successful in the attempt to copy/paste its contents to another drive in the nest. However I did drag off some otherwise unbacked up data files which had been orphaned before the appropriate automatic backup had [successfully] taken place:-)
Dug out a venerably old, trusty, noisy, hot-running, Quantum Viking II 7k2 9Gb brick from my goodie box and shoe-horned it into the hole that the absent LVD 10k 9Gb IBM had left. Geoff's earlier concern about my lights dimming when I turned this kit on perhaps need checking out now with this brick running too - four sdds and four hdds a SCSI CDRW and a SCSI MO - all on a generic 300W PSU:~/
Kit now returned to full complement of 8 hard drives.
Unhooked the power harnesses from 6 drives , leaving just two SCSI drives for the somewhat scattergun installation routine of SME server v5.5 to install itself blindly. Afterwards I normalised the power harnesses.

SME is now up in its most basic virgin state. I need to redo all the fstab mounting commands, I-Bays, hdparms and other tweaks. That'll be for tomorrow, while the elements do their level best to try and blast us off the Cornish landscape:~/ No doubt the UPS will get a lot of hits presently...

I very much fear that you, too, are facing an imminent hdd/sdd episode. As I have indicated earlier, even the manufacturers' very own utilities on their own brand of drives, with SMART enabled where appropriate, then they still are virtually useless... IMHO. Trust your ears!!! And your gut feelings too. New drives are JUST as likely to fail as old ones, believe me. There's no such thing as a safe drive:-(

I cannot vouch for RAID, of any flavour, as I have never been tempted to inflict such added complexity on my sanity or to my site. If I could afford a bunch of RAID-able drives I'd much rather use each separately, with diligent backups, and trot out the remainder as and when fatalities might occur! Keeping it as simple as possible is my preferred route. And costs figure rather highly - unfortunately - hence my re-use of a trusty but old Viking while I look around for a present day [equally reliable] replacement.

The intermittent problems are the worst:-| It'd be nice if this sort of thing just happened in a flash, so to speak, the drive fried in a second or two and my ever-faithful nose [literally] could accurately diagnose the source of the problem within minutes:-) Not like I've had for the last few weeks nor, I very much suspect, like you are experiencing. There is another way but it requires a strong nerve. Stress-tests. Find something that invokes the problem reliably [?] and then multiple up the stress until something breaks cover metaphorically and makes a run for it:-) Be ready in waiting with cheque book close by:~/

Night.

best wishes, Robert

Robert Harlow

Re: I think it's dead:-|
« Reply #10 on: October 29, 2002, 09:44:39 PM »
[update]

Storm came and went... it wasn't much.
Now replaced failed SCSI boot drive with trusty (?) but some old spare.
Reloaded SME from new... weird issues have dogged me for two days:-|

The install goes smoothly, so does the reconfiguration & ibay stuff.
Soon as I start to work with live data it all goes pear-shaped:-(
Have burnt and reloaded SME several times now, am getting REALLY good it but I'd MUCH rather just do it the once and then get on with something else...

Some difficult to explain, the SME box just doesn't want to get written to. Stops and starts. The network switch lights flash at a stop start pace and not just winking very quickly like normal. Trying to copy to the SME ends up failing with the driving end having to error out with a network problem (apparently, it's only the SME stuff that's a problem, the other boxes liaise with each other over the network perfectly well).

Maybe someone could help put me out of my extended misery and explain this jewel of a display line...
INIT: Id "nd" respawning too fast: disabled for 5 minutes
...not such if it's symptomatic of anything but I'm willing to give whatever it means a go if it cures SME of its recent troubles:-|

best wishes, Robert

Robert Harlow

Re: I think it's dead:-|
« Reply #11 on: October 29, 2002, 10:40:33 PM »
<>

Search shows other people reporting this.
Some responses but the most intensive is beyond my ken:~/

FWIW my site is v5.5 from the CD (no update) - it's server only - DHCP is supplied by my hardware firewall - so none of the that DNS and mail stuff seemed to apply to me.

Have removed the stray . (period) in front of the domain entry in the setup. It always views as wwwmysite.xxx so I put a period in front of the definition so it viewed as www.mysite.xxx. Looks like that might (?) have rippled down to cause this Id "nd" respawning too fast issue. Early yet.

Still got my stop start b*****r all use to anybody behaving SME server though... I'm beginning to earnestly detest the thing. Shame really, it performed admirably for several months without any issues or trouble at all. Until the SCSI booting hard drive went intermittent on me for three weeks, dying several days ago and then all this started...

best wishes, Robert

Robert Harlow

Re: I think it's dead:-|
« Reply #12 on: October 30, 2002, 12:44:36 AM »
[update]

The latest helpful message to appear on the SME's display...

Aiee,  killing interrupt handler
Kernel panic; Attempted to kill the idle task!
In swapper task not syncing

At which point I find that the keyboard LED is no longer responding nor is anything else... so, guess what, it's the good old hardware reset button once again despite the SCSI light being fully ON:-(((

Nice set of boot up accusations and forced checks, resulting in the maintenance halt (Ctrl-D) and the manual fsck does absolutely no good whatsoever. Can't boot up. Can't repair. Time to pull out the computer once again, take off all the data and power cables to isolate the single drive in the nest on which to allow SME to do its scattergun install. Let's go around once again... again... again....   [/stuck record]

best wishes, Robert

Dave

Re: I think it's dead:-|
« Reply #13 on: October 30, 2002, 01:55:58 AM »
> take off all the data and power cables to isolate the single drive in
> the nest on which to allow SME to do its scattergun install.

Yes, the M$ arrogance of the install is a bit of a PITA but it is tweakable to limited extent. Have a serach through the forums for hints, you do need to ba able to burn your own CDs though.

Robert Harlow

Re: I think it's dead:-|
« Reply #14 on: October 30, 2002, 01:34:54 PM »
Thank you Dave; when all of this has blown over I will investigate that, and yes I can burn my own CDs, I can also burn my own DVD too but all sort of stuff is only familiar to me on my other boxes (W2kPro-SP2)...

Somehow I've allowed the SME installer to wreak some awful revenge on all four IDE [data] drives which were *completely switched off* in the BIOS and whose latter had its Boot-using-IDE setting disabled (how/why does/can SME do that???). For good [:-|] measure the SME installer nuked all four relatively empty [programme] SCSI drives and then totally ignored their existence in itssubsequent install which took 2hr+ to complete. Before now, when I manually tweak the SME install process to one or two SCSI drives, then my own installs [yes, sadly, I've almost an expert at running through a SME install] takes between 12 to 15 minutes [max].

So now SME is really happy(!?). It put itself onto the biggest and slowest IDE drive which was logically turned off in the BIOS and with the BIOS boot set to disable all IDE booting... So, no surprise that after some 2hrs of happy IDE data destruction (20Gb, 80Gb, 80Gb & 120Gb), the first boot failed citing the reason as being that it couldn't find the OS... The SME installer also put its own swap area on the root partition - ignoring the sublime existence of seven, now supremely empty, other hard drives. And the IDE hdparm settings were a slow motion joke. OK, I accept it does that for safety but the rest of the SME installer idiocy stands:~/

Taking a guess at the --real-- booting requirements I manually got this latest iteration of SME up and running. A user and the workgroup is now set on SME. Access to the network and the other W2kPro-SP2 boxes is now on. All SME logs show no errors (except the usual complaints about the SCSI MO drive). is running on the SME's display, with the parameter and with CPU sorting. shows everything looking really lean.

From either W2kPro-SP2 boxes [there are two] I can start copying a 3minute 200Mb video mpeg (something recorded off Top Of The Pops) onto the SME box into the user area... but it never completes...
* Network lights stop flickering. drops down to tickover activity.
* W2kPro-SP2 box eventually times out with a error.
* This is on a virginally installed SME box .
* The is browsable throughout and afterwards.
* Nothing is reported in the SME's logs as viewed through .
* No other network activity or work [...work? what's that? my entire time exists merely to pander to SME's needs].
* The half-completed mpeg file is actually playable from the SME box until the end of the partial copy.
* The mpeg, such that was copied across, can be deleted from the SME box.
...basically I can't complete the copying of anything bigger than a few megabytes across to the SME without this happening but above scenario can at least be described easily:~/

Could anyone enlighten me as to what the heck is going on with the SME?

For months SME *used* to run unattended and relatively unloved on a 24/7 PC sitting in an enclosure and connected to a 100Mbit network. It did its stuff and for that I was thankful. After the booting hard drive decline into failure over a three week interval and the subsequent substitution I have just gone backwards and, now, have had the data repository that SME was supposed to contain, protect and serve out has been wiped by the SME scattergun installer AND THE DARNED THING STILL HASN'T INSTALLED ITSELF PROPERLY in order for it to do what it's supposed to do:-|

best wishes, Robert