Koozali.org: home of the SME Server

System boot locks up on "Starting network:"

Jason Judge

System boot locks up on "Starting network:"
« on: June 08, 2002, 05:37:39 AM »
I've got a bit of a problem with my SME Server 5.1.2. A few days ago the motherboard network gave up the ghost. It's plugged into a cable modem and this problem occurred a few days after quite a large thunderstorm (by UK standards) - there may be a connection, there may not - but I digress...

The end result is that now the server locks up during the boot sequence at the 'Starting network:' stage, and I have not got a clue how to get around this. It makes no difference whether I disable the network card, install new cards, plug the network cable in or not - it just locks up.

Each time I have to power-down to get out of the lockup, a reboot requires a manual fsck, which worries me each time I need to do it.

I suppose what I need to do is boot without the network, run the console to select the new network cards then all should be fine. But how do I bypass the network stage of the boot sequence?

This is getting desparate as I have been without a mail and web server now for a few days - my clients will begin to think I've disappeared!

-- Jason

Jason Judge

Re: System boot locks up on "Starting network:"
« Reply #1 on: June 08, 2002, 02:20:39 PM »
As a more general quesion, when hardware on your SME Server breaks, then how do you upgrade the server if you cannot replace that hardware with hardware of exactly the same type? Is a full installaton followed by a tape restore the only option?

-- Jason

robert

Re: System boot locks up on "Starting network:"
« Reply #2 on: June 08, 2002, 03:00:49 PM »
Jason,

Your suggested solution sounds correct to me: bypass the network startup. Here's how:
Like RedHat, SME supports an 'interactive startup', but, unlike RedHat, SME doesn't tell you about this. What you need to do is press 'i' (for interactive) at the beginning of the startup sequence, after the words 'Mitel Networks Server' (or something like that) appear (in red letters), but before it goes to the SME runlevel, which is after mounting the swap partitions. Now, in the interactice startup, don't disable too many services, or you won't be able to reconfigure for the new NIC.
BTW: have you jumpered the mobo to disable the on-board NIC?

Hope this helps,
Robert

Jason Judge

Re: System boot locks up on "Starting network:"
« Reply #3 on: June 08, 2002, 04:03:30 PM »
Robert,

Thanks - you can not believe how difficult it is to find out which key to press to enter interactive mode! I've searched through four Linux books, the whole of the Redhat site and everywhere else I can think of. Every source mentions how to enabled interactive mode 'hotkeys' through the cfg files - but not one actually spells out, to simple people like me, that the hotkey is 'i'. I guess you actually need to see it running

Anyway, I've tried disabling the network card via the BIOS (it's a jumperless Compaq mobo) but I get the same hang. I think the auto-hardware detect in SME/E-smith must be disabled or something as it still tries to start the motherboard ethernet device (perhaps this is something the main configuration menu does - I hope that once I can get into this, I can enable the relevant network cards?).

I have also tried starting in single-user mode, but this also seems to be disabled. I enter 'emsith single' at the LILO prompt but it still jumps straight into runstate 7 with ALL services being started, so there does not seem to be a way I can mount a writable filesystem so I can change these config files. Perhaps if I could locate my emergency boot disk, that would do it for me (from now on I will stick the boot disk to the side of the server).

-- Jason

Jason Judge

Re: System boot locks up on "Starting network:"
« Reply #4 on: June 08, 2002, 04:12:20 PM »
Hmm. I've tried running the setup program (from the console) again and it still says there is one on-board Compaq network device and one PCI network device in slot 1.

In fact the mobo Compaq network device is disabled and there are now TWO network cards plugged into PCI slots 1 and 2.

So - do I need to run some kind of hardware detect before the new/removed cards can be recognised? Or is this run automatically anyway, but is failing for some reason on my server?

I had a similar problem when added a second processor and installing from scratch - the second processor just wasn't recognised by SME so I never got the dual-processor kernel. In this case it is not recognising the fact that I have removed (disabled) one network card and added another.

The Compaq diagnostics confirm everything is working fine and IRQs appear to be allocated correctly.

-- Jason

robert

Re: System boot locks up on "Starting network:"
« Reply #5 on: June 08, 2002, 04:27:54 PM »
Which services did you choose not to start in interactive startup? And where does it hang? Try starting just the loggers, keytable, bootstrap-console and local (no lo, eth, ppp, httpd, sshd, samba, atalk, lpd, squid, mysql, dhcpd, ldap, etc.) if it lets you. Of course it could be that your mobo is more seriously damaged than you now know. Let's hope not.

Robert

robert

Re: System boot locks up on "Starting network:"
« Reply #6 on: June 08, 2002, 04:43:45 PM »
Jason Judge wrote:
>
> Hmm. I've tried running the setup program (from the console)
> again and it still says there is one on-board Compaq network
> device and one PCI network device in slot 1.
>
> In fact the mobo Compaq network device is disabled and there
> are now TWO network cards plugged into PCI slots 1 and 2.
>
Are you referring to BIOS setup or were you able to start up SME now?
If it's the BIOS reporting this, it's a BIOS problem and SME has nothing to do with it.
If it's SME, the problem could be that one of the PCI NICs has the same chipset as the on-board NIC and therefore uses the same driver, in which case you'll have to specify for which IRQ to load this driver.
 
> So - do I need to run some kind of hardware detect before the
> new/removed cards can be recognised? Or is this run
> automatically anyway, but is failing for some reason on my
> server?

You shouldn't need to run hardware detection.

>
> I had a similar problem when added a second processor and
> installing from scratch - the second processor just wasn't
> recognised by SME so I never got the dual-processor kernel.
> In this case it is not recognising the fact that I have
> removed (disabled) one network card and added another.
>
> The Compaq diagnostics confirm everything is working fine and
> IRQs appear to be allocated correctly.
>
> -- Jason

Jason Judge

Re: System boot locks up on "Starting network:"
« Reply #7 on: June 08, 2002, 05:46:58 PM »
I can start up SME now in interactive mode, just skipping the network service. This gets me a console, access to the Server Manager and a shell prompt.

Previous
======
My setup was (and this did work):

* eth0 - PCI slot Realtek RTL8139 (my internal network)
* eth1 - mobo ncr (my external network)

When the machine first went wrong, it was the external network that stopped dead. When I rebooted the first few times, the server started (albeit it hung a short while on the network startup) but the external network remained dead. The mobo network 'link' lamp stayed lit even with the network cable removed, so I guessed this to be hardware fault.

Current
=====
Now - I have plugged a second Realtek NIC (since the first one worked okay) and disabled the mobo NIC though the BIOS. The mobo NIC no longer lights the link lamp when I plug a cable in - so it is definately not being initialised.

I've checked the dmesg.log file. The Realtek NICs just aren't there, even though they do get initialised somewhere along the line as the link lamps work as expected (i.e. they light when a cable is plugged in).

However - the "Review Server Config" is now telling me that both eth0 and eth1 are Realtek cards. It must have detected them somehow - but it is still hanging on the network startup. The /etc/sysconfig/hwconf still tells me there is one Compaq (mobo) NIC and one Realtek card - so there seems to be a bit of inconsistency here.

*** The hwconf file is dated April 19 - which is when I installed the SME server, so it is not getting updated during the boot process. Assuming I understand the purpose of this file correctly, changes to hardware is not being detected. Even if it should nto be necessary, can I force this to be refreshed from the command line? ***

BTW, I'm on SME 5.1.2 on a Compaq Prosignia 740.

The BIOS utility confirms the following:

* Slot1, Ethernet adapter, board ID 10ec8139, Func1, IRQ5
* Slot2, Ethernet adapter, board ID 10ec8139, Func1, IRQ2(9) - I've tried IRQ15.

There is no reason why I can't use two identical network cards for the internal and external networks, is there? I have no idea now whether the system is hanging for same reason as it originally hung or a different reason now.

-- Jason

Jason Judge

Re: System boot locks up on "Starting network:"
« Reply #8 on: June 08, 2002, 05:50:44 PM »
> in which case you'll have to specify for which IRQ to load this driver

Any idea where I need to do this? I have two identical NICs installed, one on IRQ 5 and one on IRQ 15. Assuming I get both detected correctly at startup, do I need to edit any files to distiguish between them?

-- Jason

robert

Re: System boot locks up on "Starting network:"
« Reply #9 on: June 08, 2002, 06:54:44 PM »
Jason Judge wrote:
>
> > in which case you'll have to specify for which IRQ to load
> this driver
>
> Any idea where I need to do this?

I don't think you need to do this now. It's something I thought you might have to do to prevent modprobe from loading a driver for the dead on-board NIC. If the PCI NICs use a different driver from the onboard NIC, it will not be necessary.

> I have two identical NICs
> installed, one on IRQ 5 and one on IRQ 15. Assuming I get
> both detected correctly at startup, do I need to edit any
> files to distiguish between them?

No, but you may need to 'swap ethernet assignment' (or something like that) in the SME configuration console if eth0 and eth1 got assigned wrong. Simply swapping the cables should also work.
Have a look at the output from 'lsmod' and 'ifconfig' to see if this could be the issue.

>
> -- Jason

Jason Judge

Re: System boot locks up on "Starting network:"
« Reply #10 on: June 08, 2002, 10:58:00 PM »
I've swapped the cables over as my first PCI card, which would have been eth1 would now be eth0 (internal network) in place of the mobo NIC and my new network card would be eth1 (external network).

Would be, that is, if it worked. I can boot interactive with the network disabled, and it boots fine. As soon as SME tries to bring up interface eth0 then the server hangs (completely - not even ctrl-alt-del will get out of it).

Now eth0 should be my first PCI NIC, which was working. However, I am not sure that SME really is loading the drivers for that card - I believe it either thinks the mobo NIC is still there or it's getting the wrong IRQs or something. Is there some kind of debug I can turn on?

I would go to the Redhat forums to try to get this issue solved, but many of the Redhat configiration tools are not present in SME Server, so I don't expect they would be able to help me.

Any idea what I should be looking at next?

-- Jason

robert

Re: System boot locks up on "Starting network:"
« Reply #11 on: June 09, 2002, 03:36:08 PM »
Hi Jason,
I'm afraid I'm all out of ideas. Other than to recommend you to have another long hard look at /var/log/messages, /home/e-smith/configuration, /etc/modules.conf and the output of 'lspci'. Are you sure the dead on-board NIC wasn't using the same module (is it ne2k-pci?) that the new NICs are using? (not even sure if that would cause the problems you're experiencing) Have you ascertained that the new NICs work?
Good luck,
Robert

Jason Judge

Re: System boot locks up on "Starting network:"
« Reply #12 on: June 09, 2002, 04:42:01 PM »
Thanks for your help everyone. I've finally fixed the problem, and learned a whole load about modules, PCI devices and Linux. First the lessons of the story (which I hope will be helpful to others):

- If the machine locks up during boot, it is most likely a hardware fault. Software will time out, hardware just stops.
- Not all network cards are created equal.
- If you are using PnP PCI network cards, then FORGET about the IRQs and IO memory addresses. These are fully automatic and - usually - out of your control.
- Just because the _symptoms_ of the problem do not change, it does not mean the _cause_ hasn't.

So, the solution:

First the mobo NIC failed. It was using an NCR driver, and this driver locked up the machine during the startup sequence. I disabled this NIC in the BIOS to get it out the way.

Next I installed an extra Realtek RTL8139 card alongside the one I already had in the server. This combination also locked up.

I spend a LONG time tracing what was happening and trying to solve this problem through the OS. I learnt a lot about the OS, and came to the conclusion that everything was being detected and set up correctly.

So, eventually, I ripped out the old RTL8139 cards (I had no reason to really - one had been working in the machine for a long time) and replaced them with slightly different models. They were still the same chipset, same drivers, even the same vendor and version IDs - just slightly different models (lacked the ACPI). This worked.

The cards I pulled checked out okay in another machine, so it looks like two don't work side-by side. The worst thing about this is that the symptoms (i.e. locking up the server) were the same for these two cards installed as it was for the duff mobo card being enabled: the fault had moved, but the symptoms were the same, so I assumed the fault was the same.

Anyway, I've back on-line now and happy that everything is now working for the sake of a pair of four-quid (6 dollar!) NICs, though I have missed a full day I could have spent pottering around the garden!

-- Jason