Koozali.org: home of the SME Server
Obsolete Releases => SME Server 8.x => Topic started by: Brave Dave on May 28, 2012, 02:09:54 PM
-
Is anyone seeing this from the Kernel:
I have:
uname -a
Linux r300 2.6.18-308.4.1.el5PAE #1 SMP Tue Apr 17 17:47:38 EDT 2012 i686 i686 i386 GNU/Linux
Seems to b discussed here
http://bugs.centos.org/view.php?id=4515
Everything is running sweet, then kabam, the whole lot freezes up, CPU goes ballistic, then everything comes back, but there are a lot of zombie tasks. It is definitely related to server load - just before this happens the server (4core Xeon in this case) is might be running at 4-5 in htop, then spikes up to 20 and higher
I'm seeing it when I run VMServer - others over at centOS are seeing it with other tasks
May 28 14:17:51 r300 kernel: INFO: task vmware-vmx:10685 blocked for more than 120 seconds.
May 28 14:17:51 r300 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 28 14:17:51 r300 kernel: vmware-vmx D 000123A4 1012 10685 1 10686 10684 (NOTLB)
May 28 14:17:51 r300 kernel: de20acfc 00003082 64c5bd40 000123a4 00000000 00000000 00000003 0000000a
May 28 14:17:51 r300 kernel: f778e000 65214ac0 000123a4 005b8d80 00000002 f778e10c c5620908 f7782040
May 28 14:17:51 r300 kernel: 00000001 00000000 c53e0d60 00000000 d20af788 c042d81f c574cc3c ffffffff
then a stack of other debugging stuff ...
-
try booting with another kernel and let us know
-
This appears to be a vmware issue, perhaps with the vmware driver, perhaps with the vmware server's I/O performance.
-
This appears to be a vmware issue, perhaps with the vmware driver, perhaps with the vmware server's I/O performance.
I follow the italian CentOS' user forum and there are some similar issues.. with httpd, for example..
the strange thing is that, AFAICS, only the latest CentOS' kernel is affected.. but I could be wrong
-
I'm reading - like Stefano - that there is a wider issue bubbling, it seems wider than just VMware
I've rebooted one server with an older Kernel
-
the strange thing is that, AFAICS, only the latest CentOS' kernel is affected.. but I could be wrong
The referenced CentOS and RH bug tracker entries are for much older kernels.
If there is a problem with the current RH kernel, then hopefully they will find and fix the problem promptly.
-
the strange thing is that, AFAICS, only the latest CentOS' kernel is affected.. but I could be wrong
No hits on the RH bugzilla that I can see:
site:bugzilla.redhat.com "blocked for more than 120 seconds" 2.6.18-308.4.1.el5
-
I was just curious if it was being seen at all
It came about because of pretty heavy write activity in the VM Machine, a user was using it to do a backup - so it was a large sustained write
Thanks anyway
-
I think I have this worked out
- The Kernel has a queueing mechanism (I know stating the obvious - always did)
- It now exposes the ability to kill off blocked tasks (fair queueing) and sets a default of 120 seconds
- If you introduce a VM Task and copy a large file to the disk - or any other that does a long slow write, it is likely to be seen as a Hung Task - this task was a zip of a large file across the network (the things end users do - you would think they would ask the sysadmin first)
- You can mod the behaviour by moding /etc/sysctl.conf (templated of course) - I extended the parameter to 600 - I could see it changing behaviour - htop got as high 30 in the average load - so there was a fair build up
VM Server is end of life anyway, and the way to do things used to be able to vm inside SME, and you still can, this only exposes a limit. Probably better to SME inside the VM ...
See here - playing with the time out:
# cat /etc/e-smith/templates-custom/etc/sysctl.conf/kernel.hung_task_timeout_secs
# To cope with extended timeouts
# with large disk writes from vmware
kernel.hung_task_timeout_secs = 600
(didn't actually test this last part - needs a reboot to check)
-
try booting with another kernel and let us know
didn't help -no change