Koozali.org: home of the SME Server
Contribs.org Forums => General Discussion => Topic started by: pablitobs on February 19, 2009, 05:36:10 AM
-
Hello folks, I am having a weird behaviour on my sme servers I hope some one could figure it out.
First: my sme structure
- 1. One SME Server 7.3 as getway
- 2. One SME Server 7.3 as serveronly mode acting as web application server for intranet
- 3. Both servers are identical on hardware and new
- 4. Web host hosted and runing perfectily
So, this is the problem. the web application server used to connect to the web host server to retrieve and update some db records... it used to works fine, the web server is configured to allow only access from authorized IP's .
The problem is that the web application server somethings connects to the web server and after some seconds it just hangs.
If I do a telnet to the mysql server of the web server from the web application server it usually fails some times works some not.
If I do a telnet to the mysql server of the web server from the getway server it works.... so it means there is no port blocking.
The web server is working fine.
I already flush host, and flush squid cache, but still the web applications server works on and off.
So is there any way to check why is it hanging and not connecting? is there any configuration on the firewall to check. I've been trying to follow the logs, but I can't find a log that tells me something.
any help will be appreciated...thanks.
-
pablitobs
From the description you give, your problem is not clear to me.
I think you mean you are having trouble accessing mysql remotely.
See
http://wiki.contribs.org/SME_Server:Documentation:FAQ#Access_MySQL_from_the_local_network
and
http://wiki.contribs.org/SME_Server:Documentation:FAQ#Access_MySQL_from_a_remote_network
-
Hi, I know what you mean, but it is not the case, what I am trying to do is to access a normal website's mysql database from my serveronly sme server, but it works on and off, I mean some times running a php script from the sme server it connects fine to the webserver hosted outside my intrantet but after three atemps or less it gives me connection error. So first I figure it was the web server, so I decide to make a test, I open a putty terminal and connect to my getway server and then I telnet the mysql server on the webhost, It works fine each time I tried. then simuntaneosly I open another putty terminal and connect to the other sme server(serveronly mode), from there I telnet the mysql server on the web host, but this time it gives time out error or it connect once in 5 tries. So it means is not the web server and is not the internet connection or a blocked port as the getway server always connects fine to the web server. That is my problem.
What I would like is to know why is this happening as few days ago it was working fine.
Thanks
-
Do you have anything like Dansguardian installed on the gateway server? Can you ping and tracert to the external web host from the intranet server during an outage?
-
Yes I have Dansguardian installed, but the Ip and the domain are not blocked, in fact if they where I believe I would have no access at all, but it is failing 4 from 5 connections. I can ping the ip and the domain name from the getway as well as telnet.
-
Try doing a traceroute from the affected machine to the external server when the issue is occuring. This may shed some light on where the connection is timing out.
-
Thanks I will do it and I will post the results here...
-
Here is the traceroute:
traceroute to 69.65.40.154 (69.65.40.154), 30 hops max, 38 byte packets
1 pc-00001 (192.168.1.1) 0.142 ms 0.091 ms 0.091 ms
2 118.23.8.0 (118.23.8.0) 3.153 ms 3.303 ms 3.482 ms
3 118.23.5.5 (118.23.5.5) 3.496 ms 3.080 ms 2.748 ms
4 125.206.149.245 (125.206.149.245) 4.731 ms 4.569 ms 4.490 ms
5 60.37.11.41 (60.37.11.41) 2.982 ms 2.843 ms 2.742 ms
6 210.254.188.141 (210.254.188.141) 3.235 ms 3.336 ms 2.737 ms
7 210.254.188.146 (210.254.188.146) 3.238 ms 3.311 ms 3.495 ms
8 210.145.252.186 (210.145.252.186) 4.483 ms 3.331 ms 3.491 ms
9 ae-5.r21.tokyjp01.jp.bb.gin.ntt.net (129.250.11.53) 3.738 ms 3.860 ms 3.491 ms
Icmp checksum is wrong
10 as-2.r21.snjsca04.us.bb.gin.ntt.net (129.250.5.81) 131.171 msIcmp checksum is wrong
131.474 msIcmp checksum is wrong
as-0.r21.lsanca03.us.bb.gin.ntt.net (129.250.3.145) 118.423 ms
11 * * *
12 * * *
13 xe-0.level3.sttlwa01.us.bb.gin.ntt.net (129.250.9.162) 122.802 ms xe-1.level3.sttlwa01.us.bb.gin.ntt.net (129.250.9.210) 99.459 ms xe-0.level3.lsanca03.us.bb.gin.ntt.net (129.250.8.182) 114.283 ms
14 ae-32-52.ebr2.Seattle1.Level3.net (4.68.105.62) 104.327 ms 100.261 ms ae-92-92.ebr2.SanJose1.Level3.net (4.69.134.221) 128.142 ms
15 ae-3.ebr1.Denver1.Level3.net (4.69.132.58) 198.953 ms 184.215 ms ae-2.ebr2.Denver1.Level3.net (4.69.132.54) 213.611 ms
16 ae-1-100.ebr2.Denver1.Level3.net (4.69.132.38) 176.742 ms ae-2.ebr3.SanJose1.Level3.net (4.69.132.9) 125.790 ms ae-3.ebr1.Chicago2.Level3.net (4.69.132.62) 179.203 ms
17 ae-3.ebr1.Chicago2.Level3.net (4.69.132.62) 160.986 ms 161.231 ms 161.128 ms
18 ae-11-53.car1.Chicago1.Level3.net (4.68.101.66) 179.390 ms ae-62-62.ebr2.SanJose1.Level3.net (4.69.134.209) 119.511 ms 135.026 ms
19 ae-11-55.car1.Chicago1.Level3.net (4.68.101.130) 161.628 ms ae-3.ebr1.Denver1.Level3.net (4.69.132.58) 184.191 ms ae-11-55.car1.Chicago1.Level3.net (4.68.101.130) 146.227 ms
20 pos2-1.csr1.Chi3.Servernap.net (69.39.239.170) 179.715 ms 180.994 ms ae0-40.er1.Chi1.Servernap.net (4.79.65.50) 161.619 ms
21 ae-6.ebr1.Chicago1.Level3.net (4.69.140.189) 185.471 ms houston.micfo.com (69.65.40.154) 173.252 ms pos2-1.csr1.Chi3.Servernap.net (69.39.239.170) 147.270 ms
-
9 ae-5.r21.tokyjp01.jp.bb.gin.ntt.net (129.250.11.53) 3.738 ms 3.860 ms 3.491 ms
Icmp checksum is wrong
10 as-2.r21.snjsca04.us.bb.gin.ntt.net (129.250.5.81) 131.171 msIcmp checksum is wrong
131.474 msIcmp checksum is wrong
I'm guessing that this might have something to do with your issue. Can we please see two more traceroutes for comparison:
1. From the intranet server to the external server when everything is working correctly
2. From the gateway to the external server when the error is happening
Thanks :)
-
Hi, as per your request
----------------------------------------------------------------------
2. From the gateway to the external server when the error is happening
----------------------------------------------------------------------
traceroute to 69.65.40.154 (69.65.40.154), 30 hops max, 38 byte packets
1 118.23.8.0 (118.23.8.0) 6.875 ms 3.708 ms 3.219 ms
2 118.23.5.5 (118.23.5.5) 3.221 ms 3.015 ms 2.701 ms
3 125.206.149.245 (125.206.149.245) 4.714 ms 4.449 ms 4.742 ms
4 60.37.11.41 (60.37.11.41) 2.955 ms 3.216 ms 2.977 ms
5 210.254.188.141 (210.254.188.141) 2.979 ms 3.261 ms 2.725 ms
6 210.254.188.146 (210.254.188.146) 3.228 ms 3.067 ms 3.225 ms
7 210.145.252.186 (210.145.252.186) 3.238 ms 3.286 ms 3.480 ms
8 ae-5.r21.tokyjp01.jp.bb.gin.ntt.net (129.250.11.53) 106.918 ms 3.487 ms 3.232 ms
9 as-2.r21.snjsca04.us.bb.gin.ntt.net (129.250.5.81) 116.414 ms as-0.r21.lsanca03.us.bb.gin.ntt.net (129.250.3.145) 125.968 ms as-2.r21.snjsca04.us.bb.gin.ntt.net (129.250.5.81) 116.779 ms
MPLS Label=299792 CoS=6 TTL=1 S=0
10 po-2.r01.lsanca03.us.bb.gin.ntt.net (129.250.3.162) 179.848 ms as-1.r21.sttlwa01.us.bb.gin.ntt.net (129.250.3.87) 123.275 ms *
11 xe-11-1-0.edge1.SanJose3.level3.net (4.68.111.189) 128.300 ms * xe-9-0-0.edge1.SanJose3.level3.net (4.68.110.49) 135.435 ms
12 xe-1.level3.sttlwa01.us.bb.gin.ntt.net (129.250.9.210) 183.340 ms vlan99.csw4.SanJose1.Level3.net (4.68.18.254) 130.977 ms vlan89.csw3.SanJose1.Level3.net (4.68.18.190) 119.894 ms
13 ae-62-62.ebr2.SanJose1.Level3.net (4.69.134.209) 127.100 ms ae-32-52.ebr2.Seattle1.Level3.net (4.68.105.62) 131.300 ms ae-62-62.ebr2.SanJose1.Level3.net (4.69.134.209) 132.882 ms
14 ae-2.ebr2.Denver1.Level3.net (4.69.132.54) 205.076 ms 186.827 ms 214.599 ms
15 ae-63-63.csw1.SanJose1.Level3.net (4.69.134.226) 134.404 ms ae-2.ebr3.SanJose1.Level3.net (4.69.132.9) 131.350 ms ae-63-63.csw1.SanJose1.Level3.net (4.69.134.226) 129.041 ms
16 ae-62-62.ebr2.SanJose1.Level3.net (4.69.134.209) 124.877 ms ae-3.ebr1.Chicago2.Level3.net (4.69.132.62) 161.359 ms ae-62-62.ebr2.SanJose1.Level3.net (4.69.134.209) 131.526 ms
17 ae-11-51.car1.Chicago1.Level3.net (4.68.101.2) 303.539 ms 315.791 ms 212.599 ms
18 ae-3.ebr1.Denver1.Level3.net (4.69.132.58) 190.616 ms 174.020 ms ae0-40.er1.Chi1.Servernap.net (4.79.65.50) 170.891 ms
19 pos2-1.csr1.Chi3.Servernap.net (69.39.239.170) 148.880 ms ae0-40.er1.Chi1.Servernap.net (4.79.65.50) 146.141 ms 161.592 ms
20 pos2-1.csr1.Chi3.Servernap.net (69.39.239.170) 162.878 ms houston.micfo.com (69.65.40.154) 149.128 ms 149.668 ms
---------------------------------------------------------------------------------
1. From the intranet server to the external server when everything is working correctly
--------------------------------------------------------------------------------
traceroute to 69.65.40.154 (69.65.40.154), 30 hops max, 38 byte packets
1 pc-00001 (192.168.1.1) 0.124 ms 0.090 ms 0.103 ms
2 118.23.8.0 (118.23.8.0) 3.198 ms 3.115 ms 3.456 ms
3 118.23.5.5 (118.23.5.5) 2.937 ms 2.626 ms 2.951 ms
4 125.206.149.245 (125.206.149.245) 4.428 ms 4.909 ms 4.487 ms
5 60.37.11.41 (60.37.11.41) 2.996 ms 2.617 ms 3.215 ms
6 210.254.188.141 (210.254.188.141) 2.958 ms 3.130 ms 2.958 ms
7 210.254.188.146 (210.254.188.146) 2.956 ms 3.156 ms 3.457 ms
8 210.145.252.186 (210.145.252.186) 3.496 ms 3.159 ms 3.489 ms
9 ae-5.r21.tokyjp01.jp.bb.gin.ntt.net (129.250.11.53) 3.746 ms 3.850 ms 3.489 ms
Icmp checksum is wrong
10 ae-3.r21.osakjp01.jp.bb.gin.ntt.net (129.250.4.214) 12.499 msIcmp checksum is wrong
as-0.r21.lsanca03.us.bb.gin.ntt.net (129.250.3.145) 119.534 msIcmp checksum is wrong
121.023 ms
11 *Icmp checksum is wrong
as-1.r21.sttlwa01.us.bb.gin.ntt.net (129.250.3.87) 99.660 ms *
12 xe-11-1-0.edge1.SanJose3.level3.net (4.68.111.189) 212.199 ms xe-11-0-0.edge1.SanJose3.level3.net (4.68.111.249) 111.581 ms *
13 vlan89.csw3.SanJose1.Level3.net (4.68.18.190) 136.163 ms xe-0.level3.sttlwa01.us.bb.gin.ntt.net (129.250.9.162) 206.470 ms xe-1.level3.lsanca03.us.bb.gin.ntt.net (129.250.9.86) 116.288 ms
14 ae-32-52.ebr2.Seattle1.Level3.net (4.68.105.62) 106.786 ms ae-93-93.ebr3.LosAngeles1.Level3.net (4.69.137.45) 119.816 ms ae-82-82.ebr2.SanJose1.Level3.net (4.69.134.217) 127.003 ms
15 ae-2.ebr2.Denver1.Level3.net (4.69.132.54) 194.728 ms ae-83-83.ebr3.LosAngeles1.Level3.net (4.69.137.41) 114.800 ms ae-3.ebr1.Denver1.Level3.net (4.69.132.58) 195.191 ms
16 ae-73-73.csw2.SanJose1.Level3.net (4.69.134.230) 129.284 ms ae-1-100.ebr2.Denver1.Level3.net (4.69.132.38) 196.479 ms ae-73-73.csw2.SanJose1.Level3.net (4.69.134.230) 134.501 ms
17 ae-6.ebr1.Chicago1.Level3.net (4.69.140.189) 204.910 ms ae-3.ebr1.Chicago2.Level3.net (4.69.132.62) 161.008 ms ae-83-83.csw3.SanJose1.Level3.net (4.69.134.234) 137.044 ms
18 ae-11-51.car1.Chicago1.Level3.net (4.68.101.2) 374.343 ms ae-3.ebr1.Denver1.Level3.net (4.69.132.58) 181.722 ms ae-6.ebr1.Chicago1.Level3.net (4.69.140.189) 190.738 ms
19 ae-3.ebr1.Denver1.Level3.net (4.69.132.58) 188.244 ms ae0-40.er1.Chi1.Servernap.net (4.79.65.50) 178.792 ms ae-3.ebr1.Denver1.Level3.net (4.69.132.58) 171.763 ms
20 ae0-40.er1.Chi1.Servernap.net (4.79.65.50) 146.506 ms pos2-1.csr1.Chi3.Servernap.net (69.39.239.170) 180.022 ms ae0-40.er1.Chi1.Servernap.net (4.79.65.50) 146.757 ms
21 ae-6.ebr1.Chicago1.Level3.net (4.69.140.189) 181.494 ms ae-3.ebr1.Chicago2.Level3.net (4.69.132.62) 155.782 ms pos2-1.csr1.Chi3.Servernap.net (69.39.239.170) 162.472 ms
22 ae-6.ebr1.Chicago1.Level3.net (4.69.140.189) 186.994 ms 181.500 ms houston.micfo.com (69.65.40.154) 149.129 ms
---------------------------------------------------------------------------------------
On this last traceroute as soon as I got connection between the intranet and the webserver I generate the traceroute, checked againg the connection from the intranet server to the web server and it was failing ....
Googling this error I found some people believe it is a Centos bug, but everything was working fine on my servers for the past 4 months, and they are twin servers (hardware and software), so it is hard to believe to me it is a bug. I have dansguardian, could it be the reason?....
Hope the data could help you..
-
---------------------------------------------------------------------------------
2. From the intranet server to the external server when everything is working correctly
--------------------------------------------------------------------------------
9 ae-5.r21.tokyjp01.jp.bb.gin.ntt.net (129.250.11.53) 3.746 ms 3.850 ms 3.489 ms
Icmp checksum is wrong
10 ae-3.r21.osakjp01.jp.bb.gin.ntt.net (129.250.4.214) 12.499 msIcmp checksum is wrong
as-0.r21.lsanca03.us.bb.gin.ntt.net (129.250.3.145) 119.534 msIcmp checksum is wrong
121.023 ms
11 *Icmp checksum is wrong
The icmp checksum issue happens when the connection is working as well, so we can probably eliminate this as a potential error - likely it's just the CentOS bug manifesting itself, as you suggest.
With regards to the additional traces you have provided, you can see from the traces that the data is going via a longer and quite different route when things are not working.
What I suggest is this: let's eliminate SME Server as the cause of your issue. Temporarily replace SME with a garden-variety SOHO broadband router (Netgear, DLink or similar), and see if the issue remains. If your connection problems go away, next try reintroducing a vanilla install of SME Server (no contribs etc., just the base install) on a spare box and trying again.
This may give us a better idea as to what might be behind the problem. I'm leaning towards an ISP issue, but if taking the SME gateway out of the picture solves the problem, then I stand to be corrected.
-
OK, right now the servers are in producciont, I will find a window time today to make the tests and I will get back to you as soon as I get any news....
thanks for your help.
-
Hi guys, finally I found the problem, now the story:
My server is a new dell PowerEdge T300, full of ram and space and all that stuff, but also with two NICs, which when I installed the SME 7.x connect via the bonding option on the configuration panel.
Somehow, the problem was that the kernell can not understand wich of the NICs should be use when a request comes, so it takes a lot of time for him to figure it till the answer is ready, causing, delays, latency and the Icmp checksum is wrong.
I google it a little and it is something called the The ARP Flux Problem (http://linux-ip.net/html/ether-arp.html - Scroll almost to the end.) it happends When a linux box is connected to a network segment with multiple network cards, a potential problem with the link layer address to IP address mapping can occur.
So after I unplug one of the NICs, reconfigure the server to unset the bonding, everthing start working fine.
I asume there is a problem with some kind of cache, because the first configuration with the two NICs was done like 4 months ago, and it was working fine, but after some time it starts to fail a little every day, till two days ago was impossible to reach any server outside the box.
Thanks for the help and the suggestions, I hope this solution could help other people.
-
I'm glad you sorted it out!
-
me too thanks for your help
-
May be you want to update the wiki with your experience:
http://wiki.contribs.org/KnownProblems#Problem_with_NIC_card_or_integrated_NIC.
-
Sure, it will be a honor, but how can I update the wiki, I went to the link but did not find a way...sorry...
-
pablitobs
....how can I update the wiki, I went to the link but did not find a way..
If you do not already have wiki edit access, then you must request Wiki edit access by lodging a bug report. See the bugzilla link at top of forums. If you have never used bugzilla before then you will need to register as a new user.
After access has been granted, you login at the top of the wiki page using the same username and password as you use in the Forums. Then you will see the Edit tag alongside each article.
-
If you do not already have wiki edit access, then you must request Wiki edit access by lodging a bug report. See the bugzilla link at top of forums. If you have never used bugzilla before then you will need to register as a new user.
No that is no longer necessary, the procedure has recently been improved and can be found here: http://wiki.contribs.org/Help:Contents