PHX lab downtime

David Caro dcaroest at redhat.com
Tue Feb 3 15:32:52 UTC 2015


On 02/03, David Caro wrote:
> 
> It took more than one hour :S
> 
> Current status is that all the vms are up and running, all the services are
> working, but we have one host down, ovirt-srv02 is out of the pool of hosts,
> with a strange issue resolving names.
> 
> When running ping it can't resolve names, but with dig it works ok. That's
> usually a misconfiguration in the nsswitch.conf file, but it's ok
> I tried selinux and iptables.
> 
> Did a strace of pinx, and I can see it does open a socket to the nameserver and
> sends the query, but nothing goes out the interface... (had tcpdump open in
> another screen)
> 
> socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 4 <0.000786>
> connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, 16) = 0 <0.000028>
> poll([{fd=4, events=POLLOUT}], 1, 0)    = 1 ([{fd=4, revents=POLLOUT}]) <0.000016>
> sendto(4, "ck\1\0\0\1\0\0\0\0\0\0\3www\6google\3com\0\0\1\0\1", 32, MSG_NOSIGNAL, NULL, 0) = 32 <0.000052>
> 
> Any idea is welcome, I'm going to get some sleep now that all the services are
> back up...

Mystery solved! Thanks Max!

The issue was that the iproute2 module can edit multiple routing tables, and
the usual command 'ip route show' only shows the routes in the default kernel
table while newer vdsm adds a new table for the routing aside from it, and when
I modified the gateway and netmask in the routing tables of the hosts, the
extra side table was not updated. That lead to the strange behavior of ping udp
requests to the dns server being routed through the old gateway while icmp was
being routed though the new table.

You can find more info about routing tables and rules here:

http://linux-ip.net/html/routing-tables.html
http://linux-ip.net/html/routing-rpdb.html


and some examples here:

http://linux-ip.net/html/adv-multi-internet.html


Just for future reference, you can see all the routing for all the routing
tables with:

  ip route show table all


> 
> 
> ps. ovirt-srv02 was also the hosted engine master, and was the host that broke
> during the upgrade the last time. Maybe it was related to this issue, or this
> issue is related to that...
> 
> See you tomorrow!
> 
> 
> On 02/02, David Caro wrote:
> > 
> > Hi all,
> > 
> > We are having a downtime on some of the vms and hosts on the phx lab. It's
> > caused by an unexpected issue with the engine and dhcp after changing the
> > gateway of the machines to adapt to the new ip range.
> > 
> > It's almost fixed, but we (I) should really get the environment to a really
> > stable status again and finish the upgrade.
> > 
> > There are still some issues, but most of them are already fixed, will fix the
> > rest in less than one hour.
> > 
> > 
> > 
> > -- 
> > David Caro
> > 
> > Red Hat S.L.
> > Continuous Integration Engineer - EMEA ENG Virtualization R&D
> > 
> > Tel.: +420 532 294 605
> > Email: dcaro at redhat.com
> > Web: www.redhat.com
> > RHT Global #: 82-62605
> 
> 
> 
> > _______________________________________________
> > Infra mailing list
> > Infra at ovirt.org
> > http://lists.ovirt.org/mailman/listinfo/infra
> 
> 
> -- 
> David Caro
> 
> Red Hat S.L.
> Continuous Integration Engineer - EMEA ENG Virtualization R&D
> 
> Tel.: +420 532 294 605
> Email: dcaro at redhat.com
> Web: www.redhat.com
> RHT Global #: 82-62605



> _______________________________________________
> Infra mailing list
> Infra at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/infra


-- 
David Caro

Red Hat S.L.
Continuous Integration Engineer - EMEA ENG Virtualization R&D

Tel.: +420 532 294 605
Email: dcaro at redhat.com
Web: www.redhat.com
RHT Global #: 82-62605
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.ovirt.org/pipermail/infra/attachments/20150203/fa27dbec/attachment.sig>


More information about the Infra mailing list