PHX lab downtime

David Caro dcaroest at redhat.com
Wed Feb 4 10:30:47 UTC 2015


On 02/04, Eyal Edri wrote:
> 
> 
> ----- Original Message -----
> > From: "David Caro" <dcaroest at redhat.com>
> > To: "Infra" <infra at ovirt.org>
> > Cc: "Max Kovgan" <mkovgan at redhat.com>
> > Sent: Tuesday, February 3, 2015 5:32:52 PM
> > Subject: Re: PHX lab downtime
> > 
> > On 02/03, David Caro wrote:
> > > 
> > > It took more than one hour :S
> > > 
> > > Current status is that all the vms are up and running, all the services are
> > > working, but we have one host down, ovirt-srv02 is out of the pool of
> > > hosts,
> > > with a strange issue resolving names.
> > > 
> > > When running ping it can't resolve names, but with dig it works ok. That's
> > > usually a misconfiguration in the nsswitch.conf file, but it's ok
> > > I tried selinux and iptables.
> > > 
> > > Did a strace of pinx, and I can see it does open a socket to the nameserver
> > > and
> > > sends the query, but nothing goes out the interface... (had tcpdump open in
> > > another screen)
> > > 
> > > socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 4 <0.000786>
> > > connect(4, {sa_family=AF_INET, sin_port=htons(53),
> > > sin_addr=inet_addr("8.8.8.8")}, 16) = 0 <0.000028>
> > > poll([{fd=4, events=POLLOUT}], 1, 0)    = 1 ([{fd=4, revents=POLLOUT}])
> > > <0.000016>
> > > sendto(4, "ck\1\0\0\1\0\0\0\0\0\0\3www\6google\3com\0\0\1\0\1", 32,
> > > MSG_NOSIGNAL, NULL, 0) = 32 <0.000052>
> > > 
> > > Any idea is welcome, I'm going to get some sleep now that all the services
> > > are
> > > back up...
> > 
> > Mystery solved! Thanks Max!
> > 
> > The issue was that the iproute2 module can edit multiple routing tables, and
> > the usual command 'ip route show' only shows the routes in the default kernel
> > table while newer vdsm adds a new table for the routing aside from it, and
> > when
> > I modified the gateway and netmask in the routing tables of the hosts, the
> > extra side table was not updated. That lead to the strange behavior of ping
> > udp
> > requests to the dns server being routed through the old gateway while icmp
> > was
> > being routed though the new table.
> > 
> > You can find more info about routing tables and rules here:
> > 
> > http://linux-ip.net/html/routing-tables.html
> > http://linux-ip.net/html/routing-rpdb.html
> > 
> > 
> > and some examples here:
> > 
> > http://linux-ip.net/html/adv-multi-internet.html
> > 
> > 
> > Just for future reference, you can see all the routing for all the routing
> > tables with:
> > 
> >   ip route show table all
> > 
> 
> do we have the network info documented on the infra page? 
> worth documenting it or enforcing standard configuration via puppet if possible.

The network setup is there, but the first time you run vdsm it messes
everything up and changes configurations and such, so I don't think it's ok to
change it with puppet when vdsm is already managing it (not always ok it
seems). That will lead to a race condition between vdsm and puppet changing the
network configuration.

But I'll add the troubleshooting tips to the docs.


> 
> > 
> > > 
> > > 
> > > ps. ovirt-srv02 was also the hosted engine master, and was the host that
> > > broke
> > > during the upgrade the last time. Maybe it was related to this issue, or
> > > this
> > > issue is related to that...
> > > 
> > > See you tomorrow!
> > > 
> > > 
> > > On 02/02, David Caro wrote:
> > > > 
> > > > Hi all,
> > > > 
> > > > We are having a downtime on some of the vms and hosts on the phx lab.
> > > > It's
> > > > caused by an unexpected issue with the engine and dhcp after changing the
> > > > gateway of the machines to adapt to the new ip range.
> > > > 
> > > > It's almost fixed, but we (I) should really get the environment to a
> > > > really
> > > > stable status again and finish the upgrade.
> > > > 
> > > > There are still some issues, but most of them are already fixed, will fix
> > > > the
> > > > rest in less than one hour.
> > > > 
> > > > 
> > > > 
> > > > --
> > > > David Caro
> > > > 
> > > > Red Hat S.L.
> > > > Continuous Integration Engineer - EMEA ENG Virtualization R&D
> > > > 
> > > > Tel.: +420 532 294 605
> > > > Email: dcaro at redhat.com
> > > > Web: www.redhat.com
> > > > RHT Global #: 82-62605
> > > 
> > > 
> > > 
> > > > _______________________________________________
> > > > Infra mailing list
> > > > Infra at ovirt.org
> > > > http://lists.ovirt.org/mailman/listinfo/infra
> > > 
> > > 
> > > --
> > > David Caro
> > > 
> > > Red Hat S.L.
> > > Continuous Integration Engineer - EMEA ENG Virtualization R&D
> > > 
> > > Tel.: +420 532 294 605
> > > Email: dcaro at redhat.com
> > > Web: www.redhat.com
> > > RHT Global #: 82-62605
> > 
> > 
> > 
> > > _______________________________________________
> > > Infra mailing list
> > > Infra at ovirt.org
> > > http://lists.ovirt.org/mailman/listinfo/infra
> > 
> > 
> > --
> > David Caro
> > 
> > Red Hat S.L.
> > Continuous Integration Engineer - EMEA ENG Virtualization R&D
> > 
> > Tel.: +420 532 294 605
> > Email: dcaro at redhat.com
> > Web: www.redhat.com
> > RHT Global #: 82-62605
> > 
> > _______________________________________________
> > Infra mailing list
> > Infra at ovirt.org
> > http://lists.ovirt.org/mailman/listinfo/infra
> > 

-- 
David Caro

Red Hat S.L.
Continuous Integration Engineer - EMEA ENG Virtualization R&D

Tel.: +420 532 294 605
Email: dcaro at redhat.com
Web: www.redhat.com
RHT Global #: 82-62605
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.ovirt.org/pipermail/infra/attachments/20150204/2f9075da/attachment.sig>


More information about the Infra mailing list