Communication errors between engine and nodes?

10 Mar 2015

      Setup: oVirt 3.5.1 w/hosted engine, nodes: CentOS 7, engine: CentOS 6

I am periodically seeing errors like this in my engine web UI:

2015-Mar-10, 04:42 Host node5 is not responding. It will stay in Connecting state for a grace period of 89 seconds and after that an attempt to fence the host will be issued.
2015-Mar-10, 04:42 Host node3 from cluster c1 was chosen as a proxy to execute Status command on Host node5.
2015-Mar-10, 04:42 Status of host node5 was set to Up.
2015-Mar-10, 04:42 Host node5 power management was verified successfully.

The engine.log file has this:

2015-03-10 04:42:23,310 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand] (DefaultQuartzScheduler_Worker-40) [75b9e6d9] Command ListVDSCommand(HostName = node5, HostId = 8dfd0195-f386-4e16-9379-a5287221d5bd, vds=Host[node5,8dfd0195-f386-4e16-9379-a5287221d5bd]) execution failed.  Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded 

This seems to happen with a random node sometimes.  The VMs on the node
stay up and don't appear to experience any problem.  I can't find any
sign of a network problem on either the node, the engine, the node
hosting the engine, or the switches.  I don't see anything obvious in
the logs on any of the systems involved either.

The node network setup is VLANs on top of a bond of two NICs, each
connected to a different switch in a two-switch stack.

-- 
Chris Adams <cma@cmadams.net>

Chris Adams

Chris Adams

Lior Vernia

Chris Adams

Roel de Rooy

Chris Adams

Michal Skrivanek

Piotr Kliczewski

Roel de Rooy

Piotr Kliczewski

Roel de Rooy

Piotr Kliczewski

Gabi C

Chris Adams

tags

participants (6)