Setup: oVirt 3.5.1 w/hosted engine, nodes: CentOS 7, engine: CentOS 6
I am periodically seeing errors like this in my engine web UI:
2015-Mar-10, 04:42 Host node5 is not responding. It will stay in Connecting state for a
grace period of 89 seconds and after that an attempt to fence the host will be issued.
2015-Mar-10, 04:42 Host node3 from cluster c1 was chosen as a proxy to execute Status
command on Host node5.
2015-Mar-10, 04:42 Status of host node5 was set to Up.
2015-Mar-10, 04:42 Host node5 power management was verified successfully.
The engine.log file has this:
2015-03-10 04:42:23,310 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand]
(DefaultQuartzScheduler_Worker-40) [75b9e6d9] Command ListVDSCommand(HostName = node5,
HostId = 8dfd0195-f386-4e16-9379-a5287221d5bd,
vds=Host[node5,8dfd0195-f386-4e16-9379-a5287221d5bd]) execution failed. Exception:
VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded
This seems to happen with a random node sometimes. The VMs on the node
stay up and don't appear to experience any problem. I can't find any
sign of a network problem on either the node, the engine, the node
hosting the engine, or the switches. I don't see anything obvious in
the logs on any of the systems involved either.
The node network setup is VLANs on top of a bond of two NICs, each
connected to a different switch in a two-switch stack.
--
Chris Adams <cma(a)cmadams.net>