Martin -
One thing I noticed on all of the nodes is this:
Jun 14 08:11:11 njsevcnp01 ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out
Jun 14 08:11:11
njsevcnp01.fixflyer.com ovirt-ha-agent[15713]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent
Then the agent is restarted
[root@njsevcnp01 ~]# ps -Aef | grep -i ovirt-ha-agent | grep -iv grep
vdsm 15713 1 0 08:09 ? 00:00:01 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
I dont know why the connection would time out because as you can see that log is from node01 and I cant figure out why its timing out on the connection
The other interesting thing is this log from node01. The odd thing is that it seems there is some split brain somewhere in oVirt because this log is from node02 but it is asking the engine and its getting back "vm not running on this host' rather than 'stale data'. But I dont know engine internals
MainThread::INFO::2016-06-14 08:13:05,163::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host njsevcnp02 (id 2): {hostname: njsevcnp02, host-id: 2, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: 25da07df, host-ts: 3030}
MainThread::INFO::2016-06-14 08:13:05,163::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host njsevcnp03 (id 3): {hostname: njsevcnp03, host-id: 3, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: c67818cb, host-ts: 10877406}
And that same log on node02 where the engine is running
MainThread::INFO::2016-06-14 08:15:44,451::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host njsevcnp01 (id 1): {hostname: njsevcnp01, host-id: 1, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: 260dbf06, host-ts: 327}
MainThread::INFO::2016-06-14 08:15:44,451::state_machine::171::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Host njsevcnp03 (id 3): {hostname: njsevcnp03, host-id: 3, engine-status: {reason: vm not running on this host, health: bad, vm: down, detail: unknown}, score: 0, stopped: True, maintenance: False, crc32: c67818cb, host-ts: 10877406}
MainThread::INFO::2016-06-14 08:15:44,451::state_machine::174::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(refresh) Local (id 2): {engine-health: {health: good, vm: up, detail: up}, bridge: True, mem-free: 20702.0, maintenance: False, cpu-load: None, gateway: True}
MainThread::INFO::2016-06-14 08:15:44,452::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1465906544.45 type=state_transition detail=StartState-ReinitializeFSM hostname=njsevcnp02