Actually, broker is the only thing acting "right" between broker and agent. Broker is up when I bring the system up but agent is restarting all the time. Have a look
The 11th is when I restarted this node after doing 'reinstall' in the web UI
● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker
Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor preset: disabled)
Active: active (running) since Sat 2016-06-11 13:09:51 EDT; 3 days ago
Main PID: 1285 (ovirt-ha-broker)
CGroup: /system.slice/ovirt-ha-broker.service
└─1285 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon
Jun 15 09:23:56 njsevcnp01 ovirt-ha-broker[1285]: INFO:mgmt_bridge.MgmtBridge:Found bridge ovirtmgmt with ports
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed
Jun 15 09:23:58 njsevcnp01 ovirt-ha-broker[1285]: INFO:mem_free.MemFree:memFree: 26408
Uptime of proc ..
# ps -Aef | grep -i broker
vdsm 1285 1 2 Jun11 ? 02:27:50 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker --no-daemon
But the agent... is restarting all the time
# ps -Aef | grep -i ovirt-ha-agent
vdsm 76116 1 0 09:19 ? 00:00:01 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent --no-daemon
9:19 AM ET is last restart. Even the logs say it
[root@njsevcnp01 ovirt-hosted-engine-ha]# grep -i 'restarting agent' agent.log | wc -l
232719
And the restarts every
[root@njsevcnp01 ovirt-hosted-engine-ha]# tail -n 300 agent.log | grep -i 'restarting agent'
MainThread::WARNING::2016-06-15 09:23:53,029::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '6'
MainThread::WARNING::2016-06-15 09:24:28,953::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '7'
MainThread::WARNING::2016-06-15 09:25:04,879::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '8'
MainThread::WARNING::2016-06-15 09:25:40,790::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '9'
MainThread::WARNING::2016-06-15 09:26:17,136::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '0'
MainThread::WARNING::2016-06-15 09:26:53,063::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '1'
Full log of restart is like this saying "connection timed out" but its not saying to *what* is timing out, so I have nothing else to really go on here
[root@njsevcnp01 ovirt-hosted-engine-ha]# tail -n 300 agent.log | grep -i restart
MainThread::ERROR::2016-06-15 09:24:23,948::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent
MainThread::WARNING::2016-06-15 09:24:28,953::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '7'
MainThread::ERROR::2016-06-15 09:24:59,874::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent
MainThread::WARNING::2016-06-15 09:25:04,879::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '8'
MainThread::ERROR::2016-06-15 09:25:35,785::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent
MainThread::WARNING::2016-06-15 09:25:40,790::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '9'
MainThread::ERROR::2016-06-15 09:26:12,131::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent
MainThread::WARNING::2016-06-15 09:26:17,136::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '0'
MainThread::ERROR::2016-06-15 09:26:48,058::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent
MainThread::WARNING::2016-06-15 09:26:53,063::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '1'
MainThread::ERROR::2016-06-15 09:27:23,969::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'njsevcnp01'}: Connection timed out' - trying to restart agent
MainThread::WARNING::2016-06-15 09:27:28,973::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '2'
Storage is also completely fine. No logs stating anything "going away" or having issues. Engine has dedicated NFS NAS device meanwhile VM storage is completely separate storage cluster. Storage has 100% dedicated backend network with no changes being done