I added a hook to rhevm, and then restarted the engine service which triggered a hosted-engine VM shutdown (likely because of the failed liveliness check).

Once the hosted-engine VM shutdown it did not restart on the other host.

On both hosts configured for hosted-engine I'm seeing logs from ha-agent where each host thinks the other host has a better score. Is there supposed to be a mechanism for a tie breaker here? I do notice that the log mentions best REMOTE host, so perhaps I'm interpreting this message incorrectly.

ha-agent logs:

Host 001:

MainThread::INFO::2014-07-21 11:51:57,396::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1405957917.4 type=state_transition detail=EngineDown-EngineDown hostname='rhev001.miovision.corp'
MainThread::INFO::2014-07-21 11:51:57,397::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored
MainThread::INFO::2014-07-21 11:51:57,924::hosted_engine::323::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400)
MainThread::INFO::2014-07-21 11:51:57,924::hosted_engine::328::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host rhev002.miovision.corp (id: 2, score: 2400)
MainThread::INFO::2014-07-21 11:52:07,961::states::454::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down, local host does not have best score
MainThread::INFO::2014-07-21 11:52:07,975::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1405957927.98 type=state_transition detail=EngineDown-EngineDown hostname='rhev001.miovision.corp'

Host 002:

MainThread::INFO::2014-07-21 11:51:47,405::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1405957907.41 type=state_transition detail=EngineDown-EngineDown hostname='rhev002.miovision.corp'
MainThread::INFO::2014-07-21 11:51:47,406::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored
MainThread::INFO::2014-07-21 11:51:47,834::hosted_engine::323::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400)
MainThread::INFO::2014-07-21 11:51:47,835::hosted_engine::328::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host rhev001.miovision.corp (id: 1, score: 2400)
MainThread::INFO::2014-07-21 11:51:57,870::states::454::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down, local host does not have best score
MainThread::INFO::2014-07-21 11:51:57,883::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1405957917.88 type=state_transition detail=EngineDown-EngineDown hostname='rhev002.miovision.corp'

This went on for 20 minutes about an hour ago, and I decided to --vm-start on one of the hosts. The manager VM runs for a few minutes with the engine ui accessible, before shutting itself down again.

I then put host 002 into local maintenance mode, and host 001 auto started the hosted-engine VM. The logging still references host 002 as the 'best remote host' even though the calculated score is now 0:

MainThread::INFO::2014-07-21 12:03:24,011::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1405958604.01 type=state_transition detail=EngineUp-EngineUp hostname='rhev001.miovision.corp'
MainThread::INFO::2014-07-21 12:03:24,013::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineUp-EngineUp) sent? ignored
MainThread::INFO::2014-07-21 12:03:24,515::hosted_engine::323::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 2400)
MainThread::INFO::2014-07-21 12:03:24,516::hosted_engine::328::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host rhev002.miovision.corp (id: 2, score: 0)
MainThread::INFO::2014-07-21 12:03:34,567::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1405958614.57 type=state_transition detail=EngineUp-EngineUp hostname='rhev001.miovision.corp'

Once the hosted-engine VM was up for about 5 minutes I took host 002 out of local maintenance mode and the VM has not since shutdown.

Is this expected behaviour? Is this the normal recovery process when two hosts both hosting hosted-engine are started at the same time? I would have expected once hosted-engine VM was detected as bad (liveliness check from when I restarted the engine service) and the VM was shutdown, that it would spin back up on the next available host.

Thanks,
Steve