I added a hook to rhevm, and then restarted the engine service which
triggered a hosted-engine VM shutdown (likely because of the failed
liveliness check).
Once the hosted-engine VM shutdown it did not restart on the other host.
On both hosts configured for hosted-engine I'm seeing logs from ha-agent
where each host thinks the other host has a better score. Is there supposed
to be a mechanism for a tie breaker here? I do notice that the log mentions
best REMOTE host, so perhaps I'm interpreting this message incorrectly.
ha-agent logs:
Host 001:
MainThread::INFO::2014-07-21
11:51:57,396::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1405957917.4 type=state_transition
detail=EngineDown-EngineDown hostname='rhev001.miovision.corp'
MainThread::INFO::2014-07-21
11:51:57,397::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineDown-EngineDown) sent?
ignored
MainThread::INFO::2014-07-21
11:51:57,924::hosted_engine::323::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
Current state EngineDown (score: 2400)
MainThread::INFO::2014-07-21
11:51:57,924::hosted_engine::328::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
Best remote host rhev002.miovision.corp (id: 2, score: 2400)
MainThread::INFO::2014-07-21
11:52:07,961::states::454::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine down, local host does not have best score
MainThread::INFO::2014-07-21
11:52:07,975::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1405957927.98 type=state_transition
detail=EngineDown-EngineDown hostname='rhev001.miovision.corp'
Host 002:
MainThread::INFO::2014-07-21
11:51:47,405::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1405957907.41 type=state_transition
detail=EngineDown-EngineDown hostname='rhev002.miovision.corp'
MainThread::INFO::2014-07-21
11:51:47,406::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineDown-EngineDown) sent?
ignored
MainThread::INFO::2014-07-21
11:51:47,834::hosted_engine::323::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
Current state EngineDown (score: 2400)
MainThread::INFO::2014-07-21
11:51:47,835::hosted_engine::328::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
Best remote host rhev001.miovision.corp (id: 1, score: 2400)
MainThread::INFO::2014-07-21
11:51:57,870::states::454::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine down, local host does not have best score
MainThread::INFO::2014-07-21
11:51:57,883::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1405957917.88 type=state_transition
detail=EngineDown-EngineDown hostname='rhev002.miovision.corp'
This went on for 20 minutes about an hour ago, and I decided to --vm-start
on one of the hosts. The manager VM runs for a few minutes with the engine
ui accessible, before shutting itself down again.
I then put host 002 into local maintenance mode, and host 001 auto started
the hosted-engine VM. The logging still references host 002 as the 'best
remote host' even though the calculated score is now 0:
MainThread::INFO::2014-07-21
12:03:24,011::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1405958604.01 type=state_transition
detail=EngineUp-EngineUp hostname='rhev001.miovision.corp'
MainThread::INFO::2014-07-21
12:03:24,013::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineUp-EngineUp) sent?
ignored
MainThread::INFO::2014-07-21
12:03:24,515::hosted_engine::323::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
Current state EngineUp (score: 2400)
MainThread::INFO::2014-07-21
12:03:24,516::hosted_engine::328::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
Best remote host rhev002.miovision.corp (id: 2, score: 0)
MainThread::INFO::2014-07-21
12:03:34,567::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1405958614.57 type=state_transition
detail=EngineUp-EngineUp hostname='rhev001.miovision.corp'
Once the hosted-engine VM was up for about 5 minutes I took host 002 out of
local maintenance mode and the VM has not since shutdown.
Is this expected behaviour? Is this the normal recovery process when two
hosts both hosting hosted-engine are started at the same time? I would have
expected once hosted-engine VM was detected as bad (liveliness check from
when I restarted the engine service) and the VM was shutdown, that it would
spin back up on the next available host.
Thanks,
Steve