RHEV 3.4 trial hosted-engine either host wants to take ownership

I added a hook to rhevm, and then restarted the engine service which triggered a hosted-engine VM shutdown (likely because of the failed liveliness check). Once the hosted-engine VM shutdown it did not restart on the other host. On both hosts configured for hosted-engine I'm seeing logs from ha-agent where each host thinks the other host has a better score. Is there supposed to be a mechanism for a tie breaker here? I do notice that the log mentions best REMOTE host, so perhaps I'm interpreting this message incorrectly. ha-agent logs: Host 001: MainThread::INFO::2014-07-21 11:51:57,396::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1405957917.4 type=state_transition detail=EngineDown-EngineDown hostname='rhev001.miovision.corp' MainThread::INFO::2014-07-21 11:51:57,397::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-21 11:51:57,924::hosted_engine::323::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-07-21 11:51:57,924::hosted_engine::328::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host rhev002.miovision.corp (id: 2, score: 2400) MainThread::INFO::2014-07-21 11:52:07,961::states::454::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down, local host does not have best score MainThread::INFO::2014-07-21 11:52:07,975::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1405957927.98 type=state_transition detail=EngineDown-EngineDown hostname='rhev001.miovision.corp' Host 002: MainThread::INFO::2014-07-21 11:51:47,405::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1405957907.41 type=state_transition detail=EngineDown-EngineDown hostname='rhev002.miovision.corp' MainThread::INFO::2014-07-21 11:51:47,406::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-21 11:51:47,834::hosted_engine::323::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-07-21 11:51:47,835::hosted_engine::328::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host rhev001.miovision.corp (id: 1, score: 2400) MainThread::INFO::2014-07-21 11:51:57,870::states::454::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down, local host does not have best score MainThread::INFO::2014-07-21 11:51:57,883::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1405957917.88 type=state_transition detail=EngineDown-EngineDown hostname='rhev002.miovision.corp' This went on for 20 minutes about an hour ago, and I decided to --vm-start on one of the hosts. The manager VM runs for a few minutes with the engine ui accessible, before shutting itself down again. I then put host 002 into local maintenance mode, and host 001 auto started the hosted-engine VM. The logging still references host 002 as the 'best remote host' even though the calculated score is now 0: MainThread::INFO::2014-07-21 12:03:24,011::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1405958604.01 type=state_transition detail=EngineUp-EngineUp hostname='rhev001.miovision.corp' MainThread::INFO::2014-07-21 12:03:24,013::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineUp-EngineUp) sent? ignored MainThread::INFO::2014-07-21 12:03:24,515::hosted_engine::323::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 2400) MainThread::INFO::2014-07-21 12:03:24,516::hosted_engine::328::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host rhev002.miovision.corp (id: 2, score: 0) MainThread::INFO::2014-07-21 12:03:34,567::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1405958614.57 type=state_transition detail=EngineUp-EngineUp hostname='rhev001.miovision.corp' Once the hosted-engine VM was up for about 5 minutes I took host 002 out of local maintenance mode and the VM has not since shutdown. Is this expected behaviour? Is this the normal recovery process when two hosts both hosting hosted-engine are started at the same time? I would have expected once hosted-engine VM was detected as bad (liveliness check from when I restarted the engine service) and the VM was shutdown, that it would spin back up on the next available host. Thanks, Steve

Hi Steve, we had a bug (or two..) in the score comparison logic: https://bugzilla.redhat.com/show_bug.cgi?id=1093366 Which was fixed by: http://gerrit.ovirt.org/29580 http://gerrit.ovirt.org/29787 and http://gerrit.ovirt.org/30025 Unfortunately those did not get to the current 3.4 releases, but will be available in the upcoming 3.5 and any subsequent 3.4 that will appear. One of the patches (29580) just fixes two words in the code and you can apply it manually if you want. Regards -- Martin Sivák msivak@redhat.com Red Hat Czech RHEV-M SLA / Brno, CZ ----- Original Message -----
I added a hook to rhevm, and then restarted the engine service which triggered a hosted-engine VM shutdown (likely because of the failed liveliness check).
Once the hosted-engine VM shutdown it did not restart on the other host.
On both hosts configured for hosted-engine I'm seeing logs from ha-agent where each host thinks the other host has a better score. Is there supposed to be a mechanism for a tie breaker here? I do notice that the log mentions best REMOTE host, so perhaps I'm interpreting this message incorrectly.
ha-agent logs:
Host 001:
MainThread::INFO::2014-07-21 11:51:57,396::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1405957917.4 type=state_transition detail=EngineDown-EngineDown hostname='rhev001.miovision.corp' MainThread::INFO::2014-07-21 11:51:57,397::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-21 11:51:57,924::hosted_engine::323::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-07-21 11:51:57,924::hosted_engine::328::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host rhev002.miovision.corp (id: 2, score: 2400) MainThread::INFO::2014-07-21 11:52:07,961::states::454::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down, local host does not have best score MainThread::INFO::2014-07-21 11:52:07,975::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1405957927.98 type=state_transition detail=EngineDown-EngineDown hostname='rhev001.miovision.corp'
Host 002:
MainThread::INFO::2014-07-21 11:51:47,405::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1405957907.41 type=state_transition detail=EngineDown-EngineDown hostname='rhev002.miovision.corp' MainThread::INFO::2014-07-21 11:51:47,406::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-21 11:51:47,834::hosted_engine::323::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-07-21 11:51:47,835::hosted_engine::328::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host rhev001.miovision.corp (id: 1, score: 2400) MainThread::INFO::2014-07-21 11:51:57,870::states::454::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine down, local host does not have best score MainThread::INFO::2014-07-21 11:51:57,883::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1405957917.88 type=state_transition detail=EngineDown-EngineDown hostname='rhev002.miovision.corp'
This went on for 20 minutes about an hour ago, and I decided to --vm-start on one of the hosts. The manager VM runs for a few minutes with the engine ui accessible, before shutting itself down again.
I then put host 002 into local maintenance mode, and host 001 auto started the hosted-engine VM. The logging still references host 002 as the 'best remote host' even though the calculated score is now 0:
MainThread::INFO::2014-07-21 12:03:24,011::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1405958604.01 type=state_transition detail=EngineUp-EngineUp hostname='rhev001.miovision.corp' MainThread::INFO::2014-07-21 12:03:24,013::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineUp-EngineUp) sent? ignored MainThread::INFO::2014-07-21 12:03:24,515::hosted_engine::323::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineUp (score: 2400) MainThread::INFO::2014-07-21 12:03:24,516::hosted_engine::328::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host rhev002.miovision.corp (id: 2, score: 0) MainThread::INFO::2014-07-21 12:03:34,567::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1405958614.57 type=state_transition detail=EngineUp-EngineUp hostname='rhev001.miovision.corp'
Once the hosted-engine VM was up for about 5 minutes I took host 002 out of local maintenance mode and the VM has not since shutdown.
Is this expected behaviour? Is this the normal recovery process when two hosts both hosting hosted-engine are started at the same time? I would have expected once hosted-engine VM was detected as bad (liveliness check from when I restarted the engine service) and the VM was shutdown, that it would spin back up on the next available host.
Thanks, Steve
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
participants (2)
-
Martin Sivak
-
Steve Dainard