
Hi All, I have created a lab with 2 hypervisors and a self-hosted engine. Today I followed the upgrade instructions as described in http://www.ovirt.org/Hosted_Engine_Howto and rebooted the engine. I didn't really do an upgrade but simply wanted to test what would happen when the engine was rebooted. When the engine didn't restart I re-ran hosted-engine --set-maintenance=none and restarted the vdsm, ovirt-ha-agent and ovirt-ha-broker services on both nodes. 15 minutes later it still hadn't restarted, so I then tried rebooting both hypervisers. After an hour there was still no sign of the engine starting. The agent logs don't help me much. The following bits are repeated over and over. ovirt1 (192.168.19.20): MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1406157520.27 type=state_transition detail=EngineDown-EngineDown hostname='ovirt1.om.net' MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host 192.168.19.21 (id: 2, score: 2400) ovirt2 (192.168.19.21): MainThread::INFO::2014-07-24 09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1406157484.01 type=state_transition detail=EngineDown-EngineDown hostname='ovirt2.om.net' MainThread::INFO::2014-07-24 09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host 192.168.19.20 (id: 1, score: 2400)
From the above information I decided to simply shut down one hypervisor and see what happens. The engine did start back up again a few minutes later.
The interesting part is that each hypervisor seems to think the other is a better host. The two machines are identical, so there's no reason I can see for this odd behaviour. In a lab environment this is little more than an annoying inconvenience. In a production environment it would be completely unacceptable. May I suggest that this issue be looked into and some means found to eliminate this kind of mutual exclusion? e.g. After a few minutes of such an issue one hypervisor could be randomly given a slightly higher weighting, which should result in it being chosen to start the engine. regards, John