
This is a multi-part message in MIME format. --------------010306090902080504020806 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Hi Daniel, As per my original post, each host believed the *other* is a better candidate, with the result that neither would start the engine. As you may have read by now, the bug has been confirmed and a fix has been proposed. Your claim that HA is working is incorrect. A system that requires manual intervention when something goes wrong is not HA. regards, John On 18/08/14 19:18, Daniel Helgenberger wrote:
Hello John,
On Mi, 2014-07-23 at 19:47 -0400, Jason Brooks wrote:
----- Original Message -----
From: "John Gardeniers" <jgardeniers@objectmastery.com> To: "users" <users@ovirt.org> Sent: Wednesday, July 23, 2014 4:29:45 PM Subject: [ovirt-users] Self-hosted engine won't start
Hi All,
I have created a lab with 2 hypervisors and a self-hosted engine. Today I followed the upgrade instructions as described in http://www.ovirt.org/Hosted_Engine_Howto and rebooted the engine. I didn't really do an upgrade but simply wanted to test what would happen when the engine was rebooted.
When the engine didn't restart I re-ran hosted-engine --set-maintenance=none and restarted the vdsm, ovirt-ha-agent and ovirt-ha-broker services on both nodes. 15 minutes later it still hadn't restarted, so I then tried rebooting both hypervisers. After an hour there was still no sign of the engine starting. The agent logs don't help me much. The following bits are repeated over and over.
ovirt1 (192.168.19.20):
MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1406157520.27 type=state_transition detail=EngineDown-EngineDown hostname='ovirt1.om.net' MainThread::INFO::2014-07-24 09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host 192.168.19.21 (id: 2, score: 2400)
ovirt2 (192.168.19.21):
MainThread::INFO::2014-07-24 09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1406157484.01 type=state_transition detail=EngineDown-EngineDown hostname='ovirt2.om.net' MainThread::INFO::2014-07-24 09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::INFO::2014-07-24 09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host 192.168.19.20 (id: 1, score: 2400)
From the above information I decided to simply shut down one hypervisor and see what happens. The engine did start back up again a few minutes later. I've seen this behavior, too.
Jason
The interesting part is that each hypervisor seems to think the other is a better host. Where do you get this from? From the line: 'Best remote host 192.168.19.20 (id: 1, score: 2400)' ?
I assume this is not the case; HA broker just looking for the best remote candidate.
But I have also trouble with this behavior; esp. when I had the cluster in global maintenance. I resolve this by stating hosted engine manually in in global maintenance and waiting for {"health": "good", "vm": "up", "detail": "up"} and disabling global maintenance afterwards.
I found the HA feature is indeed working - and tried out best by manually stopping the engine service (service hosted-engine stop). IIRC This should trigger a failover and reboot of the engine.
The two machines are identical, so there's no reason I
can see for this odd behaviour. In a lab environment this is little more than an annoying inconvenience. In a production environment it would be completely unacceptable.
May I suggest that this issue be looked into and some means found to eliminate this kind of mutual exclusion? e.g. After a few minutes of such an issue one hypervisor could be randomly given a slightly higher weighting, which should result in it being chosen to start the engine.
regards, John _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Cheers, Daniel
_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
--------------010306090902080504020806 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: 7bit <html> <head> <meta content="text/html; charset=windows-1252" http-equiv="Content-Type"> </head> <body text="#000000" bgcolor="#FFFFFF"> Hi Daniel,<br> <br> As per my original post, each host believed the *other* is a better candidate, with the result that neither would start the engine. As you may have read by now, the bug has been confirmed and a fix has been proposed.<br> <br> Your claim that HA is working is incorrect. A system that requires manual intervention when something goes wrong is not HA.<br> <br> regards,<br> John<br> <br> <br> <div class="moz-cite-prefix">On 18/08/14 19:18, Daniel Helgenberger wrote:<br> </div> <blockquote cite="mid:1408353512.5654.3.camel@m-box.de" type="cite"> <pre wrap="">Hello John, On Mi, 2014-07-23 at 19:47 -0400, Jason Brooks wrote: </pre> <blockquote type="cite"> <pre wrap=""> ----- Original Message ----- </pre> <blockquote type="cite"> <pre wrap="">From: "John Gardeniers" <a class="moz-txt-link-rfc2396E" href="mailto:jgardeniers@objectmastery.com"><jgardeniers@objectmastery.com></a> To: "users" <a class="moz-txt-link-rfc2396E" href="mailto:users@ovirt.org"><users@ovirt.org></a> Sent: Wednesday, July 23, 2014 4:29:45 PM Subject: [ovirt-users] Self-hosted engine won't start Hi All, I have created a lab with 2 hypervisors and a self-hosted engine. Today I followed the upgrade instructions as described in <a class="moz-txt-link-freetext" href="http://www.ovirt.org/Hosted_Engine_Howto">http://www.ovirt.org/Hosted_Engine_Howto</a> and rebooted the engine. I didn't really do an upgrade but simply wanted to test what would happen when the engine was rebooted. When the engine didn't restart I re-ran hosted-engine --set-maintenance=none and restarted the vdsm, ovirt-ha-agent and ovirt-ha-broker services on both nodes. 15 minutes later it still hadn't restarted, so I then tried rebooting both hypervisers. After an hour there was still no sign of the engine starting. The agent logs don't help me much. The following bits are repeated over and over. ovirt1 (192.168.19.20): MainThread::<a class="moz-txt-link-freetext" href="INFO::2014-07-24">INFO::2014-07-24</a> 09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1406157520.27 type=state_transition detail=EngineDown-EngineDown hostname='ovirt1.om.net' MainThread::<a class="moz-txt-link-freetext" href="INFO::2014-07-24">INFO::2014-07-24</a> 09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::<a class="moz-txt-link-freetext" href="INFO::2014-07-24">INFO::2014-07-24</a> 09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::<a class="moz-txt-link-freetext" href="INFO::2014-07-24">INFO::2014-07-24</a> 09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host 192.168.19.21 (id: 2, score: 2400) ovirt2 (192.168.19.21): MainThread::<a class="moz-txt-link-freetext" href="INFO::2014-07-24">INFO::2014-07-24</a> 09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1406157484.01 type=state_transition detail=EngineDown-EngineDown hostname='ovirt2.om.net' MainThread::<a class="moz-txt-link-freetext" href="INFO::2014-07-24">INFO::2014-07-24</a> 09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineDown-EngineDown) sent? ignored MainThread::<a class="moz-txt-link-freetext" href="INFO::2014-07-24">INFO::2014-07-24</a> 09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Current state EngineDown (score: 2400) MainThread::<a class="moz-txt-link-freetext" href="INFO::2014-07-24">INFO::2014-07-24</a> 09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Best remote host 192.168.19.20 (id: 1, score: 2400)
From the above information I decided to simply shut down one hypervisor and see what happens. The engine did start back up again a few minutes later. </pre> </blockquote> <pre wrap=""> I've seen this behavior, too.
Jason </pre> <blockquote type="cite"> <pre wrap=""> The interesting part is that each hypervisor seems to think the other is a better host. </pre> </blockquote> </blockquote> <pre wrap="">Where do you get this from? From the line: 'Best remote host 192.168.19.20 (id: 1, score: 2400)' ? I assume this is not the case; HA broker just looking for the best remote candidate. But I have also trouble with this behavior; esp. when I had the cluster in global maintenance. I resolve this by stating hosted engine manually in in global maintenance and waiting for {"health": "good", "vm": "up", "detail": "up"} and disabling global maintenance afterwards. I found the HA feature is indeed working - and tried out best by manually stopping the engine service (service hosted-engine stop). IIRC This should trigger a failover and reboot of the engine. </pre> <blockquote type="cite"> <pre wrap="">The two machines are identical, so there's no reason I </pre> <blockquote type="cite"> <pre wrap="">can see for this odd behaviour. In a lab environment this is little more than an annoying inconvenience. In a production environment it would be completely unacceptable. May I suggest that this issue be looked into and some means found to eliminate this kind of mutual exclusion? e.g. After a few minutes of such an issue one hypervisor could be randomly given a slightly higher weighting, which should result in it being chosen to start the engine. regards, John _______________________________________________ Users mailing list <a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a> <a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a> </pre> </blockquote> <pre wrap="">_______________________________________________ Users mailing list <a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a> <a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a> </pre> </blockquote> <pre wrap=""> Cheers, Daniel </pre> <br> <fieldset class="mimeAttachmentHeader"></fieldset> <br> <pre wrap="">_______________________________________________ Users mailing list <a class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a> <a class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a> </pre> </blockquote> <br> </body> </html> --------------010306090902080504020806--