This is a multi-part message in MIME format.
--------------010306090902080504020806
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Hi Daniel,
As per my original post, each host believed the *other* is a better
candidate, with the result that neither would start the engine. As you
may have read by now, the bug has been confirmed and a fix has been
proposed.
Your claim that HA is working is incorrect. A system that requires
manual intervention when something goes wrong is not HA.
regards,
John
On 18/08/14 19:18, Daniel Helgenberger wrote:
Hello John,
On Mi, 2014-07-23 at 19:47 -0400, Jason Brooks wrote:
> ----- Original Message -----
>> From: "John Gardeniers" <jgardeniers(a)objectmastery.com>
>> To: "users" <users(a)ovirt.org>
>> Sent: Wednesday, July 23, 2014 4:29:45 PM
>> Subject: [ovirt-users] Self-hosted engine won't start
>>
>> Hi All,
>>
>> I have created a lab with 2 hypervisors and a self-hosted engine. Today
>> I followed the upgrade instructions as described in
>>
http://www.ovirt.org/Hosted_Engine_Howto and rebooted the engine. I
>> didn't really do an upgrade but simply wanted to test what would happen
>> when the engine was rebooted.
>>
>> When the engine didn't restart I re-ran hosted-engine
>> --set-maintenance=none and restarted the vdsm, ovirt-ha-agent and
>> ovirt-ha-broker services on both nodes. 15 minutes later it still hadn't
>> restarted, so I then tried rebooting both hypervisers. After an hour
>> there was still no sign of the engine starting. The agent logs don't
>> help me much. The following bits are repeated over and over.
>>
>> ovirt1 (192.168.19.20):
>>
>> MainThread::INFO::2014-07-24
>>
09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>> Trying: notify time=1406157520.27 type=state_transition
>> detail=EngineDown-EngineDown hostname='ovirt1.om.net'
>> MainThread::INFO::2014-07-24
>>
09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>> Success, was notification of state_transition (EngineDown-EngineDown)
>> sent? ignored
>> MainThread::INFO::2014-07-24
>>
09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
>> Current state EngineDown (score: 2400)
>> MainThread::INFO::2014-07-24
>>
09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
>> Best remote host 192.168.19.21 (id: 2, score: 2400)
>>
>> ovirt2 (192.168.19.21):
>>
>> MainThread::INFO::2014-07-24
>>
09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>> Trying: notify time=1406157484.01 type=state_transition
>> detail=EngineDown-EngineDown hostname='ovirt2.om.net'
>> MainThread::INFO::2014-07-24
>>
09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>> Success, was notification of state_transition (EngineDown-EngineDown)
>> sent? ignored
>> MainThread::INFO::2014-07-24
>>
09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
>> Current state EngineDown (score: 2400)
>> MainThread::INFO::2014-07-24
>>
09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
>> Best remote host 192.168.19.20 (id: 1, score: 2400)
>>
>> From the above information I decided to simply shut down one hypervisor
>> and see what happens. The engine did start back up again a few minutes
>> later.
> I've seen this behavior, too.
>
> Jason
>
>> The interesting part is that each hypervisor seems to think the other is
>> a better host.
Where do you get this from? From the line:
'Best remote host 192.168.19.20 (id: 1, score: 2400)' ?
I assume this is not the case; HA broker just looking for the best
remote candidate.
But I have also trouble with this behavior; esp. when I had the cluster
in global maintenance.
I resolve this by stating hosted engine manually in in global
maintenance and waiting for {"health": "good", "vm":
"up", "detail":
"up"} and disabling global maintenance afterwards.
I found the HA feature is indeed working - and tried out best by
manually stopping the engine service (service hosted-engine stop). IIRC
This should trigger a failover and reboot of the engine.
> The two machines are identical, so there's no reason I
>> can see for this odd behaviour. In a lab environment this is little more
>> than an annoying inconvenience. In a production environment it would be
>> completely unacceptable.
>>
>> May I suggest that this issue be looked into and some means found to
>> eliminate this kind of mutual exclusion? e.g. After a few minutes of
>> such an issue one hypervisor could be randomly given a slightly higher
>> weighting, which should result in it being chosen to start the engine.
>>
>> regards,
>> John
>> _______________________________________________
>> Users mailing list
>> Users(a)ovirt.org
>>
http://lists.ovirt.org/mailman/listinfo/users
>>
> _______________________________________________
> Users mailing list
> Users(a)ovirt.org
>
http://lists.ovirt.org/mailman/listinfo/users
Cheers,
Daniel
_______________________________________________
Users mailing list
Users(a)ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
--------------010306090902080504020806
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: 7bit
<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
Hi Daniel,<br>
<br>
As per my original post, each host believed the *other* is a better
candidate, with the result that neither would start the engine. As
you may have read by now, the bug has been confirmed and a fix has
been proposed.<br>
<br>
Your claim that HA is working is incorrect. A system that requires
manual intervention when something goes wrong is not HA.<br>
<br>
regards,<br>
John<br>
<br>
<br>
<div class="moz-cite-prefix">On 18/08/14 19:18, Daniel Helgenberger
wrote:<br>
</div>
<blockquote cite="mid:1408353512.5654.3.camel@m-box.de"
type="cite">
<pre wrap="">Hello John,
On Mi, 2014-07-23 at 19:47 -0400, Jason Brooks wrote:
</pre>
<blockquote type="cite">
<pre wrap="">
----- Original Message -----
</pre>
<blockquote type="cite">
<pre wrap="">From: "John Gardeniers" <a
class="moz-txt-link-rfc2396E"
href="mailto:jgardeniers@objectmastery.com"><jgardeniers@objectmastery.com></a>
To: "users" <a class="moz-txt-link-rfc2396E"
href="mailto:users@ovirt.org"><users@ovirt.org></a>
Sent: Wednesday, July 23, 2014 4:29:45 PM
Subject: [ovirt-users] Self-hosted engine won't start
Hi All,
I have created a lab with 2 hypervisors and a self-hosted engine. Today
I followed the upgrade instructions as described in
<a class="moz-txt-link-freetext"
href="http://www.ovirt.org/Hosted_Engine_Howto">http://www.o...
and rebooted the engine. I
didn't really do an upgrade but simply wanted to test what would happen
when the engine was rebooted.
When the engine didn't restart I re-ran hosted-engine
--set-maintenance=none and restarted the vdsm, ovirt-ha-agent and
ovirt-ha-broker services on both nodes. 15 minutes later it still hadn't
restarted, so I then tried rebooting both hypervisers. After an hour
there was still no sign of the engine starting. The agent logs don't
help me much. The following bits are repeated over and over.
ovirt1 (192.168.19.20):
MainThread::<a class="moz-txt-link-freetext"
href="INFO::2014-07-24">INFO::2014-07-24</a>
09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1406157520.27 type=state_transition
detail=EngineDown-EngineDown hostname='ovirt1.om.net'
MainThread::<a class="moz-txt-link-freetext"
href="INFO::2014-07-24">INFO::2014-07-24</a>
09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineDown-EngineDown)
sent? ignored
MainThread::<a class="moz-txt-link-freetext"
href="INFO::2014-07-24">INFO::2014-07-24</a>
09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
Current state EngineDown (score: 2400)
MainThread::<a class="moz-txt-link-freetext"
href="INFO::2014-07-24">INFO::2014-07-24</a>
09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
Best remote host 192.168.19.21 (id: 2, score: 2400)
ovirt2 (192.168.19.21):
MainThread::<a class="moz-txt-link-freetext"
href="INFO::2014-07-24">INFO::2014-07-24</a>
09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1406157484.01 type=state_transition
detail=EngineDown-EngineDown hostname='ovirt2.om.net'
MainThread::<a class="moz-txt-link-freetext"
href="INFO::2014-07-24">INFO::2014-07-24</a>
09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineDown-EngineDown)
sent? ignored
MainThread::<a class="moz-txt-link-freetext"
href="INFO::2014-07-24">INFO::2014-07-24</a>
09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
Current state EngineDown (score: 2400)
MainThread::<a class="moz-txt-link-freetext"
href="INFO::2014-07-24">INFO::2014-07-24</a>
09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
Best remote host 192.168.19.20 (id: 1, score: 2400)
From the above information I decided to simply shut down one
hypervisor
and see what happens. The engine did start back up again a few minutes
later.
</pre>
</blockquote>
<pre wrap="">
I've seen this behavior, too.
Jason
</pre>
<blockquote type="cite">
<pre wrap="">
The interesting part is that each hypervisor seems to think the other is
a better host.
</pre>
</blockquote>
</blockquote>
<pre wrap="">Where do you get this from? From the line:
'Best remote host 192.168.19.20 (id: 1, score: 2400)' ?
I assume this is not the case; HA broker just looking for the best
remote candidate.
But I have also trouble with this behavior; esp. when I had the cluster
in global maintenance.
I resolve this by stating hosted engine manually in in global
maintenance and waiting for {"health": "good", "vm":
"up", "detail":
"up"} and disabling global maintenance afterwards.
I found the HA feature is indeed working - and tried out best by
manually stopping the engine service (service hosted-engine stop). IIRC
This should trigger a failover and reboot of the engine.
</pre>
<blockquote type="cite">
<pre wrap="">The two machines are identical, so there's no
reason I
</pre>
<blockquote type="cite">
<pre wrap="">can see for this odd behaviour. In a lab
environment this is little more
than an annoying inconvenience. In a production environment it would be
completely unacceptable.
May I suggest that this issue be looked into and some means found to
eliminate this kind of mutual exclusion? e.g. After a few minutes of
such an issue one hypervisor could be randomly given a slightly higher
weighting, which should result in it being chosen to start the engine.
regards,
John
_______________________________________________
Users mailing list
<a class="moz-txt-link-abbreviated"
href="mailto:Users@ovirt.org">Users@ovirt.org</a>
<a class="moz-txt-link-freetext"
href="http://lists.ovirt.org/mailman/listinfo/users">http://...
</pre>
</blockquote>
<pre wrap="">_______________________________________________
Users mailing list
<a class="moz-txt-link-abbreviated"
href="mailto:Users@ovirt.org">Users@ovirt.org</a>
<a class="moz-txt-link-freetext"
href="http://lists.ovirt.org/mailman/listinfo/users">http://...
</pre>
</blockquote>
<pre wrap="">
Cheers,
Daniel
</pre>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Users mailing list
<a class="moz-txt-link-abbreviated"
href="mailto:Users@ovirt.org">Users@ovirt.org</a>
<a class="moz-txt-link-freetext"
href="http://lists.ovirt.org/mailman/listinfo/users">http://...
</pre>
</blockquote>
<br>
</body>
</html>
--------------010306090902080504020806--