[ovirt-users] Self-hosted engine won't start

John Gardeniers jgardeniers at objectmastery.com
Mon Aug 18 21:42:18 UTC 2014


Hi Daniel,

As per my original post, each host believed the *other* is a better
candidate, with the result that neither would start the engine. As you
may have read by now, the bug has been confirmed and a fix has been
proposed.

Your claim that HA is working is incorrect. A system that requires
manual intervention when something goes wrong is not HA.

regards,
John


On 18/08/14 19:18, Daniel Helgenberger wrote:
> Hello John,
>
>
> On Mi, 2014-07-23 at 19:47 -0400, Jason Brooks wrote:
>> ----- Original Message -----
>>> From: "John Gardeniers" <jgardeniers at objectmastery.com>
>>> To: "users" <users at ovirt.org>
>>> Sent: Wednesday, July 23, 2014 4:29:45 PM
>>> Subject: [ovirt-users] Self-hosted engine won't start
>>>
>>> Hi All,
>>>
>>> I have created a lab with 2 hypervisors and a self-hosted engine. Today
>>> I followed the upgrade instructions as described in
>>> http://www.ovirt.org/Hosted_Engine_Howto and rebooted the engine. I
>>> didn't really do an upgrade but simply wanted to test what would happen
>>> when the engine was rebooted.
>>>
>>> When the engine didn't restart I re-ran hosted-engine
>>> --set-maintenance=none and restarted the vdsm, ovirt-ha-agent and
>>> ovirt-ha-broker services on both nodes. 15 minutes later it still hadn't
>>> restarted, so I then tried rebooting both hypervisers. After an hour
>>> there was still no sign of the engine starting. The agent logs don't
>>> help me much. The following bits are repeated over and over.
>>>
>>> ovirt1 (192.168.19.20):
>>>
>>> MainThread::INFO::2014-07-24
>>> 09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>>> Trying: notify time=1406157520.27 type=state_transition
>>> detail=EngineDown-EngineDown hostname='ovirt1.om.net'
>>> MainThread::INFO::2014-07-24
>>> 09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>>> Success, was notification of state_transition (EngineDown-EngineDown)
>>> sent? ignored
>>> MainThread::INFO::2014-07-24
>>> 09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
>>> Current state EngineDown (score: 2400)
>>> MainThread::INFO::2014-07-24
>>> 09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
>>> Best remote host 192.168.19.21 (id: 2, score: 2400)
>>>
>>> ovirt2 (192.168.19.21):
>>>
>>> MainThread::INFO::2014-07-24
>>> 09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>>> Trying: notify time=1406157484.01 type=state_transition
>>> detail=EngineDown-EngineDown hostname='ovirt2.om.net'
>>> MainThread::INFO::2014-07-24
>>> 09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>>> Success, was notification of state_transition (EngineDown-EngineDown)
>>> sent? ignored
>>> MainThread::INFO::2014-07-24
>>> 09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
>>> Current state EngineDown (score: 2400)
>>> MainThread::INFO::2014-07-24
>>> 09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
>>> Best remote host 192.168.19.20 (id: 1, score: 2400)
>>>
>>> From the above information I decided to simply shut down one hypervisor
>>> and see what happens. The engine did start back up again a few minutes
>>> later.
>> I've seen this behavior, too.
>>
>> Jason
>>
>>> The interesting part is that each hypervisor seems to think the other is
>>> a better host. 
> Where do you get this from? From the line: 
> 'Best remote host 192.168.19.20 (id: 1, score: 2400)' ?
>
> I assume this is not the case; HA broker just looking for the best
> remote candidate. 
>
> But I have also trouble with this behavior; esp. when I had the cluster
> in global maintenance.
> I resolve this by stating hosted engine manually in in global
> maintenance and waiting for {"health": "good", "vm": "up", "detail":
> "up"} and disabling global maintenance afterwards.
>
> I found the HA feature is indeed working - and tried out best by
> manually stopping the engine service (service hosted-engine stop). IIRC
> This should trigger a failover and reboot of the engine.
>
>
>> The two machines are identical, so there's no reason I
>>> can see for this odd behaviour. In a lab environment this is little more
>>> than an annoying inconvenience. In a production environment it would be
>>> completely unacceptable.
>>>
>>> May I suggest that this issue be looked into and some means found to
>>> eliminate this kind of mutual exclusion? e.g. After a few minutes of
>>> such an issue one hypervisor could be randomly given a slightly higher
>>> weighting, which should result in it being chosen to start the engine.
>>>
>>> regards,
>>> John
>>> _______________________________________________
>>> Users mailing list
>>> Users at ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/users
>>>
>> _______________________________________________
>> Users mailing list
>> Users at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/users
>
> Cheers, 
> Daniel
>
>
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20140819/613f832c/attachment-0001.html>


More information about the Users mailing list