[ovirt-users] Self-hosted engine won't start

Mon Aug 18 06:29:20 UTC 2014

Hi John,
this is the patch fixing your problem [1]. It can be found at the top of 
that bz page. It's really a simple change, so if you want you can just 
change it manually on your system without waiting for a patches version.

--Jirka

[1] 
http://gerrit.ovirt.org/#/c/31510/2/ovirt_hosted_engine_ha/agent/states.py

On 08/18/2014 12:17 AM, John Gardeniers wrote:
> Hi Jirka,
>
> Thanks for the update. It sounds like the same bug but with a few extra
> issues thrown in. e.g. Comment 9 seems to me to be a completely separate
> bug, although it may affect the issue I reported.
>
> I can't see any mention of how the problem is being resolved, which I am
> interested in, but will keep an eye on it.
>
> I'll try the patched version when I get the time and enthusiasm to give
> it another crack.
>
> regards,
> John
>
>
> On 14/08/14 22:57, Jiri Moskovcak wrote:
>> Hi John,
>> after a deeper look I realized that you're probably facing [1]. The
>> patch is ready and I will also backport it to 3.4 branch.
>>
>> --Jirka
>>
>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1093638
>>
>> On 07/29/2014 11:41 PM, John Gardeniers wrote:
>>> Hi Jiri,
>>>
>>> Sorry, I can't supply the log because the hosts have been recycled but
>>> I'm sure it would have contained exactly the same information that you
>>> already have from host2. It's a classic deadlock situation that should
>>> never be allowed to happen. A simple and time proven solution was in my
>>> original post.
>>>
>>> The reason for recycling the hosts is that I discovered yesterday that
>>> although the engine was still running it could not be accessed in any
>>> way. Upon further finding that there was no way to get it restarted I
>>> decided to abandon the whole idea of self-hosting until such time as I
>>> see an indication that it's production ready.
>>>
>>> regards,
>>> John
>>>
>>>
>>> On 29/07/14 22:52, Jiri Moskovcak wrote:
>>>> Hi John,
>>>> thanks for the logs. Seems like the engine is running on host2 and it
>>>> decides that it doesn't have the best score and shuts the engine down
>>>> and then neither of them want's to start the vm until you restart the
>>>> host2. Unfortunately the logs doesn't contain the part from host1 from
>>>> 2014-07-24 09:XX which I'd like to investigate because it might
>>>> contain the information why host1 refused to start the vm when host2
>>>> killed it.
>>>>
>>>> Regards,
>>>> Jirka
>>>>
>>>> On 07/28/2014 02:57 AM, John Gardeniers wrote:
>>>>> Hi Jira,
>>>>>
>>>>> Version: ovirt-hosted-engine-ha-1.1.5-1.el6.noarch
>>>>>
>>>>> Attached are the logs. Thanks for looking.
>>>>>
>>>>> Regards,
>>>>> John
>>>>>
>>>>>
>>>>> On 25/07/14 17:47, Jiri Moskovcak wrote:
>>>>>> On 07/24/2014 11:37 PM, John Gardeniers wrote:
>>>>>>> Hi Jiri,
>>>>>>>
>>>>>>> Perhaps you can tell me how to determine the exact version of
>>>>>>> ovirt-hosted-engine-ha.
>>>>>>
>>>>>> Centos/RHEL/Fedora: rpm -q ovirt-hosted-engine-ha
>>>>>>
>>>>>>> As for the logs, I am not going to attach 60MB
>>>>>>> of logs to an email,
>>>>>>
>>>>>> - there are other ways to share the logs
>>>>>>
>>>>>>> nor can I see any imaginagle reason for you wanting
>>>>>>> to see them all, as the bulk is historical. I have already included
>>>>>>> the
>>>>>>> *relevant* sections. However, if you think there may be some other
>>>>>>> section that may help you feel free to be more explicit about
>>>>>>> what you
>>>>>>> are looking for. Right now I fail to understand what you might
>>>>>>> hope to
>>>>>>> see in logs from several weeks ago that you can't get from the last
>>>>>>> day
>>>>>>> or so.
>>>>>>>
>>>>>>
>>>>>> It's a standard way, people tend to think that they know what is a
>>>>>> relevant part of a log, but in many cases they fail. Asking for the
>>>>>> whole logs has proven to be faster than trying to find the relevant
>>>>>> part through the user. And you're right, I don't need the logs from
>>>>>> last week, just logs since the last start of the services when you
>>>>>> observed the problem.
>>>>>>
>>>>>> Regards,
>>>>>> Jirka
>>>>>>
>>>>>>> regards,
>>>>>>> John
>>>>>>>
>>>>>>>
>>>>>>> On 24/07/14 19:10, Jiri Moskovcak wrote:
>>>>>>>> Hi, please provide the the exact versions of ovirt-hosted-engine-ha
>>>>>>>> and all logs from /var/log/ovirt-hosted-engine-ha/
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Jirka
>>>>>>>>
>>>>>>>> On 07/24/2014 01:29 AM, John Gardeniers wrote:
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> I have created a lab with 2 hypervisors and a self-hosted engine.
>>>>>>>>> Today
>>>>>>>>> I followed the upgrade instructions as described in
>>>>>>>>> http://www.ovirt.org/Hosted_Engine_Howto and rebooted the
>>>>>>>>> engine. I
>>>>>>>>> didn't really do an upgrade but simply wanted to test what would
>>>>>>>>> happen
>>>>>>>>> when the engine was rebooted.
>>>>>>>>>
>>>>>>>>> When the engine didn't restart I re-ran hosted-engine
>>>>>>>>> --set-maintenance=none and restarted the vdsm, ovirt-ha-agent and
>>>>>>>>> ovirt-ha-broker services on both nodes. 15 minutes later it still
>>>>>>>>> hadn't
>>>>>>>>> restarted, so I then tried rebooting both hypervisers. After an
>>>>>>>>> hour
>>>>>>>>> there was still no sign of the engine starting. The agent logs
>>>>>>>>> don't
>>>>>>>>> help me much. The following bits are repeated over and over.
>>>>>>>>>
>>>>>>>>> ovirt1 (192.168.19.20):
>>>>>>>>>
>>>>>>>>> MainThread::INFO::2014-07-24
>>>>>>>>> 09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Trying: notify time=1406157520.27 type=state_transition
>>>>>>>>> detail=EngineDown-EngineDown hostname='ovirt1.om.net'
>>>>>>>>> MainThread::INFO::2014-07-24
>>>>>>>>> 09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Success, was notification of state_transition
>>>>>>>>> (EngineDown-EngineDown)
>>>>>>>>> sent? ignored
>>>>>>>>> MainThread::INFO::2014-07-24
>>>>>>>>> 09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Current state EngineDown (score: 2400)
>>>>>>>>> MainThread::INFO::2014-07-24
>>>>>>>>> 09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best remote host 192.168.19.21 (id: 2, score: 2400)
>>>>>>>>>
>>>>>>>>> ovirt2 (192.168.19.21):
>>>>>>>>>
>>>>>>>>> MainThread::INFO::2014-07-24
>>>>>>>>> 09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Trying: notify time=1406157484.01 type=state_transition
>>>>>>>>> detail=EngineDown-EngineDown hostname='ovirt2.om.net'
>>>>>>>>> MainThread::INFO::2014-07-24
>>>>>>>>> 09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Success, was notification of state_transition
>>>>>>>>> (EngineDown-EngineDown)
>>>>>>>>> sent? ignored
>>>>>>>>> MainThread::INFO::2014-07-24
>>>>>>>>> 09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Current state EngineDown (score: 2400)
>>>>>>>>> MainThread::INFO::2014-07-24
>>>>>>>>> 09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best remote host 192.168.19.20 (id: 1, score: 2400)
>>>>>>>>>
>>>>>>>>>      From the above information I decided to simply shut down one
>>>>>>>>> hypervisor
>>>>>>>>> and see what happens. The engine did start back up again a few
>>>>>>>>> minutes
>>>>>>>>> later.
>>>>>>>>>
>>>>>>>>> The interesting part is that each hypervisor seems to think the
>>>>>>>>> other is
>>>>>>>>> a better host. The two machines are identical, so there's no
>>>>>>>>> reason I
>>>>>>>>> can see for this odd behaviour. In a lab environment this is
>>>>>>>>> little
>>>>>>>>> more
>>>>>>>>> than an annoying inconvenience. In a production environment it
>>>>>>>>> would be
>>>>>>>>> completely unacceptable.
>>>>>>>>>
>>>>>>>>> May I suggest that this issue be looked into and some means
>>>>>>>>> found to
>>>>>>>>> eliminate this kind of mutual exclusion? e.g. After a few
>>>>>>>>> minutes of
>>>>>>>>> such an issue one hypervisor could be randomly given a slightly
>>>>>>>>> higher
>>>>>>>>> weighting, which should result in it being chosen to start the
>>>>>>>>> engine.
>>>>>>>>>
>>>>>>>>> regards,
>>>>>>>>> John
>>>>>>>>> _______________________________________________
>>>>>>>>> Users mailing list
>>>>>>>>> Users at ovirt.org
>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ______________________________________________________________________
>>>>>>>>
>>>>>>>>
>>>>>>>> This email has been scanned by the Symantec Email Security.cloud
>>>>>>>> service.
>>>>>>>> For more information please visit http://www.symanteccloud.com
>>>>>>>> ______________________________________________________________________
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ______________________________________________________________________
>>>>>>
>>>>>> This email has been scanned by the Symantec Email Security.cloud
>>>>>> service.
>>>>>> For more information please visit http://www.symanteccloud.com
>>>>>> ______________________________________________________________________
>>>>>>
>>>>>
>>>>
>>>>
>>>> ______________________________________________________________________
>>>> This email has been scanned by the Symantec Email Security.cloud
>>>> service.
>>>> For more information please visit http://www.symanteccloud.com
>>>> ______________________________________________________________________
>>>
>>
>>
>> ______________________________________________________________________
>> This email has been scanned by the Symantec Email Security.cloud service.
>> For more information please visit http://www.symanteccloud.com
>> ______________________________________________________________________
>