Hi John,
after a deeper look I realized that you're probably facing [1]. The
patch is ready and I will also backport it to 3.4 branch.
--Jirka
[1]
Hi Jiri,
Sorry, I can't supply the log because the hosts have been recycled but
I'm sure it would have contained exactly the same information that you
already have from host2. It's a classic deadlock situation that should
never be allowed to happen. A simple and time proven solution was in my
original post.
The reason for recycling the hosts is that I discovered yesterday that
although the engine was still running it could not be accessed in any
way. Upon further finding that there was no way to get it restarted I
decided to abandon the whole idea of self-hosting until such time as I
see an indication that it's production ready.
regards,
John
On 29/07/14 22:52, Jiri Moskovcak wrote:
> Hi John,
> thanks for the logs. Seems like the engine is running on host2 and it
> decides that it doesn't have the best score and shuts the engine down
> and then neither of them want's to start the vm until you restart the
> host2. Unfortunately the logs doesn't contain the part from host1 from
> 2014-07-24 09:XX which I'd like to investigate because it might
> contain the information why host1 refused to start the vm when host2
> killed it.
>
> Regards,
> Jirka
>
> On 07/28/2014 02:57 AM, John Gardeniers wrote:
>> Hi Jira,
>>
>> Version: ovirt-hosted-engine-ha-1.1.5-1.el6.noarch
>>
>> Attached are the logs. Thanks for looking.
>>
>> Regards,
>> John
>>
>>
>> On 25/07/14 17:47, Jiri Moskovcak wrote:
>>> On 07/24/2014 11:37 PM, John Gardeniers wrote:
>>>> Hi Jiri,
>>>>
>>>> Perhaps you can tell me how to determine the exact version of
>>>> ovirt-hosted-engine-ha.
>>>
>>> Centos/RHEL/Fedora: rpm -q ovirt-hosted-engine-ha
>>>
>>>> As for the logs, I am not going to attach 60MB
>>>> of logs to an email,
>>>
>>> - there are other ways to share the logs
>>>
>>>> nor can I see any imaginagle reason for you wanting
>>>> to see them all, as the bulk is historical. I have already included
>>>> the
>>>> *relevant* sections. However, if you think there may be some other
>>>> section that may help you feel free to be more explicit about what you
>>>> are looking for. Right now I fail to understand what you might hope to
>>>> see in logs from several weeks ago that you can't get from the last
>>>> day
>>>> or so.
>>>>
>>>
>>> It's a standard way, people tend to think that they know what is a
>>> relevant part of a log, but in many cases they fail. Asking for the
>>> whole logs has proven to be faster than trying to find the relevant
>>> part through the user. And you're right, I don't need the logs from
>>> last week, just logs since the last start of the services when you
>>> observed the problem.
>>>
>>> Regards,
>>> Jirka
>>>
>>>> regards,
>>>> John
>>>>
>>>>
>>>> On 24/07/14 19:10, Jiri Moskovcak wrote:
>>>>> Hi, please provide the the exact versions of ovirt-hosted-engine-ha
>>>>> and all logs from /var/log/ovirt-hosted-engine-ha/
>>>>>
>>>>> Thank you,
>>>>> Jirka
>>>>>
>>>>> On 07/24/2014 01:29 AM, John Gardeniers wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> I have created a lab with 2 hypervisors and a self-hosted
engine.
>>>>>> Today
>>>>>> I followed the upgrade instructions as described in
>>>>>>
http://www.ovirt.org/Hosted_Engine_Howto and rebooted the engine.
I
>>>>>> didn't really do an upgrade but simply wanted to test what
would
>>>>>> happen
>>>>>> when the engine was rebooted.
>>>>>>
>>>>>> When the engine didn't restart I re-ran hosted-engine
>>>>>> --set-maintenance=none and restarted the vdsm, ovirt-ha-agent
and
>>>>>> ovirt-ha-broker services on both nodes. 15 minutes later it
still
>>>>>> hadn't
>>>>>> restarted, so I then tried rebooting both hypervisers. After an
hour
>>>>>> there was still no sign of the engine starting. The agent logs
don't
>>>>>> help me much. The following bits are repeated over and over.
>>>>>>
>>>>>> ovirt1 (192.168.19.20):
>>>>>>
>>>>>> MainThread::INFO::2014-07-24
>>>>>>
09:18:40,272::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Trying: notify time=1406157520.27 type=state_transition
>>>>>> detail=EngineDown-EngineDown hostname='ovirt1.om.net'
>>>>>> MainThread::INFO::2014-07-24
>>>>>>
09:18:40,272::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Success, was notification of state_transition
>>>>>> (EngineDown-EngineDown)
>>>>>> sent? ignored
>>>>>> MainThread::INFO::2014-07-24
>>>>>>
09:18:40,594::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Current state EngineDown (score: 2400)
>>>>>> MainThread::INFO::2014-07-24
>>>>>>
09:18:40,594::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best remote host 192.168.19.21 (id: 2, score: 2400)
>>>>>>
>>>>>> ovirt2 (192.168.19.21):
>>>>>>
>>>>>> MainThread::INFO::2014-07-24
>>>>>>
09:18:04,005::brokerlink::108::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Trying: notify time=1406157484.01 type=state_transition
>>>>>> detail=EngineDown-EngineDown hostname='ovirt2.om.net'
>>>>>> MainThread::INFO::2014-07-24
>>>>>>
09:18:04,006::brokerlink::117::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Success, was notification of state_transition
>>>>>> (EngineDown-EngineDown)
>>>>>> sent? ignored
>>>>>> MainThread::INFO::2014-07-24
>>>>>>
09:18:04,324::hosted_engine::327::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Current state EngineDown (score: 2400)
>>>>>> MainThread::INFO::2014-07-24
>>>>>>
09:18:04,324::hosted_engine::332::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best remote host 192.168.19.20 (id: 1, score: 2400)
>>>>>>
>>>>>> From the above information I decided to simply shut down one
>>>>>> hypervisor
>>>>>> and see what happens. The engine did start back up again a few
>>>>>> minutes
>>>>>> later.
>>>>>>
>>>>>> The interesting part is that each hypervisor seems to think the
>>>>>> other is
>>>>>> a better host. The two machines are identical, so there's no
>>>>>> reason I
>>>>>> can see for this odd behaviour. In a lab environment this is
little
>>>>>> more
>>>>>> than an annoying inconvenience. In a production environment it
>>>>>> would be
>>>>>> completely unacceptable.
>>>>>>
>>>>>> May I suggest that this issue be looked into and some means found
to
>>>>>> eliminate this kind of mutual exclusion? e.g. After a few minutes
of
>>>>>> such an issue one hypervisor could be randomly given a slightly
>>>>>> higher
>>>>>> weighting, which should result in it being chosen to start the
>>>>>> engine.
>>>>>>
>>>>>> regards,
>>>>>> John
>>>>>> _______________________________________________
>>>>>> Users mailing list
>>>>>> Users(a)ovirt.org
>>>>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>>>>
>>>>>
>>>>>
>>>>>
______________________________________________________________________
>>>>>
>>>>> This email has been scanned by the Symantec Email Security.cloud
>>>>> service.
>>>>> For more information please visit
http://www.symanteccloud.com
>>>>>
______________________________________________________________________
>>>>>
>>>>
>>>
>>>
>>> ______________________________________________________________________
>>> This email has been scanned by the Symantec Email Security.cloud
>>> service.
>>> For more information please visit
http://www.symanteccloud.com
>>> ______________________________________________________________________
>>
>
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit
http://www.symanteccloud.com
> ______________________________________________________________________