[ovirt-users] [hosted-engine] engine VM doesn't respawn when its host was killed (poweroff)

Yedidyah Bar David didi at redhat.com
Sun May 1 12:32:19 UTC 2016


It's very hard to understand your flow when time moves backwards.

Please try again from a clean state. Make sure all hosts have same clock.
Then document the exact time you do stuff - starting/stopping a host,
checking status, etc.

Some things to check from your logs:

in agent.host01.log:

MainThread::INFO::2016-04-25
15:32:41,370::states::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine down and local host has best score (3400), attempting to start
engine VM
...
MainThread::INFO::2016-04-25
15:32:44,276::hosted_engine::1147::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
Engine VM started on localhost
...
MainThread::INFO::2016-04-25
15:32:58,478::states::672::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
Score is 0 due to unexpected vm shutdown at Mon Apr 25 15:32:58 2016

Why?

Also, in agent.host03.log:

MainThread::INFO::2016-04-25
15:29:53,218::states::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine down and local host has best score (3400), attempting to start
engine VM
MainThread::INFO::2016-04-25
15:29:53,223::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1461572993.22 type=state_transition
detail=EngineDown-EngineStart hostname='host03.ovirt.forest.go.th'
MainThread::ERROR::2016-04-25
15:30:23,253::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate)
Connection closed: Connection timed out

Why?

Also, in addition to the actions you stated, you changed a lot maintenance mode.

You can try something like this to get some interesting lines from agent.log:

egrep -i 'start eng|shut|vm started|vm running|vm is running on|
maintenance detected|migra'

Best,

On Mon, Apr 25, 2016 at 12:27 PM, Wee Sritippho <wee.s at forest.go.th> wrote:
> The hosted engine storage is located in an external Fibre Channel SAN.
>
>
> On 25/4/2559 16:19, Martin Sivak wrote:
>>
>> Hi,
>>
>> it seems that all nodes lost access to storage for some reason after
>> the host was killed. Where is your hosted engine storage located?
>>
>> Regards
>>
>> --
>> Martin Sivak
>> SLA / oVirt
>>
>>
>> On Mon, Apr 25, 2016 at 10:58 AM, Wee Sritippho <wee.s at forest.go.th>
>> wrote:
>>>
>>> Hi,
>>>
>>>  From the hosted-engine FAQ, the engine VM should be up and running in
>>> about
>>> 5 minutes after its host was forced poweroff. However, after updated
>>> oVirt
>>> 3.6.4 to 3.6.5, the engine VM won't restart automatically even after 10+
>>> minutes (I already made sure that global maintenance mode is set to
>>> none). I
>>> initially thought its a time sync issue, so I installed and enabled ntp
>>> on
>>> the hosts and engine. However, the issue still persists.
>>>
>>> ###Versions:
>>> [root at host01 ~]# rpm -qa | grep ovirt
>>> libgovirt-0.3.3-1.el7_2.1.x86_64
>>> ovirt-vmconsole-1.0.0-1.el7.centos.noarch
>>> ovirt-vmconsole-host-1.0.0-1.el7.centos.noarch
>>> ovirt-hosted-engine-ha-1.3.5.3-1.el7.centos.noarch
>>> ovirt-host-deploy-1.4.1-1.el7.centos.noarch
>>> ovirt-engine-sdk-python-3.6.5.0-1.el7.centos.noarch
>>> ovirt-hosted-engine-setup-1.3.5.0-1.el7.centos.noarch
>>> ovirt-release36-007-1.noarch
>>> ovirt-setup-lib-1.0.1-1.el7.centos.noarch
>>> [root at host01 ~]# rpm -qa | grep vdsm
>>> vdsm-infra-4.17.26-0.el7.centos.noarch
>>> vdsm-jsonrpc-4.17.26-0.el7.centos.noarch
>>> vdsm-gluster-4.17.26-0.el7.centos.noarch
>>> vdsm-python-4.17.26-0.el7.centos.noarch
>>> vdsm-yajsonrpc-4.17.26-0.el7.centos.noarch
>>> vdsm-4.17.26-0.el7.centos.noarch
>>> vdsm-cli-4.17.26-0.el7.centos.noarch
>>> vdsm-xmlrpc-4.17.26-0.el7.centos.noarch
>>> vdsm-hook-vmfex-dev-4.17.26-0.el7.centos.noarch
>>>
>>> ###Log files:
>>> https://app.box.com/s/fkurmwagogwkv5smkwwq7i4ztmwf9q9r
>>>
>>> ###After host02 was killed:
>>> [root at host03 wees]# hosted-engine --vm-status
>>>
>>>
>>> --== Host 1 status ==--
>>>
>>> Status up-to-date                  : True
>>> Hostname                           : host01.ovirt.forest.go.th
>>> Host ID                            : 1
>>> Engine status                      : {"reason": "vm not running on this
>>> host", "health": "bad", "vm": "down", "detail": "unknown"}
>>> Score                              : 3400
>>> stopped                            : False
>>> Local maintenance                  : False
>>> crc32                              : 396766e0
>>> Host timestamp                     : 4391
>>>
>>>
>>> --== Host 2 status ==--
>>>
>>> Status up-to-date                  : True
>>> Hostname                           : host02.ovirt.forest.go.th
>>> Host ID                            : 2
>>> Engine status                      : {"health": "good", "vm": "up",
>>> "detail": "up"}
>>> Score                              : 0
>>> stopped                            : True
>>> Local maintenance                  : False
>>> crc32                              : 3a345b65
>>> Host timestamp                     : 1458
>>>
>>>
>>> --== Host 3 status ==--
>>>
>>> Status up-to-date                  : True
>>> Hostname                           : host03.ovirt.forest.go.th
>>> Host ID                            : 3
>>> Engine status                      : {"reason": "vm not running on this
>>> host", "health": "bad", "vm": "down", "detail": "unknown"}
>>> Score                              : 3400
>>> stopped                            : False
>>> Local maintenance                  : False
>>> crc32                              : 4c34b0ed
>>> Host timestamp                     : 11958
>>>
>>> ###After host02 was killed for a while:
>>> [root at host03 wees]# hosted-engine --vm-status
>>>
>>>
>>> --== Host 1 status ==--
>>>
>>> Status up-to-date                  : False
>>> Hostname                           : host01.ovirt.forest.go.th
>>> Host ID                            : 1
>>> Engine status                      : unknown stale-data
>>> Score                              : 3400
>>> stopped                            : False
>>> Local maintenance                  : False
>>> crc32                              : 72e4e418
>>> Host timestamp                     : 4415
>>>
>>>
>>> --== Host 2 status ==--
>>>
>>> Status up-to-date                  : False
>>> Hostname                           : host02.ovirt.forest.go.th
>>> Host ID                            : 2
>>> Engine status                      : unknown stale-data
>>> Score                              : 0
>>> stopped                            : True
>>> Local maintenance                  : False
>>> crc32                              : 3a345b65
>>> Host timestamp                     : 1458
>>>
>>>
>>> --== Host 3 status ==--
>>>
>>> Status up-to-date                  : False
>>> Hostname                           : host03.ovirt.forest.go.th
>>> Host ID                            : 3
>>> Engine status                      : unknown stale-data
>>> Score                              : 3400
>>> stopped                            : False
>>> Local maintenance                  : False
>>> crc32                              : 4c34b0ed
>>> Host timestamp                     : 11958
>>>
>>> ###After host02 was up again completely:
>>> [root at host03 wees]# hosted-engine --vm-status
>>>
>>>
>>> --== Host 1 status ==--
>>>
>>> Status up-to-date                  : True
>>> Hostname                           : host01.ovirt.forest.go.th
>>> Host ID                            : 1
>>> Engine status                      : {"reason": "vm not running on this
>>> host", "health": "bad", "vm": "down", "detail": "unknown"}
>>> Score                              : 0
>>> stopped                            : False
>>> Local maintenance                  : False
>>> crc32                              : f5728fca
>>> Host timestamp                     : 5555
>>>
>>>
>>> --== Host 2 status ==--
>>>
>>> Status up-to-date                  : True
>>> Hostname                           : host02.ovirt.forest.go.th
>>> Host ID                            : 2
>>> Engine status                      : {"health": "good", "vm": "up",
>>> "detail": "up"}
>>> Score                              : 3400
>>> stopped                            : False
>>> Local maintenance                  : False
>>> crc32                              : e5284763
>>> Host timestamp                     : 715
>>>
>>>
>>> --== Host 3 status ==--
>>>
>>> Status up-to-date                  : True
>>> Hostname                           : host03.ovirt.forest.go.th
>>> Host ID                            : 3
>>> Engine status                      : {"reason": "vm not running on this
>>> host", "health": "bad", "vm": "down", "detail": "unknown"}
>>> Score                              : 3400
>>> stopped                            : False
>>> Local maintenance                  : False
>>> crc32                              : bc10c7fc
>>> Host timestamp                     : 13119
>>>
>>> --
>>> Wee
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users at ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/users
>
>
> --
> วีร์ ศรีทิพโพธิ์
> นักวิชาการคอมพิวเตอร์ปฏิบัติการ
> ศูนย์สารสนเทศ กรมป่าไม้
> โทร. 025614292-3 ต่อ 5621
> มือถือ. 0864678919
>
>
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users



-- 
Didi



More information about the Users mailing list