[ovirt-users] [hosted-engine] engine VM doesn't respawn when its host was killed (poweroff)

Wed May 4 11:48:25 UTC 2016

Hi,

you have an ISO domain inside the hosted engine VM, don't you?

MainThread::INFO::2016-05-04
12:28:47,090::ovf_store::109::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
Extracting Engine VM OVF from the OVF_STORE
MainThread::INFO::2016-05-04
12:38:47,504::ovf_store::116::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
OVF_STORE volume path:
/rhev/data-center/mnt/blockSD/d2dad0e9-4f7d-41d6-b61c-487d44ae6d5d/images/157b67ef-1a29-4e51-9396-79d3425b7871/a394b440-91bb-4c7c-b344-146240d66a43

There is a 10 minute gap between two log lines. We log something every
10 seconds..

Please check https://bugzilla.redhat.com/show_bug.cgi?id=1332813 to
see if it might be the same issue.

Regards

--
Martin Sivak
SLA / oVirt

On Wed, May 4, 2016 at 8:34 AM, Wee Sritippho <wee.s at forest.go.th> wrote:
> I've tried again and made sure all hosts have same clock.
>
> After added all 3 hosts, I tested it by shutting down host01. The engine was
> restarted on host02 in less than 2 minutes. I enabled and tested power
> management on all hosts (using ilo4), then tried disabling host02's network
> to test the fencing. Waited for about 5 minutes and saw in the console that
> host02 wasn't fenced. I thought the fencing didn't work and enabled the
> network again. host02 was then fenced immediately after the network was
> enabled (didn't know why) and the engine was never restarted, even when
> host02 is up and running again. I have to start the engine vm manually by
> running "hosted-engine --vm-start" on host02.
>
> I thought it might have something to do with ilo4, so I disabled power
> management for all hosts and tried to poweroff host02 again. After about 10
> minutes, the engine still won't start, so I manually start it on host01
> instead.
>
> Here are my recent actions:
>
> 2016-05-04 12:25:51 ICT - run hosted-engine --vm-status on host01, vm is
> running on host01
> 2016-05-04 12:28:32 ICT - run reboot on host01, engine vm is down
> 2016-05-04 12:34:57 ICT - run hosted-engine --vm-status on host01, engine
> status on every hosts is "unknown stale-data", host01's score=0,
> stopped=true
> 2016-05-04 12:37:30 ICT - host01 is pingable
> 2016-05-04 12:41:09 ICT - run hosted-engine --vm-status on host02, engine
> status on every hosts is "unknown stale-data", all hosts' score=3400,
> stopped=false
> 2016-05-04 12:43:29 ICT - run hosted-engine --vm-status on host02, vm is
> running on host01
>
> Log files: https://app.box.com/s/jjgn14onv19e1qi82mkf24jl2baa2l9s
>
>
> On 1/5/2559 19:32, Yedidyah Bar David wrote:
>>
>> It's very hard to understand your flow when time moves backwards.
>>
>> Please try again from a clean state. Make sure all hosts have same clock.
>> Then document the exact time you do stuff - starting/stopping a host,
>> checking status, etc.
>>
>> Some things to check from your logs:
>>
>> in agent.host01.log:
>>
>> MainThread::INFO::2016-04-25
>>
>> 15:32:41,370::states::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
>> Engine down and local host has best score (3400), attempting to start
>> engine VM
>> ...
>> MainThread::INFO::2016-04-25
>>
>> 15:32:44,276::hosted_engine::1147::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
>> Engine VM started on localhost
>> ...
>> MainThread::INFO::2016-04-25
>>
>> 15:32:58,478::states::672::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>> Score is 0 due to unexpected vm shutdown at Mon Apr 25 15:32:58 2016
>>
>> Why?
>>
>> Also, in agent.host03.log:
>>
>> MainThread::INFO::2016-04-25
>>
>> 15:29:53,218::states::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
>> Engine down and local host has best score (3400), attempting to start
>> engine VM
>> MainThread::INFO::2016-04-25
>>
>> 15:29:53,223::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>> Trying: notify time=1461572993.22 type=state_transition
>> detail=EngineDown-EngineStart hostname='host03.ovirt.forest.go.th'
>> MainThread::ERROR::2016-04-25
>>
>> 15:30:23,253::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate)
>> Connection closed: Connection timed out
>>
>> Why?
>>
>> Also, in addition to the actions you stated, you changed a lot maintenance
>> mode.
>>
>> You can try something like this to get some interesting lines from
>> agent.log:
>>
>> egrep -i 'start eng|shut|vm started|vm running|vm is running on|
>> maintenance detected|migra'
>>
>> Best,
>>
>> On Mon, Apr 25, 2016 at 12:27 PM, Wee Sritippho <wee.s at forest.go.th>
>> wrote:
>>>
>>> The hosted engine storage is located in an external Fibre Channel SAN.
>>>
>>>
>>> On 25/4/2559 16:19, Martin Sivak wrote:
>>>>
>>>> Hi,
>>>>
>>>> it seems that all nodes lost access to storage for some reason after
>>>> the host was killed. Where is your hosted engine storage located?
>>>>
>>>> Regards
>>>>
>>>> --
>>>> Martin Sivak
>>>> SLA / oVirt
>>>>
>>>>
>>>> On Mon, Apr 25, 2016 at 10:58 AM, Wee Sritippho <wee.s at forest.go.th>
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>   From the hosted-engine FAQ, the engine VM should be up and running in
>>>>> about
>>>>> 5 minutes after its host was forced poweroff. However, after updated
>>>>> oVirt
>>>>> 3.6.4 to 3.6.5, the engine VM won't restart automatically even after
>>>>> 10+
>>>>> minutes (I already made sure that global maintenance mode is set to
>>>>> none). I
>>>>> initially thought its a time sync issue, so I installed and enabled ntp
>>>>> on
>>>>> the hosts and engine. However, the issue still persists.
>>>>>
>>>>> ###Versions:
>>>>> [root at host01 ~]# rpm -qa | grep ovirt
>>>>> libgovirt-0.3.3-1.el7_2.1.x86_64
>>>>> ovirt-vmconsole-1.0.0-1.el7.centos.noarch
>>>>> ovirt-vmconsole-host-1.0.0-1.el7.centos.noarch
>>>>> ovirt-hosted-engine-ha-1.3.5.3-1.el7.centos.noarch
>>>>> ovirt-host-deploy-1.4.1-1.el7.centos.noarch
>>>>> ovirt-engine-sdk-python-3.6.5.0-1.el7.centos.noarch
>>>>> ovirt-hosted-engine-setup-1.3.5.0-1.el7.centos.noarch
>>>>> ovirt-release36-007-1.noarch
>>>>> ovirt-setup-lib-1.0.1-1.el7.centos.noarch
>>>>> [root at host01 ~]# rpm -qa | grep vdsm
>>>>> vdsm-infra-4.17.26-0.el7.centos.noarch
>>>>> vdsm-jsonrpc-4.17.26-0.el7.centos.noarch
>>>>> vdsm-gluster-4.17.26-0.el7.centos.noarch
>>>>> vdsm-python-4.17.26-0.el7.centos.noarch
>>>>> vdsm-yajsonrpc-4.17.26-0.el7.centos.noarch
>>>>> vdsm-4.17.26-0.el7.centos.noarch
>>>>> vdsm-cli-4.17.26-0.el7.centos.noarch
>>>>> vdsm-xmlrpc-4.17.26-0.el7.centos.noarch
>>>>> vdsm-hook-vmfex-dev-4.17.26-0.el7.centos.noarch
>>>>>
>>>>> ###Log files:
>>>>> https://app.box.com/s/fkurmwagogwkv5smkwwq7i4ztmwf9q9r
>>>>>
>>>>> ###After host02 was killed:
>>>>> [root at host03 wees]# hosted-engine --vm-status
>>>>>
>>>>>
>>>>> --== Host 1 status ==--
>>>>>
>>>>> Status up-to-date                  : True
>>>>> Hostname                           : host01.ovirt.forest.go.th
>>>>> Host ID                            : 1
>>>>> Engine status                      : {"reason": "vm not running on this
>>>>> host", "health": "bad", "vm": "down", "detail": "unknown"}
>>>>> Score                              : 3400
>>>>> stopped                            : False
>>>>> Local maintenance                  : False
>>>>> crc32                              : 396766e0
>>>>> Host timestamp                     : 4391
>>>>>
>>>>>
>>>>> --== Host 2 status ==--
>>>>>
>>>>> Status up-to-date                  : True
>>>>> Hostname                           : host02.ovirt.forest.go.th
>>>>> Host ID                            : 2
>>>>> Engine status                      : {"health": "good", "vm": "up",
>>>>> "detail": "up"}
>>>>> Score                              : 0
>>>>> stopped                            : True
>>>>> Local maintenance                  : False
>>>>> crc32                              : 3a345b65
>>>>> Host timestamp                     : 1458
>>>>>
>>>>>
>>>>> --== Host 3 status ==--
>>>>>
>>>>> Status up-to-date                  : True
>>>>> Hostname                           : host03.ovirt.forest.go.th
>>>>> Host ID                            : 3
>>>>> Engine status                      : {"reason": "vm not running on this
>>>>> host", "health": "bad", "vm": "down", "detail": "unknown"}
>>>>> Score                              : 3400
>>>>> stopped                            : False
>>>>> Local maintenance                  : False
>>>>> crc32                              : 4c34b0ed
>>>>> Host timestamp                     : 11958
>>>>>
>>>>> ###After host02 was killed for a while:
>>>>> [root at host03 wees]# hosted-engine --vm-status
>>>>>
>>>>>
>>>>> --== Host 1 status ==--
>>>>>
>>>>> Status up-to-date                  : False
>>>>> Hostname                           : host01.ovirt.forest.go.th
>>>>> Host ID                            : 1
>>>>> Engine status                      : unknown stale-data
>>>>> Score                              : 3400
>>>>> stopped                            : False
>>>>> Local maintenance                  : False
>>>>> crc32                              : 72e4e418
>>>>> Host timestamp                     : 4415
>>>>>
>>>>>
>>>>> --== Host 2 status ==--
>>>>>
>>>>> Status up-to-date                  : False
>>>>> Hostname                           : host02.ovirt.forest.go.th
>>>>> Host ID                            : 2
>>>>> Engine status                      : unknown stale-data
>>>>> Score                              : 0
>>>>> stopped                            : True
>>>>> Local maintenance                  : False
>>>>> crc32                              : 3a345b65
>>>>> Host timestamp                     : 1458
>>>>>
>>>>>
>>>>> --== Host 3 status ==--
>>>>>
>>>>> Status up-to-date                  : False
>>>>> Hostname                           : host03.ovirt.forest.go.th
>>>>> Host ID                            : 3
>>>>> Engine status                      : unknown stale-data
>>>>> Score                              : 3400
>>>>> stopped                            : False
>>>>> Local maintenance                  : False
>>>>> crc32                              : 4c34b0ed
>>>>> Host timestamp                     : 11958
>>>>>
>>>>> ###After host02 was up again completely:
>>>>> [root at host03 wees]# hosted-engine --vm-status
>>>>>
>>>>>
>>>>> --== Host 1 status ==--
>>>>>
>>>>> Status up-to-date                  : True
>>>>> Hostname                           : host01.ovirt.forest.go.th
>>>>> Host ID                            : 1
>>>>> Engine status                      : {"reason": "vm not running on this
>>>>> host", "health": "bad", "vm": "down", "detail": "unknown"}
>>>>> Score                              : 0
>>>>> stopped                            : False
>>>>> Local maintenance                  : False
>>>>> crc32                              : f5728fca
>>>>> Host timestamp                     : 5555
>>>>>
>>>>>
>>>>> --== Host 2 status ==--
>>>>>
>>>>> Status up-to-date                  : True
>>>>> Hostname                           : host02.ovirt.forest.go.th
>>>>> Host ID                            : 2
>>>>> Engine status                      : {"health": "good", "vm": "up",
>>>>> "detail": "up"}
>>>>> Score                              : 3400
>>>>> stopped                            : False
>>>>> Local maintenance                  : False
>>>>> crc32                              : e5284763
>>>>> Host timestamp                     : 715
>>>>>
>>>>>
>>>>> --== Host 3 status ==--
>>>>>
>>>>> Status up-to-date                  : True
>>>>> Hostname                           : host03.ovirt.forest.go.th
>>>>> Host ID                            : 3
>>>>> Engine status                      : {"reason": "vm not running on this
>>>>> host", "health": "bad", "vm": "down", "detail": "unknown"}
>>>>> Score                              : 3400
>>>>> stopped                            : False
>>>>> Local maintenance                  : False
>>>>> crc32                              : bc10c7fc
>>>>> Host timestamp                     : 13119
>>>>>
>>>>> --
>>>>> Wee
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list
>>>>> Users at ovirt.org
>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>
>>>
>>> --
>>> วีร์ ศรีทิพโพธิ์
>>> นักวิชาการคอมพิวเตอร์ปฏิบัติการ
>>> ศูนย์สารสนเทศ กรมป่าไม้
>>> โทร. 025614292-3 ต่อ 5621
>>> มือถือ. 0864678919
>>>
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users at ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/users
>>
>>
>>
>
> --
> Wee
>