[ovirt-users] [hosted-engine] engine VM doesn't respawn when its host was killed (poweroff)

Wee Sritippho wee.s at forest.go.th
Wed May 4 15:13:47 UTC 2016



On 4 พฤษภาคม 2016 18 นาฬิกา 48 นาที 25 วินาที GMT+07:00, Martin Sivak <msivak at redhat.com> wrote:
>Hi,
>
>you have an ISO domain inside the hosted engine VM, don't you?
>
>MainThread::INFO::2016-05-04
>12:28:47,090::ovf_store::109::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
>Extracting Engine VM OVF from the OVF_STORE
>MainThread::INFO::2016-05-04
>12:38:47,504::ovf_store::116::ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore::(getEngineVMOVF)
>OVF_STORE volume path:
>/rhev/data-center/mnt/blockSD/d2dad0e9-4f7d-41d6-b61c-487d44ae6d5d/images/157b67ef-1a29-4e51-9396-79d3425b7871/a394b440-91bb-4c7c-b344-146240d66a43
>
>There is a 10 minute gap between two log lines. We log something every
>10 seconds..
>
>Please check https://bugzilla.redhat.com/show_bug.cgi?id=1332813 to
>see if it might be the same issue.

Yes, exactly the same issue.

Thank you.

>Regards
>
>--
>Martin Sivak
>SLA / oVirt
>
>
>On Wed, May 4, 2016 at 8:34 AM, Wee Sritippho <wee.s at forest.go.th>
>wrote:
>> I've tried again and made sure all hosts have same clock.
>>
>> After added all 3 hosts, I tested it by shutting down host01. The
>engine was
>> restarted on host02 in less than 2 minutes. I enabled and tested
>power
>> management on all hosts (using ilo4), then tried disabling host02's
>network
>> to test the fencing. Waited for about 5 minutes and saw in the
>console that
>> host02 wasn't fenced. I thought the fencing didn't work and enabled
>the
>> network again. host02 was then fenced immediately after the network
>was
>> enabled (didn't know why) and the engine was never restarted, even
>when
>> host02 is up and running again. I have to start the engine vm
>manually by
>> running "hosted-engine --vm-start" on host02.
>>
>> I thought it might have something to do with ilo4, so I disabled
>power
>> management for all hosts and tried to poweroff host02 again. After
>about 10
>> minutes, the engine still won't start, so I manually start it on
>host01
>> instead.
>>
>> Here are my recent actions:
>>
>> 2016-05-04 12:25:51 ICT - run hosted-engine --vm-status on host01, vm
>is
>> running on host01
>> 2016-05-04 12:28:32 ICT - run reboot on host01, engine vm is down
>> 2016-05-04 12:34:57 ICT - run hosted-engine --vm-status on host01,
>engine
>> status on every hosts is "unknown stale-data", host01's score=0,
>> stopped=true
>> 2016-05-04 12:37:30 ICT - host01 is pingable
>> 2016-05-04 12:41:09 ICT - run hosted-engine --vm-status on host02,
>engine
>> status on every hosts is "unknown stale-data", all hosts' score=3400,
>> stopped=false
>> 2016-05-04 12:43:29 ICT - run hosted-engine --vm-status on host02, vm
>is
>> running on host01
>>
>> Log files: https://app.box.com/s/jjgn14onv19e1qi82mkf24jl2baa2l9s
>>
>>
>> On 1/5/2559 19:32, Yedidyah Bar David wrote:
>>>
>>> It's very hard to understand your flow when time moves backwards.
>>>
>>> Please try again from a clean state. Make sure all hosts have same
>clock.
>>> Then document the exact time you do stuff - starting/stopping a
>host,
>>> checking status, etc.
>>>
>>> Some things to check from your logs:
>>>
>>> in agent.host01.log:
>>>
>>> MainThread::INFO::2016-04-25
>>>
>>>
>15:32:41,370::states::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
>>> Engine down and local host has best score (3400), attempting to
>start
>>> engine VM
>>> ...
>>> MainThread::INFO::2016-04-25
>>>
>>>
>15:32:44,276::hosted_engine::1147::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
>>> Engine VM started on localhost
>>> ...
>>> MainThread::INFO::2016-04-25
>>>
>>>
>15:32:58,478::states::672::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>>> Score is 0 due to unexpected vm shutdown at Mon Apr 25 15:32:58 2016
>>>
>>> Why?
>>>
>>> Also, in agent.host03.log:
>>>
>>> MainThread::INFO::2016-04-25
>>>
>>>
>15:29:53,218::states::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
>>> Engine down and local host has best score (3400), attempting to
>start
>>> engine VM
>>> MainThread::INFO::2016-04-25
>>>
>>>
>15:29:53,223::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
>>> Trying: notify time=1461572993.22 type=state_transition
>>> detail=EngineDown-EngineStart hostname='host03.ovirt.forest.go.th'
>>> MainThread::ERROR::2016-04-25
>>>
>>>
>15:30:23,253::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate)
>>> Connection closed: Connection timed out
>>>
>>> Why?
>>>
>>> Also, in addition to the actions you stated, you changed a lot
>maintenance
>>> mode.
>>>
>>> You can try something like this to get some interesting lines from
>>> agent.log:
>>>
>>> egrep -i 'start eng|shut|vm started|vm running|vm is running on|
>>> maintenance detected|migra'
>>>
>>> Best,
>>>
>>> On Mon, Apr 25, 2016 at 12:27 PM, Wee Sritippho <wee.s at forest.go.th>
>>> wrote:
>>>>
>>>> The hosted engine storage is located in an external Fibre Channel
>SAN.
>>>>
>>>>
>>>> On 25/4/2559 16:19, Martin Sivak wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> it seems that all nodes lost access to storage for some reason
>after
>>>>> the host was killed. Where is your hosted engine storage located?
>>>>>
>>>>> Regards
>>>>>
>>>>> --
>>>>> Martin Sivak
>>>>> SLA / oVirt
>>>>>
>>>>>
>>>>> On Mon, Apr 25, 2016 at 10:58 AM, Wee Sritippho
><wee.s at forest.go.th>
>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>   From the hosted-engine FAQ, the engine VM should be up and
>running in
>>>>>> about
>>>>>> 5 minutes after its host was forced poweroff. However, after
>updated
>>>>>> oVirt
>>>>>> 3.6.4 to 3.6.5, the engine VM won't restart automatically even
>after
>>>>>> 10+
>>>>>> minutes (I already made sure that global maintenance mode is set
>to
>>>>>> none). I
>>>>>> initially thought its a time sync issue, so I installed and
>enabled ntp
>>>>>> on
>>>>>> the hosts and engine. However, the issue still persists.
>>>>>>
>>>>>> ###Versions:
>>>>>> [root at host01 ~]# rpm -qa | grep ovirt
>>>>>> libgovirt-0.3.3-1.el7_2.1.x86_64
>>>>>> ovirt-vmconsole-1.0.0-1.el7.centos.noarch
>>>>>> ovirt-vmconsole-host-1.0.0-1.el7.centos.noarch
>>>>>> ovirt-hosted-engine-ha-1.3.5.3-1.el7.centos.noarch
>>>>>> ovirt-host-deploy-1.4.1-1.el7.centos.noarch
>>>>>> ovirt-engine-sdk-python-3.6.5.0-1.el7.centos.noarch
>>>>>> ovirt-hosted-engine-setup-1.3.5.0-1.el7.centos.noarch
>>>>>> ovirt-release36-007-1.noarch
>>>>>> ovirt-setup-lib-1.0.1-1.el7.centos.noarch
>>>>>> [root at host01 ~]# rpm -qa | grep vdsm
>>>>>> vdsm-infra-4.17.26-0.el7.centos.noarch
>>>>>> vdsm-jsonrpc-4.17.26-0.el7.centos.noarch
>>>>>> vdsm-gluster-4.17.26-0.el7.centos.noarch
>>>>>> vdsm-python-4.17.26-0.el7.centos.noarch
>>>>>> vdsm-yajsonrpc-4.17.26-0.el7.centos.noarch
>>>>>> vdsm-4.17.26-0.el7.centos.noarch
>>>>>> vdsm-cli-4.17.26-0.el7.centos.noarch
>>>>>> vdsm-xmlrpc-4.17.26-0.el7.centos.noarch
>>>>>> vdsm-hook-vmfex-dev-4.17.26-0.el7.centos.noarch
>>>>>>
>>>>>> ###Log files:
>>>>>> https://app.box.com/s/fkurmwagogwkv5smkwwq7i4ztmwf9q9r
>>>>>>
>>>>>> ###After host02 was killed:
>>>>>> [root at host03 wees]# hosted-engine --vm-status
>>>>>>
>>>>>>
>>>>>> --== Host 1 status ==--
>>>>>>
>>>>>> Status up-to-date                  : True
>>>>>> Hostname                           : host01.ovirt.forest.go.th
>>>>>> Host ID                            : 1
>>>>>> Engine status                      : {"reason": "vm not running
>on this
>>>>>> host", "health": "bad", "vm": "down", "detail": "unknown"}
>>>>>> Score                              : 3400
>>>>>> stopped                            : False
>>>>>> Local maintenance                  : False
>>>>>> crc32                              : 396766e0
>>>>>> Host timestamp                     : 4391
>>>>>>
>>>>>>
>>>>>> --== Host 2 status ==--
>>>>>>
>>>>>> Status up-to-date                  : True
>>>>>> Hostname                           : host02.ovirt.forest.go.th
>>>>>> Host ID                            : 2
>>>>>> Engine status                      : {"health": "good", "vm":
>"up",
>>>>>> "detail": "up"}
>>>>>> Score                              : 0
>>>>>> stopped                            : True
>>>>>> Local maintenance                  : False
>>>>>> crc32                              : 3a345b65
>>>>>> Host timestamp                     : 1458
>>>>>>
>>>>>>
>>>>>> --== Host 3 status ==--
>>>>>>
>>>>>> Status up-to-date                  : True
>>>>>> Hostname                           : host03.ovirt.forest.go.th
>>>>>> Host ID                            : 3
>>>>>> Engine status                      : {"reason": "vm not running
>on this
>>>>>> host", "health": "bad", "vm": "down", "detail": "unknown"}
>>>>>> Score                              : 3400
>>>>>> stopped                            : False
>>>>>> Local maintenance                  : False
>>>>>> crc32                              : 4c34b0ed
>>>>>> Host timestamp                     : 11958
>>>>>>
>>>>>> ###After host02 was killed for a while:
>>>>>> [root at host03 wees]# hosted-engine --vm-status
>>>>>>
>>>>>>
>>>>>> --== Host 1 status ==--
>>>>>>
>>>>>> Status up-to-date                  : False
>>>>>> Hostname                           : host01.ovirt.forest.go.th
>>>>>> Host ID                            : 1
>>>>>> Engine status                      : unknown stale-data
>>>>>> Score                              : 3400
>>>>>> stopped                            : False
>>>>>> Local maintenance                  : False
>>>>>> crc32                              : 72e4e418
>>>>>> Host timestamp                     : 4415
>>>>>>
>>>>>>
>>>>>> --== Host 2 status ==--
>>>>>>
>>>>>> Status up-to-date                  : False
>>>>>> Hostname                           : host02.ovirt.forest.go.th
>>>>>> Host ID                            : 2
>>>>>> Engine status                      : unknown stale-data
>>>>>> Score                              : 0
>>>>>> stopped                            : True
>>>>>> Local maintenance                  : False
>>>>>> crc32                              : 3a345b65
>>>>>> Host timestamp                     : 1458
>>>>>>
>>>>>>
>>>>>> --== Host 3 status ==--
>>>>>>
>>>>>> Status up-to-date                  : False
>>>>>> Hostname                           : host03.ovirt.forest.go.th
>>>>>> Host ID                            : 3
>>>>>> Engine status                      : unknown stale-data
>>>>>> Score                              : 3400
>>>>>> stopped                            : False
>>>>>> Local maintenance                  : False
>>>>>> crc32                              : 4c34b0ed
>>>>>> Host timestamp                     : 11958
>>>>>>
>>>>>> ###After host02 was up again completely:
>>>>>> [root at host03 wees]# hosted-engine --vm-status
>>>>>>
>>>>>>
>>>>>> --== Host 1 status ==--
>>>>>>
>>>>>> Status up-to-date                  : True
>>>>>> Hostname                           : host01.ovirt.forest.go.th
>>>>>> Host ID                            : 1
>>>>>> Engine status                      : {"reason": "vm not running
>on this
>>>>>> host", "health": "bad", "vm": "down", "detail": "unknown"}
>>>>>> Score                              : 0
>>>>>> stopped                            : False
>>>>>> Local maintenance                  : False
>>>>>> crc32                              : f5728fca
>>>>>> Host timestamp                     : 5555
>>>>>>
>>>>>>
>>>>>> --== Host 2 status ==--
>>>>>>
>>>>>> Status up-to-date                  : True
>>>>>> Hostname                           : host02.ovirt.forest.go.th
>>>>>> Host ID                            : 2
>>>>>> Engine status                      : {"health": "good", "vm":
>"up",
>>>>>> "detail": "up"}
>>>>>> Score                              : 3400
>>>>>> stopped                            : False
>>>>>> Local maintenance                  : False
>>>>>> crc32                              : e5284763
>>>>>> Host timestamp                     : 715
>>>>>>
>>>>>>
>>>>>> --== Host 3 status ==--
>>>>>>
>>>>>> Status up-to-date                  : True
>>>>>> Hostname                           : host03.ovirt.forest.go.th
>>>>>> Host ID                            : 3
>>>>>> Engine status                      : {"reason": "vm not running
>on this
>>>>>> host", "health": "bad", "vm": "down", "detail": "unknown"}
>>>>>> Score                              : 3400
>>>>>> stopped                            : False
>>>>>> Local maintenance                  : False
>>>>>> crc32                              : bc10c7fc
>>>>>> Host timestamp                     : 13119
>>>>>>
>>>>>> --
>>>>>> Wee
>>>>>>
>>>>>> _______________________________________________
>>>>>> Users mailing list
>>>>>> Users at ovirt.org
>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>
>>>>
>>>> --
>>>> วีร์ ศรีทิพโพธิ์
>>>> นักวิชาการคอมพิวเตอร์ปฏิบัติการ
>>>> ศูนย์สารสนเทศ กรมป่าไม้
>>>> โทร. 025614292-3 ต่อ 5621
>>>> มือถือ. 0864678919
>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users at ovirt.org
>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>
>>>
>>>
>>
>> --
>> Wee
>>

-- 
วีร์ ศรีทิพโพธิ์
นักวิชาการคอมพิวเตอร์ปฏิบัติการ
ศูนย์สารสนเทศ กรมป่าไม้
โทร. 025614292-3 ต่อ 5621
มือถือ. 0864678919



More information about the Users mailing list