I've tried again and made sure all hosts have same clock.
After added all 3 hosts, I tested it by shutting down host01. The engine
was restarted on host02 in less than 2 minutes. I enabled and tested
power management on all hosts (using ilo4), then tried disabling
host02's network to test the fencing. Waited for about 5 minutes and saw
in the console that host02 wasn't fenced. I thought the fencing didn't
work and enabled the network again. host02 was then fenced immediately
after the network was enabled (didn't know why) and the engine was never
restarted, even when host02 is up and running again. I have to start the
engine vm manually by running "hosted-engine --vm-start" on host02.
I thought it might have something to do with ilo4, so I disabled power
management for all hosts and tried to poweroff host02 again. After about
10 minutes, the engine still won't start, so I manually start it on
host01 instead.
Here are my recent actions:
2016-05-04 12:25:51 ICT - run hosted-engine --vm-status on host01, vm is
running on host01
2016-05-04 12:28:32 ICT - run reboot on host01, engine vm is down
2016-05-04 12:34:57 ICT - run hosted-engine --vm-status on host01,
engine status on every hosts is "unknown stale-data", host01's score=0,
stopped=true
2016-05-04 12:37:30 ICT - host01 is pingable
2016-05-04 12:41:09 ICT - run hosted-engine --vm-status on host02,
engine status on every hosts is "unknown stale-data", all hosts'
score=3400, stopped=false
2016-05-04 12:43:29 ICT - run hosted-engine --vm-status on host02, vm is
running on host01
Log files:
It's very hard to understand your flow when time moves
backwards.
Please try again from a clean state. Make sure all hosts have same clock.
Then document the exact time you do stuff - starting/stopping a host,
checking status, etc.
Some things to check from your logs:
in agent.host01.log:
MainThread::INFO::2016-04-25
15:32:41,370::states::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine down and local host has best score (3400), attempting to start
engine VM
...
MainThread::INFO::2016-04-25
15:32:44,276::hosted_engine::1147::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_start_engine_vm)
Engine VM started on localhost
...
MainThread::INFO::2016-04-25
15:32:58,478::states::672::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
Score is 0 due to unexpected vm shutdown at Mon Apr 25 15:32:58 2016
Why?
Also, in agent.host03.log:
MainThread::INFO::2016-04-25
15:29:53,218::states::488::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Engine down and local host has best score (3400), attempting to start
engine VM
MainThread::INFO::2016-04-25
15:29:53,223::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1461572993.22 type=state_transition
detail=EngineDown-EngineStart hostname='host03.ovirt.forest.go.th'
MainThread::ERROR::2016-04-25
15:30:23,253::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate)
Connection closed: Connection timed out
Why?
Also, in addition to the actions you stated, you changed a lot maintenance mode.
You can try something like this to get some interesting lines from agent.log:
egrep -i 'start eng|shut|vm started|vm running|vm is running on|
maintenance detected|migra'
Best,
On Mon, Apr 25, 2016 at 12:27 PM, Wee Sritippho <wee.s(a)forest.go.th> wrote:
> The hosted engine storage is located in an external Fibre Channel SAN.
>
>
> On 25/4/2559 16:19, Martin Sivak wrote:
>> Hi,
>>
>> it seems that all nodes lost access to storage for some reason after
>> the host was killed. Where is your hosted engine storage located?
>>
>> Regards
>>
>> --
>> Martin Sivak
>> SLA / oVirt
>>
>>
>> On Mon, Apr 25, 2016 at 10:58 AM, Wee Sritippho <wee.s(a)forest.go.th>
>> wrote:
>>> Hi,
>>>
>>> From the hosted-engine FAQ, the engine VM should be up and running in
>>> about
>>> 5 minutes after its host was forced poweroff. However, after updated
>>> oVirt
>>> 3.6.4 to 3.6.5, the engine VM won't restart automatically even after 10+
>>> minutes (I already made sure that global maintenance mode is set to
>>> none). I
>>> initially thought its a time sync issue, so I installed and enabled ntp
>>> on
>>> the hosts and engine. However, the issue still persists.
>>>
>>> ###Versions:
>>> [root@host01 ~]# rpm -qa | grep ovirt
>>> libgovirt-0.3.3-1.el7_2.1.x86_64
>>> ovirt-vmconsole-1.0.0-1.el7.centos.noarch
>>> ovirt-vmconsole-host-1.0.0-1.el7.centos.noarch
>>> ovirt-hosted-engine-ha-1.3.5.3-1.el7.centos.noarch
>>> ovirt-host-deploy-1.4.1-1.el7.centos.noarch
>>> ovirt-engine-sdk-python-3.6.5.0-1.el7.centos.noarch
>>> ovirt-hosted-engine-setup-1.3.5.0-1.el7.centos.noarch
>>> ovirt-release36-007-1.noarch
>>> ovirt-setup-lib-1.0.1-1.el7.centos.noarch
>>> [root@host01 ~]# rpm -qa | grep vdsm
>>> vdsm-infra-4.17.26-0.el7.centos.noarch
>>> vdsm-jsonrpc-4.17.26-0.el7.centos.noarch
>>> vdsm-gluster-4.17.26-0.el7.centos.noarch
>>> vdsm-python-4.17.26-0.el7.centos.noarch
>>> vdsm-yajsonrpc-4.17.26-0.el7.centos.noarch
>>> vdsm-4.17.26-0.el7.centos.noarch
>>> vdsm-cli-4.17.26-0.el7.centos.noarch
>>> vdsm-xmlrpc-4.17.26-0.el7.centos.noarch
>>> vdsm-hook-vmfex-dev-4.17.26-0.el7.centos.noarch
>>>
>>> ###Log files:
>>>
https://app.box.com/s/fkurmwagogwkv5smkwwq7i4ztmwf9q9r
>>>
>>> ###After host02 was killed:
>>> [root@host03 wees]# hosted-engine --vm-status
>>>
>>>
>>> --== Host 1 status ==--
>>>
>>> Status up-to-date : True
>>> Hostname : host01.ovirt.forest.go.th
>>> Host ID : 1
>>> Engine status : {"reason": "vm not
running on this
>>> host", "health": "bad", "vm":
"down", "detail": "unknown"}
>>> Score : 3400
>>> stopped : False
>>> Local maintenance : False
>>> crc32 : 396766e0
>>> Host timestamp : 4391
>>>
>>>
>>> --== Host 2 status ==--
>>>
>>> Status up-to-date : True
>>> Hostname : host02.ovirt.forest.go.th
>>> Host ID : 2
>>> Engine status : {"health": "good",
"vm": "up",
>>> "detail": "up"}
>>> Score : 0
>>> stopped : True
>>> Local maintenance : False
>>> crc32 : 3a345b65
>>> Host timestamp : 1458
>>>
>>>
>>> --== Host 3 status ==--
>>>
>>> Status up-to-date : True
>>> Hostname : host03.ovirt.forest.go.th
>>> Host ID : 3
>>> Engine status : {"reason": "vm not
running on this
>>> host", "health": "bad", "vm":
"down", "detail": "unknown"}
>>> Score : 3400
>>> stopped : False
>>> Local maintenance : False
>>> crc32 : 4c34b0ed
>>> Host timestamp : 11958
>>>
>>> ###After host02 was killed for a while:
>>> [root@host03 wees]# hosted-engine --vm-status
>>>
>>>
>>> --== Host 1 status ==--
>>>
>>> Status up-to-date : False
>>> Hostname : host01.ovirt.forest.go.th
>>> Host ID : 1
>>> Engine status : unknown stale-data
>>> Score : 3400
>>> stopped : False
>>> Local maintenance : False
>>> crc32 : 72e4e418
>>> Host timestamp : 4415
>>>
>>>
>>> --== Host 2 status ==--
>>>
>>> Status up-to-date : False
>>> Hostname : host02.ovirt.forest.go.th
>>> Host ID : 2
>>> Engine status : unknown stale-data
>>> Score : 0
>>> stopped : True
>>> Local maintenance : False
>>> crc32 : 3a345b65
>>> Host timestamp : 1458
>>>
>>>
>>> --== Host 3 status ==--
>>>
>>> Status up-to-date : False
>>> Hostname : host03.ovirt.forest.go.th
>>> Host ID : 3
>>> Engine status : unknown stale-data
>>> Score : 3400
>>> stopped : False
>>> Local maintenance : False
>>> crc32 : 4c34b0ed
>>> Host timestamp : 11958
>>>
>>> ###After host02 was up again completely:
>>> [root@host03 wees]# hosted-engine --vm-status
>>>
>>>
>>> --== Host 1 status ==--
>>>
>>> Status up-to-date : True
>>> Hostname : host01.ovirt.forest.go.th
>>> Host ID : 1
>>> Engine status : {"reason": "vm not
running on this
>>> host", "health": "bad", "vm":
"down", "detail": "unknown"}
>>> Score : 0
>>> stopped : False
>>> Local maintenance : False
>>> crc32 : f5728fca
>>> Host timestamp : 5555
>>>
>>>
>>> --== Host 2 status ==--
>>>
>>> Status up-to-date : True
>>> Hostname : host02.ovirt.forest.go.th
>>> Host ID : 2
>>> Engine status : {"health": "good",
"vm": "up",
>>> "detail": "up"}
>>> Score : 3400
>>> stopped : False
>>> Local maintenance : False
>>> crc32 : e5284763
>>> Host timestamp : 715
>>>
>>>
>>> --== Host 3 status ==--
>>>
>>> Status up-to-date : True
>>> Hostname : host03.ovirt.forest.go.th
>>> Host ID : 3
>>> Engine status : {"reason": "vm not
running on this
>>> host", "health": "bad", "vm":
"down", "detail": "unknown"}
>>> Score : 3400
>>> stopped : False
>>> Local maintenance : False
>>> crc32 : bc10c7fc
>>> Host timestamp : 13119
>>>
>>> --
>>> Wee
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users(a)ovirt.org
>>>
http://lists.ovirt.org/mailman/listinfo/users
>
> --
> วีร์ ศรีทิพโพธิ์
> นักวิชาการคอมพิวเตอร์ปฏิบัติการ
> ศูนย์สารสนเทศ กรมป่าไม้
> โทร. 025614292-3 ต่อ 5621
> มือถือ. 0864678919
>
>
> _______________________________________________
> Users mailing list
> Users(a)ovirt.org
>
http://lists.ovirt.org/mailman/listinfo/users