[ovirt-users] HA - Fencing not working when host with engine gets shutdown

Thu Sep 24 12:31:03 UTC 2015

Ok, thanks!

So, I would still like to know if you would recommend not to use hosted
engines but rather another machine for the engine?

On 09/24/2015 01:24 PM, Martin Perina wrote:
>
> ----- Original Message -----
>> From: "Michael Hölzl" <mh at ins.jku.at>
>> To: "Martin Perina" <mperina at redhat.com>, "Eli Mesika" <emesika at redhat.com>
>> Cc: "Doron Fediuck" <dfediuck at redhat.com>, users at ovirt.org
>> Sent: Thursday, September 24, 2015 12:35:13 PM
>> Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown
>>
>> Hi,
>>
>> thanks for the detailed answer! In principle, I understand the issue
>> now. However, I can not fully follow the argument that this is a corner
>> case. In a smaller or medium sized company, I would assume that such a
>> setup, consisting of two machine with a hosted engine, is not uncommon.
>> Especially as there is documentation online which describes how to
>> deploy this setup. Does that mean that hosted engines are in general not
>> recommended?
>>
>> I am also wondering why the fencing could not be triggered by the hosted
>> engine after the DisableFenceAtStartupInSec timeout? In the events log
>> of the engine I keep on getting the message "Host hosted_engine_2 is not
>> responding. It will stay in Connecting state for a grace period of 120
>> seconds and after that an attempt to fence the host will be issued.",
>> which would indicate that the engine is actually trying to fence the non
>> responsive host.
> Unfortunately this is a bit misleading message, it's shown every time that
> we start handling network exception for the host and it's fired before
> the logic which manages to start/skip fencing process (this misleading
> message is fixed in 3.6). But in current logic we really execute fencing
> only when host status is about to change from Connecting to NonResponsive
> and this happens only for the 1st time when we are still in
> DisableFenceAtStartupInSec interval. During all other attempts the host is
> already in status Non Responsive, so fencing is skipped.
>
>> On 09/24/2015 11:50 AM, Martin Perina wrote:
>>> ----- Original Message -----
>>>> From: "Eli Mesika" <emesika at redhat.com>
>>>> To: "Martin Perina" <mperina at redhat.com>, "Doron Fediuck"
>>>> <dfediuck at redhat.com>
>>>> Cc: "Michael Hölzl" <mh at ins.jku.at>, users at ovirt.org
>>>> Sent: Thursday, September 24, 2015 11:38:39 AM
>>>> Subject: Re: [ovirt-users] HA - Fencing not working when host with engine
>>>> gets shutdown
>>>>
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Martin Perina" <mperina at redhat.com>
>>>>> To: "Michael Hölzl" <mh at ins.jku.at>
>>>>> Cc: users at ovirt.org
>>>>> Sent: Thursday, September 24, 2015 11:02:21 AM
>>>>> Subject: Re: [ovirt-users] HA - Fencing not working when host with engine
>>>>> gets shutdown
>>>>>
>>>>> Hi,
>>>>>
>>>>> sorry for the late response, but you hit a "corner case" :-(
>>>>>
>>>>> Let me start explain you a few things first:
>>>>>
>>>>> After startup of engine there's an internval during which fencing is
>>>>> disabled. It's called DisableFenceAtStartupInSec and by default it's
>>>>> set to 5 minutes. It can be changed using
>>>>>
>>>>>    engine-config -s DisableFenceAtStartupInSec
>>>>>
>>>>> but please do that with caution.
>>>>>
>>>>> Why do we have such timeout? It's a prevention of fencing storm, which
>>>>> could happen in during power issues in whole DC: when both engine and
>>>>> hosts are started, for huge hosts it may take a lot of time until become
>>>>> up and VDSM start to communicate with engine. So usually engine is
>>>>> started
>>>>> first and without this interval engine will start fencing for hosts which
>>>>> are just starting ...
>>>>>
>>>>> Another thing: if we cannot properly fence the host, we cannot determine
>>>>> if there's not just communication issue between engine and host, so we
>>>>> cannot restart HA VMs on another host. The only thing we can do is to
>>>>> offer "Mark host as rebooted" manual option to administrator. If
>>>>> administrator execution this option, we try to restart HA VMs on
>>>>> different
>>>>> host ASAP, because admin took the responsibility of validation that VMs
>>>>> are really not running.
>>>>>
>>>>>
>>>>> When engine is started, following actions related to fencing are taken:
>>>>>
>>>>> 1. Get status of all hosts from DB and schedule Non Responding Treatment
>>>>>    after DisableFenceAtStartupInSec timeout is passed
>>>>>
>>>>> 2. Try to communicate with all host and refresh their status
>>>>>
>>>>>
>>>>> If some host become Non Resposive during DisableFenceAtStartupInSec
>>>>> interval
>>>>> we skip fencing and administator will see message in Events tab that host
>>>>> is Non Responsive, but fencing is disabled due to startup interval. So
>>>>> administrator have to take care of such host manually.
>>>>>
>>>>>
>>>>> Now what happened in your case:
>>>>>
>>>>>  1. Hosted engine VM is running on host1 with other VMs
>>>>>  2. Status of host1 and host2 is Up
>>>>>  3. You kill/shutdown host1 -> hosted engine VM is also shut down -> no
>>>>>  engine
>>>>>     is running to detect issue with host1 and change its status to Non
>>>>>     Responsive
>>>>>  4. In the meantime hosted engine VM is started on host2 -> it will read
>>>>>  host
>>>>>     status from DB, but all hosts are up -> it will try to communicate
>>>>>     with
>>>>>     host1,
>>>>>     but it's unreachable -> so it changes host1 status Non Responsive and
>>>>>     starts
>>>>>     Non Responsive Treatment for host1 -> Non Responsive Treatment is
>>>>>     aborted
>>>>>     because engine is still in DisableFenceAtStartupInSec
>>>>>
>>>>>
>>>>> So in normal deployment (without hosted engine) admin is notified that
>>>>> host,
>>>>> where engine is running, crashed and was rebooted, so he has to take a
>>>>> look
>>>>> and do manual steps if needed.
>>>>>
>>>>> In hosted engine deployment it's an issue because hosted engine VM can be
>>>>> restart
>>>>> on different host also in other cases then crashes (for example if host
>>>>> is
>>>>> overloaded hosted engine can stop hosted engine VM and restart it on
>>>>> different
>>>>> host, but this shouldn't happen too often).
>>>>>
>>>>> At the moment the only solution for this is manual: let administrator to
>>>>> be
>>>>> notified that host engine VM is restarted on different host, so
>>>>> administrator
>>>>> can check manually what was the cause for this restart and execute manual
>>>>> steps
>>>>> if needed.
>>>>>
>>>>> So to summarize: at the moment I don't see any reliable automatic
>>>>> solution
>>>>> for this :-( and fencing storm prevention is more important. But feel
>>>>> free
>>>>> to
>>>>> create
>>>>> a bug for this issue, maybe we can think of at least some improvement for
>>>>> this use
>>>>> case.
>>>> Thanks for the detailed explanation Martin
>>>> Really a corner case, lets see if we got more inputs on that from other
>>>> users
>>>> Maybe when hosted engine VM is restarted on another node we can ask for
>>>> the
>>>> reason and act accordingly
>>>> Doron, with current implementation, is the reason for hosted engine VM
>>>> restart stored anywhere ?
>>> I have already discussed this with Martin Sivak and hosted engine doesn't
>>> touch engine db at all. We discussed this possible solution with Martin,
>>> which we could do in master and maybe in 3.6 if agreed:
>>>
>>>  1. Just after start of engine we can read from the db name of the host
>>>     which hosted engine VM is running on and store it somewhere in memory
>>>     for Non Responding Treatment
>>>
>>>  2. As a part of Non Responding Treatment we can some hosted engine
>>>     specific logic:
>>>       IF we are running as hosted engine AND
>>>          we are inside DisableFenceAtStartupInSec internal AND
>>>          non responsive host is the host stored above in step 1. AND
>>>          hosted engine VM is running on different host
>>>       THEN
>>>          execute fencing for non responsive host even when we are
>>>          inside DisableFenceAtStartupInSec internal
>>>
>>> But it can cause unnecessary fence for the case that whole datacenter
>>> recovers from power failure.
>>>
>>>>> Thanks
>>>>>
>>>>> Martin Perina
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Michael Hölzl" <mh at ins.jku.at>
>>>>>> To: "Martin Perina" <mperina at redhat.com>
>>>>>> Cc: users at ovirt.org
>>>>>> Sent: Monday, September 21, 2015 4:47:06 PM
>>>>>> Subject: Re: [ovirt-users] HA - Fencing not working when host with
>>>>>> engine
>>>>>> gets shutdown
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> The whole engine.log including the shutdown time (was performed around
>>>>>> 9:19)
>>>>>> http://pastebin.com/cdY9uTkJ
>>>>>>
>>>>>> vdsm.log of host01 (the host which kept on running and took over the
>>>>>> engine) split into 3 uploads (limit of 512 kB of pastebin):
>>>>>> 1 : http://pastebin.com/dr9jNTek
>>>>>> 2 : http://pastebin.com/cuyHL6ne
>>>>>> 3 : http://pastebin.com/7x2ZQy1y
>>>>>>
>>>>>> Michael
>>>>>>
>>>>>> On 09/21/2015 03:00 PM, Martin Perina wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> could you please post whole engine.log (from the time which you turned
>>>>>>> off
>>>>>>> the host with engine VM) and also vdsm.log from both hosts?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Martin Perina
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> From: "Michael Hölzl" <mh at ins.jku.at>
>>>>>>>> To: users at ovirt.org
>>>>>>>> Sent: Monday, September 21, 2015 10:27:08 AM
>>>>>>>> Subject: [ovirt-users] HA - Fencing not working when host with engine
>>>>>>>> gets
>>>>>>>> 	shutdown
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> we are trying to setup an ovirt environment with two hosts, both
>>>>>>>> connected to a ISCSI storage device, a hosted engine and power
>>>>>>>> management configured over ILO. So far it seems to work fine in our
>>>>>>>> testing setup and starting/stopping VMs works smoothly with proper
>>>>>>>> scheduling between those hosts. So we wanted to test HA for the VMs
>>>>>>>> now
>>>>>>>> and started to manually shutdown a host while there are still VMs
>>>>>>>> running on that machine (to simulate power failure or a kernel panic).
>>>>>>>> The expected outcome was that all machines were HA is enabled, are
>>>>>>>> booted again. This works if the machine with the failure does not have
>>>>>>>> the engine running. If the machine with the hosted engine VM gets
>>>>>>>> shutdown, the host gets in the "Not Responsive state" and all VMs end
>>>>>>>> up
>>>>>>>> in an unkown state. However, the engine itself starts correctly on the
>>>>>>>> second host and it seems like it tries to fence the other host (as
>>>>>>>> expected) - Events which we get in the open virtualization manager:
>>>>>>>> 1. Host hosted_engine_2 is non responsive
>>>>>>>> 2. Host hosted_engine_1 from cluster Default was chosen as a proxy to
>>>>>>>> execute Status command on Host hosted_engine_2.
>>>>>>>> 3. Host hosted_engine_2 became non responsive. It has no power
>>>>>>>> management configured. Please check the host status, manually reboot
>>>>>>>> it,
>>>>>>>> and click "Confirm Host Has Been Rebooted"
>>>>>>>> 4. Host hosted_engine_2 is not responding. It will stay in Connecting
>>>>>>>> state for a grace period of 124 seconds and after that an attempt to
>>>>>>>> fence the host will be issued.
>>>>>>>>
>>>>>>>> Event 4 is continuously coming every 3 minutes. Complete engine.log
>>>>>>>> file
>>>>>>>> during engine boot up: http://pastebin.com/D6xS3Wfy
>>>>>>>> So the host detects the machine is not responding and wants to fence
>>>>>>>> it.
>>>>>>>> But although the host has power management configured over ILO, the
>>>>>>>> engine thinks that it is not. As a result the second host does not get
>>>>>>>> fenced and VMs are not migrated to the running machine.
>>>>>>>> In the log files there are also a lot of time out exception. But I
>>>>>>>> guess
>>>>>>>> that this is because the host cannot connect to the other machine.
>>>>>>>>
>>>>>>>> Did anybody face similar problems with HA? Or any clue what the
>>>>>>>> problem
>>>>>>>> might be?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Michael
>>>>>>>>
>>>>>>>>
>>>>>>>> ----
>>>>>>>> ovirt version: 3.5.4
>>>>>>>> Hosted engine VM OS: Cent OS 6.5
>>>>>>>> Host Machines OS: Cent OS 7
>>>>>>>>
>>>>>>>> P.S. We also have to note that we had problems with the command
>>>>>>>> fence_ipmilan at the beginning. We were receiving the message "Unable
>>>>>>>> to
>>>>>>>> obtain correct plug status or plug is not available," whenever the
>>>>>>>> command fence_ipmilan was called. However, the command fence_ilo4
>>>>>>>> worked. So we use a simple script for fence_ipmilan now that calls
>>>>>>>> fence_ilo4 and passes the arguments.
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list
>>>>>>>> Users at ovirt.org
>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list
>>>>> Users at ovirt.org
>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>