[ovirt-users] HA - Fencing not working when host with engine gets shutdown

Michael Hölzl mh at ins.jku.at
Fri Sep 25 06:19:52 UTC 2015


Thanks for the help! I will definitely stay tuned with updates on this
matter.

Michael

On 09/24/2015 03:13 PM, Martin Perina wrote:
> I created a bug covering this:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1266099
>
> ----- Original Message -----
>> From: "Martin Sivak" <msivak at redhat.com>
>> To: "Michael Hölzl" <mh at ins.jku.at>
>> Cc: "Martin Perina" <mperina at redhat.com>, users at ovirt.org
>> Sent: Thursday, September 24, 2015 2:59:52 PM
>> Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown
>>
>> Hi Michael,
>>
>> Martin summed the situation neatly, I would just add that this issue
>> is not limited to the size of your setup. The same would happen to HA
>> VMs running on the same host as the hosted engine even if the cluster
>> had 50 hosts...
>>
>> About the recommended way of engine deployment: It really is about
>> whether you can tolerate your engine to be down for a longer time
>> (starting another host using a backup db).
>>
>> Hosted engine restores your management in an automated way and without
>> any data loss. However I agree that the fact that you have to tend to
>> your HA VMs manually after an engine restart is not nice. Fortunately
>> that should only happen when your host (or vdsm) dies and does not
>> come up for an extended period of time.
>>
>> The summary would be.. there will be no HA handling if the host
>> running the engine is down, independently on whether the deployment is
>> hosted engine or standalone engine. If the issue is related to the
>> software only then there is no real difference.
>>
>> - When a host with the standalone engine dies, the VMs are fine, but
>> if anything happens while the engine is down (and reinstalling a
>> standalone engine takes time + you need a very fresh db backup) you
>> might again face issues with HA VMs being down or not starting when
>> the engine comes up.
>>
>> - When a hosted engine dies because of a host failure, some VMs
>> generally disappear with it. The engine will come up automatically and
>> HA VMs from the original hosts have to be manually pushed to work.
>> This requires some manual action, but I see it as less demanding than
>> the first case.
>>
>> - When a hosted engine VM is stopped properly by the tooling it will
>> be restarted elsewhere and it will be able to connect to the original
>> host just fine. The engine will then make sure that all HA VMs are up
>> even if the the VMs died while the engine was down.
>>
>> So I would recommend hosted engine based deployment. And ask for a bit
>> of patience as we have a plan how to mitigate the second case to some
>> extent without compromising the fencing storm prevention.
>>
>> Best regards
>>
>> --
>> Martin Sivak
>> msivak at redhat.com
>> SLA RHEV-M
>>
>>
>> On Thu, Sep 24, 2015 at 2:31 PM, Michael Hölzl <mh at ins.jku.at> wrote:
>>> Ok, thanks!
>>>
>>> So, I would still like to know if you would recommend not to use hosted
>>> engines but rather another machine for the engine?
>>>
>>> On 09/24/2015 01:24 PM, Martin Perina wrote:
>>>> ----- Original Message -----
>>>>> From: "Michael Hölzl" <mh at ins.jku.at>
>>>>> To: "Martin Perina" <mperina at redhat.com>, "Eli Mesika"
>>>>> <emesika at redhat.com>
>>>>> Cc: "Doron Fediuck" <dfediuck at redhat.com>, users at ovirt.org
>>>>> Sent: Thursday, September 24, 2015 12:35:13 PM
>>>>> Subject: Re: [ovirt-users] HA - Fencing not working when host with engine
>>>>> gets shutdown
>>>>>
>>>>> Hi,
>>>>>
>>>>> thanks for the detailed answer! In principle, I understand the issue
>>>>> now. However, I can not fully follow the argument that this is a corner
>>>>> case. In a smaller or medium sized company, I would assume that such a
>>>>> setup, consisting of two machine with a hosted engine, is not uncommon.
>>>>> Especially as there is documentation online which describes how to
>>>>> deploy this setup. Does that mean that hosted engines are in general not
>>>>> recommended?
>>>>>
>>>>> I am also wondering why the fencing could not be triggered by the hosted
>>>>> engine after the DisableFenceAtStartupInSec timeout? In the events log
>>>>> of the engine I keep on getting the message "Host hosted_engine_2 is not
>>>>> responding. It will stay in Connecting state for a grace period of 120
>>>>> seconds and after that an attempt to fence the host will be issued.",
>>>>> which would indicate that the engine is actually trying to fence the non
>>>>> responsive host.
>>>> Unfortunately this is a bit misleading message, it's shown every time that
>>>> we start handling network exception for the host and it's fired before
>>>> the logic which manages to start/skip fencing process (this misleading
>>>> message is fixed in 3.6). But in current logic we really execute fencing
>>>> only when host status is about to change from Connecting to NonResponsive
>>>> and this happens only for the 1st time when we are still in
>>>> DisableFenceAtStartupInSec interval. During all other attempts the host is
>>>> already in status Non Responsive, so fencing is skipped.
>>>>
>>>>> On 09/24/2015 11:50 AM, Martin Perina wrote:
>>>>>> ----- Original Message -----
>>>>>>> From: "Eli Mesika" <emesika at redhat.com>
>>>>>>> To: "Martin Perina" <mperina at redhat.com>, "Doron Fediuck"
>>>>>>> <dfediuck at redhat.com>
>>>>>>> Cc: "Michael Hölzl" <mh at ins.jku.at>, users at ovirt.org
>>>>>>> Sent: Thursday, September 24, 2015 11:38:39 AM
>>>>>>> Subject: Re: [ovirt-users] HA - Fencing not working when host with
>>>>>>> engine
>>>>>>> gets shutdown
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> From: "Martin Perina" <mperina at redhat.com>
>>>>>>>> To: "Michael Hölzl" <mh at ins.jku.at>
>>>>>>>> Cc: users at ovirt.org
>>>>>>>> Sent: Thursday, September 24, 2015 11:02:21 AM
>>>>>>>> Subject: Re: [ovirt-users] HA - Fencing not working when host with
>>>>>>>> engine
>>>>>>>> gets shutdown
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> sorry for the late response, but you hit a "corner case" :-(
>>>>>>>>
>>>>>>>> Let me start explain you a few things first:
>>>>>>>>
>>>>>>>> After startup of engine there's an internval during which fencing is
>>>>>>>> disabled. It's called DisableFenceAtStartupInSec and by default it's
>>>>>>>> set to 5 minutes. It can be changed using
>>>>>>>>
>>>>>>>>    engine-config -s DisableFenceAtStartupInSec
>>>>>>>>
>>>>>>>> but please do that with caution.
>>>>>>>>
>>>>>>>> Why do we have such timeout? It's a prevention of fencing storm, which
>>>>>>>> could happen in during power issues in whole DC: when both engine and
>>>>>>>> hosts are started, for huge hosts it may take a lot of time until
>>>>>>>> become
>>>>>>>> up and VDSM start to communicate with engine. So usually engine is
>>>>>>>> started
>>>>>>>> first and without this interval engine will start fencing for hosts
>>>>>>>> which
>>>>>>>> are just starting ...
>>>>>>>>
>>>>>>>> Another thing: if we cannot properly fence the host, we cannot
>>>>>>>> determine
>>>>>>>> if there's not just communication issue between engine and host, so we
>>>>>>>> cannot restart HA VMs on another host. The only thing we can do is to
>>>>>>>> offer "Mark host as rebooted" manual option to administrator. If
>>>>>>>> administrator execution this option, we try to restart HA VMs on
>>>>>>>> different
>>>>>>>> host ASAP, because admin took the responsibility of validation that
>>>>>>>> VMs
>>>>>>>> are really not running.
>>>>>>>>
>>>>>>>>
>>>>>>>> When engine is started, following actions related to fencing are
>>>>>>>> taken:
>>>>>>>>
>>>>>>>> 1. Get status of all hosts from DB and schedule Non Responding
>>>>>>>> Treatment
>>>>>>>>    after DisableFenceAtStartupInSec timeout is passed
>>>>>>>>
>>>>>>>> 2. Try to communicate with all host and refresh their status
>>>>>>>>
>>>>>>>>
>>>>>>>> If some host become Non Resposive during DisableFenceAtStartupInSec
>>>>>>>> interval
>>>>>>>> we skip fencing and administator will see message in Events tab that
>>>>>>>> host
>>>>>>>> is Non Responsive, but fencing is disabled due to startup interval. So
>>>>>>>> administrator have to take care of such host manually.
>>>>>>>>
>>>>>>>>
>>>>>>>> Now what happened in your case:
>>>>>>>>
>>>>>>>>  1. Hosted engine VM is running on host1 with other VMs
>>>>>>>>  2. Status of host1 and host2 is Up
>>>>>>>>  3. You kill/shutdown host1 -> hosted engine VM is also shut down ->
>>>>>>>>  no
>>>>>>>>  engine
>>>>>>>>     is running to detect issue with host1 and change its status to Non
>>>>>>>>     Responsive
>>>>>>>>  4. In the meantime hosted engine VM is started on host2 -> it will
>>>>>>>>  read
>>>>>>>>  host
>>>>>>>>     status from DB, but all hosts are up -> it will try to communicate
>>>>>>>>     with
>>>>>>>>     host1,
>>>>>>>>     but it's unreachable -> so it changes host1 status Non Responsive
>>>>>>>>     and
>>>>>>>>     starts
>>>>>>>>     Non Responsive Treatment for host1 -> Non Responsive Treatment is
>>>>>>>>     aborted
>>>>>>>>     because engine is still in DisableFenceAtStartupInSec
>>>>>>>>
>>>>>>>>
>>>>>>>> So in normal deployment (without hosted engine) admin is notified that
>>>>>>>> host,
>>>>>>>> where engine is running, crashed and was rebooted, so he has to take a
>>>>>>>> look
>>>>>>>> and do manual steps if needed.
>>>>>>>>
>>>>>>>> In hosted engine deployment it's an issue because hosted engine VM can
>>>>>>>> be
>>>>>>>> restart
>>>>>>>> on different host also in other cases then crashes (for example if
>>>>>>>> host
>>>>>>>> is
>>>>>>>> overloaded hosted engine can stop hosted engine VM and restart it on
>>>>>>>> different
>>>>>>>> host, but this shouldn't happen too often).
>>>>>>>>
>>>>>>>> At the moment the only solution for this is manual: let administrator
>>>>>>>> to
>>>>>>>> be
>>>>>>>> notified that host engine VM is restarted on different host, so
>>>>>>>> administrator
>>>>>>>> can check manually what was the cause for this restart and execute
>>>>>>>> manual
>>>>>>>> steps
>>>>>>>> if needed.
>>>>>>>>
>>>>>>>> So to summarize: at the moment I don't see any reliable automatic
>>>>>>>> solution
>>>>>>>> for this :-( and fencing storm prevention is more important. But feel
>>>>>>>> free
>>>>>>>> to
>>>>>>>> create
>>>>>>>> a bug for this issue, maybe we can think of at least some improvement
>>>>>>>> for
>>>>>>>> this use
>>>>>>>> case.
>>>>>>> Thanks for the detailed explanation Martin
>>>>>>> Really a corner case, lets see if we got more inputs on that from other
>>>>>>> users
>>>>>>> Maybe when hosted engine VM is restarted on another node we can ask for
>>>>>>> the
>>>>>>> reason and act accordingly
>>>>>>> Doron, with current implementation, is the reason for hosted engine VM
>>>>>>> restart stored anywhere ?
>>>>>> I have already discussed this with Martin Sivak and hosted engine
>>>>>> doesn't
>>>>>> touch engine db at all. We discussed this possible solution with Martin,
>>>>>> which we could do in master and maybe in 3.6 if agreed:
>>>>>>
>>>>>>  1. Just after start of engine we can read from the db name of the host
>>>>>>     which hosted engine VM is running on and store it somewhere in
>>>>>>     memory
>>>>>>     for Non Responding Treatment
>>>>>>
>>>>>>  2. As a part of Non Responding Treatment we can some hosted engine
>>>>>>     specific logic:
>>>>>>       IF we are running as hosted engine AND
>>>>>>          we are inside DisableFenceAtStartupInSec internal AND
>>>>>>          non responsive host is the host stored above in step 1. AND
>>>>>>          hosted engine VM is running on different host
>>>>>>       THEN
>>>>>>          execute fencing for non responsive host even when we are
>>>>>>          inside DisableFenceAtStartupInSec internal
>>>>>>
>>>>>> But it can cause unnecessary fence for the case that whole datacenter
>>>>>> recovers from power failure.
>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Martin Perina
>>>>>>>>
>>>>>>>> ----- Original Message -----
>>>>>>>>> From: "Michael Hölzl" <mh at ins.jku.at>
>>>>>>>>> To: "Martin Perina" <mperina at redhat.com>
>>>>>>>>> Cc: users at ovirt.org
>>>>>>>>> Sent: Monday, September 21, 2015 4:47:06 PM
>>>>>>>>> Subject: Re: [ovirt-users] HA - Fencing not working when host with
>>>>>>>>> engine
>>>>>>>>> gets shutdown
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> The whole engine.log including the shutdown time (was performed
>>>>>>>>> around
>>>>>>>>> 9:19)
>>>>>>>>> http://pastebin.com/cdY9uTkJ
>>>>>>>>>
>>>>>>>>> vdsm.log of host01 (the host which kept on running and took over the
>>>>>>>>> engine) split into 3 uploads (limit of 512 kB of pastebin):
>>>>>>>>> 1 : http://pastebin.com/dr9jNTek
>>>>>>>>> 2 : http://pastebin.com/cuyHL6ne
>>>>>>>>> 3 : http://pastebin.com/7x2ZQy1y
>>>>>>>>>
>>>>>>>>> Michael
>>>>>>>>>
>>>>>>>>> On 09/21/2015 03:00 PM, Martin Perina wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> could you please post whole engine.log (from the time which you
>>>>>>>>>> turned
>>>>>>>>>> off
>>>>>>>>>> the host with engine VM) and also vdsm.log from both hosts?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> Martin Perina
>>>>>>>>>>
>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>> From: "Michael Hölzl" <mh at ins.jku.at>
>>>>>>>>>>> To: users at ovirt.org
>>>>>>>>>>> Sent: Monday, September 21, 2015 10:27:08 AM
>>>>>>>>>>> Subject: [ovirt-users] HA - Fencing not working when host with
>>>>>>>>>>> engine
>>>>>>>>>>> gets
>>>>>>>>>>>        shutdown
>>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> we are trying to setup an ovirt environment with two hosts, both
>>>>>>>>>>> connected to a ISCSI storage device, a hosted engine and power
>>>>>>>>>>> management configured over ILO. So far it seems to work fine in our
>>>>>>>>>>> testing setup and starting/stopping VMs works smoothly with proper
>>>>>>>>>>> scheduling between those hosts. So we wanted to test HA for the VMs
>>>>>>>>>>> now
>>>>>>>>>>> and started to manually shutdown a host while there are still VMs
>>>>>>>>>>> running on that machine (to simulate power failure or a kernel
>>>>>>>>>>> panic).
>>>>>>>>>>> The expected outcome was that all machines were HA is enabled, are
>>>>>>>>>>> booted again. This works if the machine with the failure does not
>>>>>>>>>>> have
>>>>>>>>>>> the engine running. If the machine with the hosted engine VM gets
>>>>>>>>>>> shutdown, the host gets in the "Not Responsive state" and all VMs
>>>>>>>>>>> end
>>>>>>>>>>> up
>>>>>>>>>>> in an unkown state. However, the engine itself starts correctly on
>>>>>>>>>>> the
>>>>>>>>>>> second host and it seems like it tries to fence the other host (as
>>>>>>>>>>> expected) - Events which we get in the open virtualization manager:
>>>>>>>>>>> 1. Host hosted_engine_2 is non responsive
>>>>>>>>>>> 2. Host hosted_engine_1 from cluster Default was chosen as a proxy
>>>>>>>>>>> to
>>>>>>>>>>> execute Status command on Host hosted_engine_2.
>>>>>>>>>>> 3. Host hosted_engine_2 became non responsive. It has no power
>>>>>>>>>>> management configured. Please check the host status, manually
>>>>>>>>>>> reboot
>>>>>>>>>>> it,
>>>>>>>>>>> and click "Confirm Host Has Been Rebooted"
>>>>>>>>>>> 4. Host hosted_engine_2 is not responding. It will stay in
>>>>>>>>>>> Connecting
>>>>>>>>>>> state for a grace period of 124 seconds and after that an attempt
>>>>>>>>>>> to
>>>>>>>>>>> fence the host will be issued.
>>>>>>>>>>>
>>>>>>>>>>> Event 4 is continuously coming every 3 minutes. Complete engine.log
>>>>>>>>>>> file
>>>>>>>>>>> during engine boot up: http://pastebin.com/D6xS3Wfy
>>>>>>>>>>> So the host detects the machine is not responding and wants to
>>>>>>>>>>> fence
>>>>>>>>>>> it.
>>>>>>>>>>> But although the host has power management configured over ILO, the
>>>>>>>>>>> engine thinks that it is not. As a result the second host does not
>>>>>>>>>>> get
>>>>>>>>>>> fenced and VMs are not migrated to the running machine.
>>>>>>>>>>> In the log files there are also a lot of time out exception. But I
>>>>>>>>>>> guess
>>>>>>>>>>> that this is because the host cannot connect to the other machine.
>>>>>>>>>>>
>>>>>>>>>>> Did anybody face similar problems with HA? Or any clue what the
>>>>>>>>>>> problem
>>>>>>>>>>> might be?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Michael
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ----
>>>>>>>>>>> ovirt version: 3.5.4
>>>>>>>>>>> Hosted engine VM OS: Cent OS 6.5
>>>>>>>>>>> Host Machines OS: Cent OS 7
>>>>>>>>>>>
>>>>>>>>>>> P.S. We also have to note that we had problems with the command
>>>>>>>>>>> fence_ipmilan at the beginning. We were receiving the message
>>>>>>>>>>> "Unable
>>>>>>>>>>> to
>>>>>>>>>>> obtain correct plug status or plug is not available," whenever the
>>>>>>>>>>> command fence_ipmilan was called. However, the command fence_ilo4
>>>>>>>>>>> worked. So we use a simple script for fence_ipmilan now that calls
>>>>>>>>>>> fence_ilo4 and passes the arguments.
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Users mailing list
>>>>>>>>>>> Users at ovirt.org
>>>>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list
>>>>>>>> Users at ovirt.org
>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users at ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/users



More information about the Users mailing list