[ovirt-users] HA - Fencing not working when host with engine gets shutdown

Thu Sep 24 13:13:37 UTC 2015

I created a bug covering this:

https://bugzilla.redhat.com/show_bug.cgi?id=1266099

----- Original Message -----
> From: "Martin Sivak" <msivak at redhat.com>
> To: "Michael Hölzl" <mh at ins.jku.at>
> Cc: "Martin Perina" <mperina at redhat.com>, users at ovirt.org
> Sent: Thursday, September 24, 2015 2:59:52 PM
> Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown
> 
> Hi Michael,
> 
> Martin summed the situation neatly, I would just add that this issue
> is not limited to the size of your setup. The same would happen to HA
> VMs running on the same host as the hosted engine even if the cluster
> had 50 hosts...
> 
> About the recommended way of engine deployment: It really is about
> whether you can tolerate your engine to be down for a longer time
> (starting another host using a backup db).
> 
> Hosted engine restores your management in an automated way and without
> any data loss. However I agree that the fact that you have to tend to
> your HA VMs manually after an engine restart is not nice. Fortunately
> that should only happen when your host (or vdsm) dies and does not
> come up for an extended period of time.
> 
> The summary would be.. there will be no HA handling if the host
> running the engine is down, independently on whether the deployment is
> hosted engine or standalone engine. If the issue is related to the
> software only then there is no real difference.
> 
> - When a host with the standalone engine dies, the VMs are fine, but
> if anything happens while the engine is down (and reinstalling a
> standalone engine takes time + you need a very fresh db backup) you
> might again face issues with HA VMs being down or not starting when
> the engine comes up.
> 
> - When a hosted engine dies because of a host failure, some VMs
> generally disappear with it. The engine will come up automatically and
> HA VMs from the original hosts have to be manually pushed to work.
> This requires some manual action, but I see it as less demanding than
> the first case.
> 
> - When a hosted engine VM is stopped properly by the tooling it will
> be restarted elsewhere and it will be able to connect to the original
> host just fine. The engine will then make sure that all HA VMs are up
> even if the the VMs died while the engine was down.
> 
> So I would recommend hosted engine based deployment. And ask for a bit
> of patience as we have a plan how to mitigate the second case to some
> extent without compromising the fencing storm prevention.
> 
> Best regards
> 
> --
> Martin Sivak
> msivak at redhat.com
> SLA RHEV-M
> 
> 
> On Thu, Sep 24, 2015 at 2:31 PM, Michael Hölzl <mh at ins.jku.at> wrote:
> > Ok, thanks!
> >
> > So, I would still like to know if you would recommend not to use hosted
> > engines but rather another machine for the engine?
> >
> > On 09/24/2015 01:24 PM, Martin Perina wrote:
> >>
> >> ----- Original Message -----
> >>> From: "Michael Hölzl" <mh at ins.jku.at>
> >>> To: "Martin Perina" <mperina at redhat.com>, "Eli Mesika"
> >>> <emesika at redhat.com>
> >>> Cc: "Doron Fediuck" <dfediuck at redhat.com>, users at ovirt.org
> >>> Sent: Thursday, September 24, 2015 12:35:13 PM
> >>> Subject: Re: [ovirt-users] HA - Fencing not working when host with engine
> >>> gets shutdown
> >>>
> >>> Hi,
> >>>
> >>> thanks for the detailed answer! In principle, I understand the issue
> >>> now. However, I can not fully follow the argument that this is a corner
> >>> case. In a smaller or medium sized company, I would assume that such a
> >>> setup, consisting of two machine with a hosted engine, is not uncommon.
> >>> Especially as there is documentation online which describes how to
> >>> deploy this setup. Does that mean that hosted engines are in general not
> >>> recommended?
> >>>
> >>> I am also wondering why the fencing could not be triggered by the hosted
> >>> engine after the DisableFenceAtStartupInSec timeout? In the events log
> >>> of the engine I keep on getting the message "Host hosted_engine_2 is not
> >>> responding. It will stay in Connecting state for a grace period of 120
> >>> seconds and after that an attempt to fence the host will be issued.",
> >>> which would indicate that the engine is actually trying to fence the non
> >>> responsive host.
> >> Unfortunately this is a bit misleading message, it's shown every time that
> >> we start handling network exception for the host and it's fired before
> >> the logic which manages to start/skip fencing process (this misleading
> >> message is fixed in 3.6). But in current logic we really execute fencing
> >> only when host status is about to change from Connecting to NonResponsive
> >> and this happens only for the 1st time when we are still in
> >> DisableFenceAtStartupInSec interval. During all other attempts the host is
> >> already in status Non Responsive, so fencing is skipped.
> >>
> >>> On 09/24/2015 11:50 AM, Martin Perina wrote:
> >>>> ----- Original Message -----
> >>>>> From: "Eli Mesika" <emesika at redhat.com>
> >>>>> To: "Martin Perina" <mperina at redhat.com>, "Doron Fediuck"
> >>>>> <dfediuck at redhat.com>
> >>>>> Cc: "Michael Hölzl" <mh at ins.jku.at>, users at ovirt.org
> >>>>> Sent: Thursday, September 24, 2015 11:38:39 AM
> >>>>> Subject: Re: [ovirt-users] HA - Fencing not working when host with
> >>>>> engine
> >>>>> gets shutdown
> >>>>>
> >>>>>
> >>>>>
> >>>>> ----- Original Message -----
> >>>>>> From: "Martin Perina" <mperina at redhat.com>
> >>>>>> To: "Michael Hölzl" <mh at ins.jku.at>
> >>>>>> Cc: users at ovirt.org
> >>>>>> Sent: Thursday, September 24, 2015 11:02:21 AM
> >>>>>> Subject: Re: [ovirt-users] HA - Fencing not working when host with
> >>>>>> engine
> >>>>>> gets shutdown
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> sorry for the late response, but you hit a "corner case" :-(
> >>>>>>
> >>>>>> Let me start explain you a few things first:
> >>>>>>
> >>>>>> After startup of engine there's an internval during which fencing is
> >>>>>> disabled. It's called DisableFenceAtStartupInSec and by default it's
> >>>>>> set to 5 minutes. It can be changed using
> >>>>>>
> >>>>>>    engine-config -s DisableFenceAtStartupInSec
> >>>>>>
> >>>>>> but please do that with caution.
> >>>>>>
> >>>>>> Why do we have such timeout? It's a prevention of fencing storm, which
> >>>>>> could happen in during power issues in whole DC: when both engine and
> >>>>>> hosts are started, for huge hosts it may take a lot of time until
> >>>>>> become
> >>>>>> up and VDSM start to communicate with engine. So usually engine is
> >>>>>> started
> >>>>>> first and without this interval engine will start fencing for hosts
> >>>>>> which
> >>>>>> are just starting ...
> >>>>>>
> >>>>>> Another thing: if we cannot properly fence the host, we cannot
> >>>>>> determine
> >>>>>> if there's not just communication issue between engine and host, so we
> >>>>>> cannot restart HA VMs on another host. The only thing we can do is to
> >>>>>> offer "Mark host as rebooted" manual option to administrator. If
> >>>>>> administrator execution this option, we try to restart HA VMs on
> >>>>>> different
> >>>>>> host ASAP, because admin took the responsibility of validation that
> >>>>>> VMs
> >>>>>> are really not running.
> >>>>>>
> >>>>>>
> >>>>>> When engine is started, following actions related to fencing are
> >>>>>> taken:
> >>>>>>
> >>>>>> 1. Get status of all hosts from DB and schedule Non Responding
> >>>>>> Treatment
> >>>>>>    after DisableFenceAtStartupInSec timeout is passed
> >>>>>>
> >>>>>> 2. Try to communicate with all host and refresh their status
> >>>>>>
> >>>>>>
> >>>>>> If some host become Non Resposive during DisableFenceAtStartupInSec
> >>>>>> interval
> >>>>>> we skip fencing and administator will see message in Events tab that
> >>>>>> host
> >>>>>> is Non Responsive, but fencing is disabled due to startup interval. So
> >>>>>> administrator have to take care of such host manually.
> >>>>>>
> >>>>>>
> >>>>>> Now what happened in your case:
> >>>>>>
> >>>>>>  1. Hosted engine VM is running on host1 with other VMs
> >>>>>>  2. Status of host1 and host2 is Up
> >>>>>>  3. You kill/shutdown host1 -> hosted engine VM is also shut down ->
> >>>>>>  no
> >>>>>>  engine
> >>>>>>     is running to detect issue with host1 and change its status to Non
> >>>>>>     Responsive
> >>>>>>  4. In the meantime hosted engine VM is started on host2 -> it will
> >>>>>>  read
> >>>>>>  host
> >>>>>>     status from DB, but all hosts are up -> it will try to communicate
> >>>>>>     with
> >>>>>>     host1,
> >>>>>>     but it's unreachable -> so it changes host1 status Non Responsive
> >>>>>>     and
> >>>>>>     starts
> >>>>>>     Non Responsive Treatment for host1 -> Non Responsive Treatment is
> >>>>>>     aborted
> >>>>>>     because engine is still in DisableFenceAtStartupInSec
> >>>>>>
> >>>>>>
> >>>>>> So in normal deployment (without hosted engine) admin is notified that
> >>>>>> host,
> >>>>>> where engine is running, crashed and was rebooted, so he has to take a
> >>>>>> look
> >>>>>> and do manual steps if needed.
> >>>>>>
> >>>>>> In hosted engine deployment it's an issue because hosted engine VM can
> >>>>>> be
> >>>>>> restart
> >>>>>> on different host also in other cases then crashes (for example if
> >>>>>> host
> >>>>>> is
> >>>>>> overloaded hosted engine can stop hosted engine VM and restart it on
> >>>>>> different
> >>>>>> host, but this shouldn't happen too often).
> >>>>>>
> >>>>>> At the moment the only solution for this is manual: let administrator
> >>>>>> to
> >>>>>> be
> >>>>>> notified that host engine VM is restarted on different host, so
> >>>>>> administrator
> >>>>>> can check manually what was the cause for this restart and execute
> >>>>>> manual
> >>>>>> steps
> >>>>>> if needed.
> >>>>>>
> >>>>>> So to summarize: at the moment I don't see any reliable automatic
> >>>>>> solution
> >>>>>> for this :-( and fencing storm prevention is more important. But feel
> >>>>>> free
> >>>>>> to
> >>>>>> create
> >>>>>> a bug for this issue, maybe we can think of at least some improvement
> >>>>>> for
> >>>>>> this use
> >>>>>> case.
> >>>>> Thanks for the detailed explanation Martin
> >>>>> Really a corner case, lets see if we got more inputs on that from other
> >>>>> users
> >>>>> Maybe when hosted engine VM is restarted on another node we can ask for
> >>>>> the
> >>>>> reason and act accordingly
> >>>>> Doron, with current implementation, is the reason for hosted engine VM
> >>>>> restart stored anywhere ?
> >>>> I have already discussed this with Martin Sivak and hosted engine
> >>>> doesn't
> >>>> touch engine db at all. We discussed this possible solution with Martin,
> >>>> which we could do in master and maybe in 3.6 if agreed:
> >>>>
> >>>>  1. Just after start of engine we can read from the db name of the host
> >>>>     which hosted engine VM is running on and store it somewhere in
> >>>>     memory
> >>>>     for Non Responding Treatment
> >>>>
> >>>>  2. As a part of Non Responding Treatment we can some hosted engine
> >>>>     specific logic:
> >>>>       IF we are running as hosted engine AND
> >>>>          we are inside DisableFenceAtStartupInSec internal AND
> >>>>          non responsive host is the host stored above in step 1. AND
> >>>>          hosted engine VM is running on different host
> >>>>       THEN
> >>>>          execute fencing for non responsive host even when we are
> >>>>          inside DisableFenceAtStartupInSec internal
> >>>>
> >>>> But it can cause unnecessary fence for the case that whole datacenter
> >>>> recovers from power failure.
> >>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>> Martin Perina
> >>>>>>
> >>>>>> ----- Original Message -----
> >>>>>>> From: "Michael Hölzl" <mh at ins.jku.at>
> >>>>>>> To: "Martin Perina" <mperina at redhat.com>
> >>>>>>> Cc: users at ovirt.org
> >>>>>>> Sent: Monday, September 21, 2015 4:47:06 PM
> >>>>>>> Subject: Re: [ovirt-users] HA - Fencing not working when host with
> >>>>>>> engine
> >>>>>>> gets shutdown
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> The whole engine.log including the shutdown time (was performed
> >>>>>>> around
> >>>>>>> 9:19)
> >>>>>>> http://pastebin.com/cdY9uTkJ
> >>>>>>>
> >>>>>>> vdsm.log of host01 (the host which kept on running and took over the
> >>>>>>> engine) split into 3 uploads (limit of 512 kB of pastebin):
> >>>>>>> 1 : http://pastebin.com/dr9jNTek
> >>>>>>> 2 : http://pastebin.com/cuyHL6ne
> >>>>>>> 3 : http://pastebin.com/7x2ZQy1y
> >>>>>>>
> >>>>>>> Michael
> >>>>>>>
> >>>>>>> On 09/21/2015 03:00 PM, Martin Perina wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> could you please post whole engine.log (from the time which you
> >>>>>>>> turned
> >>>>>>>> off
> >>>>>>>> the host with engine VM) and also vdsm.log from both hosts?
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>>
> >>>>>>>> Martin Perina
> >>>>>>>>
> >>>>>>>> ----- Original Message -----
> >>>>>>>>> From: "Michael Hölzl" <mh at ins.jku.at>
> >>>>>>>>> To: users at ovirt.org
> >>>>>>>>> Sent: Monday, September 21, 2015 10:27:08 AM
> >>>>>>>>> Subject: [ovirt-users] HA - Fencing not working when host with
> >>>>>>>>> engine
> >>>>>>>>> gets
> >>>>>>>>>        shutdown
> >>>>>>>>>
> >>>>>>>>> Hi all,
> >>>>>>>>>
> >>>>>>>>> we are trying to setup an ovirt environment with two hosts, both
> >>>>>>>>> connected to a ISCSI storage device, a hosted engine and power
> >>>>>>>>> management configured over ILO. So far it seems to work fine in our
> >>>>>>>>> testing setup and starting/stopping VMs works smoothly with proper
> >>>>>>>>> scheduling between those hosts. So we wanted to test HA for the VMs
> >>>>>>>>> now
> >>>>>>>>> and started to manually shutdown a host while there are still VMs
> >>>>>>>>> running on that machine (to simulate power failure or a kernel
> >>>>>>>>> panic).
> >>>>>>>>> The expected outcome was that all machines were HA is enabled, are
> >>>>>>>>> booted again. This works if the machine with the failure does not
> >>>>>>>>> have
> >>>>>>>>> the engine running. If the machine with the hosted engine VM gets
> >>>>>>>>> shutdown, the host gets in the "Not Responsive state" and all VMs
> >>>>>>>>> end
> >>>>>>>>> up
> >>>>>>>>> in an unkown state. However, the engine itself starts correctly on
> >>>>>>>>> the
> >>>>>>>>> second host and it seems like it tries to fence the other host (as
> >>>>>>>>> expected) - Events which we get in the open virtualization manager:
> >>>>>>>>> 1. Host hosted_engine_2 is non responsive
> >>>>>>>>> 2. Host hosted_engine_1 from cluster Default was chosen as a proxy
> >>>>>>>>> to
> >>>>>>>>> execute Status command on Host hosted_engine_2.
> >>>>>>>>> 3. Host hosted_engine_2 became non responsive. It has no power
> >>>>>>>>> management configured. Please check the host status, manually
> >>>>>>>>> reboot
> >>>>>>>>> it,
> >>>>>>>>> and click "Confirm Host Has Been Rebooted"
> >>>>>>>>> 4. Host hosted_engine_2 is not responding. It will stay in
> >>>>>>>>> Connecting
> >>>>>>>>> state for a grace period of 124 seconds and after that an attempt
> >>>>>>>>> to
> >>>>>>>>> fence the host will be issued.
> >>>>>>>>>
> >>>>>>>>> Event 4 is continuously coming every 3 minutes. Complete engine.log
> >>>>>>>>> file
> >>>>>>>>> during engine boot up: http://pastebin.com/D6xS3Wfy
> >>>>>>>>> So the host detects the machine is not responding and wants to
> >>>>>>>>> fence
> >>>>>>>>> it.
> >>>>>>>>> But although the host has power management configured over ILO, the
> >>>>>>>>> engine thinks that it is not. As a result the second host does not
> >>>>>>>>> get
> >>>>>>>>> fenced and VMs are not migrated to the running machine.
> >>>>>>>>> In the log files there are also a lot of time out exception. But I
> >>>>>>>>> guess
> >>>>>>>>> that this is because the host cannot connect to the other machine.
> >>>>>>>>>
> >>>>>>>>> Did anybody face similar problems with HA? Or any clue what the
> >>>>>>>>> problem
> >>>>>>>>> might be?
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Michael
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> ----
> >>>>>>>>> ovirt version: 3.5.4
> >>>>>>>>> Hosted engine VM OS: Cent OS 6.5
> >>>>>>>>> Host Machines OS: Cent OS 7
> >>>>>>>>>
> >>>>>>>>> P.S. We also have to note that we had problems with the command
> >>>>>>>>> fence_ipmilan at the beginning. We were receiving the message
> >>>>>>>>> "Unable
> >>>>>>>>> to
> >>>>>>>>> obtain correct plug status or plug is not available," whenever the
> >>>>>>>>> command fence_ipmilan was called. However, the command fence_ilo4
> >>>>>>>>> worked. So we use a simple script for fence_ipmilan now that calls
> >>>>>>>>> fence_ilo4 and passes the arguments.
> >>>>>>>>> _______________________________________________
> >>>>>>>>> Users mailing list
> >>>>>>>>> Users at ovirt.org
> >>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
> >>>>>>>>>
> >>>>>> _______________________________________________
> >>>>>> Users mailing list
> >>>>>> Users at ovirt.org
> >>>>>> http://lists.ovirt.org/mailman/listinfo/users
> >>>>>>
> > _______________________________________________
> > Users mailing list
> > Users at ovirt.org
> > http://lists.ovirt.org/mailman/listinfo/users
>