Thanks for the help! I will definitely stay tuned with updates on this
matter.
Michael
On 09/24/2015 03:13 PM, Martin Perina wrote:
I created a bug covering this:
https://bugzilla.redhat.com/show_bug.cgi?id=1266099
----- Original Message -----
> From: "Martin Sivak" <msivak(a)redhat.com>
> To: "Michael Hölzl" <mh(a)ins.jku.at>
> Cc: "Martin Perina" <mperina(a)redhat.com>, users(a)ovirt.org
> Sent: Thursday, September 24, 2015 2:59:52 PM
> Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets
shutdown
>
> Hi Michael,
>
> Martin summed the situation neatly, I would just add that this issue
> is not limited to the size of your setup. The same would happen to HA
> VMs running on the same host as the hosted engine even if the cluster
> had 50 hosts...
>
> About the recommended way of engine deployment: It really is about
> whether you can tolerate your engine to be down for a longer time
> (starting another host using a backup db).
>
> Hosted engine restores your management in an automated way and without
> any data loss. However I agree that the fact that you have to tend to
> your HA VMs manually after an engine restart is not nice. Fortunately
> that should only happen when your host (or vdsm) dies and does not
> come up for an extended period of time.
>
> The summary would be.. there will be no HA handling if the host
> running the engine is down, independently on whether the deployment is
> hosted engine or standalone engine. If the issue is related to the
> software only then there is no real difference.
>
> - When a host with the standalone engine dies, the VMs are fine, but
> if anything happens while the engine is down (and reinstalling a
> standalone engine takes time + you need a very fresh db backup) you
> might again face issues with HA VMs being down or not starting when
> the engine comes up.
>
> - When a hosted engine dies because of a host failure, some VMs
> generally disappear with it. The engine will come up automatically and
> HA VMs from the original hosts have to be manually pushed to work.
> This requires some manual action, but I see it as less demanding than
> the first case.
>
> - When a hosted engine VM is stopped properly by the tooling it will
> be restarted elsewhere and it will be able to connect to the original
> host just fine. The engine will then make sure that all HA VMs are up
> even if the the VMs died while the engine was down.
>
> So I would recommend hosted engine based deployment. And ask for a bit
> of patience as we have a plan how to mitigate the second case to some
> extent without compromising the fencing storm prevention.
>
> Best regards
>
> --
> Martin Sivak
> msivak(a)redhat.com
> SLA RHEV-M
>
>
> On Thu, Sep 24, 2015 at 2:31 PM, Michael Hölzl <mh(a)ins.jku.at> wrote:
>> Ok, thanks!
>>
>> So, I would still like to know if you would recommend not to use hosted
>> engines but rather another machine for the engine?
>>
>> On 09/24/2015 01:24 PM, Martin Perina wrote:
>>> ----- Original Message -----
>>>> From: "Michael Hölzl" <mh(a)ins.jku.at>
>>>> To: "Martin Perina" <mperina(a)redhat.com>, "Eli
Mesika"
>>>> <emesika(a)redhat.com>
>>>> Cc: "Doron Fediuck" <dfediuck(a)redhat.com>,
users(a)ovirt.org
>>>> Sent: Thursday, September 24, 2015 12:35:13 PM
>>>> Subject: Re: [ovirt-users] HA - Fencing not working when host with
engine
>>>> gets shutdown
>>>>
>>>> Hi,
>>>>
>>>> thanks for the detailed answer! In principle, I understand the issue
>>>> now. However, I can not fully follow the argument that this is a corner
>>>> case. In a smaller or medium sized company, I would assume that such a
>>>> setup, consisting of two machine with a hosted engine, is not uncommon.
>>>> Especially as there is documentation online which describes how to
>>>> deploy this setup. Does that mean that hosted engines are in general not
>>>> recommended?
>>>>
>>>> I am also wondering why the fencing could not be triggered by the hosted
>>>> engine after the DisableFenceAtStartupInSec timeout? In the events log
>>>> of the engine I keep on getting the message "Host hosted_engine_2 is
not
>>>> responding. It will stay in Connecting state for a grace period of 120
>>>> seconds and after that an attempt to fence the host will be
issued.",
>>>> which would indicate that the engine is actually trying to fence the non
>>>> responsive host.
>>> Unfortunately this is a bit misleading message, it's shown every time
that
>>> we start handling network exception for the host and it's fired before
>>> the logic which manages to start/skip fencing process (this misleading
>>> message is fixed in 3.6). But in current logic we really execute fencing
>>> only when host status is about to change from Connecting to NonResponsive
>>> and this happens only for the 1st time when we are still in
>>> DisableFenceAtStartupInSec interval. During all other attempts the host is
>>> already in status Non Responsive, so fencing is skipped.
>>>
>>>> On 09/24/2015 11:50 AM, Martin Perina wrote:
>>>>> ----- Original Message -----
>>>>>> From: "Eli Mesika" <emesika(a)redhat.com>
>>>>>> To: "Martin Perina" <mperina(a)redhat.com>,
"Doron Fediuck"
>>>>>> <dfediuck(a)redhat.com>
>>>>>> Cc: "Michael Hölzl" <mh(a)ins.jku.at>,
users(a)ovirt.org
>>>>>> Sent: Thursday, September 24, 2015 11:38:39 AM
>>>>>> Subject: Re: [ovirt-users] HA - Fencing not working when host
with
>>>>>> engine
>>>>>> gets shutdown
>>>>>>
>>>>>>
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: "Martin Perina" <mperina(a)redhat.com>
>>>>>>> To: "Michael Hölzl" <mh(a)ins.jku.at>
>>>>>>> Cc: users(a)ovirt.org
>>>>>>> Sent: Thursday, September 24, 2015 11:02:21 AM
>>>>>>> Subject: Re: [ovirt-users] HA - Fencing not working when host
with
>>>>>>> engine
>>>>>>> gets shutdown
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> sorry for the late response, but you hit a "corner
case" :-(
>>>>>>>
>>>>>>> Let me start explain you a few things first:
>>>>>>>
>>>>>>> After startup of engine there's an internval during which
fencing is
>>>>>>> disabled. It's called DisableFenceAtStartupInSec and by
default it's
>>>>>>> set to 5 minutes. It can be changed using
>>>>>>>
>>>>>>> engine-config -s DisableFenceAtStartupInSec
>>>>>>>
>>>>>>> but please do that with caution.
>>>>>>>
>>>>>>> Why do we have such timeout? It's a prevention of fencing
storm, which
>>>>>>> could happen in during power issues in whole DC: when both
engine and
>>>>>>> hosts are started, for huge hosts it may take a lot of time
until
>>>>>>> become
>>>>>>> up and VDSM start to communicate with engine. So usually
engine is
>>>>>>> started
>>>>>>> first and without this interval engine will start fencing for
hosts
>>>>>>> which
>>>>>>> are just starting ...
>>>>>>>
>>>>>>> Another thing: if we cannot properly fence the host, we
cannot
>>>>>>> determine
>>>>>>> if there's not just communication issue between engine
and host, so we
>>>>>>> cannot restart HA VMs on another host. The only thing we can
do is to
>>>>>>> offer "Mark host as rebooted" manual option to
administrator. If
>>>>>>> administrator execution this option, we try to restart HA VMs
on
>>>>>>> different
>>>>>>> host ASAP, because admin took the responsibility of
validation that
>>>>>>> VMs
>>>>>>> are really not running.
>>>>>>>
>>>>>>>
>>>>>>> When engine is started, following actions related to fencing
are
>>>>>>> taken:
>>>>>>>
>>>>>>> 1. Get status of all hosts from DB and schedule Non
Responding
>>>>>>> Treatment
>>>>>>> after DisableFenceAtStartupInSec timeout is passed
>>>>>>>
>>>>>>> 2. Try to communicate with all host and refresh their status
>>>>>>>
>>>>>>>
>>>>>>> If some host become Non Resposive during
DisableFenceAtStartupInSec
>>>>>>> interval
>>>>>>> we skip fencing and administator will see message in Events
tab that
>>>>>>> host
>>>>>>> is Non Responsive, but fencing is disabled due to startup
interval. So
>>>>>>> administrator have to take care of such host manually.
>>>>>>>
>>>>>>>
>>>>>>> Now what happened in your case:
>>>>>>>
>>>>>>> 1. Hosted engine VM is running on host1 with other VMs
>>>>>>> 2. Status of host1 and host2 is Up
>>>>>>> 3. You kill/shutdown host1 -> hosted engine VM is also
shut down ->
>>>>>>> no
>>>>>>> engine
>>>>>>> is running to detect issue with host1 and change its
status to Non
>>>>>>> Responsive
>>>>>>> 4. In the meantime hosted engine VM is started on host2
-> it will
>>>>>>> read
>>>>>>> host
>>>>>>> status from DB, but all hosts are up -> it will try to
communicate
>>>>>>> with
>>>>>>> host1,
>>>>>>> but it's unreachable -> so it changes host1 status
Non Responsive
>>>>>>> and
>>>>>>> starts
>>>>>>> Non Responsive Treatment for host1 -> Non Responsive
Treatment is
>>>>>>> aborted
>>>>>>> because engine is still in DisableFenceAtStartupInSec
>>>>>>>
>>>>>>>
>>>>>>> So in normal deployment (without hosted engine) admin is
notified that
>>>>>>> host,
>>>>>>> where engine is running, crashed and was rebooted, so he has
to take a
>>>>>>> look
>>>>>>> and do manual steps if needed.
>>>>>>>
>>>>>>> In hosted engine deployment it's an issue because hosted
engine VM can
>>>>>>> be
>>>>>>> restart
>>>>>>> on different host also in other cases then crashes (for
example if
>>>>>>> host
>>>>>>> is
>>>>>>> overloaded hosted engine can stop hosted engine VM and
restart it on
>>>>>>> different
>>>>>>> host, but this shouldn't happen too often).
>>>>>>>
>>>>>>> At the moment the only solution for this is manual: let
administrator
>>>>>>> to
>>>>>>> be
>>>>>>> notified that host engine VM is restarted on different host,
so
>>>>>>> administrator
>>>>>>> can check manually what was the cause for this restart and
execute
>>>>>>> manual
>>>>>>> steps
>>>>>>> if needed.
>>>>>>>
>>>>>>> So to summarize: at the moment I don't see any reliable
automatic
>>>>>>> solution
>>>>>>> for this :-( and fencing storm prevention is more important.
But feel
>>>>>>> free
>>>>>>> to
>>>>>>> create
>>>>>>> a bug for this issue, maybe we can think of at least some
improvement
>>>>>>> for
>>>>>>> this use
>>>>>>> case.
>>>>>> Thanks for the detailed explanation Martin
>>>>>> Really a corner case, lets see if we got more inputs on that from
other
>>>>>> users
>>>>>> Maybe when hosted engine VM is restarted on another node we can
ask for
>>>>>> the
>>>>>> reason and act accordingly
>>>>>> Doron, with current implementation, is the reason for hosted
engine VM
>>>>>> restart stored anywhere ?
>>>>> I have already discussed this with Martin Sivak and hosted engine
>>>>> doesn't
>>>>> touch engine db at all. We discussed this possible solution with
Martin,
>>>>> which we could do in master and maybe in 3.6 if agreed:
>>>>>
>>>>> 1. Just after start of engine we can read from the db name of the
host
>>>>> which hosted engine VM is running on and store it somewhere in
>>>>> memory
>>>>> for Non Responding Treatment
>>>>>
>>>>> 2. As a part of Non Responding Treatment we can some hosted engine
>>>>> specific logic:
>>>>> IF we are running as hosted engine AND
>>>>> we are inside DisableFenceAtStartupInSec internal AND
>>>>> non responsive host is the host stored above in step 1. AND
>>>>> hosted engine VM is running on different host
>>>>> THEN
>>>>> execute fencing for non responsive host even when we are
>>>>> inside DisableFenceAtStartupInSec internal
>>>>>
>>>>> But it can cause unnecessary fence for the case that whole
datacenter
>>>>> recovers from power failure.
>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Martin Perina
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> From: "Michael Hölzl" <mh(a)ins.jku.at>
>>>>>>>> To: "Martin Perina" <mperina(a)redhat.com>
>>>>>>>> Cc: users(a)ovirt.org
>>>>>>>> Sent: Monday, September 21, 2015 4:47:06 PM
>>>>>>>> Subject: Re: [ovirt-users] HA - Fencing not working when
host with
>>>>>>>> engine
>>>>>>>> gets shutdown
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> The whole engine.log including the shutdown time (was
performed
>>>>>>>> around
>>>>>>>> 9:19)
>>>>>>>>
http://pastebin.com/cdY9uTkJ
>>>>>>>>
>>>>>>>> vdsm.log of host01 (the host which kept on running and
took over the
>>>>>>>> engine) split into 3 uploads (limit of 512 kB of
pastebin):
>>>>>>>> 1 :
http://pastebin.com/dr9jNTek
>>>>>>>> 2 :
http://pastebin.com/cuyHL6ne
>>>>>>>> 3 :
http://pastebin.com/7x2ZQy1y
>>>>>>>>
>>>>>>>> Michael
>>>>>>>>
>>>>>>>> On 09/21/2015 03:00 PM, Martin Perina wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> could you please post whole engine.log (from the time
which you
>>>>>>>>> turned
>>>>>>>>> off
>>>>>>>>> the host with engine VM) and also vdsm.log from both
hosts?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> Martin Perina
>>>>>>>>>
>>>>>>>>> ----- Original Message -----
>>>>>>>>>> From: "Michael Hölzl"
<mh(a)ins.jku.at>
>>>>>>>>>> To: users(a)ovirt.org
>>>>>>>>>> Sent: Monday, September 21, 2015 10:27:08 AM
>>>>>>>>>> Subject: [ovirt-users] HA - Fencing not working
when host with
>>>>>>>>>> engine
>>>>>>>>>> gets
>>>>>>>>>> shutdown
>>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> we are trying to setup an ovirt environment with
two hosts, both
>>>>>>>>>> connected to a ISCSI storage device, a hosted
engine and power
>>>>>>>>>> management configured over ILO. So far it seems
to work fine in our
>>>>>>>>>> testing setup and starting/stopping VMs works
smoothly with proper
>>>>>>>>>> scheduling between those hosts. So we wanted to
test HA for the VMs
>>>>>>>>>> now
>>>>>>>>>> and started to manually shutdown a host while
there are still VMs
>>>>>>>>>> running on that machine (to simulate power
failure or a kernel
>>>>>>>>>> panic).
>>>>>>>>>> The expected outcome was that all machines were
HA is enabled, are
>>>>>>>>>> booted again. This works if the machine with the
failure does not
>>>>>>>>>> have
>>>>>>>>>> the engine running. If the machine with the
hosted engine VM gets
>>>>>>>>>> shutdown, the host gets in the "Not
Responsive state" and all VMs
>>>>>>>>>> end
>>>>>>>>>> up
>>>>>>>>>> in an unkown state. However, the engine itself
starts correctly on
>>>>>>>>>> the
>>>>>>>>>> second host and it seems like it tries to fence
the other host (as
>>>>>>>>>> expected) - Events which we get in the open
virtualization manager:
>>>>>>>>>> 1. Host hosted_engine_2 is non responsive
>>>>>>>>>> 2. Host hosted_engine_1 from cluster Default was
chosen as a proxy
>>>>>>>>>> to
>>>>>>>>>> execute Status command on Host hosted_engine_2.
>>>>>>>>>> 3. Host hosted_engine_2 became non responsive. It
has no power
>>>>>>>>>> management configured. Please check the host
status, manually
>>>>>>>>>> reboot
>>>>>>>>>> it,
>>>>>>>>>> and click "Confirm Host Has Been
Rebooted"
>>>>>>>>>> 4. Host hosted_engine_2 is not responding. It
will stay in
>>>>>>>>>> Connecting
>>>>>>>>>> state for a grace period of 124 seconds and after
that an attempt
>>>>>>>>>> to
>>>>>>>>>> fence the host will be issued.
>>>>>>>>>>
>>>>>>>>>> Event 4 is continuously coming every 3 minutes.
Complete engine.log
>>>>>>>>>> file
>>>>>>>>>> during engine boot up:
http://pastebin.com/D6xS3Wfy
>>>>>>>>>> So the host detects the machine is not responding
and wants to
>>>>>>>>>> fence
>>>>>>>>>> it.
>>>>>>>>>> But although the host has power management
configured over ILO, the
>>>>>>>>>> engine thinks that it is not. As a result the
second host does not
>>>>>>>>>> get
>>>>>>>>>> fenced and VMs are not migrated to the running
machine.
>>>>>>>>>> In the log files there are also a lot of time out
exception. But I
>>>>>>>>>> guess
>>>>>>>>>> that this is because the host cannot connect to
the other machine.
>>>>>>>>>>
>>>>>>>>>> Did anybody face similar problems with HA? Or any
clue what the
>>>>>>>>>> problem
>>>>>>>>>> might be?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Michael
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ----
>>>>>>>>>> ovirt version: 3.5.4
>>>>>>>>>> Hosted engine VM OS: Cent OS 6.5
>>>>>>>>>> Host Machines OS: Cent OS 7
>>>>>>>>>>
>>>>>>>>>> P.S. We also have to note that we had problems
with the command
>>>>>>>>>> fence_ipmilan at the beginning. We were receiving
the message
>>>>>>>>>> "Unable
>>>>>>>>>> to
>>>>>>>>>> obtain correct plug status or plug is not
available," whenever the
>>>>>>>>>> command fence_ipmilan was called. However, the
command fence_ilo4
>>>>>>>>>> worked. So we use a simple script for
fence_ipmilan now that calls
>>>>>>>>>> fence_ilo4 and passes the arguments.
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Users mailing list
>>>>>>>>>> Users(a)ovirt.org
>>>>>>>>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Users mailing list
>>>>>>> Users(a)ovirt.org
>>>>>>>
http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>
>> _______________________________________________
>> Users mailing list
>> Users(a)ovirt.org
>>
http://lists.ovirt.org/mailman/listinfo/users