Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown

24 Sep 2015

      Hi,

sorry for the late response, but you hit a "corner case" :-(

Let me start explain you a few things first:

After startup of engine there's an internval during which fencing is
disabled. It's called DisableFenceAtStartupInSec and by default it's
set to 5 minutes. It can be changed using 

   engine-config -s DisableFenceAtStartupInSec

but please do that with caution.

Why do we have such timeout? It's a prevention of fencing storm, which
could happen in during power issues in whole DC: when both engine and
hosts are started, for huge hosts it may take a lot of time until become
up and VDSM start to communicate with engine. So usually engine is started
first and without this interval engine will start fencing for hosts which
are just starting ...

Another thing: if we cannot properly fence the host, we cannot determine
if there's not just communication issue between engine and host, so we
cannot restart HA VMs on another host. The only thing we can do is to
offer "Mark host as rebooted" manual option to administrator. If
administrator execution this option, we try to restart HA VMs on different
host ASAP, because admin took the responsibility of validation that VMs
are really not running.

When engine is started, following actions related to fencing are taken:

1. Get status of all hosts from DB and schedule Non Responding Treatment
   after DisableFenceAtStartupInSec timeout is passed

2. Try to communicate with all host and refresh their status

If some host become Non Resposive during DisableFenceAtStartupInSec interval
we skip fencing and administator will see message in Events tab that host
is Non Responsive, but fencing is disabled due to startup interval. So
administrator have to take care of such host manually.

Now what happened in your case:

 1. Hosted engine VM is running on host1 with other VMs
 2. Status of host1 and host2 is Up
 3. You kill/shutdown host1 -> hosted engine VM is also shut down -> no engine
    is running to detect issue with host1 and change its status to Non Responsive
 4. In the meantime hosted engine VM is started on host2 -> it will read host
    status from DB, but all hosts are up -> it will try to communicate with host1,
    but it's unreachable -> so it changes host1 status Non Responsive and starts
    Non Responsive Treatment for host1 -> Non Responsive Treatment is aborted
    because engine is still in DisableFenceAtStartupInSec

So in normal deployment (without hosted engine) admin is notified that host,
where engine is running, crashed and was rebooted, so he has to take a look
and do manual steps if needed.

In hosted engine deployment it's an issue because hosted engine VM can be restart
on different host also in other cases then crashes (for example if host is
overloaded hosted engine can stop hosted engine VM and restart it on different
host, but this shouldn't happen too often).

At the moment the only solution for this is manual: let administrator to be
notified that host engine VM is restarted on different host, so administrator
can check manually what was the cause for this restart and execute manual steps
if needed.

So to summarize: at the moment I don't see any reliable automatic solution
for this :-( and fencing storm prevention is more important. But feel free to create
a bug for this issue, maybe we can think of at least some improvement for this use
case.

Thanks

Martin Perina

----- Original Message -----
...
From: "Michael Hölzl" <mh@ins.jku.at>
To: "Martin Perina" <mperina@redhat.com>
Cc: users@ovirt.org
Sent: Monday, September 21, 2015 4:47:06 PM
Subject: Re: [ovirt-users] HA - Fencing not working when host with engine gets shutdown
Hi,
The whole engine.log including the shutdown time (was performed around 9:19)
http://pastebin.com/cdY9uTkJ
vdsm.log of host01 (the host which kept on running and took over the
engine) split into 3 uploads (limit of 512 kB of pastebin):
1 : http://pastebin.com/dr9jNTek
2 : http://pastebin.com/cuyHL6ne
3 : http://pastebin.com/7x2ZQy1y
Michael
On 09/21/2015 03:00 PM, Martin Perina wrote:
...
Hi,
could you please post whole engine.log (from the time which you turned off
the host with engine VM) and also vdsm.log from both hosts?
Thanks
Martin Perina
----- Original Message -----
...
From: "Michael Hölzl" <mh@ins.jku.at>
To: users@ovirt.org
Sent: Monday, September 21, 2015 10:27:08 AM
Subject: [ovirt-users] HA - Fencing not working when host with engine gets
  shutdown
Hi all,
we are trying to setup an ovirt environment with two hosts, both
connected to a ISCSI storage device, a hosted engine and power
management configured over ILO. So far it seems to work fine in our
testing setup and starting/stopping VMs works smoothly with proper
scheduling between those hosts. So we wanted to test HA for the VMs now
and started to manually shutdown a host while there are still VMs
running on that machine (to simulate power failure or a kernel panic).
The expected outcome was that all machines were HA is enabled, are
booted again. This works if the machine with the failure does not have
the engine running. If the machine with the hosted engine VM gets
shutdown, the host gets in the "Not Responsive state" and all VMs end up
in an unkown state. However, the engine itself starts correctly on the
second host and it seems like it tries to fence the other host (as
expected) - Events which we get in the open virtualization manager:
1. Host hosted_engine_2 is non responsive
2. Host hosted_engine_1 from cluster Default was chosen as a proxy to
execute Status command on Host hosted_engine_2.
3. Host hosted_engine_2 became non responsive. It has no power
management configured. Please check the host status, manually reboot it,
and click "Confirm Host Has Been Rebooted"
4. Host hosted_engine_2 is not responding. It will stay in Connecting
state for a grace period of 124 seconds and after that an attempt to
fence the host will be issued.
Event 4 is continuously coming every 3 minutes. Complete engine.log file
during engine boot up: http://pastebin.com/D6xS3Wfy
So the host detects the machine is not responding and wants to fence it.
But although the host has power management configured over ILO, the
engine thinks that it is not. As a result the second host does not get
fenced and VMs are not migrated to the running machine.
In the log files there are also a lot of time out exception. But I guess
that this is because the host cannot connect to the other machine.
Did anybody face similar problems with HA? Or any clue what the problem
might be?
Thanks,
Michael
----
ovirt version: 3.5.4
Hosted engine VM OS: Cent OS 6.5
Host Machines OS: Cent OS 7
P.S. We also have to note that we had problems with the command
fence_ipmilan at the beginning. We were receiving the message "Unable to
obtain correct plug status or plug is not available," whenever the
command fence_ipmilan was called. However, the command fence_ilo4
worked. So we use a simple script for fence_ipmilan now that calls
fence_ilo4 and passes the arguments.
_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users