Hi all,
we are trying to setup an ovirt environment with two hosts, both
connected to a ISCSI storage device, a hosted engine and power
management configured over ILO. So far it seems to work fine in our
testing setup and starting/stopping VMs works smoothly with proper
scheduling between those hosts. So we wanted to test HA for the VMs now
and started to manually shutdown a host while there are still VMs
running on that machine (to simulate power failure or a kernel panic).
The expected outcome was that all machines were HA is enabled, are
booted again. This works if the machine with the failure does not have
the engine running. If the machine with the hosted engine VM gets
shutdown, the host gets in the "Not Responsive state" and all VMs end up
in an unkown state. However, the engine itself starts correctly on the
second host and it seems like it tries to fence the other host (as
expected) - Events which we get in the open virtualization manager:
1. Host hosted_engine_2 is non responsive
2. Host hosted_engine_1 from cluster Default was chosen as a proxy to
execute Status command on Host hosted_engine_2.
3. Host hosted_engine_2 became non responsive. It has no power
management configured. Please check the host status, manually reboot it,
and click "Confirm Host Has Been Rebooted"
4. Host hosted_engine_2 is not responding. It will stay in Connecting
state for a grace period of 124 seconds and after that an attempt to
fence the host will be issued.
Event 4 is continuously coming every 3 minutes. Complete engine.log file
during engine boot up:
http://pastebin.com/D6xS3Wfy
So the host detects the machine is not responding and wants to fence it.
But although the host has power management configured over ILO, the
engine thinks that it is not. As a result the second host does not get
fenced and VMs are not migrated to the running machine.
In the log files there are also a lot of time out exception. But I guess
that this is because the host cannot connect to the other machine.
Did anybody face similar problems with HA? Or any clue what the problem
might be?
Thanks,
Michael
----
ovirt version: 3.5.4
Hosted engine VM OS: Cent OS 6.5
Host Machines OS: Cent OS 7
P.S. We also have to note that we had problems with the command
fence_ipmilan at the beginning. We were receiving the message "Unable to
obtain correct plug status or plug is not available," whenever the
command fence_ipmilan was called. However, the command fence_ilo4
worked. So we use a simple script for fence_ipmilan now that calls
fence_ilo4 and passes the arguments.