On Thu, Aug 5, 2021 at 5:45 PM Gianluca Cecchi
<gianluca.cecchi(a)gmail.com> wrote:
Hello,
supposing latest 4.4.7 environment installed with an external engine and two hosts, one
in one site and one in another site.
For storage I have one FC storage domain.
I try to simulate a sort of "site failure scenario" to see what kind of HA I
should expect.
The 2 hosts have power mgmt configured through fence_ipmilan.
I have 2 VMs, one configured as HA with lease on storage (Resume Behavior: kill) and one
not marked as HA.
Initially host1 is SPM and it is the host that runs the two VMs.
Fencing of host1 from host2 initially works ok. I can test also from command line:
# fence_ipmilan -a 10.10.193.152 -P -l my_fence_user -A password -L operator -S
/usr/local/bin/pwd.sh -o status
Status: ON
On host2 I then prevent reaching host1 iDRAC:
firewall-cmd --direct --add-rule ipv4 filter OUTPUT 0 -d 10.10.193.152 -p udp --dport 623
-j DROP
firewall-cmd --direct --add-rule ipv4 filter OUTPUT 1 -j ACCEPT
Why do you need to prevent access from host1 to host2? Hosts do not
access each other unless you migrate vms between hosts.
so that:
# fence_ipmilan -a 10.10.193.152 -P -l my_fence_user -A password -L operator -S
/usr/local/bin/pwd.sh -o status
2021-08-05 15:06:07,254 ERROR: Failed: Unable to obtain correct plug status or plug is
not available
On host1 I generate panic:
# date ; echo 1 > /proc/sys/kernel/sysrq ; echo c > /proc/sysrq-trigger
Thu Aug 5 15:06:24 CEST 2021
host1 correctly completes its crash dump (kdump integration is enabled) and reboots, but
I stop it at grub prompt so that host1 is unreachable from host2 point of view and also
power fencing not determined
Crashing the host and preventing it from booting is fine, but isn't it
simpler to stop the host using power management?
At this point I thought that VM lease functionality would have come
in place and host2 would be able to re-start the HA VM, as it is able to see that the
lease is not taken from the other host and so it can acquire the lock itself....
Once host1 disappears from the system, engine should detect that the HA VM
is at unknown status, and start it on the other host.
But you kill the SPM, and without SPM some operation cannot
work until a new SPM is selected. And for the SPM we don't have a way
to start it on another host *before* the old SPM host reboot, and we can
verify that the old host is not the SPM.
Instead it goes through the attempt to power fence loop
I wait about 25 minutes without any effect but continuous attempts.
After 2 minutes host2 correctly becomes SPM and VMs are marked as unknown
I wonder how host2 became the SPM. This should not be possible before
host 1 is rebooted. Did you use "Confirm host was rebooted" in engine?
At a certain point after the failures in power fencing host1, I see
the event:
Failed to power fence host host1. Please check the host status and it's power
management settings, and then manually reboot it and click "Confirm Host Has Been
Rebooted"
If I select host and choose "Confirm Host Has Been Rebooted", then the two VMs
are marked as down and the HA one is correctly booted by host2.
But this requires my manual intervention.
So you host2 became the SPM after you chose: "Confirm Host Has Been Rebooted"?
Is the behavior above the expected one or the use of VM leases should
have allowed host2 to bypass fencing inability and start the HA VM with lease? Otherwise I
don't understand the reason to have the lease itself at all....
The vm lease allows engine to start HA VM on another host when it cannot
access the original host the VM was running on.
The VM can start only if it is not running on the original host. If
the VM is running
it will keep the lease live, and other hosts will not be able to acquire it.
I suggest you file an ovirt-engine bug with clear instructions on how
to reproduce
the issue.
You can check this presentation on this topic:
https://www.youtube.com/watch?v=WLnU_YsHWtU
Nir