HA VM and vm leases usage with site failure

Thursday, 5 August 2021

Hello,
supposing latest 4.4.7 environment installed with an external engine and
two hosts, one in one site and one in another site.
For storage I have one FC storage domain.
I try to simulate a sort of "site failure scenario" to see what kind of HA
I should expect.

The 2 hosts have power mgmt configured through fence_ipmilan.

I have 2 VMs, one configured as HA with lease on storage (Resume Behavior:
kill) and one not marked as HA.

Initially host1 is SPM and it is the host that runs the two VMs.

Fencing of host1 from host2 initially works ok. I can test also from
command line:
# fence_ipmilan -a 10.10.193.152 -P -l my_fence_user -A password -L
operator -S /usr/local/bin/pwd.sh -o status
Status: ON

On host2 I then prevent reaching host1 iDRAC:
firewall-cmd --direct --add-rule ipv4 filter OUTPUT 0 -d 10.10.193.152 -p
udp --dport 623 -j DROP
firewall-cmd --direct --add-rule ipv4 filter OUTPUT 1 -j ACCEPT

so that:

# fence_ipmilan -a 10.10.193.152 -P -l my_fence_user -A password -L
operator -S /usr/local/bin/pwd.sh -o status
2021-08-05 15:06:07,254 ERROR: Failed: Unable to obtain correct plug status
or plug is not available

On host1 I generate panic:
# date ; echo 1 > /proc/sys/kernel/sysrq ; echo c > /proc/sysrq-trigger
Thu Aug  5 15:06:24 CEST 2021

host1 correctly completes its crash dump (kdump integration is enabled) and
reboots, but I stop it at grub prompt so that host1 is unreachable from
host2 point of view and also power fencing not determined

At this point I thought that VM lease functionality would have come in
place and host2 would be able to re-start the HA VM, as it is able to see
that the lease is not taken from the other host and so it can acquire the
lock itself....
Instead it goes through the attempt to power fence loop
I wait about 25 minutes without any effect but continuous attempts.

After 2 minutes host2 correctly becomes SPM and VMs are marked as unknown

At a certain point after the failures in power fencing host1, I see the
event:

Failed to power fence host host1. Please check the host status and it's
power management settings, and then manually reboot it and click "Confirm
Host Has Been Rebooted"

If I select host and choose "Confirm Host Has Been Rebooted", then the two
VMs are marked as down and the HA one is correctly booted by host2.

But this requires my manual intervention.

Is the behavior above the expected one or the use of VM leases should have
allowed host2 to bypass fencing inability and start the HA VM with lease?
Otherwise I don't understand the reason to have the lease itself at all....

Thanks,
Gianluca

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011