On Wed, Oct 12, 2022 at 9:14 AM Martin Perina <mperina@redhat.com> wrote:


On Tue, Oct 11, 2022 at 1:42 PM Klaas Demter <klaasdemter@gmail.com> wrote:

Don't storage leases solve that problem?


Not entirely, you are not able to kill a VM via storage lease, you can only detect that even though we lost connection from engine to host (and this means also VMs), then we can check if host/VM leases are refresh and if so, we are not trying to restart VM on a different host

I seem to recall a HA VM also works when (gets restarted on other node) a hypervisor completely loses power, ie there is no response on the fencing device. I'd expect it to work the same without a fencing device.


So if that happens, it's not a completely correct setup. If you want reliable power management, then your power management network should be independent on your data network, so if there is an issue with data network, you should be able to use power management network to check power status and perform reboot if needed. Of course if both networks are down, then you have a problem, but that should be a rare case.


Greetings

Klaas




On September 2021 I simulated some Active-Active DR tests on one environment based on RHV 4.4.x with 1 host in Site A and 1 host in Site B.
See also here:
https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/html/disaster_recovery_guide/active_active

Cluster configuration:
. enable fencing --> yes
. skip fencing if host has live lease --> yes
. skip fencing on cluster connectivity issues --> yes with threshold 50%

I simulated (through iptables rules) unreachability of host and IPMI device of host in Site B.
One HA VM ha-vm and one not HA VM no-ha-vm running on host in Site B
Generate kernel panic on host in Site B (so that it doesn't renew leases)
(host in Site B based on RHEL8 will make automatic reboot after crash dump and I stop it in BIOS boot phase so that the server doesn't come up again)
The VM ha-vm has been correctly restarted on host in Site A after the defined timeout

Sep 10, 2021, 6:09:51 PM Host rhvh1 is not responding. It will stay in Connecting state for a grace period of 81 seconds and after that an attempt to fence the host will be issued.
Sep 10, 2021, 6:09:51 PM VDSM rhvh1 command Get Host Statistics failed: Connection timeout for host 'rhvh1', last response arrived 22501 ms ago.
...
Sep 10, 2021, 6:11:25 PM VM ha-vm was set to the Unknown status.
Sep 10, 2021, 6:11:25 PM VM non-ha-vm was set to the Unknown status.
Sep 10, 2021, 6:11:25 PM Host rhvh1 became non responsive and was not restarted due to Fencing Policy: 50 percents of the Hosts in the Cluster have connectivity issues.
...
Sep 10, 2021, 6:13:43 PM Trying to restart VM ha-vm on Host rhvh2

And the VM ha-vm becomes active and operational.
Note that non-HA VM non-ha-vm will remain in unknown status
If I remove iptables rules and let rhvh1 boot it correctly joins the cluster without trying to restart the VM.

The only limitation is that if the Site with the isolation problems is the one where the SPM host is running, you will have HA for VMs, but you cannot elect new SPM.
So you cannot for example add new disks or change the size of existing ones.
But this is an acceptable temporary situation in case of DR action, that I was simulating.

If you try to force rhvh2 to become SPM you get:
Error while executing action: Cannot force select SPM. Unknown Data Center status.

To have the new SPM (on rhvh2 in my case), in a real scenario (that I simulated before having rhvh1 boot into the OS) you have to verify the real state of Site B and that all has been powered off (to prevent a future data corruption if SIte B comes up again) and then go and select

"confirm host has been rebooted" on rhvh1

you get a window with "Are you sure?"

Please make sure the Host 'rhvh1' has been manually shut down or rebooted.
This Host is the SPM. Executing this operation on a Host that was not properly manually rebooted could lead to Storage corruption condition!
If the host has not been manually rebooted hit 'Cancel'.
Confirm Operation --> check the box

at this point rhvh2 becomes the new SPM and the non-HA VM non-ha-vm transitions from unknown status to down and the DC becomes up
From an events point of view you get:

Sep 10, 2021, 6:23:40 PM Vm non-ha-vm was shut down due to rhvh1 host reboot or manual fence
Sep 10, 2021, 6:23:41 PM All VMs' status on Non Responsive Host rhvh1 were changed to 'Down' by user@internal
Sep 10, 2021, 6:23:41 PM Manual fence for host rhvh1 was started.
Sep 10, 2021, 6:23:43 PM Storage Pool Manager runs on Host rhvh2 (Address: rhvh2), Data Center MYDC.

At this point you can start the non-ha-vm VM
Sep 10, 2021, 6:24:44 PM VM non-ha-vm was started by user@internal (Host: rhvh2).

During these tests I opened a case because the SPM related limitation was not documented inside the DR guide and I got it added (see paragraph 2.3 Storage Considerations)

What described above should be applicable to oVirt > 4.4 too for DR and applied somehow to target HA needs when missing IPMI
But for sure it is only a sort of workaround, to be avoided in production

I suggest you to test all the scenarios that you want to manage to verify expected behavior

HIH digging more,
Gianluca