On Wed, Oct 12, 2022 at 9:14 AM Martin Perina <mperina(a)redhat.com> wrote:
On Tue, Oct 11, 2022 at 1:42 PM Klaas Demter <klaasdemter(a)gmail.com>
wrote:
> Don't storage leases solve that problem?
>
Not entirely, you are not able to kill a VM via storage lease, you can
only detect that even though we lost connection from engine to host (and
this means also VMs), then we can check if host/VM leases are refresh and
if so, we are not trying to restart VM on a different host
> I seem to recall a HA VM also works when (gets restarted on other node) a
> hypervisor completely loses power, ie there is no response on the fencing
> device. I'd expect it to work the same without a fencing device.
>
So if that happens, it's not a completely correct setup. If you want
reliable power management, then your power management network should be
independent on your data network, so if there is an issue with data
network, you should be able to use power management network to check power
status and perform reboot if needed. Of course if both networks are down,
then you have a problem, but that should be a rare case.
> Greetings
>
> Klaas
>
>
On September 2021 I simulated some Active-Active DR tests on one
environment based on RHV 4.4.x with 1 host in Site A and 1 host in Site B.
See also here:
https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/...
Cluster configuration:
. enable fencing --> yes
. skip fencing if host has live lease --> yes
. skip fencing on cluster connectivity issues --> yes with threshold 50%
I simulated (through iptables rules) unreachability of host and IPMI device
of host in Site B.
One HA VM ha-vm and one not HA VM no-ha-vm running on host in Site B
Generate kernel panic on host in Site B (so that it doesn't renew leases)
(host in Site B based on RHEL8 will make automatic reboot after crash dump
and I stop it in BIOS boot phase so that the server doesn't come up again)
The VM ha-vm has been correctly restarted on host in Site A after the
defined timeout
Sep 10, 2021, 6:09:51 PM Host rhvh1 is not responding. It will stay in
Connecting state for a grace period of 81 seconds and after that an attempt
to fence the host will be issued.
Sep 10, 2021, 6:09:51 PM VDSM rhvh1 command Get Host Statistics failed:
Connection timeout for host 'rhvh1', last response arrived 22501 ms ago.
...
Sep 10, 2021, 6:11:25 PM VM ha-vm was set to the Unknown status.
Sep 10, 2021, 6:11:25 PM VM non-ha-vm was set to the Unknown status.
Sep 10, 2021, 6:11:25 PM Host rhvh1 became non responsive and was not
restarted due to Fencing Policy: 50 percents of the Hosts in the Cluster
have connectivity issues.
...
Sep 10, 2021, 6:13:43 PM Trying to restart VM ha-vm on Host rhvh2
And the VM ha-vm becomes active and operational.
Note that non-HA VM non-ha-vm will remain in unknown status
If I remove iptables rules and let rhvh1 boot it correctly joins the
cluster without trying to restart the VM.
The only limitation is that if the Site with the isolation problems is the
one where the SPM host is running, you will have HA for VMs, but you cannot
elect new SPM.
So you cannot for example add new disks or change the size of existing ones.
But this is an acceptable temporary situation in case of DR action, that I
was simulating.
If you try to force rhvh2 to become SPM you get:
Error while executing action: Cannot force select SPM. Unknown Data Center
status.
To have the new SPM (on rhvh2 in my case), in a real scenario (that I
simulated before having rhvh1 boot into the OS) you have to verify the real
state of Site B and that all has been powered off (to prevent a future data
corruption if SIte B comes up again) and then go and select
"confirm host has been rebooted" on rhvh1
you get a window with "Are you sure?"
Please make sure the Host 'rhvh1' has been manually shut down or rebooted.
This Host is the SPM. Executing this operation on a Host that was not
properly manually rebooted could lead to Storage corruption condition!
If the host has not been manually rebooted hit 'Cancel'.
Confirm Operation --> check the box
at this point rhvh2 becomes the new SPM and the non-HA VM non-ha-vm
transitions from unknown status to down and the DC becomes up
From an events point of view you get:
Sep 10, 2021, 6:23:40 PM Vm non-ha-vm was shut down due to rhvh1 host
reboot or manual fence
Sep 10, 2021, 6:23:41 PM All VMs' status on Non Responsive Host rhvh1 were
changed to 'Down' by user@internal
Sep 10, 2021, 6:23:41 PM Manual fence for host rhvh1 was started.
Sep 10, 2021, 6:23:43 PM Storage Pool Manager runs on Host rhvh2 (Address:
rhvh2), Data Center MYDC.
At this point you can start the non-ha-vm VM
Sep 10, 2021, 6:24:44 PM VM non-ha-vm was started by user@internal (Host:
rhvh2).
During these tests I opened a case because the SPM related limitation was
not documented inside the DR guide and I got it added (see paragraph 2.3
Storage Considerations)
What described above should be applicable to oVirt > 4.4 too for DR and
applied somehow to target HA needs when missing IPMI
But for sure it is only a sort of workaround, to be avoided in production
I suggest you to test all the scenarios that you want to manage to verify
expected behavior
HIH digging more,
Gianluca