[ovirt-users] Re: HA VM and vm leases usage with site failure

Tuesday, 10 August 2021

Hi,

I always thought the SPM role is also "managed" by a storage lease :) 
But that does not seem to be the case.

So this means a storage lease is only useful if the host is not the SPM? 
If the SPM host is completely unreachable, not via OS, not via power 
management, then the storage lease won't help to restart VMs on other 
hosts automatically? This is definitely something I did not consider 
when building my environment.

Greetings

Klaas

On 8/9/21 6:25 PM, Nir Soffer wrote:
> On Thu, Aug 5, 2021 at 5:45 PM Gianluca Cecchi
> <gianluca.cecchi(a)gmail.com&gt; wrote:
>> Hello,
>> supposing latest 4.4.7 environment installed with an external engine and two
hosts, one in one site and one in another site.
>> For storage I have one FC storage domain.
>> I try to simulate a sort of "site failure scenario" to see what kind of
HA I should expect.
>>
>> The 2 hosts have power mgmt configured through fence_ipmilan.
>>
>> I have 2 VMs, one configured as HA with lease on storage (Resume Behavior: kill)
and one not marked as HA.
>>
>> Initially host1 is SPM and it is the host that runs the two VMs.
>>
>> Fencing of host1 from host2 initially works ok. I can test also from command
line:
>> # fence_ipmilan -a 10.10.193.152 -P -l my_fence_user -A password -L operator -S
/usr/local/bin/pwd.sh -o status
>> Status: ON
>>
>> On host2 I then prevent reaching host1 iDRAC:
>> firewall-cmd --direct --add-rule ipv4 filter OUTPUT 0 -d 10.10.193.152 -p udp
--dport 623 -j DROP
>> firewall-cmd --direct --add-rule ipv4 filter OUTPUT 1 -j ACCEPT
> Why do you need to prevent access from host1 to host2? Hosts do not
> access each other unless you migrate vms between hosts.
>
>> so that:
>>
>> # fence_ipmilan -a 10.10.193.152 -P -l my_fence_user -A password -L operator -S
/usr/local/bin/pwd.sh -o status
>> 2021-08-05 15:06:07,254 ERROR: Failed: Unable to obtain correct plug status or
plug is not available
>>
>> On host1 I generate panic:
>> # date ; echo 1 > /proc/sys/kernel/sysrq ; echo c > /proc/sysrq-trigger
>> Thu Aug  5 15:06:24 CEST 2021
>>
>> host1 correctly completes its crash dump (kdump integration is enabled) and
reboots, but I stop it at grub prompt so that host1 is unreachable from host2 point of
view and also power fencing not determined
> Crashing the host and preventing it from booting is fine, but isn't it
> simpler to stop the host using power management?
>
>> At this point I thought that VM lease functionality would have come in place and
host2 would be able to re-start the HA VM, as it is able to see that the lease is not
taken from the other host and so it can acquire the lock itself....
> Once host1 disappears from the system, engine should detect that the HA VM
> is at unknown status, and start it on the other host.
>
> But you kill the SPM, and without SPM some operation cannot
> work until a new SPM is selected. And for the SPM we don't have a way
> to start it on another host *before* the old SPM host reboot, and we can
> verify that the old host is not the SPM.
>
>> Instead it goes through the attempt to power fence loop
>> I wait about 25 minutes without any effect but continuous attempts.
>>
>> After 2 minutes host2 correctly becomes SPM and VMs are marked as unknown
> I wonder how host2 became the SPM. This should not be possible before
> host 1 is rebooted. Did you use "Confirm host was rebooted" in engine?
>
>> At a certain point after the failures in power fencing host1, I see the event:
>>
>> Failed to power fence host host1. Please check the host status and it's power
management settings, and then manually reboot it and click "Confirm Host Has Been
Rebooted"
>>
>> If I select host and choose "Confirm Host Has Been Rebooted", then the
two VMs are marked as down and the HA one is correctly booted by host2.
>>
>> But this requires my manual intervention.
> So you host2 became the SPM after you chose: "Confirm Host Has Been
Rebooted"?
>
>> Is the behavior above the expected one or the use of VM leases should have
allowed host2 to bypass fencing inability and start the HA VM with lease? Otherwise I
don't understand the reason to have the lease itself at all....
> The vm lease allows engine to start HA VM on another host when it cannot
> access the original host the VM was running on.
>
> The VM can start only if it is not running on the original host. If
> the VM is running
> it will keep the lease live, and other hosts will not be able to acquire it.
>
> I suggest you file an ovirt-engine bug with clear instructions on how
> to reproduce
> the issue.
>
> You can check this presentation on this topic:
> https://www.youtube.com/watch?v=WLnU_YsHWtU
>
> Nir
> _______________________________________________
> Users mailing list -- users(a)ovirt.org
> To unsubscribe send an email to users-leave(a)ovirt.org
> Privacy Statement: https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/J3QWSUZTYHZ...

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[ovirt-users] Re: HA VM and vm leases usage with site failure