Okay, so the SPM is only blocking me if I have thin provisioned block
storage that needs extending. This is luckily not the case because we
primarily use NFS. It is something though I have never thought about, in
my head the storage leases completely solved the "host crashes and power
management does not answer" until I read your mail :)
Thanks for the detailed explanation Nir!
Greetings
Klaas
On 8/10/21 11:20 AM, Nir Soffer wrote:
> On Tue, Aug 10, 2021 at 12:05 PM Klaas Demter <klaasdemter(a)gmail.com> wrote:
>> I always thought the SPM role is also "managed" by a storage lease :)
> The SPM is using a storage lease to ensure we have only one SPM. But due to
> the master mount, we cannot start new SPM even if the old SPM does not hold
> the lease, since it will corrupt the master filesystem, used to keep SPM tasks.
>
>> But that does not seem to be the case.
>>
>> So this means a storage lease is only useful if the host is not the SPM?
>> If the SPM host is completely unreachable, not via OS, not via power
>> management, then the storage lease won't help to restart VMs on other
>> hosts automatically? This is definitely something I did not consider
>> when building my environment.
> Starting VMs should not depend on the SPM, this is the basic design. If issues
> with SPM beaks starting VMs, this is a bug that we need to fix.
>
> The only known dependency is extending thin provisioning disks on block storage.
> Without the SPM this cannot happen since the SPM is the only host that
> can extend
> the logical volumes.
>
>> On 8/9/21 6:25 PM, Nir Soffer wrote:
>>> On Thu, Aug 5, 2021 at 5:45 PM Gianluca Cecchi
>>> <gianluca.cecchi(a)gmail.com> wrote:
>>>> Hello,
>>>> supposing latest 4.4.7 environment installed with an external engine and
two hosts, one in one site and one in another site.
>>>> For storage I have one FC storage domain.
>>>> I try to simulate a sort of "site failure scenario" to see what
kind of HA I should expect.
>>>>
>>>> The 2 hosts have power mgmt configured through fence_ipmilan.
>>>>
>>>> I have 2 VMs, one configured as HA with lease on storage (Resume
Behavior: kill) and one not marked as HA.
>>>>
>>>> Initially host1 is SPM and it is the host that runs the two VMs.
>>>>
>>>> Fencing of host1 from host2 initially works ok. I can test also from
command line:
>>>> # fence_ipmilan -a 10.10.193.152 -P -l my_fence_user -A password -L
operator -S /usr/local/bin/pwd.sh -o status
>>>> Status: ON
>>>>
>>>> On host2 I then prevent reaching host1 iDRAC:
>>>> firewall-cmd --direct --add-rule ipv4 filter OUTPUT 0 -d 10.10.193.152 -p
udp --dport 623 -j DROP
>>>> firewall-cmd --direct --add-rule ipv4 filter OUTPUT 1 -j ACCEPT
>>> Why do you need to prevent access from host1 to host2? Hosts do not
>>> access each other unless you migrate vms between hosts.
>>>
>>>> so that:
>>>>
>>>> # fence_ipmilan -a 10.10.193.152 -P -l my_fence_user -A password -L
operator -S /usr/local/bin/pwd.sh -o status
>>>> 2021-08-05 15:06:07,254 ERROR: Failed: Unable to obtain correct plug
status or plug is not available
>>>>
>>>> On host1 I generate panic:
>>>> # date ; echo 1 > /proc/sys/kernel/sysrq ; echo c >
/proc/sysrq-trigger
>>>> Thu Aug 5 15:06:24 CEST 2021
>>>>
>>>> host1 correctly completes its crash dump (kdump integration is enabled)
and reboots, but I stop it at grub prompt so that host1 is unreachable from host2 point of
view and also power fencing not determined
>>> Crashing the host and preventing it from booting is fine, but isn't it
>>> simpler to stop the host using power management?
>>>
>>>> At this point I thought that VM lease functionality would have come in
place and host2 would be able to re-start the HA VM, as it is able to see that the lease
is not taken from the other host and so it can acquire the lock itself....
>>> Once host1 disappears from the system, engine should detect that the HA VM
>>> is at unknown status, and start it on the other host.
>>>
>>> But you kill the SPM, and without SPM some operation cannot
>>> work until a new SPM is selected. And for the SPM we don't have a way
>>> to start it on another host *before* the old SPM host reboot, and we can
>>> verify that the old host is not the SPM.
>>>
>>>> Instead it goes through the attempt to power fence loop
>>>> I wait about 25 minutes without any effect but continuous attempts.
>>>>
>>>> After 2 minutes host2 correctly becomes SPM and VMs are marked as
unknown
>>> I wonder how host2 became the SPM. This should not be possible before
>>> host 1 is rebooted. Did you use "Confirm host was rebooted" in
engine?
>>>
>>>> At a certain point after the failures in power fencing host1, I see the
event:
>>>>
>>>> Failed to power fence host host1. Please check the host status and
it's power management settings, and then manually reboot it and click "Confirm
Host Has Been Rebooted"
>>>>
>>>> If I select host and choose "Confirm Host Has Been Rebooted",
then the two VMs are marked as down and the HA one is correctly booted by host2.
>>>>
>>>> But this requires my manual intervention.
>>> So you host2 became the SPM after you chose: "Confirm Host Has Been
Rebooted"?
>>>
>>>> Is the behavior above the expected one or the use of VM leases should
have allowed host2 to bypass fencing inability and start the HA VM with lease? Otherwise I
don't understand the reason to have the lease itself at all....
>>> The vm lease allows engine to start HA VM on another host when it cannot
>>> access the original host the VM was running on.
>>>
>>> The VM can start only if it is not running on the original host. If
>>> the VM is running
>>> it will keep the lease live, and other hosts will not be able to acquire it.
>>>
>>> I suggest you file an ovirt-engine bug with clear instructions on how
>>> to reproduce
>>> the issue.
>>>
>>> You can check this presentation on this topic:
>>>
https://www.youtube.com/watch?v=WLnU_YsHWtU
>>>
>>> Nir
>>> _______________________________________________
>>> Users mailing list -- users(a)ovirt.org
>>> To unsubscribe send an email to users-leave(a)ovirt.org
>>> Privacy Statement:
https://www.ovirt.org/privacy-policy.html
>>> oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
>>> List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/J3QWSUZTYHZ...
>> _______________________________________________
>> Users mailing list -- users(a)ovirt.org
>> To unsubscribe send an email to users-leave(a)ovirt.org
>> Privacy Statement:
https://www.ovirt.org/privacy-policy.html
>> oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
>> List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/7RYGVJO52IH...