
Okay, so the SPM is only blocking me if I have thin provisioned block storage that needs extending. This is luckily not the case because we primarily use NFS. It is something though I have never thought about, in my head the storage leases completely solved the "host crashes and power management does not answer" until I read your mail :) Thanks for the detailed explanation Nir! Greetings Klaas On 8/10/21 11:20 AM, Nir Soffer wrote:
I always thought the SPM role is also "managed" by a storage lease :) The SPM is using a storage lease to ensure we have only one SPM. But due to
On Tue, Aug 10, 2021 at 12:05 PM Klaas Demter <klaasdemter@gmail.com> wrote: the master mount, we cannot start new SPM even if the old SPM does not hold the lease, since it will corrupt the master filesystem, used to keep SPM tasks.
But that does not seem to be the case.
So this means a storage lease is only useful if the host is not the SPM? If the SPM host is completely unreachable, not via OS, not via power management, then the storage lease won't help to restart VMs on other hosts automatically? This is definitely something I did not consider when building my environment. Starting VMs should not depend on the SPM, this is the basic design. If issues with SPM beaks starting VMs, this is a bug that we need to fix.
The only known dependency is extending thin provisioning disks on block storage. Without the SPM this cannot happen since the SPM is the only host that can extend the logical volumes.
On 8/9/21 6:25 PM, Nir Soffer wrote:
On Thu, Aug 5, 2021 at 5:45 PM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
Hello, supposing latest 4.4.7 environment installed with an external engine and two hosts, one in one site and one in another site. For storage I have one FC storage domain. I try to simulate a sort of "site failure scenario" to see what kind of HA I should expect.
The 2 hosts have power mgmt configured through fence_ipmilan.
I have 2 VMs, one configured as HA with lease on storage (Resume Behavior: kill) and one not marked as HA.
Initially host1 is SPM and it is the host that runs the two VMs.
Fencing of host1 from host2 initially works ok. I can test also from command line: # fence_ipmilan -a 10.10.193.152 -P -l my_fence_user -A password -L operator -S /usr/local/bin/pwd.sh -o status Status: ON
On host2 I then prevent reaching host1 iDRAC: firewall-cmd --direct --add-rule ipv4 filter OUTPUT 0 -d 10.10.193.152 -p udp --dport 623 -j DROP firewall-cmd --direct --add-rule ipv4 filter OUTPUT 1 -j ACCEPT Why do you need to prevent access from host1 to host2? Hosts do not access each other unless you migrate vms between hosts.
so that:
# fence_ipmilan -a 10.10.193.152 -P -l my_fence_user -A password -L operator -S /usr/local/bin/pwd.sh -o status 2021-08-05 15:06:07,254 ERROR: Failed: Unable to obtain correct plug status or plug is not available
On host1 I generate panic: # date ; echo 1 > /proc/sys/kernel/sysrq ; echo c > /proc/sysrq-trigger Thu Aug 5 15:06:24 CEST 2021
host1 correctly completes its crash dump (kdump integration is enabled) and reboots, but I stop it at grub prompt so that host1 is unreachable from host2 point of view and also power fencing not determined Crashing the host and preventing it from booting is fine, but isn't it simpler to stop the host using power management?
At this point I thought that VM lease functionality would have come in place and host2 would be able to re-start the HA VM, as it is able to see that the lease is not taken from the other host and so it can acquire the lock itself.... Once host1 disappears from the system, engine should detect that the HA VM is at unknown status, and start it on the other host.
But you kill the SPM, and without SPM some operation cannot work until a new SPM is selected. And for the SPM we don't have a way to start it on another host *before* the old SPM host reboot, and we can verify that the old host is not the SPM.
Instead it goes through the attempt to power fence loop I wait about 25 minutes without any effect but continuous attempts.
After 2 minutes host2 correctly becomes SPM and VMs are marked as unknown I wonder how host2 became the SPM. This should not be possible before host 1 is rebooted. Did you use "Confirm host was rebooted" in engine?
At a certain point after the failures in power fencing host1, I see the event:
Failed to power fence host host1. Please check the host status and it's power management settings, and then manually reboot it and click "Confirm Host Has Been Rebooted"
If I select host and choose "Confirm Host Has Been Rebooted", then the two VMs are marked as down and the HA one is correctly booted by host2.
But this requires my manual intervention. So you host2 became the SPM after you chose: "Confirm Host Has Been Rebooted"?
Is the behavior above the expected one or the use of VM leases should have allowed host2 to bypass fencing inability and start the HA VM with lease? Otherwise I don't understand the reason to have the lease itself at all.... The vm lease allows engine to start HA VM on another host when it cannot access the original host the VM was running on.
The VM can start only if it is not running on the original host. If the VM is running it will keep the lease live, and other hosts will not be able to acquire it.
I suggest you file an ovirt-engine bug with clear instructions on how to reproduce the issue.
You can check this presentation on this topic: https://www.youtube.com/watch?v=WLnU_YsHWtU
Nir _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/J3QWSUZTYHZM74...
Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/7RYGVJO52IHZPI...