HA VMs fail to start on host failure

29 Apr 2019

      In a host failure situation, we see that oVirt tries to restart the VMs on other hosts in the cluster but this (more often than not) fails due to kvm being unable to acquire a write lock on the qcow2 image. We see ovirt attempt to restart the VMs several times, each time on different hosts but with the same outcome after which it gives up trying.

After this we must log into the oVirt web interface and start the VM manually, which works fine (by this time we assume enough time has passed for the lock to clear itself).

This behaviour is experienced with Centos 7.6, Libvirt 4.5.0-10, vdsm 4.30.13-1

Log excerpt from hosted engine:

2019-04-24 17:05:26,653+01 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-82) [] VM 'ef7e04f0-764a-4cfe-96bf-c0862f1f5b83'(vm-21.example.local) moved from 'WaitForLaunch' --> 'Down'
2019-04-24 17:05:26,710+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-82) [] EVENT_ID: VM_DOWN_ERROR(119), VM vm-21.example.local is down with error. Exit message: internal error: process exited while connecting to monitor: 2019-04-24T16:04:48.049352Z qemu-kvm: -drive file=/rhev/data-center/mnt/192.168.111.111:_/21a1390b-b73b-46b1-85b9-2bbf9bba5308/images/c9d96ab6-cb0b-4fba-9b07-096ff750c7f7/16da3660-1afe-40a3-b868-3a74e74bab2f,format=qcow2,if=none,id=drive-ua-c9d96ab6-cb0b-4fba-9b07-096ff750c7f7,serial=c9d96ab6-cb0b-4fba-9b07-096ff750c7f7,werror=stop,rerror=stop,cache=none,aio=threads: 'serial' is deprecated, please use the corresponding option of '-device' instead
2019-04-24T16:04:48.079989Z qemu-kvm: -drive file=/rhev/data-center/mnt/192.168.111.111:_/21a1390b-b73b-46b1-85b9-2bbf9bba5308/images/c9d96ab6-cb0b-4fba-9b07-096ff750c7f7/16da3660-1afe-40a3-b868-3a74e74bab2f,format=qcow2,if=none,id=drive-ua-c9d96ab6-cb0b-4fba-9b07-096ff750c7f7,serial=c9d96ab6-cb0b-4fba-9b07-096ff750c7f7,werror=stop,rerror=stop,cache=none,aio=threads: Failed to get "write" lock

So my question is, how can I either force oVirt to continue to try restarting the VM or delay the initial VM restart for enough time to allow locks to clear?

tezmobile＠googlemail.com

tags

participants (1)