In a host failure situation, we see that oVirt tries to restart the VMs on other hosts in
the cluster but this (more often than not) fails due to kvm being unable to acquire a
write lock on the qcow2 image. We see ovirt attempt to restart the VMs several times, each
time on different hosts but with the same outcome after which it gives up trying.
After this we must log into the oVirt web interface and start the VM manually, which works
fine (by this time we assume enough time has passed for the lock to clear itself).
This behaviour is experienced with Centos 7.6, Libvirt 4.5.0-10, vdsm 4.30.13-1
Log excerpt from hosted engine:
2019-04-24 17:05:26,653+01 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer]
(EE-ManagedThreadFactory-engineScheduled-Thread-82) [] VM
'ef7e04f0-764a-4cfe-96bf-c0862f1f5b83'(vm-21.example.local) moved from
'WaitForLaunch' --> 'Down'
2019-04-24 17:05:26,710+01 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedThreadFactory-engineScheduled-Thread-82) [] EVENT_ID: VM_DOWN_ERROR(119), VM
vm-21.example.local is down with error. Exit message: internal error: process exited while
connecting to monitor: 2019-04-24T16:04:48.049352Z qemu-kvm: -drive
file=/rhev/data-center/mnt/192.168.111.111:_/21a1390b-b73b-46b1-85b9-2bbf9bba5308/images/c9d96ab6-cb0b-4fba-9b07-096ff750c7f7/16da3660-1afe-40a3-b868-3a74e74bab2f,format=qcow2,if=none,id=drive-ua-c9d96ab6-cb0b-4fba-9b07-096ff750c7f7,serial=c9d96ab6-cb0b-4fba-9b07-096ff750c7f7,werror=stop,rerror=stop,cache=none,aio=threads:
'serial' is deprecated, please use the corresponding option of '-device'
instead
2019-04-24T16:04:48.079989Z qemu-kvm: -drive
file=/rhev/data-center/mnt/192.168.111.111:_/21a1390b-b73b-46b1-85b9-2bbf9bba5308/images/c9d96ab6-cb0b-4fba-9b07-096ff750c7f7/16da3660-1afe-40a3-b868-3a74e74bab2f,format=qcow2,if=none,id=drive-ua-c9d96ab6-cb0b-4fba-9b07-096ff750c7f7,serial=c9d96ab6-cb0b-4fba-9b07-096ff750c7f7,werror=stop,rerror=stop,cache=none,aio=threads:
Failed to get "write" lock
So my question is, how can I either force oVirt to continue to try restarting the VM or
delay the initial VM restart for enough time to allow locks to clear?