VM HostedEngine is down with error
by souvaliotimaria@mail.com
Hello everyone,
I have a replica 2 + arbiter installation and this morning the Hosted Engine gave the following error on the UI and resumed on a different node (node3) than the one it was originally running(node1). (The original node has more memory than the one it ended up, but it had a better memory usage percentage at the time). Also, the only way I discovered the migration had happened and there was an Error in Events, was because I logged in the web interface of ovirt for a routine inspection. Βesides that, everything was working properly and still is.
The error that popped is the following:
VM HostedEngine is down with error. Exit message: internal error: qemu unexpectedly closed the monitor:
2020-09-01T06:49:20.749126Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future
2020-09-01T06:49:20.927274Z qemu-kvm: -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-d5de54b6-9f8e-4fba-819b-ebf6780757d2,id=ua-d5de54b6-9f8e-4fba-819b-ebf6780757d2,bootindex=1,write-cache=on: Failed to get "write" lock
Is another process using the image?.
Which from what I could gather concerns the following snippet from the HostedEngine.xml and it's the virtio disk of the Hosted Engine:
<disk type='file' device='disk' snapshot='no'>
<driver name='qemu' type='raw' cache='none' error_policy='stop' io='threads' iothread='1'/>
<source file='/var/run/vdsm/storage/80f6e393-9718-4738-a14a-64cf43c3d8c2/d5de54b6-9f8e-4fba-819b-ebf6780757d2/a48555f4-be23-4467-8a54-400ae7baf9d7'>
<seclabel model='dac' relabel='no'/>
</source>
<target dev='vda' bus='virtio'/>
<serial>d5de54b6-9f8e-4fba-819b-ebf6780757d2</serial>
<alias name='ua-d5de54b6-9f8e-4fba-819b-ebf6780757d2'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
</disk>
I've tried looking into the logs and the sar command but I couldn't find anything to relate with the above errors and determining the reason for it to happen. Is this a Gluster or a QEMU problem?
The Hosted Engine was manually migrated five days before on node1.
Is there a standard practice I could follow to determine what happened and secure my system?
Thank you very much for your time,
Maria Souvalioti