On Tue, Sep 1, 2020 at 7:17 PM <souvaliotimaria(a)mail.com> wrote:
I have a replica 2 + arbiter installation and this morning the Hosted Engine gave the
following error on the UI and resumed on a different node (node3) than the one it was
originally running(node1). (The original node has more memory than the one it ended up,
but it had a better memory usage percentage at the time). Also, the only way I discovered
the migration had happened and there was an Error in Events, was because I logged in the
web interface of ovirt for a routine inspection. Βesides that, everything was working
properly and still is.
The error that popped is the following:
VM HostedEngine is down with error. Exit message: internal error: qemu unexpectedly
closed the monitor:
2020-09-01T06:49:20.749126Z qemu-kvm: warning: All CPU(s) up to maxcpus should be
described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and
will be removed in future
2020-09-01T06:49:20.927274Z qemu-kvm: -device
Failed to get "write" lock
Is another process using the image?.
It's quite likely that this isn't the root cause.
Please check your logs from before that.
Above looks like something (ovirt-ha-agent?) tried to start the hosted
engine VM, but failed due to locking - most likely, because it was
already up elsewhere (on some other host?).
So you want to check when/where the VM was started before this error,
and then carefully any errors before it was started.
Also, check that the clocks on all your machines are in sync.
Which from what I could gather concerns the following snippet from the HostedEngine.xml
and it's the virtio disk of the Hosted Engine:
<disk type='file' device='disk' snapshot='no'>
<driver name='qemu' type='raw' cache='none'
error_policy='stop' io='threads' iothread='1'/>
<seclabel model='dac' relabel='no'/>
<target dev='vda' bus='virtio'/>
<address type='pci' domain='0x0000' bus='0x00'
I've tried looking into the logs and the sar command but I couldn't find anything
to relate with the above errors and determining the reason for it to happen. Is this a
Gluster or a QEMU problem?
Likely, but hard to tell without more information.
The Hosted Engine was manually migrated five days before on node1.
Is there a standard practice I could follow to determine what happened and secure my
Nothing, other than checking the logs.
Check, on all of your hosts:
And on the engine (likely won't help in this case, but just in case):
Thank you very much for your time,
Good luck and best regards,