
On Tue, Sep 1, 2020 at 7:17 PM <souvaliotimaria@mail.com> wrote:
Hello everyone,
I have a replica 2 + arbiter installation and this morning the Hosted Engine gave the following error on the UI and resumed on a different node (node3) than the one it was originally running(node1). (The original node has more memory than the one it ended up, but it had a better memory usage percentage at the time). Also, the only way I discovered the migration had happened and there was an Error in Events, was because I logged in the web interface of ovirt for a routine inspection. Î’esides that, everything was working properly and still is.
The error that popped is the following:
VM HostedEngine is down with error. Exit message: internal error: qemu unexpectedly closed the monitor: 2020-09-01T06:49:20.749126Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future 2020-09-01T06:49:20.927274Z qemu-kvm: -device virtio-blk-pci,iothread=iothread1,scsi=off,bus=pci.0,addr=0x7,drive=drive-ua-d5de54b6-9f8e-4fba-819b-ebf6780757d2,id=ua-d5de54b6-9f8e-4fba-819b-ebf6780757d2,bootindex=1,write-cache=on: Failed to get "write" lock Is another process using the image?.
It's quite likely that this isn't the root cause. Please check your logs from before that. Above looks like something (ovirt-ha-agent?) tried to start the hosted engine VM, but failed due to locking - most likely, because it was already up elsewhere (on some other host?). So you want to check when/where the VM was started before this error, and then carefully any errors before it was started. Also, check that the clocks on all your machines are in sync.
Which from what I could gather concerns the following snippet from the HostedEngine.xml and it's the virtio disk of the Hosted Engine:
<disk type='file' device='disk' snapshot='no'> <driver name='qemu' type='raw' cache='none' error_policy='stop' io='threads' iothread='1'/> <source file='/var/run/vdsm/storage/80f6e393-9718-4738-a14a-64cf43c3d8c2/d5de54b6-9f8e-4fba-819b-ebf6780757d2/a48555f4-be23-4467-8a54-400ae7baf9d7'> <seclabel model='dac' relabel='no'/> </source> <target dev='vda' bus='virtio'/> <serial>d5de54b6-9f8e-4fba-819b-ebf6780757d2</serial> <alias name='ua-d5de54b6-9f8e-4fba-819b-ebf6780757d2'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </disk>
I've tried looking into the logs and the sar command but I couldn't find anything to relate with the above errors and determining the reason for it to happen. Is this a Gluster or a QEMU problem?
Likely, but hard to tell without more information.
The Hosted Engine was manually migrated five days before on node1.
Is there a standard practice I could follow to determine what happened and secure my system?
Nothing, other than checking the logs. Check, on all of your hosts: /var/log/messages /var/log/vdsm/* /var/log/ovirt-hosted-engine-ha/* And on the engine (likely won't help in this case, but just in case): /var/log/ovirt-engine/*
Thank you very much for your time,
Good luck and best regards, -- Didi