Hello, I have a cluster made by 3 nodes in a "self-hosted-engine" topology.
I implemented the storage with gluster implementation in 2 replica + arbiter topology.
I have two gluster volumes
glen - is the volume used by hosted-engine vm
gv0 - is the volume used by VMs
The physical disks are 4TB SSD used only to accomodate VMs (also hosted-engine)
I have continuos VMs hangs, even hosted-engine, this give full of troubles as I have
continuous hangs by hosted-engine and this happen asyncrounosly even while there is
management operation on VMs (mobility, cloning...)
after a while it happens that the VM is freed but in the VMs I have in console kernel
complaining by CPU hang or timer hangs and the solution is only to shutdown/poweroff the
VM... even hosted engine in fact it happens that hosted-engine -vm-status give
"state=EngineUpBadHealth"
This is the log during the event in the host while there is the event:
Sep 01 07:48:58 ovirt-node3.ovirt NetworkManager[1923]: <info> [1662018538.0166]
device (vnet73): state change: activated -> unmanaged (reason 'unmanaged',
sys-iface-state: 'removed')
Sep 01 07:48:58 ovirt-node3.ovirt NetworkManager[1923]: <info> [1662018538.0168]
device (vnet73): released from master device ovirtmgmt
Sep 01 07:48:58 ovirt-node3.ovirt libvirtd[2496]: Unable to read from monitor: Connection
reset by peer
Sep 01 07:48:58 ovirt-node3.ovirt libvirtd[2496]: internal error: qemu unexpectedly closed
the monitor: 2022-09-01T07:48:57.930955Z qemu-kvm: -device
virtio-blk-pci,iothread=iothread1,bus=pci.6,addr=0x0,drive=libvirt-1-format,id=ua-0a1a501c-fc45-430f-bfd3-076172cec406,bootindex=1,write-cache=on,serial=0a1a501c-fc45-430f-bfd3-076172cec406,werror=stop,rerror=stop:
Failed to get "write" lock Is another process using the image
[/run/vdsm/storage/3577c21e-f757-4405-97d1-0f827c9b4e22/0a1a501c-fc45-430f-bfd3-076172cec406/f65dab86-67f1-46fa-87c0-f9076f479741]?
Sep 01 07:48:58 ovirt-node3.ovirt kvm[268578]: 5 guests now active
Sep 01 07:48:58 ovirt-node3.ovirt systemd[1]: machine-qemu\x2d67\x2dHostedEngine.scope:
Succeeded.
Sep 01 07:48:58 ovirt-node3.ovirt systemd-machined[1613]: Machine qemu-67-HostedEngine
terminated.
Sep 01 07:49:08 ovirt-node3.ovirt systemd[1]: NetworkManager-dispatcher.service:
Succeeded.
Sep 01 07:49:08 ovirt-node3.ovirt ovirt-ha-agent[3338]: ovirt-ha-agent
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Engine VM stopped on
localhost
Sep 01 07:49:14 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted Engine HA
info: timed out
Sep 01 07:49:18 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted Engine HA
info: timed out
Sep 01 07:49:18 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted Engine HA
info: timed out
Sep 01 07:49:28 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted Engine HA
info: timed out
Sep 01 07:49:28 ovirt-node3.ovirt sanlock[1633]: 2022-09-01 07:49:28 1706161 [5083]: s4
delta_renew long write time 11 sec
Sep 01 07:49:28 ovirt-node3.ovirt sanlock[1633]: 2022-09-01 07:49:28 1706161 [5033]: s3
delta_renew long write time 11 sec
Sep 01 07:49:34 ovirt-node3.ovirt libvirtd[2496]: Domain id=65
name='ocr-Brain-28-ovirt.dmz.ssis' uuid=00425fb1-c24b-4eaa-9683-534d66b2cb04 is
tainted: custom-ga-command
Sep 01 07:49:47 ovirt-node3.ovirt sudo[268984]: root : TTY=unknown ; PWD=/ ; USER=root
; COMMAND=/bin/privsep-helper --privsep_context os_brick.privileged.default
--privsep_sock_path /tmp/tmp1iolt06i/privsep.sock
This is the indication I have on gluster:
[root@ovirt-node3 ~]# gluster volume heal gv0 info
Brick ovirt-node2.ovirt:/brickgv0/_gv0
Status: Connected
Number of entries: 0
Brick ovirt-node3.ovirt:/brickgv0/gv0_1
Status: Connected
Number of entries: 0
Brick ovirt-node4.ovirt:/dati/_gv0
Status: Connected
Number of entries: 0
[root@ovirt-node3 ~]# gluster volume heal glen info
Brick ovirt-node2.ovirt:/brickhe/_glen
Status: Connected
Number of entries: 0
Brick ovirt-node3.ovirt:/brickhe/glen
Status: Connected
Number of entries: 0
Brick ovirt-node4.ovirt:/dati/_glen
Status: Connected
Number of entries: 0
So it seem healty.
I don't know how to address the issue but this is a great problem.