Any sanlock errors to indicate storage problems ?
Have you checked Gluster logs for errors or indication of network disruption?

Best Regards,
Strahil Nikolov 

On Thu, Sep 1, 2022 at 12:18, Diego Ercolani
<diego.ercolani@ssis.sm> wrote:
Hello, I have a cluster made by 3 nodes in a "self-hosted-engine" topology.
I implemented the storage with gluster implementation in 2 replica + arbiter topology.
I have two gluster volumes
glen - is the volume used by hosted-engine vm
gv0 - is the volume used by VMs

The physical disks are 4TB SSD used only to accomodate VMs (also hosted-engine)

I have continuos VMs hangs, even hosted-engine, this give full of troubles as I have continuous hangs by hosted-engine and this happen asyncrounosly even while there is management operation on VMs (mobility, cloning...)

after a while it happens that the VM is freed but in the VMs I have in console kernel complaining by CPU hang or timer hangs and the solution is only to shutdown/poweroff the VM... even hosted engine in fact it happens that hosted-engine -vm-status give "state=EngineUpBadHealth"

This is the log during the event in the host while there is the event:
Sep 01 07:48:58 ovirt-node3.ovirt NetworkManager[1923]: <info>  [1662018538.0166] device (vnet73): state change: activated -> unmanaged (reason 'unmanaged', sys-iface-state: 'removed')
Sep 01 07:48:58 ovirt-node3.ovirt NetworkManager[1923]: <info>  [1662018538.0168] device (vnet73): released from master device ovirtmgmt
Sep 01 07:48:58 ovirt-node3.ovirt libvirtd[2496]: Unable to read from monitor: Connection reset by peer
Sep 01 07:48:58 ovirt-node3.ovirt libvirtd[2496]: internal error: qemu unexpectedly closed the monitor: 2022-09-01T07:48:57.930955Z qemu-kvm: -device virtio-blk-pci,iothread=iothread1,bus=pci.6,addr=0x0,drive=libvirt-1-format,id=ua-0a1a501c-fc45-430f-bfd3-076172cec406,bootindex=1,write-cache=on,serial=0a1a501c-fc45-430f-bfd3-076172cec406,werror=stop,rerror=stop: Failed to get "write" lock Is another process using the image [/run/vdsm/storage/3577c21e-f757-4405-97d1-0f827c9b4e22/0a1a501c-fc45-430f-bfd3-076172cec406/f65dab86-67f1-46fa-87c0-f9076f479741]?
Sep 01 07:48:58 ovirt-node3.ovirt kvm[268578]: 5 guests now active
Sep 01 07:48:58 ovirt-node3.ovirt systemd[1]: machine-qemu\x2d67\x2dHostedEngine.scope: Succeeded.
Sep 01 07:48:58 ovirt-node3.ovirt systemd-machined[1613]: Machine qemu-67-HostedEngine terminated.
Sep 01 07:49:08 ovirt-node3.ovirt systemd[1]: NetworkManager-dispatcher.service: Succeeded.
Sep 01 07:49:08 ovirt-node3.ovirt ovirt-ha-agent[3338]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Engine VM stopped on localhost
Sep 01 07:49:14 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted Engine HA info: timed out
Sep 01 07:49:18 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted Engine HA info: timed out
Sep 01 07:49:18 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted Engine HA info: timed out
Sep 01 07:49:28 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted Engine HA info: timed out
Sep 01 07:49:28 ovirt-node3.ovirt sanlock[1633]: 2022-09-01 07:49:28 1706161 [5083]: s4 delta_renew long write time 11 sec
Sep 01 07:49:28 ovirt-node3.ovirt sanlock[1633]: 2022-09-01 07:49:28 1706161 [5033]: s3 delta_renew long write time 11 sec
Sep 01 07:49:34 ovirt-node3.ovirt libvirtd[2496]: Domain id=65 name='ocr-Brain-28-ovirt.dmz.ssis' uuid=00425fb1-c24b-4eaa-9683-534d66b2cb04 is tainted: custom-ga-command
Sep 01 07:49:47 ovirt-node3.ovirt sudo[268984]:    root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/privsep-helper --privsep_context os_brick.privileged.default --privsep_sock_path /tmp/tmp1iolt06i/privsep.sock

This is the indication I have on gluster:
[root@ovirt-node3 ~]# gluster volume heal gv0 info
Brick ovirt-node2.ovirt:/brickgv0/_gv0
Status: Connected
Number of entries: 0

Brick ovirt-node3.ovirt:/brickgv0/gv0_1
Status: Connected
Number of entries: 0

Brick ovirt-node4.ovirt:/dati/_gv0
Status: Connected
Number of entries: 0


[root@ovirt-node3 ~]# gluster volume heal glen info
Brick ovirt-node2.ovirt:/brickhe/_glen
Status: Connected
Number of entries: 0

Brick ovirt-node3.ovirt:/brickhe/glen
Status: Connected
Number of entries: 0

Brick ovirt-node4.ovirt:/dati/_glen
Status: Connected
Number of entries: 0

So it seem healty.

I don't know how to address the issue but this is a great problem.
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/62EZCKA5HCAY2CP4O7RAXVNYEW4RAQOF/