VMs hang periodically: gluster problem?

1 Sep 2022

      Hello, I have a cluster made by 3 nodes in a "self-hosted-engine" topology.
I implemented the storage with gluster implementation in 2 replica + arbiter topology.
I have two gluster volumes
glen - is the volume used by hosted-engine vm
gv0 - is the volume used by VMs

The physical disks are 4TB SSD used only to accomodate VMs (also hosted-engine)

I have continuos VMs hangs, even hosted-engine, this give full of troubles as I have continuous hangs by hosted-engine and this happen asyncrounosly even while there is management operation on VMs (mobility, cloning...)

after a while it happens that the VM is freed but in the VMs I have in console kernel complaining by CPU hang or timer hangs and the solution is only to shutdown/poweroff the VM... even hosted engine in fact it happens that hosted-engine -vm-status give "state=EngineUpBadHealth"

This is the log during the event in the host while there is the event:
Sep 01 07:48:58 ovirt-node3.ovirt NetworkManager[1923]: <info>  [1662018538.0166] device (vnet73): state change: activated -> unmanaged (reason 'unmanaged', sys-iface-state: 'removed')
Sep 01 07:48:58 ovirt-node3.ovirt NetworkManager[1923]: <info>  [1662018538.0168] device (vnet73): released from master device ovirtmgmt
Sep 01 07:48:58 ovirt-node3.ovirt libvirtd[2496]: Unable to read from monitor: Connection reset by peer
Sep 01 07:48:58 ovirt-node3.ovirt libvirtd[2496]: internal error: qemu unexpectedly closed the monitor: 2022-09-01T07:48:57.930955Z qemu-kvm: -device virtio-blk-pci,iothread=iothread1,bus=pci.6,addr=0x0,drive=libvirt-1-format,id=ua-0a1a501c-fc45-430f-bfd3-076172cec406,bootindex=1,write-cache=on,serial=0a1a501c-fc45-430f-bfd3-076172cec406,werror=stop,rerror=stop: Failed to get "write" lock Is another process using the image [/run/vdsm/storage/3577c21e-f757-4405-97d1-0f827c9b4e22/0a1a501c-fc45-430f-bfd3-076172cec406/f65dab86-67f1-46fa-87c0-f9076f479741]?
Sep 01 07:48:58 ovirt-node3.ovirt kvm[268578]: 5 guests now active
Sep 01 07:48:58 ovirt-node3.ovirt systemd[1]: machine-qemu\x2d67\x2dHostedEngine.scope: Succeeded.
Sep 01 07:48:58 ovirt-node3.ovirt systemd-machined[1613]: Machine qemu-67-HostedEngine terminated.
Sep 01 07:49:08 ovirt-node3.ovirt systemd[1]: NetworkManager-dispatcher.service: Succeeded.
Sep 01 07:49:08 ovirt-node3.ovirt ovirt-ha-agent[3338]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine ERROR Engine VM stopped on localhost
Sep 01 07:49:14 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted Engine HA info: timed out
Sep 01 07:49:18 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted Engine HA info: timed out
Sep 01 07:49:18 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted Engine HA info: timed out
Sep 01 07:49:28 ovirt-node3.ovirt vdsm[3335]: WARN Failed to retrieve Hosted Engine HA info: timed out
Sep 01 07:49:28 ovirt-node3.ovirt sanlock[1633]: 2022-09-01 07:49:28 1706161 [5083]: s4 delta_renew long write time 11 sec
Sep 01 07:49:28 ovirt-node3.ovirt sanlock[1633]: 2022-09-01 07:49:28 1706161 [5033]: s3 delta_renew long write time 11 sec
Sep 01 07:49:34 ovirt-node3.ovirt libvirtd[2496]: Domain id=65 name='ocr-Brain-28-ovirt.dmz.ssis' uuid=00425fb1-c24b-4eaa-9683-534d66b2cb04 is tainted: custom-ga-command
Sep 01 07:49:47 ovirt-node3.ovirt sudo[268984]:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/privsep-helper --privsep_context os_brick.privileged.default --privsep_sock_path /tmp/tmp1iolt06i/privsep.sock

This is the indication I have on gluster:
[root@ovirt-node3 ~]# gluster volume heal gv0 info
Brick ovirt-node2.ovirt:/brickgv0/_gv0
Status: Connected
Number of entries: 0

Brick ovirt-node3.ovirt:/brickgv0/gv0_1
Status: Connected
Number of entries: 0

Brick ovirt-node4.ovirt:/dati/_gv0
Status: Connected
Number of entries: 0

[root@ovirt-node3 ~]# gluster volume heal glen info
Brick ovirt-node2.ovirt:/brickhe/_glen
Status: Connected
Number of entries: 0

Brick ovirt-node3.ovirt:/brickhe/glen
Status: Connected
Number of entries: 0

Brick ovirt-node4.ovirt:/dati/_glen
Status: Connected
Number of entries: 0

So it seem healty.

I don't know how to address the issue but this is a great problem.

Diego Ercolani

Diego Ercolani

Strahil Nikolov

Diego Ercolani

Diego Ercolani

Strahil Nikolov

Diego Ercolani

Diego Ercolani

Strahil Nikolov

Diego Ercolani

Diego Ercolani

Strahil Nikolov

Diego Ercolani

Diego Ercolani

Strahil Nikolov

Diego Ercolani

Strahil Nikolov

Diego Ercolani

Strahil Nikolov

Diego Ercolani

tags

participants (2)