Hi everyone,

we have an oVirt-Cluster with 5 nodes and 3 of them provide the storage with GlusterFS and replica 3.
The cluster is running 87 VMs and has 9TB storage where 4TB is in use.
The version of the oVirt-Engine is 4.1.8.2 and GlusterFS is 3.8.15.
The servers are running in a HP bladecenter and are connected with 10GBit to each other.

Actually we have some problems that all ovirt nodes periodically won't respond in the cluster with the following error messages in the oVirt webinterface:

VDSM glustervirt05 command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues
Host glustervirt05 is not responding. It will stay in Connecting state for a grace period of 68 seconds and after that an attempt to fence the host will be issued.
Host glustervirt05 does not enforce SELinux. Current status: PERMISSIVE
Executing power management status on Host glustervirt05 using Proxy Host glustervirt02 and Fence Agent ilo4:xxx.xxx.xxx.xxx.
Manually synced the storage devices from host glustervirt05
Status of host glustervirt05 was set to Up.


In the vdsm logfile I can find the following message:
2019-11-26 11:18:22,909+0100 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/7 running <Task <JsonRpcTask {'params': {u'volumeName': u'data'}, 'jsonrpc': '2.0', 'method': u'GlusterVolume.healInfo', 'id': u'2e86ed2c-
3e79-42c1-a7e4-c09bfbfc7794'} at 0x7fb938373190> timeout=60, duration=180 at 0x316a6d0> task#=2859802 at 0x1b70dd0> (executor:351)


And I figured out, that the gluster heal info command takes very long:
[root@glustervirt01 ~]# time gluster volume heal data info
Brick glustervirt01:/gluster/data/brick1
Status: Connected
Number of entries: 0

Brick glustervirt02:/gluster/data/brick1
Status: Connected
Number of entries: 0

Brick glustervirt03:/gluster/data/brick2
Status: Connected
Number of entries: 0


real 3m3.626s
user 0m0.593s
sys 0m0.559s


A strange behavier is also that there is one virtual machine (a postgresql database) which stops running unexpectedly every one or two days ...
The only thing that has been changed on the vm in the least past was a resize of the disk.
VM replication-zabbix is down with error. Exit message: Lost connection with qemu process.

And when we add or delete a larger disk with approximately 100GB in glusterfs, the glusterfs cluster freaks out won't respond anymore.
This also results in paused VMs ...


Has anyone an idea what could cause such problems?