Forwarding to oVirt users list.

---------- Forwarded message ---------
From: <srivathsa.puliyala@dunami.com>
Date: Wed, May 6, 2020 at 12:01 PM
Subject: Ovirt host GetGlusterVolumeHealInfoVDS failed events
To: <infra@ovirt.org>

Hi,

We have a oVirt cluster with 4 hosts and hosted engine running on one of them (all the nodes provide the storage with GlusterFS)
Currently there are 53 VMs running.
The version of the oVirt-Engine is 4.2.8.2-1.el7 and GlusterFS is 3.12.15.

From past 1 week, we seem to have multiple events popping up on Ovirt-UI about the GetGlusterVolumeHealInfoVDS from all the nodes randomly like one ERROR event for every ~13minutes.

Sample Event dashboard example:
May 4, 2020, 2:32:14 PM - Status of host <host-1> was set to Up.
May 4, 2020, 2:32:11 PM - Manually synced the storage devices from host <host-1>
May 4, 2020, 2:31:55 PM - Host <host-1> is not responding. Host cannot be fenced automatically because power management for the host is disabled.
May 4, 2020, 2:31:55 PM - VDSM <host-1> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues

May 4, 2020, 2:19:14 PM - Status of host <host-2> was set to Up.
May 4, 2020, 2:19:12 PM - Manually synced the storage devices from host <host-2>
May 4, 2020, 2:18:49 PM - Host <host-2> is not responding. Host cannot be fenced automatically because power management for the host is disabled.
May 4, 2020, 2:18:49 PM - VDSM <host-2> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues

May 4, 2020, 2:05:55 PM - Status of host <host-2> was set to Up.
May 4, 2020, 2:05:54 PM - Manually synced the storage devices from host <host-2>
May 4, 2020, 2:05:35 PM - Host <host-2> is not responding. Host cannot be fenced automatically because power management for the host is disabled.
May 4, 2020, 2:05:35 PM - VDSM <host-2> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues

May 4, 2020, 1:52:45 PM - Status of host <host-3> was set to Up.
May 4, 2020, 1:52:44 PM - Manually synced the storage devices from host <host-3>
May 4, 2020, 1:52:22 PM - Host <host-3> is not responding. Host cannot be fenced automatically because power management for the host is disabled.
May 4, 2020, 1:52:22 PM - VDSM <host-3> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues

May 4, 2020, 1:39:11 PM - Status of host <host-4> was set to Up.
May 4, 2020, 1:39:11 PM - Manually synced the storage devices from host <host-4>
May 4, 2020, 1:39:11 PM - Host <host-4> is not responding. Host cannot be fenced automatically because power management for the host is disabled.
May 4, 2020, 1:39:11 PM - VDSM <host-4> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues

May 4, 2020, 1:26:29 PM - Status of host <host-3> was set to Up.
May 4, 2020, 1:26:28 PM - Manually synced the storage devices from host <host-3>
May 4, 2020, 1:26:11 PM - Host <host-3> is not responding. Host cannot be fenced automatically because power management for the host is disabled.
May 4, 2020, 1:26:11 PM - VDSM <host-3> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues

May 4, 2020, 1:13:10 PM - Status of host <host-1> was set to Up.
May 4, 2020, 1:13:08 PM - Manually synced the storage devices from host <host-1>
May 4, 2020, 1:12:51 PM - Host <host-1> is not responding. Host cannot be fenced automatically because power management for the host is disabled.
May 4, 2020, 1:12:51 PM - VDSM <host-1> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues
and so on.....

When I look at the Compute > Hosts dashboard, I see the host status to be DOWN when VDSM event (GetGlusterVolumeHealInfoVDS failed) is popped and automatically the host status is set to UP within no time.
FYI: when host status is DOWN, the VM's running on that host are not migrating and everything is running perfectly fine.

This is happening all day. Is there something I can troubleshoot? Appreciate your comments.
_______________________________________________
Infra mailing list -- infra@ovirt.org
To unsubscribe send an email to infra-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/message/GNE3QC7GLEER4ZPHGP3H6M27DPSKCQO3/

Anton Marchukov
Associate Manager - RHV DevOps - Red Hat