Ovirt host GetGlusterVolumeHealInfoVDS failed events
by srivathsa.puliyala@dunami.com
Hi,
We have a oVirt cluster with 4 hosts and hosted engine running on one of them (all the nodes provide the storage with GlusterFS)
Currently there are 53 VMs running.
The version of the oVirt-Engine is 4.2.8.2-1.el7 and GlusterFS is 3.12.15.
From past 1 week, we seem to have multiple events popping up on Ovirt-UI about the GetGlusterVolumeHealInfoVDS from all the nodes randomly like one ERROR event for every ~13minutes.
Sample Event dashboard example:
May 4, 2020, 2:32:14 PM - Status of host <host-1> was set to Up.
May 4, 2020, 2:32:11 PM - Manually synced the storage devices from host <host-1>
May 4, 2020, 2:31:55 PM - Host <host-1> is not responding. Host cannot be fenced automatically because power management for the host is disabled.
May 4, 2020, 2:31:55 PM - VDSM <host-1> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues
May 4, 2020, 2:19:14 PM - Status of host <host-2> was set to Up.
May 4, 2020, 2:19:12 PM - Manually synced the storage devices from host <host-2>
May 4, 2020, 2:18:49 PM - Host <host-2> is not responding. Host cannot be fenced automatically because power management for the host is disabled.
May 4, 2020, 2:18:49 PM - VDSM <host-2> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues
May 4, 2020, 2:05:55 PM - Status of host <host-2> was set to Up.
May 4, 2020, 2:05:54 PM - Manually synced the storage devices from host <host-2>
May 4, 2020, 2:05:35 PM - Host <host-2> is not responding. Host cannot be fenced automatically because power management for the host is disabled.
May 4, 2020, 2:05:35 PM - VDSM <host-2> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues
May 4, 2020, 1:52:45 PM - Status of host <host-3> was set to Up.
May 4, 2020, 1:52:44 PM - Manually synced the storage devices from host <host-3>
May 4, 2020, 1:52:22 PM - Host <host-3> is not responding. Host cannot be fenced automatically because power management for the host is disabled.
May 4, 2020, 1:52:22 PM - VDSM <host-3> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues
May 4, 2020, 1:39:11 PM - Status of host <host-4> was set to Up.
May 4, 2020, 1:39:11 PM - Manually synced the storage devices from host <host-4>
May 4, 2020, 1:39:11 PM - Host <host-4> is not responding. Host cannot be fenced automatically because power management for the host is disabled.
May 4, 2020, 1:39:11 PM - VDSM <host-4> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues
May 4, 2020, 1:26:29 PM - Status of host <host-3> was set to Up.
May 4, 2020, 1:26:28 PM - Manually synced the storage devices from host <host-3>
May 4, 2020, 1:26:11 PM - Host <host-3> is not responding. Host cannot be fenced automatically because power management for the host is disabled.
May 4, 2020, 1:26:11 PM - VDSM <host-3> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues
May 4, 2020, 1:13:10 PM - Status of host <host-1> was set to Up.
May 4, 2020, 1:13:08 PM - Manually synced the storage devices from host <host-1>
May 4, 2020, 1:12:51 PM - Host <host-1> is not responding. Host cannot be fenced automatically because power management for the host is disabled.
May 4, 2020, 1:12:51 PM - VDSM <host-1> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues
and so on.....
When I look at the Compute > Hosts dashboard, I see the host status to be DOWN when VDSM event (GetGlusterVolumeHealInfoVDS failed) is popped and automatically the host status is set to UP within no time.
FYI: when host status is DOWN, the VM's running on that host are not migrating and everything is running perfectly fine.
This is happening all day. Is there something I can troubleshoot? Appreciate your comments.