Fwd: Ovirt host GetGlusterVolumeHealInfoVDS failed events

Forwarding to oVirt users list. ---------- Forwarded message --------- From: <srivathsa.puliyala@dunami.com> Date: Wed, May 6, 2020 at 12:01 PM Subject: Ovirt host GetGlusterVolumeHealInfoVDS failed events To: <infra@ovirt.org> Hi, We have a oVirt cluster with 4 hosts and hosted engine running on one of them (all the nodes provide the storage with GlusterFS) Currently there are 53 VMs running. The version of the oVirt-Engine is 4.2.8.2-1.el7 and GlusterFS is 3.12.15. From past 1 week, we seem to have multiple events popping up on Ovirt-UI about the GetGlusterVolumeHealInfoVDS from all the nodes randomly like one ERROR event for every ~13minutes. Sample Event dashboard example: May 4, 2020, 2:32:14 PM - Status of host <host-1> was set to Up. May 4, 2020, 2:32:11 PM - Manually synced the storage devices from host <host-1> May 4, 2020, 2:31:55 PM - Host <host-1> is not responding. Host cannot be fenced automatically because power management for the host is disabled. May 4, 2020, 2:31:55 PM - VDSM <host-1> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues May 4, 2020, 2:19:14 PM - Status of host <host-2> was set to Up. May 4, 2020, 2:19:12 PM - Manually synced the storage devices from host <host-2> May 4, 2020, 2:18:49 PM - Host <host-2> is not responding. Host cannot be fenced automatically because power management for the host is disabled. May 4, 2020, 2:18:49 PM - VDSM <host-2> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues May 4, 2020, 2:05:55 PM - Status of host <host-2> was set to Up. May 4, 2020, 2:05:54 PM - Manually synced the storage devices from host <host-2> May 4, 2020, 2:05:35 PM - Host <host-2> is not responding. Host cannot be fenced automatically because power management for the host is disabled. May 4, 2020, 2:05:35 PM - VDSM <host-2> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues May 4, 2020, 1:52:45 PM - Status of host <host-3> was set to Up. May 4, 2020, 1:52:44 PM - Manually synced the storage devices from host <host-3> May 4, 2020, 1:52:22 PM - Host <host-3> is not responding. Host cannot be fenced automatically because power management for the host is disabled. May 4, 2020, 1:52:22 PM - VDSM <host-3> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues May 4, 2020, 1:39:11 PM - Status of host <host-4> was set to Up. May 4, 2020, 1:39:11 PM - Manually synced the storage devices from host <host-4> May 4, 2020, 1:39:11 PM - Host <host-4> is not responding. Host cannot be fenced automatically because power management for the host is disabled. May 4, 2020, 1:39:11 PM - VDSM <host-4> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues May 4, 2020, 1:26:29 PM - Status of host <host-3> was set to Up. May 4, 2020, 1:26:28 PM - Manually synced the storage devices from host <host-3> May 4, 2020, 1:26:11 PM - Host <host-3> is not responding. Host cannot be fenced automatically because power management for the host is disabled. May 4, 2020, 1:26:11 PM - VDSM <host-3> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues May 4, 2020, 1:13:10 PM - Status of host <host-1> was set to Up. May 4, 2020, 1:13:08 PM - Manually synced the storage devices from host <host-1> May 4, 2020, 1:12:51 PM - Host <host-1> is not responding. Host cannot be fenced automatically because power management for the host is disabled. May 4, 2020, 1:12:51 PM - VDSM <host-1> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues and so on..... When I look at the Compute > Hosts dashboard, I see the host status to be DOWN when VDSM event (GetGlusterVolumeHealInfoVDS failed) is popped and automatically the host status is set to UP within no time. FYI: when host status is DOWN, the VM's running on that host are not migrating and everything is running perfectly fine. This is happening all day. Is there something I can troubleshoot? Appreciate your comments. _______________________________________________ Infra mailing list -- infra@ovirt.org To unsubscribe send an email to infra-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/message/GNE3QC7GLEER4Z... -- Anton Marchukov Associate Manager - RHV DevOps - Red Hat

On May 6, 2020 1:21:07 PM GMT+03:00, Anton Marchukov <amarchuk@redhat.com> wrote:
Forwarding to oVirt users list.
---------- Forwarded message --------- From: <srivathsa.puliyala@dunami.com> Date: Wed, May 6, 2020 at 12:01 PM Subject: Ovirt host GetGlusterVolumeHealInfoVDS failed events To: <infra@ovirt.org>
Hi,
We have a oVirt cluster with 4 hosts and hosted engine running on one of them (all the nodes provide the storage with GlusterFS) Currently there are 53 VMs running. The version of the oVirt-Engine is 4.2.8.2-1.el7 and GlusterFS is 3.12.15.
From past 1 week, we seem to have multiple events popping up on Ovirt-UI about the GetGlusterVolumeHealInfoVDS from all the nodes randomly like one ERROR event for every ~13minutes.
Sample Event dashboard example: May 4, 2020, 2:32:14 PM - Status of host <host-1> was set to Up. May 4, 2020, 2:32:11 PM - Manually synced the storage devices from host <host-1> May 4, 2020, 2:31:55 PM - Host <host-1> is not responding. Host cannot be fenced automatically because power management for the host is disabled. May 4, 2020, 2:31:55 PM - VDSM <host-1> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues
May 4, 2020, 2:19:14 PM - Status of host <host-2> was set to Up. May 4, 2020, 2:19:12 PM - Manually synced the storage devices from host <host-2> May 4, 2020, 2:18:49 PM - Host <host-2> is not responding. Host cannot be fenced automatically because power management for the host is disabled. May 4, 2020, 2:18:49 PM - VDSM <host-2> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues
May 4, 2020, 2:05:55 PM - Status of host <host-2> was set to Up. May 4, 2020, 2:05:54 PM - Manually synced the storage devices from host <host-2> May 4, 2020, 2:05:35 PM - Host <host-2> is not responding. Host cannot be fenced automatically because power management for the host is disabled. May 4, 2020, 2:05:35 PM - VDSM <host-2> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues
May 4, 2020, 1:52:45 PM - Status of host <host-3> was set to Up. May 4, 2020, 1:52:44 PM - Manually synced the storage devices from host <host-3> May 4, 2020, 1:52:22 PM - Host <host-3> is not responding. Host cannot be fenced automatically because power management for the host is disabled. May 4, 2020, 1:52:22 PM - VDSM <host-3> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues
May 4, 2020, 1:39:11 PM - Status of host <host-4> was set to Up. May 4, 2020, 1:39:11 PM - Manually synced the storage devices from host <host-4> May 4, 2020, 1:39:11 PM - Host <host-4> is not responding. Host cannot be fenced automatically because power management for the host is disabled. May 4, 2020, 1:39:11 PM - VDSM <host-4> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues
May 4, 2020, 1:26:29 PM - Status of host <host-3> was set to Up. May 4, 2020, 1:26:28 PM - Manually synced the storage devices from host <host-3> May 4, 2020, 1:26:11 PM - Host <host-3> is not responding. Host cannot be fenced automatically because power management for the host is disabled. May 4, 2020, 1:26:11 PM - VDSM <host-3> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues
May 4, 2020, 1:13:10 PM - Status of host <host-1> was set to Up. May 4, 2020, 1:13:08 PM - Manually synced the storage devices from host <host-1> May 4, 2020, 1:12:51 PM - Host <host-1> is not responding. Host cannot be fenced automatically because power management for the host is disabled. May 4, 2020, 1:12:51 PM - VDSM <host-1> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues and so on.....
When I look at the Compute > Hosts dashboard, I see the host status to be DOWN when VDSM event (GetGlusterVolumeHealInfoVDS failed) is popped and automatically the host status is set to UP within no time. FYI: when host status is DOWN, the VM's running on that host are not migrating and everything is running perfectly fine.
This is happening all day. Is there something I can troubleshoot? Appreciate your comments. _______________________________________________ Infra mailing list -- infra@ovirt.org To unsubscribe send an email to infra-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/message/GNE3QC7GLEER4Z...
Hi Srivathsa, Based on the logs I have the feeling that you have some communication problems there. Could you check: 1. System load and bandwidth utilization on one of the affected nodes 2. Login on one of the hosts and run ping (to the engine) in a 'screen' or 'tmux' for longer periods 3. Run ping from the engine to each of the hosts (in separate 'screen' or 'tmux' sessions) and store that data in separate files Best Regards, Strahil Nikolov
participants (2)
-
Anton Marchukov
-
Strahil Nikolov