On May 6, 2020 1:21:07 PM GMT+03:00, Anton Marchukov <amarchuk(a)redhat.com> wrote:
Forwarding to oVirt users list.
---------- Forwarded message ---------
From: <srivathsa.puliyala(a)dunami.com>
Date: Wed, May 6, 2020 at 12:01 PM
Subject: Ovirt host GetGlusterVolumeHealInfoVDS failed events
To: <infra(a)ovirt.org>
Hi,
We have a oVirt cluster with 4 hosts and hosted engine running on one
of
them (all the nodes provide the storage with GlusterFS)
Currently there are 53 VMs running.
The version of the oVirt-Engine is 4.2.8.2-1.el7 and GlusterFS is
3.12.15.
From past 1 week, we seem to have multiple events popping up on
Ovirt-UI
about the GetGlusterVolumeHealInfoVDS from all the nodes randomly like
one
ERROR event for every ~13minutes.
Sample Event dashboard example:
May 4, 2020, 2:32:14 PM - Status of host <host-1> was set to Up.
May 4, 2020, 2:32:11 PM - Manually synced the storage devices from host
<host-1>
May 4, 2020, 2:31:55 PM - Host <host-1> is not responding. Host cannot
be
fenced automatically because power management for the host is disabled.
May 4, 2020, 2:31:55 PM - VDSM <host-1> command
GetGlusterVolumeHealInfoVDS
failed: Message timeout which can be caused by communication issues
May 4, 2020, 2:19:14 PM - Status of host <host-2> was set to Up.
May 4, 2020, 2:19:12 PM - Manually synced the storage devices from host
<host-2>
May 4, 2020, 2:18:49 PM - Host <host-2> is not responding. Host cannot
be
fenced automatically because power management for the host is disabled.
May 4, 2020, 2:18:49 PM - VDSM <host-2> command
GetGlusterVolumeHealInfoVDS
failed: Message timeout which can be caused by communication issues
May 4, 2020, 2:05:55 PM - Status of host <host-2> was set to Up.
May 4, 2020, 2:05:54 PM - Manually synced the storage devices from host
<host-2>
May 4, 2020, 2:05:35 PM - Host <host-2> is not responding. Host cannot
be
fenced automatically because power management for the host is disabled.
May 4, 2020, 2:05:35 PM - VDSM <host-2> command
GetGlusterVolumeHealInfoVDS
failed: Message timeout which can be caused by communication issues
May 4, 2020, 1:52:45 PM - Status of host <host-3> was set to Up.
May 4, 2020, 1:52:44 PM - Manually synced the storage devices from host
<host-3>
May 4, 2020, 1:52:22 PM - Host <host-3> is not responding. Host cannot
be
fenced automatically because power management for the host is disabled.
May 4, 2020, 1:52:22 PM - VDSM <host-3> command
GetGlusterVolumeHealInfoVDS
failed: Message timeout which can be caused by communication issues
May 4, 2020, 1:39:11 PM - Status of host <host-4> was set to Up.
May 4, 2020, 1:39:11 PM - Manually synced the storage devices from host
<host-4>
May 4, 2020, 1:39:11 PM - Host <host-4> is not responding. Host cannot
be
fenced automatically because power management for the host is disabled.
May 4, 2020, 1:39:11 PM - VDSM <host-4> command
GetGlusterVolumeHealInfoVDS
failed: Message timeout which can be caused by communication issues
May 4, 2020, 1:26:29 PM - Status of host <host-3> was set to Up.
May 4, 2020, 1:26:28 PM - Manually synced the storage devices from host
<host-3>
May 4, 2020, 1:26:11 PM - Host <host-3> is not responding. Host cannot
be
fenced automatically because power management for the host is disabled.
May 4, 2020, 1:26:11 PM - VDSM <host-3> command
GetGlusterVolumeHealInfoVDS
failed: Message timeout which can be caused by communication issues
May 4, 2020, 1:13:10 PM - Status of host <host-1> was set to Up.
May 4, 2020, 1:13:08 PM - Manually synced the storage devices from host
<host-1>
May 4, 2020, 1:12:51 PM - Host <host-1> is not responding. Host cannot
be
fenced automatically because power management for the host is disabled.
May 4, 2020, 1:12:51 PM - VDSM <host-1> command
GetGlusterVolumeHealInfoVDS
failed: Message timeout which can be caused by communication issues
and so on.....
When I look at the Compute > Hosts dashboard, I see the host status to
be
DOWN when VDSM event (GetGlusterVolumeHealInfoVDS failed) is popped and
automatically the host status is set to UP within no time.
FYI: when host status is DOWN, the VM's running on that host are not
migrating and everything is running perfectly fine.
This is happening all day. Is there something I can troubleshoot?
Appreciate your comments.
_______________________________________________
Infra mailing list -- infra(a)ovirt.org
To unsubscribe send an email to infra-leave(a)ovirt.org
Privacy Statement:
https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/infra@ovirt.org/message/GNE3QC7GLEE...
Hi Srivathsa,
Based on the logs I have the feeling that you have some communication problems there.
Could you check:
1. System load and bandwidth utilization on one of the affected nodes
2. Login on one of the hosts and run ping (to the engine) in a 'screen' or
'tmux' for longer periods
3. Run ping from the engine to each of the hosts (in separate 'screen' or
'tmux' sessions) and store that data in separate files
Best Regards,
Strahil Nikolov