Ovirt host GetGlusterVolumeHealInfoVDS failed events

Hi, We have a oVirt cluster with 4 hosts and hosted engine running on one of them (all the nodes provide the storage with GlusterFS) Currently there are 53 VMs running. The version of the oVirt-Engine is 4.2.8.2-1.el7 and GlusterFS is 3.12.15. From past 1 week, we seem to have multiple events popping up on Ovirt-UI about the GetGlusterVolumeHealInfoVDS from all the nodes randomly like one ERROR event for every ~13minutes. Sample Event dashboard example: May 4, 2020, 2:32:14 PM - Status of host <host-1> was set to Up. May 4, 2020, 2:32:11 PM - Manually synced the storage devices from host <host-1> May 4, 2020, 2:31:55 PM - Host <host-1> is not responding. Host cannot be fenced automatically because power management for the host is disabled. May 4, 2020, 2:31:55 PM - VDSM <host-1> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues May 4, 2020, 2:19:14 PM - Status of host <host-2> was set to Up. May 4, 2020, 2:19:12 PM - Manually synced the storage devices from host <host-2> May 4, 2020, 2:18:49 PM - Host <host-2> is not responding. Host cannot be fenced automatically because power management for the host is disabled. May 4, 2020, 2:18:49 PM - VDSM <host-2> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues May 4, 2020, 2:05:55 PM - Status of host <host-2> was set to Up. May 4, 2020, 2:05:54 PM - Manually synced the storage devices from host <host-2> May 4, 2020, 2:05:35 PM - Host <host-2> is not responding. Host cannot be fenced automatically because power management for the host is disabled. May 4, 2020, 2:05:35 PM - VDSM <host-2> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues May 4, 2020, 1:52:45 PM - Status of host <host-3> was set to Up. May 4, 2020, 1:52:44 PM - Manually synced the storage devices from host <host-3> May 4, 2020, 1:52:22 PM - Host <host-3> is not responding. Host cannot be fenced automatically because power management for the host is disabled. May 4, 2020, 1:52:22 PM - VDSM <host-3> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues May 4, 2020, 1:39:11 PM - Status of host <host-4> was set to Up. May 4, 2020, 1:39:11 PM - Manually synced the storage devices from host <host-4> May 4, 2020, 1:39:11 PM - Host <host-4> is not responding. Host cannot be fenced automatically because power management for the host is disabled. May 4, 2020, 1:39:11 PM - VDSM <host-4> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues May 4, 2020, 1:26:29 PM - Status of host <host-3> was set to Up. May 4, 2020, 1:26:28 PM - Manually synced the storage devices from host <host-3> May 4, 2020, 1:26:11 PM - Host <host-3> is not responding. Host cannot be fenced automatically because power management for the host is disabled. May 4, 2020, 1:26:11 PM - VDSM <host-3> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues May 4, 2020, 1:13:10 PM - Status of host <host-1> was set to Up. May 4, 2020, 1:13:08 PM - Manually synced the storage devices from host <host-1> May 4, 2020, 1:12:51 PM - Host <host-1> is not responding. Host cannot be fenced automatically because power management for the host is disabled. May 4, 2020, 1:12:51 PM - VDSM <host-1> command GetGlusterVolumeHealInfoVDS failed: Message timeout which can be caused by communication issues and so on..... When I look at the Compute > Hosts dashboard, I see the host status to be DOWN when VDSM event (GetGlusterVolumeHealInfoVDS failed) is popped and automatically the host status is set to UP within no time. FYI: when host status is DOWN, the VM's running on that host are not migrating and everything is running perfectly fine. This is happening all day. Is there something I can troubleshoot? Appreciate your comments.

Ping looks fine from both engine-host and host-engine. While troubleshooting more from logs, found the below errors from various files: ################### VDSM <host> command Get Host Capabilities failed: Not enough resources: {'reason': 'Too many tasks', 'resource': 'jsonrpc', 'current_tasks': 80} ############ May 9 03:54:04 <host> vdsm[26934]: WARN Worker blocked: <Worker name=jsonrpc/4 running <Task <JsonRpcTask {'params': {u'volumeName': u'vm_gv0'}, 'jsonrpc': '2.0', 'method': u'GlusterVolume.healInfo', 'id': u'f4e56ab9-6916-4938-821a-1b9aab2ef162'} at 0x7fb886fd8dd0> timeout=60, duration=7980 at 0x7fb886edc910> task#=14247 at 0x7fb8a4035450>, traceback:#012File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap#012 self.__bootstrap_inner()#012File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner#012 self.run()#012File: "/usr/lib64/python2.7/threading.py", line 765, in run#012 self.__target(*self.__args, **self.__kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run#012 ret = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run#012 self._execute_task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012 task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__#012 self._callable()#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__#012 self._handler(self._ctx, self._req)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest#012 response = self._handle_request(req, ctx)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request#012 res = method(**params)#012File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod#012 result = fn(*methodArgs)#012File: "/usr/lib/python2.7/site-packages/vdsm/gluster/apiwrapper.py", line 129, in healInfo#012 return self._gluster.volumeHealInfo(volumeName)#012File: "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 90, in wrapper#012 rv = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 776, in volumeHealInfo#012 return {'healInfo': self.svdsmProxy.glusterVolumeHealInfo(volumeName)}#012File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__#012 return callMethod()#012File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda>#012 **kwargs)#012File: "<string>", line 2, in glusterVolumeHealInfo#012File: "/usr/lib64/python2.7/multiprocessing/managers.py", line 759, in _callmethod#012 kind, result = conn.recv() ######### cat /var/log/messages | grep 'database connection failed' May 9 07:25:59 <host> ovs-vsctl: ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory) ####### /var/log/ovirt-hosted-engine-ha/agent.log MainThread::ERROR::2020-05-09 11:32:33,089::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2020-05-09 11:32:33,089::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2020-05-09 11:32:43,926::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.2.16 started MainThread::INFO::2020-05-09 11:32:43,984::hosted_engine::244::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: <hostname> MainThread::ERROR::2020-05-09 11:33:49,369::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 412, in start_monitoring self._initialize_vdsm() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 569, in _initialize_vdsm logger=self._log File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 468, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 411, in __vdsm_json_rpc_connect timeout=VDSM_MAX_RETRY * VDSM_DELAY RuntimeError: Couldn't connect to VDSM within 60 seconds MainThread::ERROR::2020-05-09 11:33:49,371::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2020-05-09 11:33:49,371::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2020-05-09 11:34:00,216::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.2.16 started MainThread::INFO::2020-05-09 11:34:00,326::hosted_engine::244::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Found certificate common name: <hostname>
участники (1)
-
srivathsa.puliyala@dunami.com