Possible vdsmd issues - " WARN Worker blocked"

Hi all, We have had a bit of a strange issue with the starting of the self-hosted engine due to the vdsm and ovirt-ha processes having errors and not able to be started. Running Ovirt version 4.2 with Gluster mounts (not hyper converged) gluster 3.12.13 replicated volumes with arbiter brick. The mounts for the storage are mounted on all the Ovirt hosts and able to read/touch files on the mounts. Unfortunately we can not pin a specific timeframe when this started happening. Meaning we don’t know if it started after a specific update or any other event. The vdsmd is throwing errors (bottom – with version numbers) when trying to check the status of the self -hosted engine getting : # hosted-engine --vm-status The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable. Thanks for any assistance. systemctl status ovirt-ha-agent -l ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled) Active: activating (auto-restart) (Result: exit-code) since Thu 2018-08-30 13:24:40 CDT; 2s ago Process: 17135 ExecStart=/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent (code=exited, status=157) Main PID: 17135 (code=exited, status=157) systemctl status ovirt-ha-broker -l ● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2018-08-30 13:10:35 CDT; 14min ago Main PID: 14863 (ovirt-ha-broker) Tasks: 4 CGroup: /system.slice/ovirt-ha-broker.service └─14863 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker Aug 30 13:10:35 rrb-vmhost1.rrc.local systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker. Aug 30 13:10:35 rrb-vmhost1.rrc.local systemd[1]: Starting oVirt Hosted Engine High Availability Communications Broker... Aug 30 13:14:44 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker ERROR Failed to start monitoring domain (sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2, host_id=1): timeout during domain acquisition Aug 30 13:14:44 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.listener.Action.start_domain_monitor ERROR Error in RPC call: Failed to start monitoring domain (sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2, host_id=1): timeout during domain acquisition Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker ERROR Failed to start monitoring domain (sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2, host_id=1): timeout during domain acquisition Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.listener.Action.start_domain_monitor ERROR Error in RPC call: Failed to start monitoring domain (sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2, host_id=1): timeout during domain acquisition Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update ERROR Failed to update state. Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 82, in run if (self._status_broker._inquire_whiteboard_lock() or File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 190, in _inquire_whiteboard_lock self.host_id, self._lease_file) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 128, in host_id raise ex.HostIdNotLockedError("Host id is not set") HostIdNotLockedError: Host id is not set Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update ERROR Trying to restart the broker Aug 30 13:19:42 rrb-vmhost1.rrc.local python[14863]: detected unhandled Python exception in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker' Aug 30 13:19:48 rrb-vmhost1.rrc.local python[14863]: communication with ABRT daemon failed: timed out glusterfs-api-3.12.13-1.el7.x86_64 glusterfs-server-3.12.13-1.el7.x86_64 glusterfs-fuse-3.12.13-1.el7.x86_64 vdsm-gluster-4.20.35-1.el7.x86_64 centos-release-gluster312-1.0-2.el7.centos.noarch glusterfs-client-xlators-3.12.13-1.el7.x86_64 glusterfs-geo-replication-3.12.13-1.el7.x86_64 glusterfs-libs-3.12.13-1.el7.x86_64 glusterfs-cli-3.12.13-1.el7.x86_64 glusterfs-rdma-3.12.13-1.el7.x86_64 glusterfs-events-3.12.13-1.el7.x86_64 glusterfs-3.12.13-1.el7.x86_64 libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.7.x86_64 python2-gluster-3.12.13-1.el7.x86_64 glusterfs-gnfs-3.12.13-1.el7.x86_64 ovirt-setup-lib-1.1.4-1.el7.centos.noarch ovirt-iso-uploader-4.2.0-1.el7.centos.noarch ovirt-engine-setup-base-4.2.5.3-1.el7.noarch ovirt-host-dependencies-4.2.3-1.el7.x86_64 ovirt-guest-agent-common-1.0.14-1.el7.noarch python-ovirt-engine-sdk4-4.2.7-2.el7.x86_64 ovirt-engine-setup-plugin-ovirt-engine-common-4.2.5.3-1.el7.noarch ovirt-vmconsole-1.0.5-4.el7.centos.noarch ovirt-engine-tools-backup-4.2.5.3-1.el7.noarch ovirt-guest-agent-windows-1.0.14-1.el7.centos.noarch cockpit-ovirt-dashboard-0.11.31-1.el7.noarch ovirt-host-deploy-1.7.4-1.el7.noarch ovirt-engine-lib-4.2.5.3-1.el7.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7.noarch ovirt-hosted-engine-ha-2.2.16-1.el7.noarch ovirt-guest-tools-iso-4.2-1.el7.centos.noarch ovirt-imageio-common-1.4.2-0.el7.noarch ovirt-hosted-engine-setup-2.2.25-1.el7.noarch ovirt-imageio-daemon-1.4.2-0.el7.noarch ovirt-vmconsole-host-1.0.5-4.el7.centos.noarch ovirt-provider-ovn-driver-1.2.13-1.el7.noarch ovirt-host-4.2.3-1.el7.x86_64 vdsm-jsonrpc-4.20.35-1.el7.noarch vdsm-hook-ethtool-options-4.20.35-1.el7.noarch vdsm-gluster-4.20.35-1.el7.x86_64 vdsm-api-4.20.35-1.el7.noarch vdsm-yajsonrpc-4.20.35-1.el7.noarch vdsm-client-4.20.35-1.el7.noarch vdsm-hook-fcoe-4.20.35-1.el7.noarch vdsm-network-4.20.35-1.el7.x86_64 vdsm-hook-vmfex-dev-4.20.35-1.el7.noarch vdsm-hook-vhostmd-4.20.35-1.el7.noarch vdsm-http-4.20.35-1.el7.noarch vdsm-hook-openstacknet-4.20.35-1.el7.noarch vdsm-python-4.20.35-1.el7.noarch vdsm-hook-vfio-mdev-4.20.35-1.el7.noarch vdsm-common-4.20.35-1.el7.noarch vdsm-4.20.35-1.el7.x86_64 Aug 30 13:02:42 rrb-vmhost1.rrc.local vdsm[6792]: WARN Worker blocked: <Worker name=periodic/1 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7f713c0b0fd0> at 0x7f713c063050> timeout=15, duration=75 at 0x7f7170698c90> task#=354 at 0x7f713c10ec90>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 220, in __call__ self._func() File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 580, in __call__ stats = hostapi.get_stats(self._cif, self._samples.stats()) File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 77, in get_stats ret['haStats'] = _getHaInfo() File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 182, in _getHaInfo stats = instance.get_all_stats() File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 94, in get_all_stats stats = broker.get_stats_from_storage() File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 135, in get_stats_from_storage result = self._proxy.get_stats() File: "/usr/lib64/python2.7/xmlrpclib.py", line 1233, in __call__ return self.__send(self.__name, args) File: "/usr/lib64/python2.7/xmlrpclib.py", line 1591, in __request verbose=self.__verbose File: "/usr/lib64/python2.7/xmlrpclib.py", line 1273, in request return self.single_request(host, handler, request_body, verbose) File: "/usr/lib64/python2.7/xmlrpclib.py", line 1303, in single_request response = h.getresponse(buffering=True) File: "/usr/lib64/python2.7/httplib.py", line 1113, in getresponse response.begin() File: "/usr/lib64/python2.7/httplib.py", line 444, in begin version, status, reason = self._read_status() File: "/usr/lib64/python2.7/httplib.py", line 400, in _read_status line = self.fp.readline(_MAXLINE + 1) File: "/usr/lib64/python2.7/socket.py", line 476, in readline data = self._sock.recv(self._rbufsize)

On Fri, Aug 31, 2018 at 2:26 AM Johnson, Tim <tjj@uic.edu> wrote:
Hi all,
We have had a bit of a strange issue with the starting of the self-hosted engine due to the vdsm and ovirt-ha processes having errors and not able to be started.
Running Ovirt version 4.2 with Gluster mounts (not hyper converged) gluster 3.12.13 replicated volumes with arbiter brick.
The mounts for the storage are mounted on all the Ovirt hosts and able to read/touch files on the mounts.
Unfortunately we can not pin a specific timeframe when this started happening. Meaning we don’t know if it started after a specific update or any other event.
The vdsmd is throwing errors (bottom – with version numbers) when trying to check the status of the self -hosted engine getting :
# hosted-engine --vm-status
The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable.
Thanks for any assistance.
systemctl status ovirt-ha-agent -l
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Thu 2018-08-30 13:24:40 CDT; 2s ago
Process: 17135 ExecStart=/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent (code=exited, status=157)
Main PID: 17135 (code=exited, status=157)
systemctl status ovirt-ha-broker -l
● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker
Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2018-08-30 13:10:35 CDT; 14min ago
Main PID: 14863 (ovirt-ha-broker)
Tasks: 4
CGroup: /system.slice/ovirt-ha-broker.service
└─14863 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker
Aug 30 13:10:35 rrb-vmhost1.rrc.local systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker.
Aug 30 13:10:35 rrb-vmhost1.rrc.local systemd[1]: Starting oVirt Hosted Engine High Availability Communications Broker...
Aug 30 13:14:44 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker ERROR Failed to start monitoring domain (sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2, host_id=1): timeout during domain acquisition
Aug 30 13:14:44 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.listener.Action.start_domain_monitor ERROR Error in RPC call: Failed to start monitoring domain (sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2, host_id=1): timeout during domain acquisition
Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker ERROR Failed to start monitoring domain (sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2, host_id=1): timeout during domain acquisition
Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.listener.Action.start_domain_monitor ERROR Error in RPC call: Failed to start monitoring domain (sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2, host_id=1): timeout during domain acquisition
Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update ERROR Failed to update state.
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 82, in run
if (self._status_broker._inquire_whiteboard_lock() or
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 190, in _inquire_whiteboard_lock
self.host_id, self._lease_file)
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 128, in host_id
raise ex.HostIdNotLockedError("Host id is not set")
HostIdNotLockedError: Host id is not set
Looks like hosted engine agent fail to inquire sanlock lease. This is mostly likely issue with storage or selinux. Can you attach /var/log/sanlock.log and /var/log/messages?
Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update ERROR Trying to restart the broker
Aug 30 13:19:42 rrb-vmhost1.rrc.local python[14863]: detected unhandled Python exception in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker'
Aug 30 13:19:48 rrb-vmhost1.rrc.local python[14863]: communication with ABRT daemon failed: timed out
glusterfs-api-3.12.13-1.el7.x86_64
glusterfs-server-3.12.13-1.el7.x86_64
glusterfs-fuse-3.12.13-1.el7.x86_64
vdsm-gluster-4.20.35-1.el7.x86_64
centos-release-gluster312-1.0-2.el7.centos.noarch
glusterfs-client-xlators-3.12.13-1.el7.x86_64
glusterfs-geo-replication-3.12.13-1.el7.x86_64
glusterfs-libs-3.12.13-1.el7.x86_64
glusterfs-cli-3.12.13-1.el7.x86_64
glusterfs-rdma-3.12.13-1.el7.x86_64
glusterfs-events-3.12.13-1.el7.x86_64
glusterfs-3.12.13-1.el7.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.7.x86_64
python2-gluster-3.12.13-1.el7.x86_64
glusterfs-gnfs-3.12.13-1.el7.x86_64
ovirt-setup-lib-1.1.4-1.el7.centos.noarch
ovirt-iso-uploader-4.2.0-1.el7.centos.noarch
ovirt-engine-setup-base-4.2.5.3-1.el7.noarch
ovirt-host-dependencies-4.2.3-1.el7.x86_64
ovirt-guest-agent-common-1.0.14-1.el7.noarch
python-ovirt-engine-sdk4-4.2.7-2.el7.x86_64
ovirt-engine-setup-plugin-ovirt-engine-common-4.2.5.3-1.el7.noarch
ovirt-vmconsole-1.0.5-4.el7.centos.noarch
ovirt-engine-tools-backup-4.2.5.3-1.el7.noarch
ovirt-guest-agent-windows-1.0.14-1.el7.centos.noarch
cockpit-ovirt-dashboard-0.11.31-1.el7.noarch
ovirt-host-deploy-1.7.4-1.el7.noarch
ovirt-engine-lib-4.2.5.3-1.el7.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7.noarch
ovirt-hosted-engine-ha-2.2.16-1.el7.noarch
ovirt-guest-tools-iso-4.2-1.el7.centos.noarch
ovirt-imageio-common-1.4.2-0.el7.noarch
ovirt-hosted-engine-setup-2.2.25-1.el7.noarch
ovirt-imageio-daemon-1.4.2-0.el7.noarch
ovirt-vmconsole-host-1.0.5-4.el7.centos.noarch
ovirt-provider-ovn-driver-1.2.13-1.el7.noarch
ovirt-host-4.2.3-1.el7.x86_64
vdsm-jsonrpc-4.20.35-1.el7.noarch
vdsm-hook-ethtool-options-4.20.35-1.el7.noarch
vdsm-gluster-4.20.35-1.el7.x86_64
vdsm-api-4.20.35-1.el7.noarch
vdsm-yajsonrpc-4.20.35-1.el7.noarch
vdsm-client-4.20.35-1.el7.noarch
vdsm-hook-fcoe-4.20.35-1.el7.noarch
vdsm-network-4.20.35-1.el7.x86_64
vdsm-hook-vmfex-dev-4.20.35-1.el7.noarch
vdsm-hook-vhostmd-4.20.35-1.el7.noarch
vdsm-http-4.20.35-1.el7.noarch
vdsm-hook-openstacknet-4.20.35-1.el7.noarch
vdsm-python-4.20.35-1.el7.noarch
vdsm-hook-vfio-mdev-4.20.35-1.el7.noarch
vdsm-common-4.20.35-1.el7.noarch
vdsm-4.20.35-1.el7.x86_64
Aug 30 13:02:42 rrb-vmhost1.rrc.local vdsm[6792]: WARN Worker blocked: <Worker name=periodic/1 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7f713c0b0fd0> at 0x7f713c063050> timeout=15, duration=75 at 0x7f7170698c90> task#=354 at 0x7f713c10ec90>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
self._callable()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 220, in __call__
self._func()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 580, in __call__
stats = hostapi.get_stats(self._cif, self._samples.stats())
File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 77, in get_stats
ret['haStats'] = _getHaInfo()
File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 182, in _getHaInfo
stats = instance.get_all_stats()
File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 94, in get_all_stats
stats = broker.get_stats_from_storage()
File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 135, in get_stats_from_storage
result = self._proxy.get_stats()
File: "/usr/lib64/python2.7/xmlrpclib.py", line 1233, in __call__
return self.__send(self.__name, args)
File: "/usr/lib64/python2.7/xmlrpclib.py", line 1591, in __request
verbose=self.__verbose
File: "/usr/lib64/python2.7/xmlrpclib.py", line 1273, in request
return self.single_request(host, handler, request_body, verbose)
File: "/usr/lib64/python2.7/xmlrpclib.py", line 1303, in single_request
response = h.getresponse(buffering=True)
File: "/usr/lib64/python2.7/httplib.py", line 1113, in getresponse
response.begin()
File: "/usr/lib64/python2.7/httplib.py", line 444, in begin
version, status, reason = self._read_status()
File: "/usr/lib64/python2.7/httplib.py", line 400, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File: "/usr/lib64/python2.7/socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
This show that hosted engine agent does not respond to vdsm. The root cause is hosted engine, but vdsm should not let workers block on accessing the agent. It should time out instead after short delay, or move this monitoring to separate thread, so it does not block other tasks that depend on this thread pool. Please file vdsm bug for this. Nir
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/P7ULPZG5OO37EC...
participants (2)
-
Johnson, Tim
-
Nir Soffer