Hi all,
We have had a bit of a strange issue with the starting of the self-hosted engine due to the vdsm and ovirt-ha processes having errors and not able to be started.
Running Ovirt version 4.2 with Gluster mounts (not hyper converged) gluster 3.12.13 replicated volumes with arbiter brick.
The mounts for the storage are mounted on all the Ovirt hosts and able to read/touch files on the mounts.
Unfortunately we can not pin a specific timeframe when this started happening. Meaning we don’t know if it started after a specific update or any other event.
The vdsmd is throwing errors (bottom – with version numbers) when trying to check the status of the self -hosted engine getting :
# hosted-engine --vm-status
The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable.
Thanks for any assistance.
systemctl status ovirt-ha-agent -l
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Thu 2018-08-30 13:24:40 CDT; 2s ago
Process: 17135 ExecStart=/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent (code=exited, status=157)
Main PID: 17135 (code=exited, status=157)
systemctl status ovirt-ha-broker -l
● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker
Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2018-08-30 13:10:35 CDT; 14min ago
Main PID: 14863 (ovirt-ha-broker)
Tasks: 4
CGroup: /system.slice/ovirt-ha-broker.service
└─14863 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker
Aug 30 13:10:35 rrb-vmhost1.rrc.local systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker.
Aug 30 13:10:35 rrb-vmhost1.rrc.local systemd[1]: Starting oVirt Hosted Engine High Availability Communications Broker...
Aug 30 13:14:44 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker ERROR Failed to start monitoring domain (sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2,
host_id=1): timeout during domain acquisition
Aug 30 13:14:44 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.listener.Action.start_domain_monitor ERROR Error in RPC call: Failed to start monitoring domain (sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2,
host_id=1): timeout during domain acquisition
Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker ERROR Failed to start monitoring domain (sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2,
host_id=1): timeout during domain acquisition
Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.listener.Action.start_domain_monitor ERROR Error in RPC call: Failed to start monitoring domain (sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2,
host_id=1): timeout during domain acquisition
Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update ERROR Failed to update state.
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 82, in run
if (self._status_broker._inquire_whiteboard_lock() or
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 190, in _inquire_whiteboard_lock
self.host_id, self._lease_file)
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 128, in host_id
raise ex.HostIdNotLockedError("Host id is not set")
HostIdNotLockedError: Host id is not set
Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update ERROR Trying to restart the broker
Aug 30 13:19:42 rrb-vmhost1.rrc.local python[14863]: detected unhandled Python exception in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker'
Aug 30 13:19:48 rrb-vmhost1.rrc.local python[14863]: communication with ABRT daemon failed: timed out
glusterfs-api-3.12.13-1.el7.x86_64
glusterfs-server-3.12.13-1.el7.x86_64
glusterfs-fuse-3.12.13-1.el7.x86_64
vdsm-gluster-4.20.35-1.el7.x86_64
centos-release-gluster312-1.0-2.el7.centos.noarch
glusterfs-client-xlators-3.12.13-1.el7.x86_64
glusterfs-geo-replication-3.12.13-1.el7.x86_64
glusterfs-libs-3.12.13-1.el7.x86_64
glusterfs-cli-3.12.13-1.el7.x86_64
glusterfs-rdma-3.12.13-1.el7.x86_64
glusterfs-events-3.12.13-1.el7.x86_64
glusterfs-3.12.13-1.el7.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.7.x86_64
python2-gluster-3.12.13-1.el7.x86_64
glusterfs-gnfs-3.12.13-1.el7.x86_64
ovirt-setup-lib-1.1.4-1.el7.centos.noarch
ovirt-iso-uploader-4.2.0-1.el7.centos.noarch
ovirt-engine-setup-base-4.2.5.3-1.el7.noarch
ovirt-host-dependencies-4.2.3-1.el7.x86_64
ovirt-guest-agent-common-1.0.14-1.el7.noarch
python-ovirt-engine-sdk4-4.2.7-2.el7.x86_64
ovirt-engine-setup-plugin-ovirt-engine-common-4.2.5.3-1.el7.noarch
ovirt-vmconsole-1.0.5-4.el7.centos.noarch
ovirt-engine-tools-backup-4.2.5.3-1.el7.noarch
ovirt-guest-agent-windows-1.0.14-1.el7.centos.noarch
cockpit-ovirt-dashboard-0.11.31-1.el7.noarch
ovirt-host-deploy-1.7.4-1.el7.noarch
ovirt-engine-lib-4.2.5.3-1.el7.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7.noarch
ovirt-hosted-engine-ha-2.2.16-1.el7.noarch
ovirt-guest-tools-iso-4.2-1.el7.centos.noarch
ovirt-imageio-common-1.4.2-0.el7.noarch
ovirt-hosted-engine-setup-2.2.25-1.el7.noarch
ovirt-imageio-daemon-1.4.2-0.el7.noarch
ovirt-vmconsole-host-1.0.5-4.el7.centos.noarch
ovirt-provider-ovn-driver-1.2.13-1.el7.noarch
ovirt-host-4.2.3-1.el7.x86_64
vdsm-jsonrpc-4.20.35-1.el7.noarch
vdsm-hook-ethtool-options-4.20.35-1.el7.noarch
vdsm-gluster-4.20.35-1.el7.x86_64
vdsm-api-4.20.35-1.el7.noarch
vdsm-yajsonrpc-4.20.35-1.el7.noarch
vdsm-client-4.20.35-1.el7.noarch
vdsm-hook-fcoe-4.20.35-1.el7.noarch
vdsm-network-4.20.35-1.el7.x86_64
vdsm-hook-vmfex-dev-4.20.35-1.el7.noarch
vdsm-hook-vhostmd-4.20.35-1.el7.noarch
vdsm-http-4.20.35-1.el7.noarch
vdsm-hook-openstacknet-4.20.35-1.el7.noarch
vdsm-python-4.20.35-1.el7.noarch
vdsm-hook-vfio-mdev-4.20.35-1.el7.noarch
vdsm-common-4.20.35-1.el7.noarch
vdsm-4.20.35-1.el7.x86_64
Aug 30 13:02:42 rrb-vmhost1.rrc.local vdsm[6792]: WARN Worker blocked: <Worker name=periodic/1 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7f713c0b0fd0> at 0x7f713c063050> timeout=15,
duration=75 at 0x7f7170698c90> task#=354 at 0x7f713c10ec90>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
self._callable()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 220, in __call__
self._func()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 580, in __call__
stats = hostapi.get_stats(self._cif, self._samples.stats())
File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 77, in get_stats
ret['haStats'] = _getHaInfo()
File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 182, in _getHaInfo
stats = instance.get_all_stats()
File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 94, in get_all_stats
stats = broker.get_stats_from_storage()
File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 135, in get_stats_from_storage
result = self._proxy.get_stats()
File: "/usr/lib64/python2.7/xmlrpclib.py", line 1233, in __call__
return self.__send(self.__name, args)
File: "/usr/lib64/python2.7/xmlrpclib.py", line 1591, in __request
verbose=self.__verbose
File: "/usr/lib64/python2.7/xmlrpclib.py", line 1273, in request
return self.single_request(host, handler, request_body, verbose)
File: "/usr/lib64/python2.7/xmlrpclib.py", line 1303, in single_request
response = h.getresponse(buffering=True)
File: "/usr/lib64/python2.7/httplib.py", line 1113, in getresponse
response.begin()
File: "/usr/lib64/python2.7/httplib.py", line 444, in begin
version, status, reason = self._read_status()
File: "/usr/lib64/python2.7/httplib.py", line 400, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File: "/usr/lib64/python2.7/socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)