
Hi all, We have had a bit of a strange issue with the starting of the self-hosted engine due to the vdsm and ovirt-ha processes having errors and not able to be started. Running Ovirt version 4.2 with Gluster mounts (not hyper converged) gluster 3.12.13 replicated volumes with arbiter brick. The mounts for the storage are mounted on all the Ovirt hosts and able to read/touch files on the mounts. Unfortunately we can not pin a specific timeframe when this started happening. Meaning we don’t know if it started after a specific update or any other event. The vdsmd is throwing errors (bottom – with version numbers) when trying to check the status of the self -hosted engine getting : # hosted-engine --vm-status The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable. Thanks for any assistance. systemctl status ovirt-ha-agent -l ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled) Active: activating (auto-restart) (Result: exit-code) since Thu 2018-08-30 13:24:40 CDT; 2s ago Process: 17135 ExecStart=/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent (code=exited, status=157) Main PID: 17135 (code=exited, status=157) systemctl status ovirt-ha-broker -l ● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2018-08-30 13:10:35 CDT; 14min ago Main PID: 14863 (ovirt-ha-broker) Tasks: 4 CGroup: /system.slice/ovirt-ha-broker.service └─14863 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker Aug 30 13:10:35 rrb-vmhost1.rrc.local systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker. Aug 30 13:10:35 rrb-vmhost1.rrc.local systemd[1]: Starting oVirt Hosted Engine High Availability Communications Broker... Aug 30 13:14:44 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker ERROR Failed to start monitoring domain (sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2, host_id=1): timeout during domain acquisition Aug 30 13:14:44 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.listener.Action.start_domain_monitor ERROR Error in RPC call: Failed to start monitoring domain (sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2, host_id=1): timeout during domain acquisition Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker ERROR Failed to start monitoring domain (sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2, host_id=1): timeout during domain acquisition Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.listener.Action.start_domain_monitor ERROR Error in RPC call: Failed to start monitoring domain (sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2, host_id=1): timeout during domain acquisition Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update ERROR Failed to update state. Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 82, in run if (self._status_broker._inquire_whiteboard_lock() or File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 190, in _inquire_whiteboard_lock self.host_id, self._lease_file) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 128, in host_id raise ex.HostIdNotLockedError("Host id is not set") HostIdNotLockedError: Host id is not set Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update ERROR Trying to restart the broker Aug 30 13:19:42 rrb-vmhost1.rrc.local python[14863]: detected unhandled Python exception in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker' Aug 30 13:19:48 rrb-vmhost1.rrc.local python[14863]: communication with ABRT daemon failed: timed out glusterfs-api-3.12.13-1.el7.x86_64 glusterfs-server-3.12.13-1.el7.x86_64 glusterfs-fuse-3.12.13-1.el7.x86_64 vdsm-gluster-4.20.35-1.el7.x86_64 centos-release-gluster312-1.0-2.el7.centos.noarch glusterfs-client-xlators-3.12.13-1.el7.x86_64 glusterfs-geo-replication-3.12.13-1.el7.x86_64 glusterfs-libs-3.12.13-1.el7.x86_64 glusterfs-cli-3.12.13-1.el7.x86_64 glusterfs-rdma-3.12.13-1.el7.x86_64 glusterfs-events-3.12.13-1.el7.x86_64 glusterfs-3.12.13-1.el7.x86_64 libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.7.x86_64 python2-gluster-3.12.13-1.el7.x86_64 glusterfs-gnfs-3.12.13-1.el7.x86_64 ovirt-setup-lib-1.1.4-1.el7.centos.noarch ovirt-iso-uploader-4.2.0-1.el7.centos.noarch ovirt-engine-setup-base-4.2.5.3-1.el7.noarch ovirt-host-dependencies-4.2.3-1.el7.x86_64 ovirt-guest-agent-common-1.0.14-1.el7.noarch python-ovirt-engine-sdk4-4.2.7-2.el7.x86_64 ovirt-engine-setup-plugin-ovirt-engine-common-4.2.5.3-1.el7.noarch ovirt-vmconsole-1.0.5-4.el7.centos.noarch ovirt-engine-tools-backup-4.2.5.3-1.el7.noarch ovirt-guest-agent-windows-1.0.14-1.el7.centos.noarch cockpit-ovirt-dashboard-0.11.31-1.el7.noarch ovirt-host-deploy-1.7.4-1.el7.noarch ovirt-engine-lib-4.2.5.3-1.el7.noarch ovirt-engine-sdk-python-3.6.9.1-1.el7.noarch ovirt-hosted-engine-ha-2.2.16-1.el7.noarch ovirt-guest-tools-iso-4.2-1.el7.centos.noarch ovirt-imageio-common-1.4.2-0.el7.noarch ovirt-hosted-engine-setup-2.2.25-1.el7.noarch ovirt-imageio-daemon-1.4.2-0.el7.noarch ovirt-vmconsole-host-1.0.5-4.el7.centos.noarch ovirt-provider-ovn-driver-1.2.13-1.el7.noarch ovirt-host-4.2.3-1.el7.x86_64 vdsm-jsonrpc-4.20.35-1.el7.noarch vdsm-hook-ethtool-options-4.20.35-1.el7.noarch vdsm-gluster-4.20.35-1.el7.x86_64 vdsm-api-4.20.35-1.el7.noarch vdsm-yajsonrpc-4.20.35-1.el7.noarch vdsm-client-4.20.35-1.el7.noarch vdsm-hook-fcoe-4.20.35-1.el7.noarch vdsm-network-4.20.35-1.el7.x86_64 vdsm-hook-vmfex-dev-4.20.35-1.el7.noarch vdsm-hook-vhostmd-4.20.35-1.el7.noarch vdsm-http-4.20.35-1.el7.noarch vdsm-hook-openstacknet-4.20.35-1.el7.noarch vdsm-python-4.20.35-1.el7.noarch vdsm-hook-vfio-mdev-4.20.35-1.el7.noarch vdsm-common-4.20.35-1.el7.noarch vdsm-4.20.35-1.el7.x86_64 Aug 30 13:02:42 rrb-vmhost1.rrc.local vdsm[6792]: WARN Worker blocked: <Worker name=periodic/1 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7f713c0b0fd0> at 0x7f713c063050> timeout=15, duration=75 at 0x7f7170698c90> task#=354 at 0x7f713c10ec90>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 220, in __call__ self._func() File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 580, in __call__ stats = hostapi.get_stats(self._cif, self._samples.stats()) File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 77, in get_stats ret['haStats'] = _getHaInfo() File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 182, in _getHaInfo stats = instance.get_all_stats() File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 94, in get_all_stats stats = broker.get_stats_from_storage() File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 135, in get_stats_from_storage result = self._proxy.get_stats() File: "/usr/lib64/python2.7/xmlrpclib.py", line 1233, in __call__ return self.__send(self.__name, args) File: "/usr/lib64/python2.7/xmlrpclib.py", line 1591, in __request verbose=self.__verbose File: "/usr/lib64/python2.7/xmlrpclib.py", line 1273, in request return self.single_request(host, handler, request_body, verbose) File: "/usr/lib64/python2.7/xmlrpclib.py", line 1303, in single_request response = h.getresponse(buffering=True) File: "/usr/lib64/python2.7/httplib.py", line 1113, in getresponse response.begin() File: "/usr/lib64/python2.7/httplib.py", line 444, in begin version, status, reason = self._read_status() File: "/usr/lib64/python2.7/httplib.py", line 400, in _read_status line = self.fp.readline(_MAXLINE + 1) File: "/usr/lib64/python2.7/socket.py", line 476, in readline data = self._sock.recv(self._rbufsize)