On Fri, Aug 31, 2018 at 2:26 AM Johnson, Tim <tjj(a)uic.edu> wrote:
Hi all,
We have had a bit of a strange issue with the starting of the
self-hosted engine due to the vdsm and ovirt-ha processes having errors
and not able to be started.
Running Ovirt version 4.2 with Gluster mounts (not hyper converged)
gluster 3.12.13 replicated volumes with arbiter brick.
The mounts for the storage are mounted on all the Ovirt hosts and able
to read/touch files on the mounts.
Unfortunately we can not pin a specific timeframe when this started
happening. Meaning we don’t know if it started after a specific update or
any other event.
The vdsmd is throwing errors (bottom – with version numbers) when
trying to check the status of the self -hosted engine getting :
# hosted-engine --vm-status
The hosted engine configuration has not been retrieved from shared
storage. Please ensure that ovirt-ha-agent is running and the storage
server is reachable.
Thanks for any assistance.
systemctl status ovirt-ha-agent -l
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability
Monitoring Agent
Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service;
enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Thu
2018-08-30 13:24:40 CDT; 2s ago
Process: 17135
ExecStart=/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent (code=exited,
status=157)
Main PID: 17135 (code=exited, status=157)
systemctl status ovirt-ha-broker -l
● ovirt-ha-broker.service - oVirt Hosted Engine High Availability
Communications Broker
Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service;
enabled; vendor preset: disabled)
Active: active (running) since Thu 2018-08-30 13:10:35 CDT; 14min ago
Main PID: 14863 (ovirt-ha-broker)
Tasks: 4
CGroup: /system.slice/ovirt-ha-broker.service
└─14863 /usr/bin/python
/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker
Aug 30 13:10:35 rrb-vmhost1.rrc.local systemd[1]: Started oVirt Hosted
Engine High Availability Communications Broker.
Aug 30 13:10:35 rrb-vmhost1.rrc.local systemd[1]: Starting oVirt Hosted
Engine High Availability Communications Broker...
Aug 30 13:14:44 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]:
ovirt-ha-broker ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker
ERROR Failed to start monitoring domain
(sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2, host_id=1): timeout during
domain acquisition
Aug 30 13:14:44 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]:
ovirt-ha-broker
ovirt_hosted_engine_ha.broker.listener.Action.start_domain_monitor ERROR
Error in RPC call: Failed to start monitoring domain
(sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2, host_id=1): timeout during
domain acquisition
Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]:
ovirt-ha-broker ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker
ERROR Failed to start monitoring domain
(sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2, host_id=1): timeout during
domain acquisition
Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]:
ovirt-ha-broker
ovirt_hosted_engine_ha.broker.listener.Action.start_domain_monitor ERROR
Error in RPC call: Failed to start monitoring domain
(sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2, host_id=1): timeout during
domain acquisition
Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]:
ovirt-ha-broker
ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update ERROR
Failed to update state.
Traceback
(most recent call last):
File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py",
line 82, in run
if
(self._status_broker._inquire_whiteboard_lock() or
File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py",
line 190, in _inquire_whiteboard_lock
self.host_id, self._lease_file)
File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py",
line 128, in host_id
raise
ex.HostIdNotLockedError("Host id is not set")
HostIdNotLockedError: Host id is not set
Looks like hosted engine agent fail to inquire sanlock lease. This is
mostly likely issue
with storage or selinux.
Can you attach /var/log/sanlock.log and /var/log/messages?
Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]:
ovirt-ha-broker
ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update ERROR
Trying to restart the broker
Aug 30 13:19:42 rrb-vmhost1.rrc.local python[14863]: detected unhandled
Python exception in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker'
Aug 30 13:19:48 rrb-vmhost1.rrc.local python[14863]: communication with
ABRT daemon failed: timed out
glusterfs-api-3.12.13-1.el7.x86_64
glusterfs-server-3.12.13-1.el7.x86_64
glusterfs-fuse-3.12.13-1.el7.x86_64
vdsm-gluster-4.20.35-1.el7.x86_64
centos-release-gluster312-1.0-2.el7.centos.noarch
glusterfs-client-xlators-3.12.13-1.el7.x86_64
glusterfs-geo-replication-3.12.13-1.el7.x86_64
glusterfs-libs-3.12.13-1.el7.x86_64
glusterfs-cli-3.12.13-1.el7.x86_64
glusterfs-rdma-3.12.13-1.el7.x86_64
glusterfs-events-3.12.13-1.el7.x86_64
glusterfs-3.12.13-1.el7.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.7.x86_64
python2-gluster-3.12.13-1.el7.x86_64
glusterfs-gnfs-3.12.13-1.el7.x86_64
ovirt-setup-lib-1.1.4-1.el7.centos.noarch
ovirt-iso-uploader-4.2.0-1.el7.centos.noarch
ovirt-engine-setup-base-4.2.5.3-1.el7.noarch
ovirt-host-dependencies-4.2.3-1.el7.x86_64
ovirt-guest-agent-common-1.0.14-1.el7.noarch
python-ovirt-engine-sdk4-4.2.7-2.el7.x86_64
ovirt-engine-setup-plugin-ovirt-engine-common-4.2.5.3-1.el7.noarch
ovirt-vmconsole-1.0.5-4.el7.centos.noarch
ovirt-engine-tools-backup-4.2.5.3-1.el7.noarch
ovirt-guest-agent-windows-1.0.14-1.el7.centos.noarch
cockpit-ovirt-dashboard-0.11.31-1.el7.noarch
ovirt-host-deploy-1.7.4-1.el7.noarch
ovirt-engine-lib-4.2.5.3-1.el7.noarch
ovirt-engine-sdk-python-3.6.9.1-1.el7.noarch
ovirt-hosted-engine-ha-2.2.16-1.el7.noarch
ovirt-guest-tools-iso-4.2-1.el7.centos.noarch
ovirt-imageio-common-1.4.2-0.el7.noarch
ovirt-hosted-engine-setup-2.2.25-1.el7.noarch
ovirt-imageio-daemon-1.4.2-0.el7.noarch
ovirt-vmconsole-host-1.0.5-4.el7.centos.noarch
ovirt-provider-ovn-driver-1.2.13-1.el7.noarch
ovirt-host-4.2.3-1.el7.x86_64
vdsm-jsonrpc-4.20.35-1.el7.noarch
vdsm-hook-ethtool-options-4.20.35-1.el7.noarch
vdsm-gluster-4.20.35-1.el7.x86_64
vdsm-api-4.20.35-1.el7.noarch
vdsm-yajsonrpc-4.20.35-1.el7.noarch
vdsm-client-4.20.35-1.el7.noarch
vdsm-hook-fcoe-4.20.35-1.el7.noarch
vdsm-network-4.20.35-1.el7.x86_64
vdsm-hook-vmfex-dev-4.20.35-1.el7.noarch
vdsm-hook-vhostmd-4.20.35-1.el7.noarch
vdsm-http-4.20.35-1.el7.noarch
vdsm-hook-openstacknet-4.20.35-1.el7.noarch
vdsm-python-4.20.35-1.el7.noarch
vdsm-hook-vfio-mdev-4.20.35-1.el7.noarch
vdsm-common-4.20.35-1.el7.noarch
vdsm-4.20.35-1.el7.x86_64
Aug 30 13:02:42 rrb-vmhost1.rrc.local vdsm[6792]: WARN Worker blocked:
<Worker name=periodic/1 running <Task <Operation
action=<vdsm.virt.sampling.HostMonitor object at 0x7f713c0b0fd0> at
0x7f713c063050> timeout=15, duration=75 at 0x7f7170698c90> task#=354 at
0x7f713c10ec90>, traceback:
File:
"/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
self.__bootstrap_inner()
File:
"/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
self.run()
File:
"/usr/lib64/python2.7/threading.py", line 765, in run
self.__target(*self.__args, **self.__kwargs)
File:
"/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in
run
ret = func(*args,
**kwargs)
File:
"/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
self._execute_task()
File:
"/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in
_execute_task
task()
File:
"/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
self._callable()
File:
"/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 220, in
__call__
self._func()
File:
"/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 580, in
__call__
stats =
hostapi.get_stats(self._cif, self._samples.stats())
File:
"/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 77, in get_stats
ret['haStats'] =
_getHaInfo()
File:
"/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 182, in _getHaInfo
stats =
instance.get_all_stats()
File:
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py",
line 94, in get_all_stats
stats =
broker.get_stats_from_storage()
File:
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 135, in get_stats_from_storage
result =
self._proxy.get_stats()
File:
"/usr/lib64/python2.7/xmlrpclib.py", line 1233, in __call__
return
self.__send(self.__name, args)
File:
"/usr/lib64/python2.7/xmlrpclib.py", line 1591, in __request
verbose=self.__verbose
File:
"/usr/lib64/python2.7/xmlrpclib.py", line 1273, in request
return
self.single_request(host, handler, request_body, verbose)
File:
"/usr/lib64/python2.7/xmlrpclib.py", line 1303, in single_request
response =
h.getresponse(buffering=True)
File:
"/usr/lib64/python2.7/httplib.py", line 1113, in getresponse
response.begin()
File:
"/usr/lib64/python2.7/httplib.py", line 444, in begin
version, status,
reason = self._read_status()
File:
"/usr/lib64/python2.7/httplib.py", line 400, in _read_status
line =
self.fp.readline(_MAXLINE + 1)
File:
"/usr/lib64/python2.7/socket.py", line 476, in readline
data =
self._sock.recv(self._rbufsize)
This show that hosted engine agent does not respond to vdsm.
The root cause is hosted engine, but vdsm should not let workers block
on accessing the agent. It should time out instead after short delay, or
move this monitoring to separate thread, so it does not block other tasks
that depend on this thread pool.
Please file vdsm bug for this.
Nir
_______________________________________________
Users mailing list -- users(a)ovirt.org
To unsubscribe send an email to users-leave(a)ovirt.org
Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/P7ULPZG5OO3...