On Fri, Aug 31, 2018 at 2:26 AM Johnson, Tim <tjj@uic.edu> wrote:

  Hi all,

 

     We have had a bit of a strange issue with the starting of the self-hosted engine due to the vdsm and ovirt-ha processes having errors and  not able to be started.

Running Ovirt version 4.2 with Gluster mounts (not hyper converged) gluster 3.12.13  replicated volumes with arbiter brick.

 

  The mounts for the storage are mounted on all the Ovirt hosts and able to read/touch files on the mounts.

 

Unfortunately we can not pin a specific timeframe when this started happening. Meaning we don’t know if it started after a specific update or any other event.

      The vdsmd is throwing errors (bottom – with version numbers) when trying to check the status of the  self -hosted engine getting :

    

# hosted-engine --vm-status

The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable.

 

 

 

Thanks for any assistance.

 

 

 

 

 

systemctl status ovirt-ha-agent -l

● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent

   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)

   Active: activating (auto-restart) (Result: exit-code) since Thu 2018-08-30 13:24:40 CDT; 2s ago

  Process: 17135 ExecStart=/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent (code=exited, status=157)

Main PID: 17135 (code=exited, status=157)

 

 

 

 

 

systemctl status ovirt-ha-broker -l

● ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker

   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled; vendor preset: disabled)

   Active: active (running) since Thu 2018-08-30 13:10:35 CDT; 14min ago

Main PID: 14863 (ovirt-ha-broker)

    Tasks: 4

   CGroup: /system.slice/ovirt-ha-broker.service

           └─14863 /usr/bin/python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker

 

Aug 30 13:10:35 rrb-vmhost1.rrc.local systemd[1]: Started oVirt Hosted Engine High Availability Communications Broker.

Aug 30 13:10:35 rrb-vmhost1.rrc.local systemd[1]: Starting oVirt Hosted Engine High Availability Communications Broker...

Aug 30 13:14:44 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker ERROR Failed to start monitoring domain (sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2, host_id=1): timeout during domain acquisition

Aug 30 13:14:44 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.listener.Action.start_domain_monitor ERROR Error in RPC call: Failed to start monitoring domain (sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2, host_id=1): timeout during domain acquisition

Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker ERROR Failed to start monitoring domain (sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2, host_id=1): timeout during domain acquisition

Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.listener.Action.start_domain_monitor ERROR Error in RPC call: Failed to start monitoring domain (sd_uuid=d0bfc335-ab6c-4378-9bcb-2f5f833431c2, host_id=1): timeout during domain acquisition

Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update ERROR Failed to update state.

                                                              Traceback (most recent call last):

                                                                File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 82, in run

                                                                  if (self._status_broker._inquire_whiteboard_lock() or

                                                                File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 190, in _inquire_whiteboard_lock

                                                                  self.host_id, self._lease_file)

                                                                File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 128, in host_id

                                                                  raise ex.HostIdNotLockedError("Host id is not set")

                                                              HostIdNotLockedError: Host id is not set


Looks like hosted engine agent fail to inquire sanlock lease. This is mostly likely issue 
with storage or selinux.

Can you attach /var/log/sanlock.log and /var/log/messages?
 

Aug 30 13:19:42 rrb-vmhost1.rrc.local ovirt-ha-broker[14863]: ovirt-ha-broker ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update ERROR Trying to restart the broker

Aug 30 13:19:42 rrb-vmhost1.rrc.local python[14863]: detected unhandled Python exception in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker'

Aug 30 13:19:48 rrb-vmhost1.rrc.local python[14863]: communication with ABRT daemon failed: timed out

 

 

 

 

 

glusterfs-api-3.12.13-1.el7.x86_64

glusterfs-server-3.12.13-1.el7.x86_64

glusterfs-fuse-3.12.13-1.el7.x86_64

vdsm-gluster-4.20.35-1.el7.x86_64

centos-release-gluster312-1.0-2.el7.centos.noarch

glusterfs-client-xlators-3.12.13-1.el7.x86_64

glusterfs-geo-replication-3.12.13-1.el7.x86_64

glusterfs-libs-3.12.13-1.el7.x86_64

glusterfs-cli-3.12.13-1.el7.x86_64

glusterfs-rdma-3.12.13-1.el7.x86_64

glusterfs-events-3.12.13-1.el7.x86_64

glusterfs-3.12.13-1.el7.x86_64

libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.7.x86_64

python2-gluster-3.12.13-1.el7.x86_64

glusterfs-gnfs-3.12.13-1.el7.x86_64

 

 

ovirt-setup-lib-1.1.4-1.el7.centos.noarch

ovirt-iso-uploader-4.2.0-1.el7.centos.noarch

ovirt-engine-setup-base-4.2.5.3-1.el7.noarch

ovirt-host-dependencies-4.2.3-1.el7.x86_64

ovirt-guest-agent-common-1.0.14-1.el7.noarch

python-ovirt-engine-sdk4-4.2.7-2.el7.x86_64

ovirt-engine-setup-plugin-ovirt-engine-common-4.2.5.3-1.el7.noarch

ovirt-vmconsole-1.0.5-4.el7.centos.noarch

ovirt-engine-tools-backup-4.2.5.3-1.el7.noarch

ovirt-guest-agent-windows-1.0.14-1.el7.centos.noarch

cockpit-ovirt-dashboard-0.11.31-1.el7.noarch

ovirt-host-deploy-1.7.4-1.el7.noarch

ovirt-engine-lib-4.2.5.3-1.el7.noarch

ovirt-engine-sdk-python-3.6.9.1-1.el7.noarch

ovirt-hosted-engine-ha-2.2.16-1.el7.noarch

ovirt-guest-tools-iso-4.2-1.el7.centos.noarch

ovirt-imageio-common-1.4.2-0.el7.noarch

ovirt-hosted-engine-setup-2.2.25-1.el7.noarch

ovirt-imageio-daemon-1.4.2-0.el7.noarch

ovirt-vmconsole-host-1.0.5-4.el7.centos.noarch

ovirt-provider-ovn-driver-1.2.13-1.el7.noarch

ovirt-host-4.2.3-1.el7.x86_64

 

 

vdsm-jsonrpc-4.20.35-1.el7.noarch

vdsm-hook-ethtool-options-4.20.35-1.el7.noarch

vdsm-gluster-4.20.35-1.el7.x86_64

vdsm-api-4.20.35-1.el7.noarch

vdsm-yajsonrpc-4.20.35-1.el7.noarch

vdsm-client-4.20.35-1.el7.noarch

vdsm-hook-fcoe-4.20.35-1.el7.noarch

vdsm-network-4.20.35-1.el7.x86_64

vdsm-hook-vmfex-dev-4.20.35-1.el7.noarch

vdsm-hook-vhostmd-4.20.35-1.el7.noarch

vdsm-http-4.20.35-1.el7.noarch

vdsm-hook-openstacknet-4.20.35-1.el7.noarch

vdsm-python-4.20.35-1.el7.noarch

vdsm-hook-vfio-mdev-4.20.35-1.el7.noarch

vdsm-common-4.20.35-1.el7.noarch

vdsm-4.20.35-1.el7.x86_64

 

 

 

 

 

Aug 30 13:02:42 rrb-vmhost1.rrc.local vdsm[6792]: WARN Worker blocked: <Worker name=periodic/1 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7f713c0b0fd0> at 0x7f713c063050> timeout=15, duration=75 at 0x7f7170698c90> task#=354 at 0x7f713c10ec90>, traceback:

                                                  File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap

                                                    self.__bootstrap_inner()

                                                  File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner

                                                    self.run()

                                                  File: "/usr/lib64/python2.7/threading.py", line 765, in run

                                                    self.__target(*self.__args, **self.__kwargs)

                                                  File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run

                                                    ret = func(*args, **kwargs)

                                                  File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run

                                                    self._execute_task()

                                                  File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task

                                                    task()

                                                  File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__

                                                    self._callable()

                                                  File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 220, in __call__

                                                   self._func()

                                                  File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 580, in __call__

                                                    stats = hostapi.get_stats(self._cif, self._samples.stats())

                                                  File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 77, in get_stats

                                                    ret['haStats'] = _getHaInfo()

                                                  File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 182, in _getHaInfo

                                                    stats = instance.get_all_stats()

                                                  File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 94, in get_all_stats

                                                    stats = broker.get_stats_from_storage()

                                                  File: "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 135, in get_stats_from_storage

                                                    result = self._proxy.get_stats()

                                                  File: "/usr/lib64/python2.7/xmlrpclib.py", line 1233, in __call__

                                                    return self.__send(self.__name, args)

                                                  File: "/usr/lib64/python2.7/xmlrpclib.py", line 1591, in __request

                                                    verbose=self.__verbose

                                                  File: "/usr/lib64/python2.7/xmlrpclib.py", line 1273, in request

                                                    return self.single_request(host, handler, request_body, verbose)

                                                  File: "/usr/lib64/python2.7/xmlrpclib.py", line 1303, in single_request

                                                    response = h.getresponse(buffering=True)

                                                  File: "/usr/lib64/python2.7/httplib.py", line 1113, in getresponse

                                                    response.begin()

                                                  File: "/usr/lib64/python2.7/httplib.py", line 444, in begin

                                                    version, status, reason = self._read_status()

                                                  File: "/usr/lib64/python2.7/httplib.py", line 400, in _read_status

                                                    line = self.fp.readline(_MAXLINE + 1)

                                                  File: "/usr/lib64/python2.7/socket.py", line 476, in readline

                                                    data = self._sock.recv(self._rbufsize)


This show that hosted engine agent does not respond to vdsm.

The root cause is hosted engine, but vdsm should not let workers block
on accessing the agent. It should time out instead after short delay, or
move this monitoring to separate thread, so it does not block other tasks
that depend on this thread pool.

Please file vdsm bug for this.

Nir
 

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/P7ULPZG5OO37ECKCSEQB76GGCTWA4G7Q/