
I'm running a three server HCI. Up and running on 4.3.7 with no problems. Today I updated to 4.3.8. Engine upgraded fine, rebooted. First host updated fine, rebooted and let all gluster volumes heal. Put second host in maintenance, upgraded successfully, rebooted. Waited for gluster volumes to heal for over an hour but the heal process was not completing. I tried restarting gluster services as well as the host with no success. I'm in a state right now where there are pending heals on almost all of my volumes. Nothing is reporting split-brain, but the heals are not completing. All vms are still currently running except hosted engine. Hosted engine was running but on the 2nd host I upgraded I was seeing errors such as: Dec 12 16:34:39 orchard2 journal: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Failed scanning for OVF_STORE due to Command Volume.getInfo with args {'storagepoolID': '00000000-0000-0000-0000-000000000000', 'storagedomainID': 'd70b171e-7488-4d52-8cad-bbc581dbf16e', 'volumeID': u'2632f423-ed89-43d9-93a9-36738420b866', 'imageID': u'd909dc74-5bbd-4e39-b9b5-755c167a6ee8'} failed:#012(code=201, message=Volume does not exist: (u'2632f423-ed89-43d9-93a9-36738420b866',)) I shut down the engine VM and attempted a manual heal on the engine volume. I cannot start the engine on any host now. I get: The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable. I'm seeing ovirt-ha-agent crashing on all three nodes: Dec 12 18:30:48 orchard0 python: detected unhandled Python exception in '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker' Dec 12 18:30:48 orchard0 abrt-server: Duplicate: core backtrace Dec 12 18:30:48 orchard0 abrt-server: DUP_OF_DIR: /var/tmp/abrt/Python-2019-03-14-21:02:52-44318 Dec 12 18:30:48 orchard0 abrt-server: Deleting problem directory Python-2019-12-12-18:30:48-23193 (dup of Python-2019-03-14-21:02:52-44318) Dec 12 18:30:49 orchard0 vdsm[6087]: ERROR failed to retrieve Hosted Engine HA score '[Errno 2] No such file or directory'Is the Hosted Engine setup finished? Dec 12 18:30:49 orchard0 systemd: ovirt-ha-broker.service: main process exited, code=exited, status=1/FAILURE Dec 12 18:30:49 orchard0 systemd: Unit ovirt-ha-broker.service entered failed state. Dec 12 18:30:49 orchard0 systemd: ovirt-ha-broker.service failed. Dec 12 18:30:49 orchard0 systemd: ovirt-ha-broker.service holdoff time over, scheduling restart. Dec 12 18:30:49 orchard0 systemd: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked. Dec 12 18:30:49 orchard0 systemd: Stopped oVirt Hosted Engine High Availability Communications Broker. Here is what gluster volume heal info on engine looks like, it's similar on other volumes as well (although more heals pending on some of those): gluster volume heal engine info Brick gluster0:/gluster_bricks/engine/engine /d70b171e-7488-4d52-8cad-bbc581dbf16e/images/d909dc74-5bbd-4e39-b9b5-755c167a6ee8/2632f423-ed89-43d9-93a9-36738420b866.meta /d70b171e-7488-4d52-8cad-bbc581dbf16e/images/053171e4-f782-42d7-9115-c602beb3c826/627b8f93-5373-48bb-bd20-a308a455e082.meta /d70b171e-7488-4d52-8cad-bbc581dbf16e/master/tasks/a9b11e33-9b93-46a0-a36e-85063fd53ebe.backup /d70b171e-7488-4d52-8cad-bbc581dbf16e/dom_md/ids Status: Connected Number of entries: 4 Brick gluster1:/gluster_bricks/engine/engine /d70b171e-7488-4d52-8cad-bbc581dbf16e/images/d909dc74-5bbd-4e39-b9b5-755c167a6ee8/2632f423-ed89-43d9-93a9-36738420b866.meta /d70b171e-7488-4d52-8cad-bbc581dbf16e/master/tasks/a9b11e33-9b93-46a0-a36e-85063fd53ebe.backup /d70b171e-7488-4d52-8cad-bbc581dbf16e/images/053171e4-f782-42d7-9115-c602beb3c826/627b8f93-5373-48bb-bd20-a308a455e082.meta /d70b171e-7488-4d52-8cad-bbc581dbf16e/dom_md/ids Status: Connected Number of entries: 4 Brick gluster2:/gluster_bricks/engine/engine Status: Connected Number of entries: 0 I don't see much in vdsm.log and gluster logs look fairly normal to me, I'm not seeing any obvious errors in the gluster logs. As far as I can tell the underlying storage is fine. Why are my gluster volumes not healing and why is self-hosted engine failing to start? More agent and broker logs: ==> agent.log <== MainThread::ERROR::2019-12-12 18:36:09,056::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Failed to start necessary monitors MainThread::ERROR::2019-12-12 18:36:09,058::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 432, in start_monitoring self._initialize_broker() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 556, in _initialize_broker m.get('options', {})) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 89, in start_monitor ).format(t=type, o=options, e=e) RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network', options: {'tcp_t_address': None, 'network_test': None, 'tcp_t_port': None, 'addr': '10.11.0.254'}] MainThread::ERROR::2019-12-12 18:36:09,058::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::ERROR::2019-12-12 18:36:19,619::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Failed to start necessary monitors MainThread::ERROR::2019-12-12 18:36:19,619::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 432, in start_monitoring self._initialize_broker() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 556, in _initialize_broker m.get('options', {})) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 89, in start_monitor ).format(t=type, o=options, e=e) RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network', options: {'tcp_t_address': None, 'network_test': None, 'tcp_t_port': None, 'addr': '10.11.0.254'}] MainThread::ERROR::2019-12-12 18:36:19,619::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::ERROR::2019-12-12 18:36:30,568::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Failed to start necessary monitors MainThread::ERROR::2019-12-12 18:36:30,570::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 432, in start_monitoring self._initialize_broker() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 556, in _initialize_broker m.get('options', {})) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 89, in start_monitor ).format(t=type, o=options, e=e) RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network', options: {'tcp_t_address': None, 'network_test': None, 'tcp_t_port': None, 'addr': '10.11.0.254'}] MainThread::ERROR::2019-12-12 18:36:30,570::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::ERROR::2019-12-12 18:36:41,581::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Failed to start necessary monitors MainThread::ERROR::2019-12-12 18:36:41,583::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 432, in start_monitoring self._initialize_broker() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 556, in _initialize_broker m.get('options', {})) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 89, in start_monitor ).format(t=type, o=options, e=e) RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network', options: {'tcp_t_address': None, 'network_test': None, 'tcp_t_port': None, 'addr': '10.11.0.254'}] MainThread::ERROR::2019-12-12 18:36:41,583::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent