On Wed, Jul 21, 2021 at 1:01 AM Valerio Luccio <valerio.luccio(a)nyu.edu> wrote:
Hi list,
I have a hosted engine running on a CentOS 8. The engine and all the VM's are stored
on a 4-node gluster. I had some issues with the gluster and then the hosted-engine stopped
working (even though the virtualization dashboard showed 4 virtual machines running). I
tried to "systemctl restart" the hosted-engine, but it failed. I try to reboot
the server and the hosted-engine still will not come up. Note that the server has no issue
mounting the gluster:
$ df
hydra1:/MRIData 390664407040 20530130012 370134277028 6%
/rhev/data-center/mnt/glusterSD/hydra1:_MRIData
$ ls -l
/rhev/data-center/mnt/glusterSD/hydra1\:_MRIData/6547dc22-b89e-4f14-8958-c9e8d27b29a4/
drwxr-xr-x. 2 vdsm kvm 4.0K Mar 29 12:24 dom_md
drwxr-xr-x. 2 vdsm kvm 4.0K Jul 20 17:47 ha_agent
drwxr-xr-x. 12 vdsm kvm 4.0K Apr 1 16:32 images
drwxr-xr-x. 4 vdsm kvm 4.0K Mar 29 12:24 master
Where "hydra1" is one of my gluster nodes and MRIData is the volume name.
Here is the relevant snippet from /var/log/ovirt-hosted-engine-ha/agent.log
MainThread::INFO::2021-07-20
17:29:07,584::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
ovirt-hosted-engine-ha agent 2.4.6 started
MainThread::INFO::2021-07-20
17:29:07,594::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname)
Certificate common name not found, using hostname to identify host
MainThread::INFO::2021-07-20
17:29:07,635::hosted_engine::548::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
Initializing ha-broker connection
MainThread::INFO::2021-07-20
17:29:07,636::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor)
Starting monitor network, options {'addr': '192.168.39.65',
'network_test': 'dns', 'tcp_t_address': '',
'tcp_t_port': ''}
MainThread::ERROR::2021-07-20
17:29:07,636::hosted_engine::564::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
Failed to start necessary monitors
MainThread::ERROR::2021-07-20
17:29:07,637::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback
(most recent call last):
File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 85, in start_monitor
response = self._proxy.start_monitor(type, options)
File "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__
return self.__send(self.__name, args)
File "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request
verbose=self.__verbose
File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request
return self.single_request(host, handler, request_body, verbose)
File "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in single_request
http_conn = self.send_request(host, handler, request_body, verbose)
File "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in send_request
self.send_content(connection, request_body)
File "/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in send_content
connection.endheaders(request_body)
File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output
self.send(msg)
File "/usr/lib64/python3.6/http/client.py", line 974, in send
self.connect()
File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line
74, in connect
self.sock.connect(base64.b16decode(self.host))
FileNotFoundError: [Errno 2] No such file or directory
This seems to indicate that the broker is down. Can you check it,
please - log, restart, status, etc.?
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line
131, in _run_agent
return action(he)
File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line
55, in action_proper
return he.start_monitoring()
File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 437, in start_monitoring
self._initialize_broker()
File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 561, in _initialize_broker
m.get('options', {}))
File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 91, in start_monitor
).format(t=type, o=options, e=e)
ovirt_hosted_engine_ha.lib.exceptions.RequestError: brokerlink - failed to start monitor
via ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network',
options: {'addr': '192.168.39.65', 'network_test': 'dns',
'tcp_t_address': '', 'tcp_t_port': ''}]
MainThread::ERROR::2021-07-20
17:29:07,637::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to
restart agent
MainThread::INFO::2021-07-20
17:29:07,637::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting
down
I'm puzzled by that "Certificate common name not found", which I had not
seen before.
This seems like a harmless bug. Now filed it, mainly for reference -
not sure it's worth fixing:
https://bugzilla.redhat.com/show_bug.cgi?id=1984262
The fqdn of the hosted engine resolves fine on the server, so does
the fqdn of the server itself. The ip address it seems to try to use for the network is
that of one of the university's gateways.
Any ideas ? Any way to debug this further ?
See above - check the broker.
Thanks and best regards,
--
Didi