Cannot start hosted-engine

Hi list, I have a hosted engine running on a CentOS 8. The engine and all the VM's are stored on a 4-node gluster. I had some issues with the gluster and then the hosted-engine stopped working (even though the virtualization dashboard showed 4 virtual machines running). I tried to "systemctl restart" the hosted-engine, but it failed. I try to reboot the server and the hosted-engine still will not come up. Note that the server has no issue mounting the gluster: $ df hydra1:/MRIData 390664407040 20530130012 370134277028 6% /rhev/data-center/mnt/glusterSD/hydra1:_MRIData $ ls -l /rhev/data-center/mnt/glusterSD/hydra1\:_MRIData/6547dc22-b89e-4f14-8958-c9e8d27b29a4/ drwxr-xr-x. 2 vdsm kvm 4.0K Mar 29 12:24 dom_md drwxr-xr-x. 2 vdsm kvm 4.0K Jul 20 17:47 ha_agent drwxr-xr-x. 12 vdsm kvm 4.0K Apr 1 16:32 images drwxr-xr-x. 4 vdsm kvm 4.0K Mar 29 12:24 master Where "hydra1" is one of my gluster nodes and MRIData is the volume name. Here is the relevant snippet from /var/log/ovirt-hosted-engine-ha/agent.log MainThread::INFO::2021-07-20 17:29:07,584::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.6 started MainThread::INFO::2021-07-20 17:29:07,594::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host MainThread::INFO::2021-07-20 17:29:07,635::hosted_engine::548::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Initializing ha-broker connection MainThread::INFO::2021-07-20 17:29:07,636::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor network, options {'addr': '192.168.39.65', 'network_test': 'dns', 'tcp_t_address': '', 'tcp_t_port': ''} MainThread::ERROR::2021-07-20 17:29:07,636::hosted_engine::564::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Failed to start necessary monitors MainThread::ERROR::2021-07-20 17:29:07,637::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 85, in start_monitor response = self._proxy.start_monitor(type, options) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__ return self.__send(self.__name, args) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request verbose=self.__verbose File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request return self.single_request(host, handler, request_body, verbose) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in single_request http_conn = self.send_request(host, handler, request_body, verbose) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in send_request self.send_content(connection, request_body) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in send_content connection.endheaders(request_body) File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output self.send(msg) File "/usr/lib64/python3.6/http/client.py", line 974, in send self.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 74, in connect self.sock.connect(base64.b16decode(self.host)) FileNotFoundError: [Errno 2] No such file or directory During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 437, in start_monitoring self._initialize_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 561, in _initialize_broker m.get('options', {})) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 91, in start_monitor ).format(t=type, o=options, e=e) ovirt_hosted_engine_ha.lib.exceptions.RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network', options: {'addr': '192.168.39.65', 'network_test': 'dns', 'tcp_t_address': '', 'tcp_t_port': ''}] MainThread::ERROR::2021-07-20 17:29:07,637::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2021-07-20 17:29:07,637::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down I'm puzzled by that "Certificate common name not found", which I had not seen before. The fqdn of the hosted engine resolves fine on the server, so does the fqdn of the server itself. The ip address it seems to try to use for the network is that of one of the university's gateways. Any ideas ? Any way to debug this further ? Thanks in advance, -- Valerio Luccio (212) 998-8736 Center for Brain Imaging 4 Washington Place, Room 157 New York University New York, NY 10003 "In an open world, who needs windows or gates ?"

Seems to be a bug:. https://bugzilla.redhat.com/show_bug.cgi?id=1727581 Can you update the host ? Best Regards,Strahil Nikolov On Wed, Jul 21, 2021 at 1:01, Valerio Luccio<valerio.luccio@nyu.edu> wrote: Hi list, I have a hosted engine running on a CentOS 8. The engine and all the VM's are stored on a 4-node gluster. I had some issues with the gluster and then the hosted-engine stopped working (even though the virtualization dashboard showed 4 virtual machines running). I tried to "systemctl restart" the hosted-engine, but it failed. I try to reboot the server and the hosted-engine still will not come up. Note that the server has no issue mounting the gluster: $ df hydra1:/MRIData 390664407040 20530130012 370134277028 6% /rhev/data-center/mnt/glusterSD/hydra1:_MRIData $ ls -l /rhev/data-center/mnt/glusterSD/hydra1\:_MRIData/6547dc22-b89e-4f14-8958-c9e8d27b29a4/ drwxr-xr-x. 2 vdsm kvm 4.0K Mar 29 12:24 dom_md drwxr-xr-x. 2 vdsm kvm 4.0K Jul 20 17:47 ha_agent drwxr-xr-x. 12 vdsm kvm 4.0K Apr 1 16:32 images drwxr-xr-x. 4 vdsm kvm 4.0K Mar 29 12:24 master Where "hydra1" is one of my gluster nodes and MRIData is the volume name. Here is the relevant snippet from /var/log/ovirt-hosted-engine-ha/agent.log MainThread::INFO::2021-07-20 17:29:07,584::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.6 started MainThread::INFO::2021-07-20 17:29:07,594::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host MainThread::INFO::2021-07-20 17:29:07,635::hosted_engine::548::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Initializing ha-broker connection MainThread::INFO::2021-07-20 17:29:07,636::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor network, options {'addr': '192.168.39.65', 'network_test': 'dns', 'tcp_t_address': '', 'tcp_t_port': ''} MainThread::ERROR::2021-07-20 17:29:07,636::hosted_engine::564::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Failed to start necessary monitors MainThread::ERROR::2021-07-20 17:29:07,637::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 85, in start_monitor response = self._proxy.start_monitor(type, options) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__ return self.__send(self.__name, args) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request verbose=self.__verbose File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request return self.single_request(host, handler, request_body, verbose) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in single_request http_conn = self.send_request(host, handler, request_body, verbose) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in send_request self.send_content(connection, request_body) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in send_content connection.endheaders(request_body) File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output self.send(msg) File "/usr/lib64/python3.6/http/client.py", line 974, in send self.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 74, in connect self.sock.connect(base64.b16decode(self.host)) FileNotFoundError: [Errno 2] No such file or directory During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 437, in start_monitoring self._initialize_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 561, in _initialize_broker m.get('options', {})) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 91, in start_monitor ).format(t=type, o=options, e=e) ovirt_hosted_engine_ha.lib.exceptions.RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network', options: {'addr': '192.168.39.65', 'network_test': 'dns', 'tcp_t_address': '', 'tcp_t_port': ''}] MainThread::ERROR::2021-07-20 17:29:07,637::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2021-07-20 17:29:07,637::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down I'm puzzled by that "Certificate common name not found", which I had not seen before. The fqdn of the hosted engine resolves fine on the server, so does the fqdn of the server itself. The ip address it seems to try to use for the network is that of one of the university's gateways. Any ideas ? Any way to debug this further ? Thanks in advance, -- Valerio Luccio (212) 998-8736 Center for Brain Imaging 4 Washington Place, Room 157 New York University New York, NY 10003 "In an open world, who needs windows or gates ?" _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/DTEY6OTBIWNX2D...

On Wed, Jul 21, 2021 at 1:01 AM Valerio Luccio <valerio.luccio@nyu.edu> wrote:
Hi list,
I have a hosted engine running on a CentOS 8. The engine and all the VM's are stored on a 4-node gluster. I had some issues with the gluster and then the hosted-engine stopped working (even though the virtualization dashboard showed 4 virtual machines running). I tried to "systemctl restart" the hosted-engine, but it failed. I try to reboot the server and the hosted-engine still will not come up. Note that the server has no issue mounting the gluster:
$ df hydra1:/MRIData 390664407040 20530130012 370134277028 6% /rhev/data-center/mnt/glusterSD/hydra1:_MRIData $ ls -l /rhev/data-center/mnt/glusterSD/hydra1\:_MRIData/6547dc22-b89e-4f14-8958-c9e8d27b29a4/ drwxr-xr-x. 2 vdsm kvm 4.0K Mar 29 12:24 dom_md drwxr-xr-x. 2 vdsm kvm 4.0K Jul 20 17:47 ha_agent drwxr-xr-x. 12 vdsm kvm 4.0K Apr 1 16:32 images drwxr-xr-x. 4 vdsm kvm 4.0K Mar 29 12:24 master
Where "hydra1" is one of my gluster nodes and MRIData is the volume name.
Here is the relevant snippet from /var/log/ovirt-hosted-engine-ha/agent.log
MainThread::INFO::2021-07-20 17:29:07,584::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.4.6 started MainThread::INFO::2021-07-20 17:29:07,594::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) Certificate common name not found, using hostname to identify host MainThread::INFO::2021-07-20 17:29:07,635::hosted_engine::548::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Initializing ha-broker connection MainThread::INFO::2021-07-20 17:29:07,636::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor network, options {'addr': '192.168.39.65', 'network_test': 'dns', 'tcp_t_address': '', 'tcp_t_port': ''} MainThread::ERROR::2021-07-20 17:29:07,636::hosted_engine::564::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) Failed to start necessary monitors MainThread::ERROR::2021-07-20 17:29:07,637::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 85, in start_monitor response = self._proxy.start_monitor(type, options) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__ return self.__send(self.__name, args) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request verbose=self.__verbose File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request return self.single_request(host, handler, request_body, verbose) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in single_request http_conn = self.send_request(host, handler, request_body, verbose) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in send_request self.send_content(connection, request_body) File "/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in send_content connection.endheaders(request_body) File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output self.send(msg) File "/usr/lib64/python3.6/http/client.py", line 974, in send self.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", line 74, in connect self.sock.connect(base64.b16decode(self.host)) FileNotFoundError: [Errno 2] No such file or directory
This seems to indicate that the broker is down. Can you check it, please - log, restart, status, etc.?
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent return action(he) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper return he.start_monitoring() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 437, in start_monitoring self._initialize_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 561, in _initialize_broker m.get('options', {})) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 91, in start_monitor ).format(t=type, o=options, e=e) ovirt_hosted_engine_ha.lib.exceptions.RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'network', options: {'addr': '192.168.39.65', 'network_test': 'dns', 'tcp_t_address': '', 'tcp_t_port': ''}]
MainThread::ERROR::2021-07-20 17:29:07,637::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Trying to restart agent MainThread::INFO::2021-07-20 17:29:07,637::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down
I'm puzzled by that "Certificate common name not found", which I had not seen before.
This seems like a harmless bug. Now filed it, mainly for reference - not sure it's worth fixing: https://bugzilla.redhat.com/show_bug.cgi?id=1984262
The fqdn of the hosted engine resolves fine on the server, so does the fqdn of the server itself. The ip address it seems to try to use for the network is that of one of the university's gateways.
Any ideas ? Any way to debug this further ?
See above - check the broker. Thanks and best regards, -- Didi

On 7/21/21 1:46 AM, Yedidyah Bar David wrote:
[...] This seems to indicate that the broker is down. Can you check it,
On Wed, Jul 21, 2021 at 1:01 AM Valerio Luccio <valerio.luccio@nyu.edu> wrote: please - log, restart, status, etc.?
[...] See above - check the broker.
Thanks and best regards,
Indeed Didi, the broker starts, but then fails. It would seem that some of my images have disappeared, it probably has to do with my gluster issues. I'll try to fix that first and see if I can recover the images. Thanks, -- As a result of Coronavirus-related precautions, NYU and the Center for Brain Imaging operations will be managed remotely until further notice. All telephone calls and e-mail correspondence are being monitored remotely during our normal business hours of 9am-5pm, Monday through Friday. For MRI scanner-related emergency, please contact: Keith Sanzenbach at keith.sanzenbach@nyu.edu and/or Pablo Velasco at pablo.velasco@nyu.edu For computer/hardware/software emergency, please contact: Valerio Luccio at valerio.luccio@nyu.edu For TMS/EEG-related emergency, please contact: Chrysa Papadaniil at chrysa@nyu.edu For CBI-related administrative emergency, please contact: Jennifer Mangan at jennifer.mangan@nyu.edu Valerio Luccio (212) 998-8736 Center for Brain Imaging 4 Washington Place, Room 158 New York University New York, NY 10003 "In an open world, who needs windows or gates ?"
participants (3)
-
Strahil Nikolov
-
Valerio Luccio
-
Yedidyah Bar David