[ovirt-users] Re: Hosted engine on HCI cluster is not running

Friday, 13 August 2021

I have updated the Bugzilla with all of the details I included below, as well as
additional details. 

I figured better to err on the side of providing too many details than not enough. 

For the oVirt list's edification, I will note that restarting vdsmd on all 3 hosts did
fix the problem -- to an extent. Unfortunately, my hosted-engine is still not starting
(although I can now clearly connect to the hosted-engine storage), and I see this output
every time I try to start the hosted-engine:

[root@cha2-storage ~]# hosted-engine --vm-start
Command VM.getStats with args {'vmID':
'ffd77d79-a699-455e-88e2-f55ee53166ef'} failed:
(code=1, message=Virtual machine does not exist: {'vmId':
'ffd77d79-a699-455e-88e2-f55ee53166ef'})
VM in WaitForLaunch

I'm not sure if that's because I screwed up when I was doing gluster maintenance,
or what.
But at this point, does this mean I have to re-deploy the hosted engine?
To confirm, if I re-deploy the hosted engine, will all of my regular VMs remain intact? I
have over 20 VMs in this environment, and it would be a major deal to have to rebuild all
20+ of those VMs.

Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Friday, August 13th, 2021 at 2:41 PM, Nir Soffer <nsoffer(a)redhat.com&gt; wrote:

...
 On Fri, Aug 13, 2021 at 9:13 PM David White via Users users(a)ovirt.org
wrote:

...
 > Hello,
 >  
...
 > It appears that my Manager / hosted-engine isn't working,
and I'm unable to get it to start.
 >  
...
 > I have a 3-node HCI cluster, but right now, Gluster is only
running on 1 host (so no replication).
 >  
...
 > I was hoping to upgrade / replace the storage on my 2nd host
today, but aborted that maintenance when I found that I couldn't even get into the
Manager.
 >  
...
 > The storage is mounted, but here's what I see:
 >  
...
 > [root@cha2-storage dwhite]# hosted-engine --vm-status
 >  
...
 > The hosted engine configuration has not been retrieved from
shared storage. Please ensure that ovirt-ha-agent is running and the storage server is
reachable.
 >  
...
 > [root@cha2-storage dwhite]# systemctl status ovirt-ha-agent
 >  
...
 > ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability
Monitoring Agent
 >  
...
 > Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service;
enabled; vendor preset: disabled)
 >  
...
 > Active: active (running) since Fri 2021-08-13 11:10:51 EDT; 2h
44min ago
 >  
...
 > Main PID: 3591872 (ovirt-ha-agent)
 >  
...
 > Tasks: 1 (limit: 409676)
 >  
...
 > Memory: 21.5M
 >  
...
 > CGroup: /system.slice/ovirt-ha-agent.service
 >  
...
 > └─3591872 /usr/libexec/platform-python
/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent
 >  
...
 > Aug 13 11:10:51 cha2-storage.mgt.barredowlweb.com systemd[1]:
Started oVirt Hosted Engine High Availability Monitoring Agent.
 >  
...
 > Any time I try to do anything like connect the engine storage,
disconnect the engine storage, or connect to the console, it just sits there, and
doesn't do anything, and I eventually have to ctl-c out of it.
 >  
...
 > Maybe I have to be patient? When I ctl-c, I get a trackback
error:
 >  
...
 > [root@cha2-storage dwhite]# hosted-engine --console
 >  
...
 > ^CTraceback (most recent call last):
 >  
...
 > File "/usr/lib64/python3.6/runpy.py", line 193, in
_run_module_as_main
 >  
...
 >     "__main__", mod_spec)
 >      
...
 >  
> > File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
...
 >  
> > exec(code, run_globals)
...
 >  
> > File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py",
line 214, in <module>
...
 >  
> > [root@cha2-storage dwhite]# args.command(args)
...
 >  
> > File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py",
line 42, in func
...
 >  
> > f(*args, **kwargs)
...
 >  
> > File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py",
line 91, in checkVmStatus
...
 >  
> > cli = ohautil.connect_vdsm_json_rpc()
...
 >  
> > File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472,
in connect_vdsm_json_rpc
...
 >  
> > __vdsm_json_rpc_connect(logger, timeout)
...
 >  
> > File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 395,
in __vdsm_json_rpc_connect
...
 >  
> > timeout=timeout)
...
 >  
> > File "/usr/lib/python3.6/site-packages/vdsm/client.py", line 154, in
connect
...
 >  
> > outgoing_heartbeat=outgoing_heartbeat, nr_retries=nr_retries)
...
 >  
> > File "/usr/lib/python3.6/site-packages/yajsonrpc/stompclient.py", line
426, in SimpleClient
...
 >  
> > nr_retries, reconnect_interval)
...
 >  
> > File "/usr/lib/python3.6/site-packages/yajsonrpc/stompclient.py", line
448, in StandAloneRpcClient
...
 >  
> > client = StompClient(utils.create_connected_socket(host, port, sslctx),
...
 >  
> > File "/usr/lib/python3.6/site-packages/vdsm/utils.py", line 379, in
create_connected_socket
...
 >  
> > sock.connect((host, port))
...
 >  
> > File "/usr/lib64/python3.6/ssl.py", line 1068, in connect
...
 >  
> > self._real_connect(addr, False)
...
 >  
> > File "/usr/lib64/python3.6/ssl.py", line 1059, in _real_connect
...
 >  
> > self.do_handshake()
...
 >  
> > File "/usr/lib64/python3.6/ssl.py", line 1036, in do_handshake
...
 >  
> > self._sslobj.do_handshake()
...
 >  
> > File "/usr/lib64/python3.6/ssl.py", line 648, in do_handshake
...
 >  
> > self._sslobj.do_handshake()
...
 >  
> > This is what I see in /var/log/ovirt-hosted-engine-ha/broker.log:
...
 >  
> > MainThread::WARNING::2021-08-11
10:24:41,596::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(init)
Can't connect vdsm storage: Connection to storage server failed
...
 >  
> > MainThread::ERROR::2021-08-11
10:24:41,596::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Failed
initializing the broker: Connection to storage server failed
...
 >  
> > MainThread::ERROR::2021-08-11
10:24:41,598::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Traceback
(most recent call last):
...
 >  
> > File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line
64, in run
...
 >  
> > self._storage_broker_instance = self._get_storage_broker()
...
 >  
> > File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line
143, in _get_storage_broker
...
 >  
> > return storage_broker.StorageBroker()
...
 >  
> > File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py",
line 97, in init
...
 >  
> > self._backend.connect()
...
 >  
> > File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py",
line 375, in connect
...
 >  
> > sserver.connect_storage_server()
...
 >  
> > File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_server.py",
line 451, in connect_storage_server
...
 >  
> > 'Connection to storage server failed'
...
 >  
> > RuntimeError: Connection to storage server failed
...
 >  
> > MainThread::ERROR::2021-08-11
10:24:41,599::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Trying to
restart the broker
...
 >  
> > MainThread::INFO::2021-08-11
10:24:42,439::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
ovirt-hosted-engine-ha broker 2.4.7 started
...
 >  
> > MainThread::INFO::2021-08-11
10:24:44,442::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Searching for submonitors in
/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors
...
 >  
> > MainThread::INFO::2021-08-11
10:24:44,443::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor cpu-load
...
 >  
> > MainThread::INFO::2021-08-11
10:24:44,449::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor cpu-load-no-engine
...
 >  
> > MainThread::INFO::2021-08-11
10:24:44,450::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor engine-health
...
 >  
> > MainThread::INFO::2021-08-11
10:24:44,451::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor mem-free
...
 >  
> > MainThread::INFO::2021-08-11
10:24:44,451::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor mgmt-bridge
...
 >  
> > MainThread::INFO::2021-08-11
10:24:44,452::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor network
...
 >  
> > MainThread::INFO::2021-08-11
10:24:44,452::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor storage-domain
...
 >  
> > MainThread::INFO::2021-08-11
10:24:44,452::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Finished loading submonitors
...
 >  
> > And I see this in /var/log/vdsm/vdsm.log:
...
 >  
> > 2021-08-13 14:08:10,844-0400 ERROR (Reactor thread)
[ProtocolDetector.AcceptorImpl] Unhandled exception in acceptor (protocoldetector:76)
...
 >  
> > Traceback (most recent call last):
...
 >  
> > File "/usr/lib64/python3.6/asyncore.py", line 108, in readwrite
...
 >  
> > File "/usr/lib64/python3.6/asyncore.py", line 417, in
handle_read_event
...
 >  
> > File "/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py",
line 57, in handle_accept
...
 >  
> > File "/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py",
line 173, in _delegate_call
...
 >  
> > File "/usr/lib/python3.6/site-packages/vdsm/protocoldetector.py", line
53, in handle_accept
...
 >  
> > File "/usr/lib64/python3.6/asyncore.py", line 348, in accept
...
 >  
> > File "/usr/lib64/python3.6/socket.py", line 205, in accept
...
 >  
...
 > OSError: [Errno 24] Too many open files

...
 This may be this bug:

...
 https://bugzilla.redhat.com/show_bug.cgi?id=1926589

...
 Since vdsm will never recover from this error without a reboot, you
should

...
 start by restarting vdsmd service on all hosts.

...
 After restarting vdsmd, connecting to the storage server may
succeed.

...
 Please also report this bug, we need to understand if this is the
same

...
 issue or another issue.

...
 Vdsm should recover from such critical errors by exiting, so leaks

...
 will cause service restarts (maybe every few days) instead of
downtime

...
 of the entire system.

...
 Nir 

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[ovirt-users] Re: Hosted engine on HCI cluster is not running