
Of course right when I sent this email, I went back over to one of my consoles, re-ran "hosted-engine --status", and I saw that it was up. I can confirm my hosted engine is now online and healthy. So to recap: restarting vdsmd solved my problem. I provided lots of details in the Bugzilla, and I generated an sosreport on two of my three systems prior to restarting vdsmd. Sent with ProtonMail Secure Email. ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Friday, August 13th, 2021 at 9:31 PM, David White <dmwhite823@protonmail.com> wrote:
I have updated the Bugzilla with all of the details I included below, as well as additional details.
I figured better to err on the side of providing too many details than not enough.
For the oVirt list's edification, I will note that restarting vdsmd on all 3 hosts did fix the problem -- to an extent. Unfortunately, my hosted-engine is still not starting (although I can now clearly connect to the hosted-engine storage), and I see this output every time I try to start the hosted-engine:
[root@cha2-storage ~]# hosted-engine --vm-start
Command VM.getStats with args {'vmID': 'ffd77d79-a699-455e-88e2-f55ee53166ef'} failed:
(code=1, message=Virtual machine does not exist: {'vmId': 'ffd77d79-a699-455e-88e2-f55ee53166ef'})
VM in WaitForLaunch
I'm not sure if that's because I screwed up when I was doing gluster maintenance, or what.
But at this point, does this mean I have to re-deploy the hosted engine?
To confirm, if I re-deploy the hosted engine, will all of my regular VMs remain intact? I have over 20 VMs in this environment, and it would be a major deal to have to rebuild all 20+ of those VMs.
Sent with ProtonMail Secure Email.
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Friday, August 13th, 2021 at 2:41 PM, Nir Soffer nsoffer@redhat.com wrote:
On Fri, Aug 13, 2021 at 9:13 PM David White via Users users@ovirt.org wrote:
Hello,
It appears that my Manager / hosted-engine isn't working, and I'm unable to get it to start.
I have a 3-node HCI cluster, but right now, Gluster is only running on 1 host (so no replication).
I was hoping to upgrade / replace the storage on my 2nd host today, but aborted that maintenance when I found that I couldn't even get into the Manager.
The storage is mounted, but here's what I see:
[root@cha2-storage dwhite]# hosted-engine --vm-status
The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable.
[root@cha2-storage dwhite]# systemctl status ovirt-ha-agent
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2021-08-13 11:10:51 EDT; 2h 44min ago
Main PID: 3591872 (ovirt-ha-agent)
Tasks: 1 (limit: 409676)
Memory: 21.5M
CGroup: /system.slice/ovirt-ha-agent.service
└─3591872 /usr/libexec/platform-python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent
Aug 13 11:10:51 cha2-storage.mgt.barredowlweb.com systemd[1]: Started oVirt Hosted Engine High Availability Monitoring Agent.
Any time I try to do anything like connect the engine storage, disconnect the engine storage, or connect to the console, it just sits there, and doesn't do anything, and I eventually have to ctl-c out of it.
Maybe I have to be patient? When I ctl-c, I get a trackback error:
[root@cha2-storage dwhite]# hosted-engine --console
^CTraceback (most recent call last):
File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py", line 214, in <module>
[root@cha2-storage dwhite]# args.command(args)
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py", line 42, in func
f(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py", line 91, in checkVmStatus
cli = ohautil.connect_vdsm_json_rpc()
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc
__vdsm_json_rpc_connect(logger, timeout)
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 395, in __vdsm_json_rpc_connect
timeout=timeout)
File "/usr/lib/python3.6/site-packages/vdsm/client.py", line 154, in connect
outgoing_heartbeat=outgoing_heartbeat, nr_retries=nr_retries)
File "/usr/lib/python3.6/site-packages/yajsonrpc/stompclient.py", line 426, in SimpleClient
nr_retries, reconnect_interval)
File "/usr/lib/python3.6/site-packages/yajsonrpc/stompclient.py", line 448, in StandAloneRpcClient
client = StompClient(utils.create_connected_socket(host, port, sslctx),
File "/usr/lib/python3.6/site-packages/vdsm/utils.py", line 379, in create_connected_socket
sock.connect((host, port))
File "/usr/lib64/python3.6/ssl.py", line 1068, in connect
self._real_connect(addr, False)
File "/usr/lib64/python3.6/ssl.py", line 1059, in _real_connect
self.do_handshake()
File "/usr/lib64/python3.6/ssl.py", line 1036, in do_handshake
self._sslobj.do_handshake()
File "/usr/lib64/python3.6/ssl.py", line 648, in do_handshake
self._sslobj.do_handshake()
This is what I see in /var/log/ovirt-hosted-engine-ha/broker.log:
MainThread::WARNING::2021-08-11 10:24:41,596::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(init) Can't connect vdsm storage: Connection to storage server failed
MainThread::ERROR::2021-08-11 10:24:41,596::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Failed initializing the broker: Connection to storage server failed
MainThread::ERROR::2021-08-11 10:24:41,598::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run
self._storage_broker_instance = self._get_storage_broker()
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker
return storage_broker.StorageBroker()
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 97, in init
self._backend.connect()
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 375, in connect
sserver.connect_storage_server()
File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_server.py", line 451, in connect_storage_server
'Connection to storage server failed'
RuntimeError: Connection to storage server failed
MainThread::ERROR::2021-08-11 10:24:41,599::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Trying to restart the broker
MainThread::INFO::2021-08-11 10:24:42,439::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.4.7 started
MainThread::INFO::2021-08-11 10:24:44,442::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Searching for submonitors in /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors
MainThread::INFO::2021-08-11 10:24:44,443::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load
MainThread::INFO::2021-08-11 10:24:44,449::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine
MainThread::INFO::2021-08-11 10:24:44,450::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health
MainThread::INFO::2021-08-11 10:24:44,451::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free
MainThread::INFO::2021-08-11 10:24:44,451::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge
MainThread::INFO::2021-08-11 10:24:44,452::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network
MainThread::INFO::2021-08-11 10:24:44,452::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain
MainThread::INFO::2021-08-11 10:24:44,452::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Finished loading submonitors
And I see this in /var/log/vdsm/vdsm.log:
2021-08-13 14:08:10,844-0400 ERROR (Reactor thread) [ProtocolDetector.AcceptorImpl] Unhandled exception in acceptor (protocoldetector:76)
Traceback (most recent call last):
File "/usr/lib64/python3.6/asyncore.py", line 108, in readwrite
File "/usr/lib64/python3.6/asyncore.py", line 417, in handle_read_event
File "/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py", line 57, in handle_accept
File "/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py", line 173, in _delegate_call
File "/usr/lib/python3.6/site-packages/vdsm/protocoldetector.py", line 53, in handle_accept
File "/usr/lib64/python3.6/asyncore.py", line 348, in accept
File "/usr/lib64/python3.6/socket.py", line 205, in accept
OSError: [Errno 24] Too many open files
This may be this bug:
Since vdsm will never recover from this error without a reboot, you should
start by restarting vdsmd service on all hosts.
After restarting vdsmd, connecting to the storage server may succeed.
Please also report this bug, we need to understand if this is the same
issue or another issue.
Vdsm should recover from such critical errors by exiting, so leaks
will cause service restarts (maybe every few days) instead of downtime
of the entire system.
Nir