
On Fri, Aug 13, 2021 at 9:13 PM David White via Users <users@ovirt.org> wrote:
Hello, It appears that my Manager / hosted-engine isn't working, and I'm unable to get it to start.
I have a 3-node HCI cluster, but right now, Gluster is only running on 1 host (so no replication). I was hoping to upgrade / replace the storage on my 2nd host today, but aborted that maintenance when I found that I couldn't even get into the Manager.
The storage is mounted, but here's what I see:
[root@cha2-storage dwhite]# hosted-engine --vm-status The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable.
[root@cha2-storage dwhite]# systemctl status ovirt-ha-agent ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2021-08-13 11:10:51 EDT; 2h 44min ago Main PID: 3591872 (ovirt-ha-agent) Tasks: 1 (limit: 409676) Memory: 21.5M CGroup: /system.slice/ovirt-ha-agent.service └─3591872 /usr/libexec/platform-python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent
Aug 13 11:10:51 cha2-storage.mgt.barredowlweb.com systemd[1]: Started oVirt Hosted Engine High Availability Monitoring Agent.
Any time I try to do anything like connect the engine storage, disconnect the engine storage, or connect to the console, it just sits there, and doesn't do anything, and I eventually have to ctl-c out of it. Maybe I have to be patient? When I ctl-c, I get a trackback error:
[root@cha2-storage dwhite]# hosted-engine --console ^CTraceback (most recent call last): File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec) File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py", line 214, in <module> [root@cha2-storage dwhite]# args.command(args) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py", line 42, in func f(*args, **kwargs) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py", line 91, in checkVmStatus cli = ohautil.connect_vdsm_json_rpc() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc __vdsm_json_rpc_connect(logger, timeout) File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 395, in __vdsm_json_rpc_connect timeout=timeout) File "/usr/lib/python3.6/site-packages/vdsm/client.py", line 154, in connect outgoing_heartbeat=outgoing_heartbeat, nr_retries=nr_retries) File "/usr/lib/python3.6/site-packages/yajsonrpc/stompclient.py", line 426, in SimpleClient nr_retries, reconnect_interval) File "/usr/lib/python3.6/site-packages/yajsonrpc/stompclient.py", line 448, in StandAloneRpcClient client = StompClient(utils.create_connected_socket(host, port, sslctx), File "/usr/lib/python3.6/site-packages/vdsm/utils.py", line 379, in create_connected_socket sock.connect((host, port)) File "/usr/lib64/python3.6/ssl.py", line 1068, in connect self._real_connect(addr, False) File "/usr/lib64/python3.6/ssl.py", line 1059, in _real_connect self.do_handshake() File "/usr/lib64/python3.6/ssl.py", line 1036, in do_handshake self._sslobj.do_handshake() File "/usr/lib64/python3.6/ssl.py", line 648, in do_handshake self._sslobj.do_handshake()
This is what I see in /var/log/ovirt-hosted-engine-ha/broker.log:
MainThread::WARNING::2021-08-11 10:24:41,596::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: Connection to storage server failed MainThread::ERROR::2021-08-11 10:24:41,596::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Failed initializing the broker: Connection to storage server failed MainThread::ERROR::2021-08-11 10:24:41,598::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run self._storage_broker_instance = self._get_storage_broker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker return storage_broker.StorageBroker() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 97, in __init__ self._backend.connect() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 375, in connect sserver.connect_storage_server() File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_server.py", line 451, in connect_storage_server 'Connection to storage server failed' RuntimeError: Connection to storage server failed
MainThread::ERROR::2021-08-11 10:24:41,599::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Trying to restart the broker MainThread::INFO::2021-08-11 10:24:42,439::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.4.7 started MainThread::INFO::2021-08-11 10:24:44,442::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Searching for submonitors in /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors MainThread::INFO::2021-08-11 10:24:44,443::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load MainThread::INFO::2021-08-11 10:24:44,449::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine MainThread::INFO::2021-08-11 10:24:44,450::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health MainThread::INFO::2021-08-11 10:24:44,451::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free MainThread::INFO::2021-08-11 10:24:44,451::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge MainThread::INFO::2021-08-11 10:24:44,452::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network MainThread::INFO::2021-08-11 10:24:44,452::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain MainThread::INFO::2021-08-11 10:24:44,452::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Finished loading submonitors
And I see this in /var/log/vdsm/vdsm.log:
2021-08-13 14:08:10,844-0400 ERROR (Reactor thread) [ProtocolDetector.AcceptorImpl] Unhandled exception in acceptor (protocoldetector:76) Traceback (most recent call last): File "/usr/lib64/python3.6/asyncore.py", line 108, in readwrite File "/usr/lib64/python3.6/asyncore.py", line 417, in handle_read_event File "/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py", line 57, in handle_accept File "/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py", line 173, in _delegate_call File "/usr/lib/python3.6/site-packages/vdsm/protocoldetector.py", line 53, in handle_accept File "/usr/lib64/python3.6/asyncore.py", line 348, in accept File "/usr/lib64/python3.6/socket.py", line 205, in accept OSError: [Errno 24] Too many open files
This may be this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1926589 Since vdsm will never recover from this error without a reboot, you should start by restarting vdsmd service on all hosts. After restarting vdsmd, connecting to the storage server may succeed. Please also report this bug, we need to understand if this is the same issue or another issue. Vdsm should recover from such critical errors by exiting, so leaks will cause service restarts (maybe every few days) instead of downtime of the entire system. Nir