Of course right when I sent this email, I went back over to one of my consoles, re-ran
"hosted-engine --status", and I saw that it was up. I can confirm my hosted
engine is now online and healthy.
So to recap: restarting vdsmd solved my problem.
I provided lots of details in the Bugzilla, and I generated an sosreport on two of my
three systems prior to restarting vdsmd.
Sent with ProtonMail Secure Email.
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Friday, August 13th, 2021 at 9:31 PM, David White <dmwhite823(a)protonmail.com>
wrote:
I have updated the Bugzilla with all of the details I included below,
as well as additional details.
I figured better to err on the side of providing too many details
than not enough.
For the oVirt list's edification, I will note that restarting
vdsmd on all 3 hosts did fix the problem -- to an extent. Unfortunately, my hosted-engine
is still not starting (although I can now clearly connect to the hosted-engine storage),
and I see this output every time I try to start the hosted-engine:
[root@cha2-storage ~]# hosted-engine --vm-start
Command VM.getStats with args {'vmID':
'ffd77d79-a699-455e-88e2-f55ee53166ef'} failed:
(code=1, message=Virtual machine does not exist: {'vmId':
'ffd77d79-a699-455e-88e2-f55ee53166ef'})
VM in WaitForLaunch
I'm not sure if that's because I screwed up when I was doing
gluster maintenance, or what.
But at this point, does this mean I have to re-deploy the hosted
engine?
To confirm, if I re-deploy the hosted engine, will all of my regular
VMs remain intact? I have over 20 VMs in this environment, and it would be a major deal to
have to rebuild all 20+ of those VMs.
Sent with ProtonMail Secure Email.
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Friday, August 13th, 2021 at 2:41 PM, Nir Soffer
nsoffer(a)redhat.com wrote:
> On Fri, Aug 13, 2021 at 9:13 PM David White via Users
users(a)ovirt.org wrote:
>
> > Hello,
> >
> > It appears that my Manager / hosted-engine isn't
working, and I'm unable to get it to start.
> >
> > I have a 3-node HCI cluster, but right now, Gluster is only
running on 1 host (so no replication).
> >
> > I was hoping to upgrade / replace the storage on my 2nd
host today, but aborted that maintenance when I found that I couldn't even get into
the Manager.
> >
> > The storage is mounted, but here's what I see:
> >
> > [root@cha2-storage dwhite]# hosted-engine --vm-status
> >
> > The hosted engine configuration has not been retrieved from
shared storage. Please ensure that ovirt-ha-agent is running and the storage server is
reachable.
> >
> > [root@cha2-storage dwhite]# systemctl status
ovirt-ha-agent
> >
> > ● ovirt-ha-agent.service - oVirt Hosted Engine High
Availability Monitoring Agent
> >
> > Loaded: loaded
(/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
> >
> > Active: active (running) since Fri 2021-08-13 11:10:51 EDT;
2h 44min ago
> >
> > Main PID: 3591872 (ovirt-ha-agent)
> >
> > Tasks: 1 (limit: 409676)
> >
> > Memory: 21.5M
> >
> > CGroup: /system.slice/ovirt-ha-agent.service
> >
> > └─3591872 /usr/libexec/platform-python
/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent
> >
> > Aug 13 11:10:51
cha2-storage.mgt.barredowlweb.com
systemd[1]: Started oVirt Hosted Engine High Availability Monitoring Agent.
> >
> > Any time I try to do anything like connect the engine
storage, disconnect the engine storage, or connect to the console, it just sits there, and
doesn't do anything, and I eventually have to ctl-c out of it.
> >
> > Maybe I have to be patient? When I ctl-c, I get a trackback
error:
> >
> > [root@cha2-storage dwhite]# hosted-engine --console
> >
> > ^CTraceback (most recent call last):
> >
> > File "/usr/lib64/python3.6/runpy.py", line 193,
in _run_module_as_main
> >
> > "__main__", mod_spec)
> >
> >
> > > File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
> >
> > > exec(code, run_globals)
> >
> > > File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py",
line 214, in <module>
> >
> > > [root@cha2-storage dwhite]# args.command(args)
> >
> > > File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py",
line 42, in func
> >
> > > f(*args, **kwargs)
> >
> > > File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py",
line 91, in checkVmStatus
> >
> > > cli = ohautil.connect_vdsm_json_rpc()
> >
> > > File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472,
in connect_vdsm_json_rpc
> >
> > > __vdsm_json_rpc_connect(logger, timeout)
> >
> > > File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 395,
in __vdsm_json_rpc_connect
> >
> > > timeout=timeout)
> >
> > > File "/usr/lib/python3.6/site-packages/vdsm/client.py", line 154,
in connect
> >
> > > outgoing_heartbeat=outgoing_heartbeat, nr_retries=nr_retries)
> >
> > > File "/usr/lib/python3.6/site-packages/yajsonrpc/stompclient.py",
line 426, in SimpleClient
> >
> > > nr_retries, reconnect_interval)
> >
> > > File "/usr/lib/python3.6/site-packages/yajsonrpc/stompclient.py",
line 448, in StandAloneRpcClient
> >
> > > client = StompClient(utils.create_connected_socket(host, port, sslctx),
> >
> > > File "/usr/lib/python3.6/site-packages/vdsm/utils.py", line 379,
in create_connected_socket
> >
> > > sock.connect((host, port))
> >
> > > File "/usr/lib64/python3.6/ssl.py", line 1068, in connect
> >
> > > self._real_connect(addr, False)
> >
> > > File "/usr/lib64/python3.6/ssl.py", line 1059, in _real_connect
> >
> > > self.do_handshake()
> >
> > > File "/usr/lib64/python3.6/ssl.py", line 1036, in do_handshake
> >
> > > self._sslobj.do_handshake()
> >
> > > File "/usr/lib64/python3.6/ssl.py", line 648, in do_handshake
> >
> > > self._sslobj.do_handshake()
> >
> > > This is what I see in /var/log/ovirt-hosted-engine-ha/broker.log:
> >
> > > MainThread::WARNING::2021-08-11
10:24:41,596::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(init)
Can't connect vdsm storage: Connection to storage server failed
> >
> > > MainThread::ERROR::2021-08-11
10:24:41,596::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Failed
initializing the broker: Connection to storage server failed
> >
> > > MainThread::ERROR::2021-08-11
10:24:41,598::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Traceback
(most recent call last):
> >
> > > File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line
64, in run
> >
> > > self._storage_broker_instance = self._get_storage_broker()
> >
> > > File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line
143, in _get_storage_broker
> >
> > > return storage_broker.StorageBroker()
> >
> > > File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py",
line 97, in init
> >
> > > self._backend.connect()
> >
> > > File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py",
line 375, in connect
> >
> > > sserver.connect_storage_server()
> >
> > > File
"/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_server.py",
line 451, in connect_storage_server
> >
> > > 'Connection to storage server failed'
> >
> > > RuntimeError: Connection to storage server failed
> >
> > > MainThread::ERROR::2021-08-11
10:24:41,599::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Trying to
restart the broker
> >
> > > MainThread::INFO::2021-08-11
10:24:42,439::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
ovirt-hosted-engine-ha broker 2.4.7 started
> >
> > > MainThread::INFO::2021-08-11
10:24:44,442::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Searching for submonitors in
/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors
> >
> > > MainThread::INFO::2021-08-11
10:24:44,443::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor cpu-load
> >
> > > MainThread::INFO::2021-08-11
10:24:44,449::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor cpu-load-no-engine
> >
> > > MainThread::INFO::2021-08-11
10:24:44,450::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor engine-health
> >
> > > MainThread::INFO::2021-08-11
10:24:44,451::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor mem-free
> >
> > > MainThread::INFO::2021-08-11
10:24:44,451::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor mgmt-bridge
> >
> > > MainThread::INFO::2021-08-11
10:24:44,452::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor network
> >
> > > MainThread::INFO::2021-08-11
10:24:44,452::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Loaded submonitor storage-domain
> >
> > > MainThread::INFO::2021-08-11
10:24:44,452::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
Finished loading submonitors
> >
> > > And I see this in /var/log/vdsm/vdsm.log:
> >
> > > 2021-08-13 14:08:10,844-0400 ERROR (Reactor thread)
[ProtocolDetector.AcceptorImpl] Unhandled exception in acceptor (protocoldetector:76)
> >
> > > Traceback (most recent call last):
> >
> > > File "/usr/lib64/python3.6/asyncore.py", line 108, in readwrite
> >
> > > File "/usr/lib64/python3.6/asyncore.py", line 417, in
handle_read_event
> >
> > > File
"/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py", line 57, in
handle_accept
> >
> > > File
"/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py", line 173, in
_delegate_call
> >
> > > File "/usr/lib/python3.6/site-packages/vdsm/protocoldetector.py",
line 53, in handle_accept
> >
> > > File "/usr/lib64/python3.6/asyncore.py", line 348, in accept
> >
> > > File "/usr/lib64/python3.6/socket.py", line 205, in accept
> >
> > OSError: [Errno 24] Too many open files
>
> This may be this bug:
>
> Since vdsm will never recover from this error without a reboot,
you should
>
> start by restarting vdsmd service on all hosts.
>
> After restarting vdsmd, connecting to the storage server may
succeed.
>
> Please also report this bug, we need to understand if this is
the same
>
> issue or another issue.
>
> Vdsm should recover from such critical errors by exiting, so
leaks
>
> will cause service restarts (maybe every few days) instead of
downtime
>
> of the entire system.
>
> Nir