Hi!
We have a problem with multiple hosts stuck in Connecting state, which I hoped somebody
here could help us wrap our heads around.
All hosts, except one, seem to have very similar symptoms but I'll focus on one host
that represents the rest.
So, the host is stuck in Connecting state and this what we see in oVirt log files.
/var/log/ovirt-engine/engine.log:
2023-04-20 09:51:53,021+03 ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesAsyncVDSCommand]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-37) [] Command
'GetCapabilitiesAsyncVDSCommand(HostName = ABC010-176-XYZ,
VdsIdAndVdsVDSCommandParametersBase:{hostId='2c458562-3d4d-4408-afc9-9a9484984a91',
vds='Host[ABC010-176-XYZ,2c458562-3d4d-4408-afc9-9a9484984a91]'})' execution
failed: org.ovirt.vdsm.jsonrpc.client.ClientConnectionException: SSL session is invalid
2023-04-20 09:55:16,556+03 ERROR
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-67) [] EVENT_ID:
VDS_BROKER_COMMAND_FAILURE(10,802), VDSM ABC010-176-XYZ command Get Host Capabilities
failed: Message timeout which can be caused by communication issues
/var/log/vdsm/vdsm.log:
2023-04-20 17:48:51,977+0300 INFO (vmrecovery) [vdsm.api] START
getConnectedStoragePoolsList() from=internal, task_id=ebce7c8c-6ded-454e-9aee-86edf72764ef
(api:31)
2023-04-20 17:48:51,977+0300 INFO (vmrecovery) [vdsm.api] FINISH
getConnectedStoragePoolsList return={'poollist': []} from=internal,
task_id=ebce7c8c-6ded-454e-9aee-86edf72764ef (api:37)
2023-04-20 17:48:51,978+0300 INFO (vmrecovery) [vds] recovery: waiting for storage pool
to go up (clientIF:723)
Both engine.log and vdsm.log are flooded with these messages. They are repeated at regular
intervals ad infinitum. This is one common symptom shared by multiple hosts in our
deployment. They all have these message loops in engine.log and vdsm.log files. On all
Running vdsm-client Host getConnectedStoragePools also returns an empty list represented
by [] on all hosts (but interestingly there is one that showed Storage Pool UUID and yet
it was still stuck in Connecting state).
This particular host (ABC010-176-XYZ) is connected to 3 CEPH iSCSI Storage Domains and
lsblk shows 3 block devices with matching UUIDs in their device components. So, the
storage seems to be connected but the Storage Pool is not? How is that even possible?
Now, what's even more weird is that we tried rebooting the host (via Administrator
Portal) and it didn't help. We even tried removing and re-adding the host in
Administrator Portal but to no avail.
Additionally, the host refused to go into Maintenance mode so we had to enforce it by
manually updating Engine DB.
We also tried reinstalling the host via Administrator Portal and ran into another weird
problem, which I'm not sure if it's a related one or a problem that deserves a
dedicated discussion thread but, basically, the underlying Ansible playbook exited with
the following error message:
"stdout" : "fatal: [10.10.10.176]: UNREACHABLE! =>
{\"changed\": false, \"msg\": \"Data could not be sent to remote
host \\\"10.10.10.176\\\". Make sure this host can be reached over ssh: \",
\"unreachable\": true}",
Counterintuitively, just before running Reinstall via Administrator Portal we had been
able to reboot the same host (which as you know oVirt does via Ansible as well). So, no
changes on the host in between just different Ansible playbooks. To confirm that we
actually had access to the host over ssh we successfully ran ssh -p $PORT
root(a)10.10.10.176 -i /etc/pki/ovirt-engine/keys/engine_id_rsa and it worked.
That made us scratch our heads for a while but what seems to had fixed Ansible's ssh
access problems was manual full stop of all VDSM-related systemd services on the host. It
was just a wild guess but as soon as we stopped all VDSM services Ansible stopped
complaining about not being able to reach the target host and successfully did its job.
I'm sure you'd like to see more logs but I'm not certain what exactly is
relevant. There are a ton of logs as this deployment is comprised of nearly 80 hosts. So,
I guess it's best if you just request to see specific logs, messages or configuration
details and I'll cherry-pick what's relevant.
We don't really understand what's going on and would appreciate any help. We tried
just about anything we could think of to resolve this issue and are running out of ideas
what to do next.
If you have any questions just ask and I'll do my best to answer them.