Re: 3node HCI fails when HostedEngineLocal is trying to add additional Gluster members

Most probably the vdsm or supervdsm's PreExec task is doing it (they got multiple, so you can run manually till you find it out). Just try the following: systemctl stop vdsmd supervdsmd systemctl start supervdsmd Check for certs systemctl start vdsmd Keep in mind that that the chain of events (at least for me is): 1. VG activation 2. VDO activation 3. Gluster brick is mounted (I use systemd service due to deps between vdo, gluster brick and glusterd) 4. Glusterd and libvirt are started 5. Sanlock is started 6. Supervdsm 7. Vdsm If this is a host that will host HostedEngine VM: 8. Ovirt-ha-broker 9. Ovirt-ha-agent After cleanup, did you reboot? Best Regards, Strahil NikolovOn Dec 4, 2019 17:14, thomas@hoberg.net wrote:
After spending another couple of hours trying to track down the problem, I have found that the "lost connection" seems due to KVM shutting down, because it cannot find the certificates for the Spice and VNC connections in /etc/pki/vdsm/*, where 'ovirt-hosted-engine-cleanup' deleted them.
So now I wonder: Who is supposed to (re-)generated them afterwards?
Assuming that it was a much earlier step I proceeded to completely undo the deployment, get rid of the Gluster setup etc. and start from the very beginning, only to find that that didn't change a thing: It still missed those certificates....
...while something or someone *did* generated them when I tried a distinct and new set of nodes for counter-testing..
That setup failed with an Ansible error (reported separately), but I have now grown afraid of using 'ovirt-hosted-engine-cleanup' when I don't know how to get the ciphers/keys for /etc/pki/vsdm/{spice|vnc} regenerated...
Can anyone shed some light into this darkness? _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/Z7AFFFU6KMDPSB...

Hi Strahil, first of all, thanks for following up on this... I think I'll put that list of yours on the wall: It's a key piece of documentation that I found missing: Perhaps you could reconstruct it from systemd dependencies, but... I may not have rebooted... it takes a long time on these older HP servers and it sometimes brings about additional challenges.. e.g the device-mapper keeps finding DM signatures on storage that I had definitely told it to delete... so if I forget over the reboot of three nodes (all of which have to be accessed through a myriad of ILO tunnels) VDO setup will fail again with a "filter on /dev/sdb..." But I do restart the network and all ovirt related daemons... and check that they are in an expected state (e.g. ha-broker and ha-agent stopped after a "complete" cleanup. I am currently blaiming an Ansible change (and waiting for more statistical evidence from others): "The 'ovirt_host_facts' module has been renamed to 'ovirt_host_info', and the renamed one no longer returns ansible_facts" But I also managed to make /etc/resolv.conf not world readable on some of the nodes, which seems to have unexpected effects... Didn't have time to clean that cluster yet... (four HCI clusters for testing currently, all failing one way or another)
participants (2)
-
Strahil
-
thomas@hoberg.net