3node HCI fails when HostedEngineLocal is trying to add additional Gluster members

Three Gen8 HP360 recalled from retirement with single 1TB TLC SATA SSD for boot and oVirt /engine and 7x4TB HDD RAID6 for /vmstore and /data, 10Gbit NICs and network. All CentOS 7.7 updated daily. These machines may not be used exclusively for oVirt so I don't want to re-install the OS, when an oVirt setup fails: Instead I try my best to clean up the nodes when doing another oVirt installation run. They ran oVirt for a week or two using a completely distinct set of storage, so they are fundamentally sound, but we wanted higher storage capacity so I swapped everything and re-installed CentOS very much the same way as before. The first oVirt setup went smoothly but the cluster crumbled without much usage. I won't go into details here, because I didn't want to investigate for now, instead I focussed on redoing the installation and cleaning up the old setup. I know the docs actually recommend starting with wiped hardware, but operationally that would be a show-stopper for the intended use case. So I cleaned up the best I can (ovirt-hosted-engine-cleanup with and without redoing the whole Gluster storage setup, where apart from SSD caching not working, I don't have issues). Undoing the network changes in such a way that the oVirt HCI wizard ceases complaining is a bit more involved. I typically run: - vdsm-tool ovn-unconfigure - vdsm-tool clear-nets (now need to switch to the console) - vdsm-tool remove-config and then I still need to edit /etc/sysconfig/network-scripts/ifcfg-<ethernet-device> to bring the physical adapter back to life. Sometimes I still need to remove the ovirtmgmnt bridge manually etc. Whether I remove and redo the Gluster as a bit of an effect in re-installation, but it doesn't make a difference in what follows. So here is where I am currently getting stuck consistently: The wizard is gone through preparing the Gluster storage (which is completely functional at that point), has created the local VM on the installation node, installed the Postgres database, filled it etc. basically has oVirt up and running with the primary Gluster node and now would like to add the second and third nodes. At that point I get "Connection lost" in the Web-Wizard, evidently as a consequence of Ansible fiddling around heavily to set up the local bridge for the VM. I remember that for the scripted variant of the setup it is recommended to run the script behind 'screen' or 'tmux' in order to ensure its execution isn't interrupted by that. But for the GUI variant, evidently there *should* be some other type of potection, perhaps via the re-connecting nature of HTTP... Pushing the "Reconnect" button on the GUI at that point doesn't return you to the point of the setup, but only offers to redeploy, while the HostedEngineLocal is still there and running. I ssh'd into the machine and started looking for errors and warning and saw that the installation had gone rather far without incidence. OTOPI had completely finished the WildFly server is up and running the Postgres database fully installed and running smoothly, the only thing I can find is that it's trying to add the additional gluster nodes, but complains that these nodes (quotes gluster-UUIDs) are not part of the "cluster". An investigation into the Postgres database shows, that the 'gluster_server' table indeed only has the primary node in it. I don't know what part of the process should have added the other two nodes, but there seems to be no *remaining* connectivity issue with the Gluster members. I installed gscli and connected to all three nodes and volumes without issue. I am guessing at this point, that the complex rewiring of the software defined network is causing a temporary issue and a race condition that I don't know how to recover from. Since the oVirt management GUI is actually fully operational and can be reached from the primary node via the temporary bridge, I went into the GUI and even managed to add the additional two nodes without any problems. Their installation went through without any issues, they showed up in the gluster_servers table on Postgress and basically the installation could have proceeded from that point, except... that I don't know how to restart the process from that point: It still has to 'beam' the local VM into the Gluster storage and restart it there. I have gone through the process three times now, with absolutely identical results. I could use some help how to recover from that situation, which looks like a race condition, nothing a re-installation of everything would really resolve. In the mean-time, I'll try the scripted variant on 'screen' to see if that fares better.

I think that you can go on with the installation (as far as I remember , next phase is the HostedEngine deployment) on the same node. You should not use the single node setup , but the other one.At the end - the engine (once migrated to gluster volume and started up by the ovirt-ha-broker/ovirt-ha-agent) will detect the gluster cluster, once you add all nodes in oVirt. Then you won't have any issues to manage the storage (although I prefer the cli approach). Best Regards,Strahil Nikolov В неделя, 1 декември 2019 г., 16:37:23 ч. Гринуич+2, thomas@hoberg.net <thomas@hoberg.net> написа: Three Gen8 HP360 recalled from retirement with single 1TB TLC SATA SSD for boot and oVirt /engine and 7x4TB HDD RAID6 for /vmstore and /data, 10Gbit NICs and network. All CentOS 7.7 updated daily. These machines may not be used exclusively for oVirt so I don't want to re-install the OS, when an oVirt setup fails: Instead I try my best to clean up the nodes when doing another oVirt installation run. They ran oVirt for a week or two using a completely distinct set of storage, so they are fundamentally sound, but we wanted higher storage capacity so I swapped everything and re-installed CentOS very much the same way as before. The first oVirt setup went smoothly but the cluster crumbled without much usage. I won't go into details here, because I didn't want to investigate for now, instead I focussed on redoing the installation and cleaning up the old setup. I know the docs actually recommend starting with wiped hardware, but operationally that would be a show-stopper for the intended use case. So I cleaned up the best I can (ovirt-hosted-engine-cleanup with and without redoing the whole Gluster storage setup, where apart from SSD caching not working, I don't have issues). Undoing the network changes in such a way that the oVirt HCI wizard ceases complaining is a bit more involved. I typically run: - vdsm-tool ovn-unconfigure - vdsm-tool clear-nets (now need to switch to the console) - vdsm-tool remove-config and then I still need to edit /etc/sysconfig/network-scripts/ifcfg-<ethernet-device> to bring the physical adapter back to life. Sometimes I still need to remove the ovirtmgmnt bridge manually etc. Whether I remove and redo the Gluster as a bit of an effect in re-installation, but it doesn't make a difference in what follows. So here is where I am currently getting stuck consistently: The wizard is gone through preparing the Gluster storage (which is completely functional at that point), has created the local VM on the installation node, installed the Postgres database, filled it etc. basically has oVirt up and running with the primary Gluster node and now would like to add the second and third nodes. At that point I get "Connection lost" in the Web-Wizard, evidently as a consequence of Ansible fiddling around heavily to set up the local bridge for the VM. I remember that for the scripted variant of the setup it is recommended to run the script behind 'screen' or 'tmux' in order to ensure its execution isn't interrupted by that. But for the GUI variant, evidently there *should* be some other type of potection, perhaps via the re-connecting nature of HTTP... Pushing the "Reconnect" button on the GUI at that point doesn't return you to the point of the setup, but only offers to redeploy, while the HostedEngineLocal is still there and running. I ssh'd into the machine and started looking for errors and warning and saw that the installation had gone rather far without incidence. OTOPI had completely finished the WildFly server is up and running the Postgres database fully installed and running smoothly, the only thing I can find is that it's trying to add the additional gluster nodes, but complains that these nodes (quotes gluster-UUIDs) are not part of the "cluster". An investigation into the Postgres database shows, that the 'gluster_server' table indeed only has the primary node in it. I don't know what part of the process should have added the other two nodes, but there seems to be no *remaining* connectivity issue with the Gluster members. I installed gscli and connected to all three nodes and volumes without issue. I am guessing at this point, that the complex rewiring of the software defined network is causing a temporary issue and a race condition that I don't know how to recover from. Since the oVirt management GUI is actually fully operational and can be reached from the primary node via the temporary bridge, I went into the GUI and even managed to add the additional two nodes without any problems. Their installation went through without any issues, they showed up in the gluster_servers table on Postgress and basically the installation could have proceeded from that point, except... that I don't know how to restart the process from that point: It still has to 'beam' the local VM into the Gluster storage and restart it there. I have gone through the process three times now, with absolutely identical results. I could use some help how to recover from that situation, which looks like a race condition, nothing a re-installation of everything would really resolve. In the mean-time, I'll try the scripted variant on 'screen' to see if that fares better. _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/TRUEYBHAYPJUAI...

Thanks Strahil, for your suggestions. Actually, I was far beyond the pick-up point you describe, as the Gluster had all been prepared and was operable, even the local VM was already running and accessible via the GUI. But I picked up your hint to try to continue with the scripted variant, and found that it allowed me much better insight into what was going on. I am a little worried, though, that it actually works a little differently from the GUI-wizard variant, in any case the failures don't seem identical. That would have implications in terms of test-automation, which I'd rather not have to worry about. In a separate test for the same operation on a separate set of hardware the installation got significantly further, up to the point where the local VM had actually been moved onto the Gluster storage and into the Cluster, but then failed a validation step at the very end (while the VM is actually up and running, albeit only withe primary host and that listed as "Unresponsive" while it's hosting the VM....) I have opened a separate tickt for that...

After spending another couple of hours trying to track down the problem, I have found that the "lost connection" seems due to KVM shutting down, because it cannot find the certificates for the Spice and VNC connections in /etc/pki/vdsm/*, where 'ovirt-hosted-engine-cleanup' deleted them. So now I wonder: Who is supposed to (re-)generated them afterwards? Assuming that it was a much earlier step I proceeded to completely undo the deployment, get rid of the Gluster setup etc. and start from the very beginning, only to find that that didn't change a thing: It still missed those certificates.... ...while something or someone *did* generated them when I tried a distinct and new set of nodes for counter-testing.. That setup failed with an Ansible error (reported separately), but I have now grown afraid of using 'ovirt-hosted-engine-cleanup' when I don't know how to get the ciphers/keys for /etc/pki/vsdm/{spice|vnc} regenerated... Can anyone shed some light into this darkness?
participants (2)
-
Strahil Nikolov
-
thomas@hoberg.net