New subject: 3node HCI fails when HostedEngineLocal is trying to add additional Gluster members

1 Dec 2019

      Three Gen8 HP360 recalled from retirement with single 1TB TLC SATA SSD for boot and oVirt /engine and 7x4TB HDD RAID6 for /vmstore and /data, 10Gbit NICs and network.

All CentOS 7.7 updated daily.

These machines may not be used exclusively for oVirt so I don't want to re-install the OS, when an oVirt setup fails: Instead I try my best to clean up the nodes when doing another oVirt installation run.

They ran oVirt for a week or two using a completely distinct set of storage, so they are fundamentally sound, but we wanted higher storage capacity so I swapped everything and re-installed CentOS very much the same way as before.

The first oVirt setup went smoothly but the cluster crumbled without much usage. I won't go into details here, because I didn't want to investigate for now, instead I focussed on redoing the installation and cleaning up the old setup.

I know the docs actually recommend starting with wiped hardware, but operationally that would be a show-stopper for the intended use case.

So I cleaned up the best I can (ovirt-hosted-engine-cleanup with and without redoing the whole Gluster storage setup, where apart from SSD caching not working, I don't have issues).

Undoing the network changes in such a way that the oVirt HCI wizard ceases complaining is a bit more involved. I typically run:
- vdsm-tool ovn-unconfigure
- vdsm-tool clear-nets (now need to switch to the console)
- vdsm-tool remove-config

and then I still need to edit /etc/sysconfig/network-scripts/ifcfg-<ethernet-device> to bring the physical adapter back to life.
Sometimes I still need to remove the ovirtmgmnt bridge manually etc.

Whether I remove and redo the Gluster as a bit of an effect in re-installation, but it doesn't make a difference in what follows.

So here is where I am currently getting stuck consistently:

The wizard is gone through preparing the Gluster storage (which is completely functional at that point), has created the local VM on the installation node, installed the Postgres database, filled it etc. basically has oVirt up and running with the primary Gluster node and now would like to add the second and third nodes.

At that point I get "Connection lost" in the Web-Wizard, evidently as a consequence of Ansible fiddling around heavily to set up the local bridge for the VM. I remember that for the scripted variant of the setup it is recommended to run the script behind 'screen' or 'tmux' in order to ensure its execution isn't interrupted by that. But for the GUI variant, evidently there *should* be some other type of potection, perhaps via the re-connecting nature of HTTP...

Pushing the "Reconnect" button on the GUI at that point doesn't return you to the point of the setup, but only offers to redeploy, while the HostedEngineLocal is still there and running.

I ssh'd into the machine and started looking for errors and warning and saw that the installation had gone rather far without incidence. OTOPI had completely finished the WildFly server is up and running the Postgres database fully installed and running smoothly, the only thing I can find is that it's trying to add the additional gluster nodes, but complains that these nodes (quotes gluster-UUIDs) are not part of the "cluster". An investigation into the Postgres database shows, that the 'gluster_server' table indeed only has the primary node in it.

I don't know what part of the process should have added the other two nodes, but there seems to be no *remaining* connectivity issue with the Gluster members. I installed gscli and connected to all three nodes and volumes without issue.

I am guessing at this point, that the complex rewiring of the software defined network is causing a temporary issue and a race condition that I don't know how to recover from.

Since the oVirt management GUI is actually fully operational and can be reached from the primary node via the temporary bridge, I went into the GUI and even managed to add the additional two nodes without any problems. Their installation went through without any issues, they showed up in the gluster_servers table on Postgress and basically the installation could have proceeded from that point, except... that I don't know how to restart the process from that point: It still has to 'beam' the local VM into the Gluster storage and restart it there.

I have gone through the process three times now, with absolutely identical results.

I could use some help how to recover from that situation, which looks like a race condition, nothing a re-installation of everything would really resolve.

In the mean-time, I'll try the scripted variant on 'screen' to see if that fares better.

3node HCI fails when HostedEngineLocal is trying to add additional Gluster members

thomas＠hoberg.net

Strahil Nikolov

thomas＠hoberg.net

thomas＠hoberg.net

tags

participants (2)