Three Gen8 HP360 recalled from retirement with single 1TB TLC SATA SSD for boot and oVirt
/engine and 7x4TB HDD RAID6 for /vmstore and /data, 10Gbit NICs and network.
All CentOS 7.7 updated daily.
These machines may not be used exclusively for oVirt so I don't want to re-install the
OS, when an oVirt setup fails: Instead I try my best to clean up the nodes when doing
another oVirt installation run.
They ran oVirt for a week or two using a completely distinct set of storage, so they are
fundamentally sound, but we wanted higher storage capacity so I swapped everything and
re-installed CentOS very much the same way as before.
The first oVirt setup went smoothly but the cluster crumbled without much usage. I
won't go into details here, because I didn't want to investigate for now, instead
I focussed on redoing the installation and cleaning up the old setup.
I know the docs actually recommend starting with wiped hardware, but operationally that
would be a show-stopper for the intended use case.
So I cleaned up the best I can (ovirt-hosted-engine-cleanup with and without redoing the
whole Gluster storage setup, where apart from SSD caching not working, I don't have
issues).
Undoing the network changes in such a way that the oVirt HCI wizard ceases complaining is
a bit more involved. I typically run:
- vdsm-tool ovn-unconfigure
- vdsm-tool clear-nets (now need to switch to the console)
- vdsm-tool remove-config
and then I still need to edit /etc/sysconfig/network-scripts/ifcfg-<ethernet-device>
to bring the physical adapter back to life.
Sometimes I still need to remove the ovirtmgmnt bridge manually etc.
Whether I remove and redo the Gluster as a bit of an effect in re-installation, but it
doesn't make a difference in what follows.
So here is where I am currently getting stuck consistently:
The wizard is gone through preparing the Gluster storage (which is completely functional
at that point), has created the local VM on the installation node, installed the Postgres
database, filled it etc. basically has oVirt up and running with the primary Gluster node
and now would like to add the second and third nodes.
At that point I get "Connection lost" in the Web-Wizard, evidently as a
consequence of Ansible fiddling around heavily to set up the local bridge for the VM. I
remember that for the scripted variant of the setup it is recommended to run the script
behind 'screen' or 'tmux' in order to ensure its execution isn't
interrupted by that. But for the GUI variant, evidently there *should* be some other type
of potection, perhaps via the re-connecting nature of HTTP...
Pushing the "Reconnect" button on the GUI at that point doesn't return you
to the point of the setup, but only offers to redeploy, while the HostedEngineLocal is
still there and running.
I ssh'd into the machine and started looking for errors and warning and saw that the
installation had gone rather far without incidence. OTOPI had completely finished the
WildFly server is up and running the Postgres database fully installed and running
smoothly, the only thing I can find is that it's trying to add the additional gluster
nodes, but complains that these nodes (quotes gluster-UUIDs) are not part of the
"cluster". An investigation into the Postgres database shows, that the
'gluster_server' table indeed only has the primary node in it.
I don't know what part of the process should have added the other two nodes, but there
seems to be no *remaining* connectivity issue with the Gluster members. I installed gscli
and connected to all three nodes and volumes without issue.
I am guessing at this point, that the complex rewiring of the software defined network is
causing a temporary issue and a race condition that I don't know how to recover from.
Since the oVirt management GUI is actually fully operational and can be reached from the
primary node via the temporary bridge, I went into the GUI and even managed to add the
additional two nodes without any problems. Their installation went through without any
issues, they showed up in the gluster_servers table on Postgress and basically the
installation could have proceeded from that point, except... that I don't know how to
restart the process from that point: It still has to 'beam' the local VM into the
Gluster storage and restart it there.
I have gone through the process three times now, with absolutely identical results.
I could use some help how to recover from that situation, which looks like a race
condition, nothing a re-installation of everything would really resolve.
In the mean-time, I'll try the scripted variant on 'screen' to see if that
fares better.