New subject: Procedure to replace out out of three hyperconverged nodes

Monday, 19 August 2019

On my silent Atom based three node Hyperconverged journey I hit upon a snag: Evidently
they are too slow for Ansible.

The Gluster storage part went all great and perfect on fresh oVirt node images that I had
configured to leave an empty partition instead of the standard /dev/sdb, but the
HostedEngine setup part would then fail without any log-visible error while the transient
VM HostedEngineLocal was supposed to be launched and the Wizard would just show
"deployment failed" and go ahead and delete the VM.

I then moved the SSD to a more powerful Xeon-D 1541 CPU and after some fiddling with the
network (I miss good old eth0!), this also failed the deployment, but also failed to
delete the temporary VM image, because that actually turned out to be running: I could
even connect to its console and investigate the logs for any clues as to what might have
gone wrong (nothing visible): Evidently Ansible was running out of patience just a tiny
bit too early.

And then I kicked it into high-gear with an i7-7700K again using the same SSD with a
working three node Gluster all in sync, which still took what felt like an hour to creep
through every step, but got it done, primary node on i7, secondary nodes on Atoms, with
full migration capabilities etc.

I then had to do some fiddling, because the HostedEngine had configued the Cluster CPU
architecture to Skylake-Spectre, but after that I migrated it to an Atom node and was now
ready to move the primary to the intended Atom hardware target.

But at that point the overlay network has already been configured and evidently it's
tied to the device name of the 10Gbit NIC on the i7 workstation and I haven't been
able to make it work with the Atom. The Gluster runs fine, but the CPU node is reported
"non-operational" and re-installation fails, because the ovirtmgmt network
isn't properly configured.

That specific issue may be seem way out of what oVirt should support, yet a HA-embedded
edge platform may very well see nodes having to be replaced or renewed with as little
interruption or downtime as possible, which is why I am asking the larger question:

How can you replace a) a failed "burned" node or b) upgrade nodes while
maintaining fault tolerance?

The distinction in b) would be that it's a planned maneuver during normal operations
without downtime.

I'd want to do it pretty much like I have been playing with compute nodes, creating
new ones, pushing VMs on them, pushing them out to other hosts, removing and replacing
them seamlessly... Except that the Gluster nodes are special and much harder to replace,
than a pure Gluster storage brick... from what I see

I welcome any help 
- for fixing the network config in my limiping Atom 1:3 cluster
- eliminating the need to fiddle with an i7 because of Ansible timing
- ensuring long-term operability of a software defined datacenter with changing hardware

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Procedure to replace out out of three hyperconverged nodes