On Mon, Aug 19, 2019 at 10:55 PM <thomas@hoberg.net> wrote:

On my silent Atom based three node Hyperconverged journey I hit upon a snag: Evidently they are too slow for Ansible.

The Gluster storage part went all great and perfect on fresh oVirt node images that I had configured to leave an empty partition instead of the standard /dev/sdb, but the HostedEngine setup part would then fail without any log-visible error while the transient VM HostedEngineLocal was supposed to be launched and the Wizard would just show "deployment failed" and go ahead and delete the VM.

I then moved the SSD to a more powerful Xeon-D 1541 CPU and after some fiddling with the network (I miss good old eth0!), this also failed the deployment, but also failed to delete the temporary VM image, because that actually turned out to be running: I could even connect to its console and investigate the logs for any clues as to what might have gone wrong (nothing visible): Evidently Ansible was running out of patience just a tiny bit too early.

And then I kicked it into high-gear with an i7-7700K again using the same SSD with a working three node Gluster all in sync, which still took what felt like an hour to creep through every step, but got it done, primary node on i7, secondary nodes on Atoms, with full migration capabilities etc.

I then had to do some fiddling, because the HostedEngine had configued the Cluster CPU architecture to Skylake-Spectre, but after that I migrated it to an Atom node and was now ready to move the primary to the intended Atom hardware target.

But at that point the overlay network has already been configured and evidently it's tied to the device name of the 10Gbit NIC on the i7 workstation and I haven't been able to make it work with the Atom. The Gluster runs fine, but the CPU node is reported "non-operational" and re-installation fails, because the ovirtmgmt network isn't properly configured.

That specific issue may be seem way out of what oVirt should support, yet a HA-embedded edge platform may very well see nodes having to be replaced or renewed with as little interruption or downtime as possible, which is why I am asking the larger question:

How can you replace a) a failed "burned" node or b) upgrade nodes while maintaining fault tolerance?

a) You could add a new node to the cluster, replace bricks on the failed node with ones on new node and remove the failed node.

b) Cluster supports rolling upgrade. So nodes are updated one at a time, ensuring there's always 2 replicas of gluster bricks online so there's no data unavailability

The distinction in b) would be that it's a planned maneuver during normal operations without downtime.

I'd want to do it pretty much like I have been playing with compute nodes, creating new ones, pushing VMs on them, pushing them out to other hosts, removing and replacing them seamlessly... Except that the Gluster nodes are special and much harder to replace, than a pure Gluster storage brick... from what I see

I welcome any help
- for fixing the network config in my limiping Atom 1:3 cluster
- eliminating the need to fiddle with an i7 because of Ansible timing
- ensuring long-term operability of a software defined datacenter with changing hardware
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/ICV5HRJFHWENZNBCQCFNIL7NDD7YYD33/