On Mon, Aug 19, 2019 at 10:55 PM <thomas(a)hoberg.net> wrote:
On my silent Atom based three node Hyperconverged journey I hit upon
a
snag: Evidently they are too slow for Ansible.
The Gluster storage part went all great and perfect on fresh oVirt node
images that I had configured to leave an empty partition instead of the
standard /dev/sdb, but the HostedEngine setup part would then fail without
any log-visible error while the transient VM HostedEngineLocal was supposed
to be launched and the Wizard would just show "deployment failed" and go
ahead and delete the VM.
I then moved the SSD to a more powerful Xeon-D 1541 CPU and after some
fiddling with the network (I miss good old eth0!), this also failed the
deployment, but also failed to delete the temporary VM image, because that
actually turned out to be running: I could even connect to its console and
investigate the logs for any clues as to what might have gone wrong
(nothing visible): Evidently Ansible was running out of patience just a
tiny bit too early.
And then I kicked it into high-gear with an i7-7700K again using the same
SSD with a working three node Gluster all in sync, which still took what
felt like an hour to creep through every step, but got it done, primary
node on i7, secondary nodes on Atoms, with full migration capabilities etc.
I then had to do some fiddling, because the HostedEngine had configued the
Cluster CPU architecture to Skylake-Spectre, but after that I migrated it
to an Atom node and was now ready to move the primary to the intended Atom
hardware target.
But at that point the overlay network has already been configured and
evidently it's tied to the device name of the 10Gbit NIC on the i7
workstation and I haven't been able to make it work with the Atom. The
Gluster runs fine, but the CPU node is reported "non-operational" and
re-installation fails, because the ovirtmgmt network isn't properly
configured.
That specific issue may be seem way out of what oVirt should support, yet
a HA-embedded edge platform may very well see nodes having to be replaced
or renewed with as little interruption or downtime as possible, which is
why I am asking the larger question:
How can you replace a) a failed "burned" node or b) upgrade nodes while
maintaining fault tolerance?
a) You could add a new node to the cluster, replace bricks on the failed
node with ones on new node and remove the failed node.
b) Cluster supports rolling upgrade. So nodes are updated one at a time,
ensuring there's always 2 replicas of gluster bricks online so there's no
data unavailability
The distinction in b) would be that it's a planned maneuver
during normal
operations without downtime.
I'd want to do it pretty much like I have been playing with compute nodes,
creating new ones, pushing VMs on them, pushing them out to other hosts,
removing and replacing them seamlessly... Except that the Gluster nodes are
special and much harder to replace, than a pure Gluster storage brick...
from what I see
I welcome any help
- for fixing the network config in my limiping Atom 1:3 cluster
- eliminating the need to fiddle with an i7 because of Ansible timing
- ensuring long-term operability of a software defined datacenter with
changing hardware
_______________________________________________
Users mailing list -- users(a)ovirt.org
To unsubscribe send an email to users-leave(a)ovirt.org
Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/ICV5HRJFHWE...