For my home-lab I operate a 3 node HCI cluster on 100% passive Atoms,
mostly to run light infrastructure services such as LDAP and NextCloud.
I then add workstations or even laptops as pure compute hosts to the
cluster for bigger but temporary things, that might actually run a
different OS most of the time or just be shut off. From oVirt's point
of view, these are just first put into maintenance and then shut down
until needed again. No fencing or power management, all manual.
All nodes, even the HCI ones, run CentOS7 with more of a workstation
configuration, so updates pile up pretty quickly.
After I recently upgraded one of these extra compute nodes, I found my
three node HCI cluster not just faltering, but indeed very hard to
reactivate at all.
The faltering is a distinct issue: I have the impression that reboots
of oVirt nodes cause broadcast storms on my rather simplistic 10Gibt L2
switch, which a normal CentOS instance (or any other OS) doesn't, but
that's for another post.
No what struck me, was that the gluster daemons on the three HCI nodes
kept complaining about a lack of quorum long after the network was all
back to normal, even if all three of them were there, saw each other
perfectly on "gluster show status all", ready and without any healing
issues pending at all.
Glusterd would complain on all three nodes that there was no quota for
the bricks and stop them.
That went away as soon as I started one additional compute node, a node
that was a gluster peer (because an oVirt host added to a HCI cluster
always gets put into the Gluster, even if it's not contributing
storage) but had no bricks. Immediately the gluster daemon on the three
nodes with contributing bricks would report back good quota and launch
the volumes (and thus all the rest of oVirt), even if in terms of
*storage bricks* nothing had changed.
I am afraid that downing the extra compute-only oVirtNode will bring
down the HCI: Clearly not the type of redundancy it's designed to
deliver.
Evidently such compute-only hosts (and gluster members) get included
into some quorum deliberations even if they hold not a single brick,
neither storage nor arbitration.
To me that seems like a bug, if that is indeed what happens: There I
need your advice and suggestions.
AFAIK HCI is a late addition to oVirt/RHEV as storage and compute were
orginally designed to be completely distinct. In fact there are still
remnants of documentation which seem to prohibit using a node for both
compute and storage... what HCI is all about.
And I have seen compute nodes with "matching" storage (parts of a
distinct HCI setup, that was taken down but still had all the storage
and Gluster elements operable), being happliy absorbed into a HCI
cluster with all Gluster storage appearing in the GUI etc., without any
manual creation or inclusion of bricks: Fully automatic (and
undocumented)!
In that case it makes sense to widen the scope of quota calculations
when additional nodes are hyperconverged elements with contributing
bricks. It also seems the only way to turn a 3 node HCI into 6 or 9
node one.
But if you really just want to add compute nodes without bricks, those
can't get "quota votes" without storage to play a role in the
redundancy.
I can easily imagine the missing "if then else" in the code here, but I
was actually very surprised to see those failure and success messages
coming from glusterd itself, which to my understanding is pretty
unrelated to oVirt on top. Not from the management engine (wasn't
running anyway), not from VDSM.
Re-creating the scenario is very scary even if I have gone through this
three times already, trying to just bring my HCI back up. And then
there is so verbose logs all over the place that I'd like some advice
which ones I should post.
But simply speaking: Gluster peers should get no quota voting rights on
volumes unless they contribute bricks. That rule seems broken.
Those in the know, please let me know if am on a goose chase or if
there is a real issue here that deserves a bug report.
_______________________________________________
Users mailing list -- users(a)ovirt.org
To unsubscribe send an email to users-leave(a)ovirt.org
Privacy Statement:
https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/IBHZ674FMWF...
I have skipped a huge part of your e-mail cause it was too long (don't get
offended).
Can you summarize in one (or 2 ) sentences what exactly is the problem ?
Is the UI not detecting the Gluster status, quota is preventing you to start VMs or
something else ?
Best Rergards,
Strahil Nikolov