Running a couple of oVirt clusters on left-over hardware in an R&D niche of the data
center. Lots of switches/proxies still at 100Mbit and just checking for updates via
'yum update' can take awhile, even time out 2 times of out 3.
The network between the nodes is 10Gbit though, faster than any other part of the
hardware, including some SSDs and RAIDs: Cluster communication should be excellent, even
if everything goes through a single port.
After moving some servers to a new IP range, where there are even more hops to the proxy,
I am shocked to see the three HCI nodes in one cluster almost permanently report bad HA
scores, which of course becomes a real issue, when it hits all three. The entire cluster
really starts to 'wobble'....
Trying to find the reason for that bad score and there is nothing obvious: Machines have
been running just fine, very light loads, no downtimes, reboots etc.
But looking at the events recorded on hosts, something like "Failed to check for
available updates on host <name> with message 'Failed to run check-update of
host '<host>'. Error: null'." does come up pretty often. Moreover,
when I then have all three servers run the update check on the GUI, I can find myself
locked-out of the oVirt GUI and once I get back in, all non-active HostedEngine hosts are
suddenly back in the 'low HS score' state.
So I have this inkling impression, that the ability (or not) to run the update check is
counting into the HA score, which ... IMHO would be quite mad. It would have production
clusters go haywire, just because an external internet connection is interrupted...
Any feedback on this?
P.S.
Only minutes later, after noticing the ha-scores reported by hosted-engine --vm-status
were really in the low 2000s range overall, I did a quick Google and found this:
ovirt-ha-agent - host score
penaltieshttps://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovir...
NOTE: These values must be the same for all hosts in the HA
cluster!base-score=3400
gateway-score-penalty=1600
not-uptodate-config-penalty=1000 //not 'knowing if there are updates' is not the
same as 'knowing it missing critical patches'
mgmt-bridge-score-penalty=600
free-memory-score-penalty=400
cpu-load-score-penalty=1000
engine-retry-score-penalty=50
cpu-load-penalty-min=0.4
cpu-load-penalty-max=0.9
So now I know how to fix it for me, but I'd consider this pretty much a bug: When the
update check fails, that implies really only that the update check could not go through.
It doesn't mean the cluster is fundamentally unhealthy.
Now I understand how that negative feedback is next to impossible inside RedHat's
network, where update servers are local.
But having a cluster HA score being based on something 'just now happening' on the
other far edges of the Internet... seems a very bad design decision.
Please comment and/or tell me how and where I should file this as a bug.