[ovirt-users] Re: Does a slow Web proxy (can't run the update-check) ruin HA scores? (Solved, ...but really?)

27 Jun 2020

      Most probably the hosts's ICMP echo requests to the gateway get lost. This leads  to enough penalty, so your engine is moved away from the host.

Which 'penalty' did you disable  to stabilize your environment ?

Best Regards,
Strahil Nikolov

На 27 юни 2020 г. 18:19:58 GMT+03:00, thomas@hoberg.net написа:
...
Running a couple of oVirt clusters on left-over hardware in an R&D
niche of the data center. Lots of switches/proxies still at 100Mbit and
just checking for updates via 'yum update' can take awhile, even time
out 2 times of out 3.
The network between the nodes is 10Gbit though, faster than any other
part of the hardware, including some SSDs and RAIDs: Cluster
communication should be excellent, even if everything goes through a
single port.
After moving some servers to a new IP range, where there are even more
hops to the proxy, I am shocked to see the three HCI nodes in one
cluster almost permanently report bad HA scores, which of course
becomes a real issue, when it hits all three. The entire cluster really
starts to 'wobble'....
Trying to find the reason for that bad score and there is nothing
obvious: Machines have been running just fine, very light loads, no
downtimes, reboots etc.
But looking at the events recorded on hosts, something like "Failed to
check for available updates on host <name> with message 'Failed to run
check-update of host '<host>'. Error: null'." does come up pretty
often. Moreover, when I then have all three servers run the update
check on the GUI, I can find myself locked-out of the oVirt GUI and
once I get back in, all non-active HostedEngine hosts are suddenly back
in the 'low HS score' state.
So I have this inkling impression, that the ability (or not) to run the
update check is counting into the HA score, which ... IMHO would be
quite mad. It would have production clusters go haywire, just because
an external internet connection is interrupted...
Any feedback on this?
P.S.
Only minutes later, after noticing the ha-scores reported by
hosted-engine --vm-status were really in the low 2000s range overall, I
did a quick Google and found this:
ovirt-ha-agent - host score
penaltieshttps://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/agent/agent.conf[score]#
NOTE: These values must be the same for all hosts in the HA 
cluster!base-score=3400
gateway-score-penalty=1600
not-uptodate-config-penalty=1000 //not 'knowing if there are updates'
is not the same as 'knowing it missing critical patches'
mgmt-bridge-score-penalty=600
free-memory-score-penalty=400
cpu-load-score-penalty=1000
engine-retry-score-penalty=50
cpu-load-penalty-min=0.4
cpu-load-penalty-max=0.9
So now I know how to fix it for me, but I'd consider this pretty much a
bug: When the update check fails, that implies really only that the
update check could not go through. It doesn't mean the cluster is
fundamentally unhealthy.
Now I understand how that negative feedback is next to impossible
inside RedHat's network, where update servers are local.
But having a cluster HA score being based on something 'just now
happening' on the other far edges of the Internet... seems a very bad
design decision.
Please comment and/or tell me how and where I should file this as a
bug.
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/ACY3NEU6NYM6ZJ...