Does a slow Web proxy (can't run the update-check) ruin HA scores? (Solved, ...but really?)

Running a couple of oVirt clusters on left-over hardware in an R&D niche of the data center. Lots of switches/proxies still at 100Mbit and just checking for updates via 'yum update' can take awhile, even time out 2 times of out 3. The network between the nodes is 10Gbit though, faster than any other part of the hardware, including some SSDs and RAIDs: Cluster communication should be excellent, even if everything goes through a single port. After moving some servers to a new IP range, where there are even more hops to the proxy, I am shocked to see the three HCI nodes in one cluster almost permanently report bad HA scores, which of course becomes a real issue, when it hits all three. The entire cluster really starts to 'wobble'.... Trying to find the reason for that bad score and there is nothing obvious: Machines have been running just fine, very light loads, no downtimes, reboots etc. But looking at the events recorded on hosts, something like "Failed to check for available updates on host <name> with message 'Failed to run check-update of host '<host>'. Error: null'." does come up pretty often. Moreover, when I then have all three servers run the update check on the GUI, I can find myself locked-out of the oVirt GUI and once I get back in, all non-active HostedEngine hosts are suddenly back in the 'low HS score' state. So I have this inkling impression, that the ability (or not) to run the update check is counting into the HA score, which ... IMHO would be quite mad. It would have production clusters go haywire, just because an external internet connection is interrupted... Any feedback on this? P.S. Only minutes later, after noticing the ha-scores reported by hosted-engine --vm-status were really in the low 2000s range overall, I did a quick Google and found this: ovirt-ha-agent - host score penaltieshttps://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/agent/agent.conf[score]# NOTE: These values must be the same for all hosts in the HA cluster!base-score=3400 gateway-score-penalty=1600 not-uptodate-config-penalty=1000 //not 'knowing if there are updates' is not the same as 'knowing it missing critical patches' mgmt-bridge-score-penalty=600 free-memory-score-penalty=400 cpu-load-score-penalty=1000 engine-retry-score-penalty=50 cpu-load-penalty-min=0.4 cpu-load-penalty-max=0.9 So now I know how to fix it for me, but I'd consider this pretty much a bug: When the update check fails, that implies really only that the update check could not go through. It doesn't mean the cluster is fundamentally unhealthy. Now I understand how that negative feedback is next to impossible inside RedHat's network, where update servers are local. But having a cluster HA score being based on something 'just now happening' on the other far edges of the Internet... seems a very bad design decision. Please comment and/or tell me how and where I should file this as a bug.

Most probably the hosts's ICMP echo requests to the gateway get lost. This leads to enough penalty, so your engine is moved away from the host. Which 'penalty' did you disable to stabilize your environment ? Best Regards, Strahil Nikolov На 27 юни 2020 г. 18:19:58 GMT+03:00, thomas@hoberg.net написа:
Running a couple of oVirt clusters on left-over hardware in an R&D niche of the data center. Lots of switches/proxies still at 100Mbit and just checking for updates via 'yum update' can take awhile, even time out 2 times of out 3.
The network between the nodes is 10Gbit though, faster than any other part of the hardware, including some SSDs and RAIDs: Cluster communication should be excellent, even if everything goes through a single port.
After moving some servers to a new IP range, where there are even more hops to the proxy, I am shocked to see the three HCI nodes in one cluster almost permanently report bad HA scores, which of course becomes a real issue, when it hits all three. The entire cluster really starts to 'wobble'....
Trying to find the reason for that bad score and there is nothing obvious: Machines have been running just fine, very light loads, no downtimes, reboots etc.
But looking at the events recorded on hosts, something like "Failed to check for available updates on host <name> with message 'Failed to run check-update of host '<host>'. Error: null'." does come up pretty often. Moreover, when I then have all three servers run the update check on the GUI, I can find myself locked-out of the oVirt GUI and once I get back in, all non-active HostedEngine hosts are suddenly back in the 'low HS score' state.
So I have this inkling impression, that the ability (or not) to run the update check is counting into the HA score, which ... IMHO would be quite mad. It would have production clusters go haywire, just because an external internet connection is interrupted...
Any feedback on this?
P.S. Only minutes later, after noticing the ha-scores reported by hosted-engine --vm-status were really in the low 2000s range overall, I did a quick Google and found this:
ovirt-ha-agent - host score penaltieshttps://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_engine_ha/agent/agent.conf[score]#
NOTE: These values must be the same for all hosts in the HA cluster!base-score=3400 gateway-score-penalty=1600 not-uptodate-config-penalty=1000 //not 'knowing if there are updates' is not the same as 'knowing it missing critical patches' mgmt-bridge-score-penalty=600 free-memory-score-penalty=400 cpu-load-score-penalty=1000 engine-retry-score-penalty=50 cpu-load-penalty-min=0.4 cpu-load-penalty-max=0.9
So now I know how to fix it for me, but I'd consider this pretty much a bug: When the update check fails, that implies really only that the update check could not go through. It doesn't mean the cluster is fundamentally unhealthy.
Now I understand how that negative feedback is next to impossible inside RedHat's network, where update servers are local.
But having a cluster HA score being based on something 'just now happening' on the other far edges of the Internet... seems a very bad design decision.
Please comment and/or tell me how and where I should file this as a bug. _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/ACY3NEU6NYM6ZJ...

I'd say you were close! I tried fiddling with the penalties, but that didn't do anything good. But once I found that hosted-engine --vm-status displayed the score across the hosts, I found them to be very low constantly, the 1600 gateway penalty seems a proper match. I then reinstalled the cluster, bypassing any dependencies on DNS, which may be a little slow as it's not under my control. I have fully-fleshed out /etc/hosts files to accelerate that, but those seem to be ignored sometimes, or only come into play when a DNS lookup has outright failed, not just taken too long. In the cockpit setup screen you get to chose if you want to use DNS, ping or TCP for a liveliness check, I guess for the ovirt-ha-agent or -broker, and I also chose 'ping' there, which also has the cockpit screen immediately happy, while the 'dns' setting seems to take a long time. With that, I see scores of 3400 all around so I guess that nailed it. I've found the Python code that implements the ovirt-ha monitors, but I can't something a broker.conf file or any other entry where the mechanism is actually configured, so I can change and test with different settings without a re-installation. I quite like the liberty a proper DNS might give me, in case I need to move networks again. Yet after this, I'm very motivated to go back to plain old hardwired IPv4. Pretty confident it wasn't the missing package updates now (sorry guys!), but at least it got me looking in the proper direction...
participants (2)
-
Strahil Nikolov
-
thomas@hoberg.net