[ovirt-users] Re: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to network status

Monday, 19 July 2021

Hi Didi,

thank you for the quick response.

Am 19.07.21 um 07:59 schrieb Yedidyah Bar David:
...
 On Mon, Jul 19, 2021 at 8:39 AM Christoph Timm
<ovirt(a)timmi.org&gt; wrote:
> Hi List,
>
> I'm trying to understand why my hosted engine is moved from one node to
> another from time to time.
> It is happening sometime multiple times a day. But there are also days
> without it.
>
> I can see the following in the ovirt-hosted-engine-ha/agent.log:
> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
> Penalizing score by 1600 due to network status
>
> After that the engine will be shutdown and started on another host.
> The oVirt Admin portal is showing the following around the same time:
> Invalid status on Data Center Default. Setting status to Non Responsive.
>
> But the whole cluster is working normally during that time.
>
> I believe that I have somehow a network issue on my side but I have no
> clue what kind of check is causing the network status to penalized.
>
> Does anyone have an idea how to investigate this further?
 Please check also broker.log. Do you see 'dig' failures? Yes I found them
as well.

Thread-1::WARNING::2021-07-19 
08:02:00,032::network::120::network.Network::(_dns) DNS query failed:
; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
;; global options: +cmd
;; connection timed out; no servers could be reached

...

 This happened several times already on our CI infrastructure, but yours is
 the first report from an actual real user. See also:

https://lists.ovirt.org/archives/list/infra@ovirt.org/thread/LIGS5WXGEKWA...
So I understand that the following command is triggered to test the 
network: "dig +tries=1 +time=5"
...

 I didn't open a bug for this (yet?), also because I never reproduced on my
 own machines and am not sure about the exact failing flow. If this is
 reproducible
 reliably for you, you might want to test the patch I pushed:

 https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596 I'm happy to give it
a try.
Please confirm that I need to replace this file (network.py) on all my 
nodes (CentOS 8.4 based) which can host my engine.
...

 Other ideas/opinions about how to enhance this part of the monitoring
 are most welcome.

 If this phenomenon is new for you, and you can reliably say it's not due to
 a recent "natural" higher network load, I wonder if it's due to some weird
 bug/change somewhere. I'm quite sure that I see this since we moved to
4.4.(4).
Just for house keeping I'm running 4.4.7 now.
...

 Thanks and best regards, 

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[ovirt-users] Re: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to network status