Hi Didi,
thank you for the quick response.
Am 19.07.21 um 07:59 schrieb Yedidyah Bar David:
On Mon, Jul 19, 2021 at 8:39 AM Christoph Timm
<ovirt(a)timmi.org> wrote:
> Hi List,
>
> I'm trying to understand why my hosted engine is moved from one node to
> another from time to time.
> It is happening sometime multiple times a day. But there are also days
> without it.
>
> I can see the following in the ovirt-hosted-engine-ha/agent.log:
> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
> Penalizing score by 1600 due to network status
>
> After that the engine will be shutdown and started on another host.
> The oVirt Admin portal is showing the following around the same time:
> Invalid status on Data Center Default. Setting status to Non Responsive.
>
> But the whole cluster is working normally during that time.
>
> I believe that I have somehow a network issue on my side but I have no
> clue what kind of check is causing the network status to penalized.
>
> Does anyone have an idea how to investigate this further?
Please check also broker.log. Do you see 'dig' failures?
Yes I found them
as well.
Thread-1::WARNING::2021-07-19
08:02:00,032::network::120::network.Network::(_dns) DNS query failed:
; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
;; global options: +cmd
;; connection timed out; no servers could be reached
This happened several times already on our CI infrastructure, but yours is
the first report from an actual real user. See also:
https://lists.ovirt.org/archives/list/infra@ovirt.org/thread/LIGS5WXGEKWA...
So I understand that the following command is triggered to test the
network: "dig +tries=1 +time=5"
I didn't open a bug for this (yet?), also because I never reproduced on my
own machines and am not sure about the exact failing flow. If this is
reproducible
reliably for you, you might want to test the patch I pushed:
https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596 I'm happy to give it
a try.
Please confirm that I need to replace this file (network.py) on all my
nodes (CentOS 8.4 based) which can host my engine.
Other ideas/opinions about how to enhance this part of the monitoring
are most welcome.
If this phenomenon is new for you, and you can reliably say it's not due to
a recent "natural" higher network load, I wonder if it's due to some weird
bug/change somewhere.
I'm quite sure that I see this since we moved to
4.4.(4).
Just for house keeping I'm running 4.4.7 now.
Thanks and best regards,