On Mon, Jul 19, 2021 at 1:54 PM Christoph Timm <ovirt(a)timmi.org> wrote:
Am 19.07.21 um 10:52 schrieb Yedidyah Bar David:
> On Mon, Jul 19, 2021 at 11:39 AM Christoph Timm <ovirt(a)timmi.org> wrote:
>>
>> Am 19.07.21 um 10:25 schrieb Yedidyah Bar David:
>>> On Mon, Jul 19, 2021 at 11:02 AM Christoph Timm <ovirt(a)timmi.org>
wrote:
>>>> Am 19.07.21 um 09:27 schrieb Yedidyah Bar David:
>>>>> On Mon, Jul 19, 2021 at 10:04 AM Christoph Timm
<ovirt(a)timmi.org> wrote:
>>>>>> Hi Didi,
>>>>>>
>>>>>> thank you for the quick response.
>>>>>>
>>>>>>
>>>>>> Am 19.07.21 um 07:59 schrieb Yedidyah Bar David:
>>>>>>> On Mon, Jul 19, 2021 at 8:39 AM Christoph Timm
<ovirt(a)timmi.org> wrote:
>>>>>>>> Hi List,
>>>>>>>>
>>>>>>>> I'm trying to understand why my hosted engine is
moved from one node to
>>>>>>>> another from time to time.
>>>>>>>> It is happening sometime multiple times a day. But there
are also days
>>>>>>>> without it.
>>>>>>>>
>>>>>>>> I can see the following in the
ovirt-hosted-engine-ha/agent.log:
>>>>>>>>
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>>>>>>>> Penalizing score by 1600 due to network status
>>>>>>>>
>>>>>>>> After that the engine will be shutdown and started on
another host.
>>>>>>>> The oVirt Admin portal is showing the following around
the same time:
>>>>>>>> Invalid status on Data Center Default. Setting status to
Non Responsive.
>>>>>>>>
>>>>>>>> But the whole cluster is working normally during that
time.
>>>>>>>>
>>>>>>>> I believe that I have somehow a network issue on my side
but I have no
>>>>>>>> clue what kind of check is causing the network status to
penalized.
>>>>>>>>
>>>>>>>> Does anyone have an idea how to investigate this
further?
>>>>>>> Please check also broker.log. Do you see 'dig'
failures?
>>>>>> Yes I found them as well.
>>>>>>
>>>>>> Thread-1::WARNING::2021-07-19
>>>>>> 08:02:00,032::network::120::network.Network::(_dns) DNS query
failed:
>>>>>> ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4
<<>> +tries=1 +time=5
>>>>>> ;; global options: +cmd
>>>>>> ;; connection timed out; no servers could be reached
>>>>>>
>>>>>>> This happened several times already on our CI
infrastructure, but yours is
>>>>>>> the first report from an actual real user. See also:
>>>>>>>
>>>>>>>
https://lists.ovirt.org/archives/list/infra@ovirt.org/thread/LIGS5WXGEKWA...
>>>>>> So I understand that the following command is triggered to test
the
>>>>>> network: "dig +tries=1 +time=5"
>>>>> Indeed.
>>>>>
>>>>>>> I didn't open a bug for this (yet?), also because I
never reproduced on my
>>>>>>> own machines and am not sure about the exact failing flow.
If this is
>>>>>>> reproducible
>>>>>>> reliably for you, you might want to test the patch I
pushed:
>>>>>>>
>>>>>>>
https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596
>>>>>> I'm happy to give it a try.
>>>>>> Please confirm that I need to replace this file (network.py) on
all my
>>>>>> nodes (CentOS 8.4 based) which can host my engine.
>>>>> It definitely makes sense to do so, but in principle there is no
problem
>>>>> with applying it only on some of them. That's especially useful
if you try
>>>>> this first on a test env and try to enforce a reproduction somehow
(overload
>>>>> the network, disconnect stuff, etc.).
>>>> OK will give it a try and report back.
>>> Thanks and good luck.
Do I need to restart anything after that change?
Yes, the broker. This might restart some other services there, so best put the
host to maintenance during this.
Also please confirm that the comma after TCP is correct as there
wasn't
one before after the timeout in row 110.
It is correct, but not mandatory. We (my team, at least) often add it
in such cases
to make a theoretical future patch that adds another parameter not
require adding
it again (thus making the patch smaller and hopefully cleaner).
>>>
>>>>>>> Other ideas/opinions about how to enhance this part of the
monitoring
>>>>>>> are most welcome.
>>>>>>>
>>>>>>> If this phenomenon is new for you, and you can reliably say
it's not due to
>>>>>>> a recent "natural" higher network load, I wonder
if it's due to some weird
>>>>>>> bug/change somewhere.
>>>>>> I'm quite sure that I see this since we moved to 4.4.(4).
>>>>>> Just for house keeping I'm running 4.4.7 now.
>>>>> We use 'dig' as the network monitor since 4.3.5, around one
year before 4.4
>>>>> was released:
https://bugzilla.redhat.com/1659052
>>>>>
>>>>> Which version did you use before 4.4?
>>>> The last 4.3 versions have been 4.3.7, 4.3.9 and 4.3.10 before
migrating
>>>> to 4.4.4.
>>> I now realize that in above-linked bug we only changed the default, for new
>>> setups. So if you deployed He before 4.3.5, upgrade to later 4.3 would not
>>> change the default (as opposed to upgrade to 4.4, which was actually a
>>> new deployment with engine backup/restore). Do you know which version
>>> your cluster was originally deployed with?
>> Hm, I'm sorry but I don't recall this. I'm quite sure that we
started
> OK, thanks for trying.
>
>> with 4.0 something. But we moved to a HE setup around September 2019.
>> But I don't recall the version. But we installed also the backup from
>> the old installation into the HE environment if I'm not wrong.
> If indeed this change was the trigger for you, you can rather easily try to
> change this to 'ping' and see if this helps - I think it's enough to
change
> 'network_test' to 'ping' in
/etc/ovirt-hosted-engine/hosted-engine.conf
> and restart the broker - didn't try, though. But generally speaking, I do not
> think we want to change the default back to 'ping', but rather make
'dns'
> work better/well. We had valid reasons to move away from ping...
OK I will try this if the tcp change does not help me.
Ok.
In parallel, especially if this is reproducible, you might want to do
some general
monitoring of your network - packet losses, etc. - and correlate this with the
failures you see.
Best regards,
--
Didi