Am 21.07.21 um 12:17 schrieb Yedidyah Bar David:
On Mon, Jul 19, 2021 at 2:20 PM Yedidyah Bar David
<didi(a)redhat.com> wrote:
> On Mon, Jul 19, 2021 at 1:54 PM Christoph Timm <ovirt(a)timmi.org> wrote:
>>
>>
>> Am 19.07.21 um 10:52 schrieb Yedidyah Bar David:
>>> On Mon, Jul 19, 2021 at 11:39 AM Christoph Timm <ovirt(a)timmi.org>
wrote:
>>>> Am 19.07.21 um 10:25 schrieb Yedidyah Bar David:
>>>>> On Mon, Jul 19, 2021 at 11:02 AM Christoph Timm
<ovirt(a)timmi.org> wrote:
>>>>>> Am 19.07.21 um 09:27 schrieb Yedidyah Bar David:
>>>>>>> On Mon, Jul 19, 2021 at 10:04 AM Christoph Timm
<ovirt(a)timmi.org> wrote:
>>>>>>>> Hi Didi,
>>>>>>>>
>>>>>>>> thank you for the quick response.
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 19.07.21 um 07:59 schrieb Yedidyah Bar David:
>>>>>>>>> On Mon, Jul 19, 2021 at 8:39 AM Christoph Timm
<ovirt(a)timmi.org> wrote:
>>>>>>>>>> Hi List,
>>>>>>>>>>
>>>>>>>>>> I'm trying to understand why my hosted engine
is moved from one node to
>>>>>>>>>> another from time to time.
>>>>>>>>>> It is happening sometime multiple times a day.
But there are also days
>>>>>>>>>> without it.
>>>>>>>>>>
>>>>>>>>>> I can see the following in the
ovirt-hosted-engine-ha/agent.log:
>>>>>>>>>>
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>>>>>>>>>> Penalizing score by 1600 due to network status
>>>>>>>>>>
>>>>>>>>>> After that the engine will be shutdown and
started on another host.
>>>>>>>>>> The oVirt Admin portal is showing the following
around the same time:
>>>>>>>>>> Invalid status on Data Center Default. Setting
status to Non Responsive.
>>>>>>>>>>
>>>>>>>>>> But the whole cluster is working normally during
that time.
>>>>>>>>>>
>>>>>>>>>> I believe that I have somehow a network issue on
my side but I have no
>>>>>>>>>> clue what kind of check is causing the network
status to penalized.
>>>>>>>>>>
>>>>>>>>>> Does anyone have an idea how to investigate this
further?
>>>>>>>>> Please check also broker.log. Do you see
'dig' failures?
>>>>>>>> Yes I found them as well.
>>>>>>>>
>>>>>>>> Thread-1::WARNING::2021-07-19
>>>>>>>> 08:02:00,032::network::120::network.Network::(_dns) DNS
query failed:
>>>>>>>> ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4
<<>> +tries=1 +time=5
>>>>>>>> ;; global options: +cmd
>>>>>>>> ;; connection timed out; no servers could be reached
>>>>>>>>
>>>>>>>>> This happened several times already on our CI
infrastructure, but yours is
>>>>>>>>> the first report from an actual real user. See also:
>>>>>>>>>
>>>>>>>>>
https://lists.ovirt.org/archives/list/infra@ovirt.org/thread/LIGS5WXGEKWA...
>>>>>>>> So I understand that the following command is triggered
to test the
>>>>>>>> network: "dig +tries=1 +time=5"
>>>>>>> Indeed.
>>>>>>>
>>>>>>>>> I didn't open a bug for this (yet?), also because
I never reproduced on my
>>>>>>>>> own machines and am not sure about the exact failing
flow. If this is
>>>>>>>>> reproducible
>>>>>>>>> reliably for you, you might want to test the patch I
pushed:
>>>>>>>>>
>>>>>>>>>
https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596
Now filed this bug and linked to it in the above patch. Thanks for your report!
https://bugzilla.redhat.com/show_bug.cgi?id=1984356 Perfect I added me cc as well.
I have implemented the change on one of my nodes, restarted the
ovirt-ha-broker and moved the engine to that node.
Since than the issue did not occur. I guess I will leave it running
until end of the week and will move the engine back to a none changed
node to see that the issue is back again.
Best regards,
>>>>>>>> I'm happy to give it a try.
>>>>>>>> Please confirm that I need to replace this file
(network.py) on all my
>>>>>>>> nodes (CentOS 8.4 based) which can host my engine.
>>>>>>> It definitely makes sense to do so, but in principle there is
no problem
>>>>>>> with applying it only on some of them. That's especially
useful if you try
>>>>>>> this first on a test env and try to enforce a reproduction
somehow (overload
>>>>>>> the network, disconnect stuff, etc.).
>>>>>> OK will give it a try and report back.
>>>>> Thanks and good luck.
>> Do I need to restart anything after that change?
> Yes, the broker. This might restart some other services there, so best put the
> host to maintenance during this.
>
>> Also please confirm that the comma after TCP is correct as there wasn't
>> one before after the timeout in row 110.
> It is correct, but not mandatory. We (my team, at least) often add it
> in such cases
> to make a theoretical future patch that adds another parameter not
> require adding
> it again (thus making the patch smaller and hopefully cleaner).
>
>>>>>>>>> Other ideas/opinions about how to enhance this part
of the monitoring
>>>>>>>>> are most welcome.
>>>>>>>>>
>>>>>>>>> If this phenomenon is new for you, and you can
reliably say it's not due to
>>>>>>>>> a recent "natural" higher network load, I
wonder if it's due to some weird
>>>>>>>>> bug/change somewhere.
>>>>>>>> I'm quite sure that I see this since we moved to
4.4.(4).
>>>>>>>> Just for house keeping I'm running 4.4.7 now.
>>>>>>> We use 'dig' as the network monitor since 4.3.5,
around one year before 4.4
>>>>>>> was released:
https://bugzilla.redhat.com/1659052
>>>>>>>
>>>>>>> Which version did you use before 4.4?
>>>>>> The last 4.3 versions have been 4.3.7, 4.3.9 and 4.3.10 before
migrating
>>>>>> to 4.4.4.
>>>>> I now realize that in above-linked bug we only changed the default,
for new
>>>>> setups. So if you deployed He before 4.3.5, upgrade to later 4.3
would not
>>>>> change the default (as opposed to upgrade to 4.4, which was actually
a
>>>>> new deployment with engine backup/restore). Do you know which
version
>>>>> your cluster was originally deployed with?
>>>> Hm, I'm sorry but I don't recall this. I'm quite sure that we
started
>>> OK, thanks for trying.
>>>
>>>> with 4.0 something. But we moved to a HE setup around September 2019.
>>>> But I don't recall the version. But we installed also the backup
from
>>>> the old installation into the HE environment if I'm not wrong.
>>> If indeed this change was the trigger for you, you can rather easily try to
>>> change this to 'ping' and see if this helps - I think it's enough
to change
>>> 'network_test' to 'ping' in
/etc/ovirt-hosted-engine/hosted-engine.conf
>>> and restart the broker - didn't try, though. But generally speaking, I do
not
>>> think we want to change the default back to 'ping', but rather make
'dns'
>>> work better/well. We had valid reasons to move away from ping...
>> OK I will try this if the tcp change does not help me.
> Ok.
>
> In parallel, especially if this is reproducible, you might want to do
> some general
> monitoring of your network - packet losses, etc. - and correlate this with the
> failures you see.
>
> Best regards,
> --
> Didi