[ovirt-users] Re: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to network status

Friday, 23 July 2021

Am 21.07.21 um 12:33 schrieb Christoph Timm:
...

 Am 21.07.21 um 12:17 schrieb Yedidyah Bar David:
> On Mon, Jul 19, 2021 at 2:20 PM Yedidyah Bar David <didi(a)redhat.com&gt; 
> wrote:
>> On Mon, Jul 19, 2021 at 1:54 PM Christoph Timm <ovirt(a)timmi.org&gt; wrote:
>>>
>>>
>>> Am 19.07.21 um 10:52 schrieb Yedidyah Bar David:
>>>> On Mon, Jul 19, 2021 at 11:39 AM Christoph Timm <ovirt(a)timmi.org&gt; 
>>>> wrote:
>>>>> Am 19.07.21 um 10:25 schrieb Yedidyah Bar David:
>>>>>> On Mon, Jul 19, 2021 at 11:02 AM Christoph Timm 
>>>>>> <ovirt(a)timmi.org&gt; wrote:
>>>>>>> Am 19.07.21 um 09:27 schrieb Yedidyah Bar David:
>>>>>>>> On Mon, Jul 19, 2021 at 10:04 AM Christoph Timm 
>>>>>>>> <ovirt(a)timmi.org&gt; wrote:
>>>>>>>>> Hi Didi,
>>>>>>>>>
>>>>>>>>> thank you for the quick response.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Am 19.07.21 um 07:59 schrieb Yedidyah Bar David:
>>>>>>>>>> On Mon, Jul 19, 2021 at 8:39 AM Christoph Timm 
>>>>>>>>>> <ovirt(a)timmi.org&gt; wrote:
>>>>>>>>>>> Hi List,
>>>>>>>>>>>
>>>>>>>>>>> I'm trying to understand why my hosted
engine is moved from 
>>>>>>>>>>> one node to
>>>>>>>>>>> another from time to time.
>>>>>>>>>>> It is happening sometime multiple times a
day. But there 
>>>>>>>>>>> are also days
>>>>>>>>>>> without it.
>>>>>>>>>>>
>>>>>>>>>>> I can see the following in the 
>>>>>>>>>>> ovirt-hosted-engine-ha/agent.log:
>>>>>>>>>>>
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) 
>>>>>>>>>>>
>>>>>>>>>>> Penalizing score by 1600 due to network
status
>>>>>>>>>>>
>>>>>>>>>>> After that the engine will be shutdown and
started on 
>>>>>>>>>>> another host.
>>>>>>>>>>> The oVirt Admin portal is showing the
following around the 
>>>>>>>>>>> same time:
>>>>>>>>>>> Invalid status on Data Center Default.
Setting status to 
>>>>>>>>>>> Non Responsive.
>>>>>>>>>>>
>>>>>>>>>>> But the whole cluster is working normally
during that time.
>>>>>>>>>>>
>>>>>>>>>>> I believe that I have somehow a network issue
on my side 
>>>>>>>>>>> but I have no
>>>>>>>>>>> clue what kind of check is causing the
network status to 
>>>>>>>>>>> penalized.
>>>>>>>>>>>
>>>>>>>>>>> Does anyone have an idea how to investigate
this further?
>>>>>>>>>> Please check also broker.log. Do you see
'dig' failures?
>>>>>>>>> Yes I found them as well.
>>>>>>>>>
>>>>>>>>> Thread-1::WARNING::2021-07-19
>>>>>>>>> 08:02:00,032::network::120::network.Network::(_dns)
DNS query 
>>>>>>>>> failed:
>>>>>>>>> ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4
<<>> +tries=1 +time=5
>>>>>>>>> ;; global options: +cmd
>>>>>>>>> ;; connection timed out; no servers could be reached
>>>>>>>>>
>>>>>>>>>> This happened several times already on our CI 
>>>>>>>>>> infrastructure, but yours is
>>>>>>>>>> the first report from an actual real user. See
also:
>>>>>>>>>>
>>>>>>>>>>
https://lists.ovirt.org/archives/list/infra@ovirt.org/thread/LIGS5WXGEKWA...

>>>>>>>>>>
>>>>>>>>> So I understand that the following command is
triggered to 
>>>>>>>>> test the
>>>>>>>>> network: "dig +tries=1 +time=5"
>>>>>>>> Indeed.
>>>>>>>>
>>>>>>>>>> I didn't open a bug for this (yet?), also
because I never 
>>>>>>>>>> reproduced on my
>>>>>>>>>> own machines and am not sure about the exact
failing flow. 
>>>>>>>>>> If this is
>>>>>>>>>> reproducible
>>>>>>>>>> reliably for you, you might want to test the
patch I pushed:
>>>>>>>>>>
>>>>>>>>>>
https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596
> Now filed this bug and linked to it in the above patch. Thanks for 
> your report!
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1984356
 Perfect I added me cc as well.

 I have implemented the change on one of my nodes, restarted the 
 ovirt-ha-broker and moved the engine to that node.
 Since than the issue did not occur. I guess I will leave it running 
 until end of the week and will move the engine back to a none changed 
 node to see that the issue is back again. So I had no issue with the changed host
until now. So I moved the engine 
to different host in the morning and now I received the issue. So I will 
implement the fix on all my hosts now.
So hope this fix will be permanently included in the next release.
...
>
> Best regards,
>
>>>>>>>>> I'm happy to give it a try.
>>>>>>>>> Please confirm that I need to replace this file
(network.py) 
>>>>>>>>> on all my
>>>>>>>>> nodes (CentOS 8.4 based) which can host my engine.
>>>>>>>> It definitely makes sense to do so, but in principle
there is 
>>>>>>>> no problem
>>>>>>>> with applying it only on some of them. That's
especially 
>>>>>>>> useful if you try
>>>>>>>> this first on a test env and try to enforce a
reproduction 
>>>>>>>> somehow (overload
>>>>>>>> the network, disconnect stuff, etc.).
>>>>>>> OK will give it a try and report back.
>>>>>> Thanks and good luck.
>>> Do I need to restart anything after that change?
>> Yes, the broker. This might restart some other services there, so 
>> best put the
>> host to maintenance during this.
>>
>>> Also please confirm that the comma after TCP is correct as there 
>>> wasn't
>>> one before after the timeout in row 110.
>> It is correct, but not mandatory. We (my team, at least) often add it
>> in such cases
>> to make a theoretical future patch that adds another parameter not
>> require adding
>> it again (thus making the patch smaller and hopefully cleaner).
>>
>>>>>>>>>> Other ideas/opinions about how to enhance this
part of the 
>>>>>>>>>> monitoring
>>>>>>>>>> are most welcome.
>>>>>>>>>>
>>>>>>>>>> If this phenomenon is new for you, and you can
reliably say 
>>>>>>>>>> it's not due to
>>>>>>>>>> a recent "natural" higher network load,
I wonder if it's due 
>>>>>>>>>> to some weird
>>>>>>>>>> bug/change somewhere.
>>>>>>>>> I'm quite sure that I see this since we moved to
4.4.(4).
>>>>>>>>> Just for house keeping I'm running 4.4.7 now.
>>>>>>>> We use 'dig' as the network monitor since 4.3.5,
around one 
>>>>>>>> year before 4.4
>>>>>>>> was released: https://bugzilla.redhat.com/1659052
>>>>>>>>
>>>>>>>> Which version did you use before 4.4?
>>>>>>> The last 4.3 versions have been 4.3.7, 4.3.9 and 4.3.10
before 
>>>>>>> migrating
>>>>>>> to 4.4.4.
>>>>>> I now realize that in above-linked bug we only changed the 
>>>>>> default, for new
>>>>>> setups. So if you deployed He before 4.3.5, upgrade to later 4.3

>>>>>> would not
>>>>>> change the default (as opposed to upgrade to 4.4, which was 
>>>>>> actually a
>>>>>> new deployment with engine backup/restore). Do you know which 
>>>>>> version
>>>>>> your cluster was originally deployed with?
>>>>> Hm, I'm sorry but I don't recall this. I'm quite sure
that we 
>>>>> started
>>>> OK, thanks for trying.
>>>>
>>>>> with 4.0 something. But we moved to a HE setup around September 
>>>>> 2019.
>>>>> But I don't recall the version. But we installed also the backup

>>>>> from
>>>>> the old installation into the HE environment if I'm not wrong.
>>>> If indeed this change was the trigger for you, you can rather 
>>>> easily try to
>>>> change this to 'ping' and see if this helps - I think it's
enough 
>>>> to change
>>>> 'network_test' to 'ping' in 
>>>> /etc/ovirt-hosted-engine/hosted-engine.conf
>>>> and restart the broker - didn't try, though. But generally 
>>>> speaking, I do not
>>>> think we want to change the default back to 'ping', but rather 
>>>> make 'dns'
>>>> work better/well. We had valid reasons to move away from ping...
>>> OK I will try this if the tcp change does not help me.
>> Ok.
>>
>> In parallel, especially if this is reproducible, you might want to do
>> some general
>> monitoring of your network - packet losses, etc. - and correlate 
>> this with the
>> failures you see.
>>
>> Best regards,
>> -- 
>> Didi
>
>
 _______________________________________________
 Users mailing list -- users(a)ovirt.org
 To unsubscribe send an email to users-leave(a)ovirt.org
 Privacy Statement: https://www.ovirt.org/privacy-policy.html
 oVirt Code of Conduct: 
 https://www.ovirt.org/community/about/community-guidelines/
 List Archives: 

https://lists.ovirt.org/archives/list/users@ovirt.org/message/IJ3RIVKFFB6...

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[ovirt-users] Re: ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to network status