On Fri, Jul 23, 2021 at 6:17 PM Christoph Timm <ovirt(a)timmi.org> wrote:
Am 21.07.21 um 12:33 schrieb Christoph Timm:
>
> Am 21.07.21 um 12:17 schrieb Yedidyah Bar David:
>> On Mon, Jul 19, 2021 at 2:20 PM Yedidyah Bar David <didi(a)redhat.com>
>> wrote:
>>> On Mon, Jul 19, 2021 at 1:54 PM Christoph Timm <ovirt(a)timmi.org>
wrote:
>>>>
>>>>
>>>> Am 19.07.21 um 10:52 schrieb Yedidyah Bar David:
>>>>> On Mon, Jul 19, 2021 at 11:39 AM Christoph Timm
<ovirt(a)timmi.org>
>>>>> wrote:
>>>>>> Am 19.07.21 um 10:25 schrieb Yedidyah Bar David:
>>>>>>> On Mon, Jul 19, 2021 at 11:02 AM Christoph Timm
>>>>>>> <ovirt(a)timmi.org> wrote:
>>>>>>>> Am 19.07.21 um 09:27 schrieb Yedidyah Bar David:
>>>>>>>>> On Mon, Jul 19, 2021 at 10:04 AM Christoph Timm
>>>>>>>>> <ovirt(a)timmi.org> wrote:
>>>>>>>>>> Hi Didi,
>>>>>>>>>>
>>>>>>>>>> thank you for the quick response.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Am 19.07.21 um 07:59 schrieb Yedidyah Bar
David:
>>>>>>>>>>> On Mon, Jul 19, 2021 at 8:39 AM Christoph
Timm
>>>>>>>>>>> <ovirt(a)timmi.org> wrote:
>>>>>>>>>>>> Hi List,
>>>>>>>>>>>>
>>>>>>>>>>>> I'm trying to understand why my
hosted engine is moved from
>>>>>>>>>>>> one node to
>>>>>>>>>>>> another from time to time.
>>>>>>>>>>>> It is happening sometime multiple times
a day. But there
>>>>>>>>>>>> are also days
>>>>>>>>>>>> without it.
>>>>>>>>>>>>
>>>>>>>>>>>> I can see the following in the
>>>>>>>>>>>> ovirt-hosted-engine-ha/agent.log:
>>>>>>>>>>>>
ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>>>>>>>>>>>>
>>>>>>>>>>>> Penalizing score by 1600 due to network
status
>>>>>>>>>>>>
>>>>>>>>>>>> After that the engine will be shutdown
and started on
>>>>>>>>>>>> another host.
>>>>>>>>>>>> The oVirt Admin portal is showing the
following around the
>>>>>>>>>>>> same time:
>>>>>>>>>>>> Invalid status on Data Center Default.
Setting status to
>>>>>>>>>>>> Non Responsive.
>>>>>>>>>>>>
>>>>>>>>>>>> But the whole cluster is working
normally during that time.
>>>>>>>>>>>>
>>>>>>>>>>>> I believe that I have somehow a network
issue on my side
>>>>>>>>>>>> but I have no
>>>>>>>>>>>> clue what kind of check is causing the
network status to
>>>>>>>>>>>> penalized.
>>>>>>>>>>>>
>>>>>>>>>>>> Does anyone have an idea how to
investigate this further?
>>>>>>>>>>> Please check also broker.log. Do you see
'dig' failures?
>>>>>>>>>> Yes I found them as well.
>>>>>>>>>>
>>>>>>>>>> Thread-1::WARNING::2021-07-19
>>>>>>>>>>
08:02:00,032::network::120::network.Network::(_dns) DNS query
>>>>>>>>>> failed:
>>>>>>>>>> ; <<>> DiG
9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
>>>>>>>>>> ;; global options: +cmd
>>>>>>>>>> ;; connection timed out; no servers could be
reached
>>>>>>>>>>
>>>>>>>>>>> This happened several times already on our
CI
>>>>>>>>>>> infrastructure, but yours is
>>>>>>>>>>> the first report from an actual real user.
See also:
>>>>>>>>>>>
>>>>>>>>>>>
https://lists.ovirt.org/archives/list/infra@ovirt.org/thread/LIGS5WXGEKWA...
>>>>>>>>>>>
>>>>>>>>>> So I understand that the following command is
triggered to
>>>>>>>>>> test the
>>>>>>>>>> network: "dig +tries=1 +time=5"
>>>>>>>>> Indeed.
>>>>>>>>>
>>>>>>>>>>> I didn't open a bug for this (yet?),
also because I never
>>>>>>>>>>> reproduced on my
>>>>>>>>>>> own machines and am not sure about the exact
failing flow.
>>>>>>>>>>> If this is
>>>>>>>>>>> reproducible
>>>>>>>>>>> reliably for you, you might want to test the
patch I pushed:
>>>>>>>>>>>
>>>>>>>>>>>
https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596
>> Now filed this bug and linked to it in the above patch. Thanks for
>> your report!
>>
>>
https://bugzilla.redhat.com/show_bug.cgi?id=1984356
> Perfect I added me cc as well.
>
> I have implemented the change on one of my nodes, restarted the
> ovirt-ha-broker and moved the engine to that node.
> Since than the issue did not occur. I guess I will leave it running
> until end of the week and will move the engine back to a none changed
> node to see that the issue is back again.
So I had no issue with the changed host until now. So I moved the engine
to different host in the morning and now I received the issue. So I will
implement the fix on all my hosts now.
So hope this fix will be permanently included in the next release.
Yes, the bug is targeted 4.4.8 and patch is merged.
Best regards,
>>
>> Best regards,
>>
>>>>>>>>>> I'm happy to give it a try.
>>>>>>>>>> Please confirm that I need to replace this file
(network.py)
>>>>>>>>>> on all my
>>>>>>>>>> nodes (CentOS 8.4 based) which can host my
engine.
>>>>>>>>> It definitely makes sense to do so, but in principle
there is
>>>>>>>>> no problem
>>>>>>>>> with applying it only on some of them. That's
especially
>>>>>>>>> useful if you try
>>>>>>>>> this first on a test env and try to enforce a
reproduction
>>>>>>>>> somehow (overload
>>>>>>>>> the network, disconnect stuff, etc.).
>>>>>>>> OK will give it a try and report back.
>>>>>>> Thanks and good luck.
>>>> Do I need to restart anything after that change?
>>> Yes, the broker. This might restart some other services there, so
>>> best put the
>>> host to maintenance during this.
>>>
>>>> Also please confirm that the comma after TCP is correct as there
>>>> wasn't
>>>> one before after the timeout in row 110.
>>> It is correct, but not mandatory. We (my team, at least) often add it
>>> in such cases
>>> to make a theoretical future patch that adds another parameter not
>>> require adding
>>> it again (thus making the patch smaller and hopefully cleaner).
>>>
>>>>>>>>>>> Other ideas/opinions about how to enhance
this part of the
>>>>>>>>>>> monitoring
>>>>>>>>>>> are most welcome.
>>>>>>>>>>>
>>>>>>>>>>> If this phenomenon is new for you, and you
can reliably say
>>>>>>>>>>> it's not due to
>>>>>>>>>>> a recent "natural" higher network
load, I wonder if it's due
>>>>>>>>>>> to some weird
>>>>>>>>>>> bug/change somewhere.
>>>>>>>>>> I'm quite sure that I see this since we
moved to 4.4.(4).
>>>>>>>>>> Just for house keeping I'm running 4.4.7
now.
>>>>>>>>> We use 'dig' as the network monitor since
4.3.5, around one
>>>>>>>>> year before 4.4
>>>>>>>>> was released:
https://bugzilla.redhat.com/1659052
>>>>>>>>>
>>>>>>>>> Which version did you use before 4.4?
>>>>>>>> The last 4.3 versions have been 4.3.7, 4.3.9 and 4.3.10
before
>>>>>>>> migrating
>>>>>>>> to 4.4.4.
>>>>>>> I now realize that in above-linked bug we only changed the
>>>>>>> default, for new
>>>>>>> setups. So if you deployed He before 4.3.5, upgrade to later
4.3
>>>>>>> would not
>>>>>>> change the default (as opposed to upgrade to 4.4, which was
>>>>>>> actually a
>>>>>>> new deployment with engine backup/restore). Do you know
which
>>>>>>> version
>>>>>>> your cluster was originally deployed with?
>>>>>> Hm, I'm sorry but I don't recall this. I'm quite
sure that we
>>>>>> started
>>>>> OK, thanks for trying.
>>>>>
>>>>>> with 4.0 something. But we moved to a HE setup around September
>>>>>> 2019.
>>>>>> But I don't recall the version. But we installed also the
backup
>>>>>> from
>>>>>> the old installation into the HE environment if I'm not
wrong.
>>>>> If indeed this change was the trigger for you, you can rather
>>>>> easily try to
>>>>> change this to 'ping' and see if this helps - I think
it's enough
>>>>> to change
>>>>> 'network_test' to 'ping' in
>>>>> /etc/ovirt-hosted-engine/hosted-engine.conf
>>>>> and restart the broker - didn't try, though. But generally
>>>>> speaking, I do not
>>>>> think we want to change the default back to 'ping', but
rather
>>>>> make 'dns'
>>>>> work better/well. We had valid reasons to move away from ping...
>>>> OK I will try this if the tcp change does not help me.
>>> Ok.
>>>
>>> In parallel, especially if this is reproducible, you might want to do
>>> some general
>>> monitoring of your network - packet losses, etc. - and correlate
>>> this with the
>>> failures you see.
>>>
>>> Best regards,
>>> --
>>> Didi
>>
>>
> _______________________________________________
> Users mailing list -- users(a)ovirt.org
> To unsubscribe send an email to users-leave(a)ovirt.org
> Privacy Statement:
https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct:
>
https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
>
https://lists.ovirt.org/archives/list/users@ovirt.org/message/IJ3RIVKFFB6...
--
Didi