Adding Ales as well.
AFAIK vdsm does not actively poll engine for liveness, nor does any retries. But retries
might be at a deeper infra level where Marcin is the person to ask IIUC.
Right, no retries in vdsm, we send replies or events, and we don't have any
way to tell if engine got the message.
On Wed, Jul 7, 2021 at 4:40 PM Yedidyah Bar David
<didi(a)redhat.com> wrote:
>
> On Wed, Jun 23, 2021 at 12:30 PM Yedidyah Bar David <didi(a)redhat.com> wrote:
> >
> > On Wed, Jun 23, 2021 at 12:02 PM Sandro Bonazzola <sbonazzo(a)redhat.com>
wrote:
> >>
> >>
> >>
> >> Il giorno mer 23 giu 2021 alle ore 07:48 Yedidyah Bar David
<didi(a)redhat.com> ha scritto:
> >>>
> >>> On Wed, Jun 9, 2021 at 12:13 PM Yedidyah Bar David
<didi(a)redhat.com> wrote:
> >>> >
> >>> > On Tue, Jun 8, 2021 at 9:01 AM Yedidyah Bar David
<didi(a)redhat.com> wrote:
> >>> > >
> >>> > > On Tue, Jun 8, 2021 at 6:08 AM
<jenkins(a)jenkins.phx.ovirt.org> wrote:
> >>> > > >
> >>> > > > Project:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/
> >>> > > > Build:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/
> >>> > > > Build Number: 2046
> >>> > > > Build Status: Failure
> >>> > > > Triggered By: Started by timer
> >>> > > >
> >>> > > > -------------------------------------
> >>> > > > Changes Since Last Success:
> >>> > > > -------------------------------------
> >>> > > > Changes for Build #2046
> >>> > > > [Eitan Raviv] network: force select spm - wait for dc
status
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > > -----------------
> >>> > > > Failed Tests:
> >>> > > > -----------------
> >>> > > > 1 tests failed.
> >>> > > > FAILED:
he-basic-suite-master.test-scenarios.test_004_basic_sanity.test_template_export
> >>> > > >
> >>> > > > Error Message:
> >>> > > > ovirtsdk4.Error: Failed to read response:
[(<pycurl.Curl object at 0x5624fe64d108>, 7, 'Failed to connect to
192.168.200.99 port 443: Connection refused')]
> >>> > >
> >>> > > - The engine VM went down:
> >>> > >
> >>> > >
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/20...
> >>> > >
> >>> > > MainThread::INFO::2021-06-08
> >>> > >
05:07:34,414::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> >>> > > Current state EngineUp (score: 3400)
> >>> > > MainThread::INFO::2021-06-08
> >>> > >
05:07:44,575::states::135::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
> >>> > > Penalizing score by 960 due to network status
> >>> > >
> >>> > > - Because HA monitoring failed to get a reply from the dns
server:
> >>> > >
> >>> > >
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/20...
> >>> > >
> >>> > > Thread-1::WARNING::2021-06-08
> >>> > > 05:07:25,486::network::120::network.Network::(_dns) DNS query
failed:
> >>> > > ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4
<<>> +tries=1 +time=5
> >>> > > ;; global options: +cmd
> >>> > > ;; connection timed out; no servers could be reached
> >>> > >
> >>> > > Thread-3::INFO::2021-06-08
> >>> > > 05:07:28,543::mem_free::51::mem_free.MemFree::(action)
memFree: 1801
> >>> > > Thread-5::INFO::2021-06-08
> >>> > >
05:07:28,972::engine_health::246::engine_health.EngineHealth::(_result_from_stats)
> >>> > > VM is up on this host with healthy engine
> >>> > > Thread-2::INFO::2021-06-08
> >>> > >
05:07:31,532::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found
> >>> > > bridge ovirtmgmt in up state
> >>> > > Thread-1::WARNING::2021-06-08
> >>> > > 05:07:33,011::network::120::network.Network::(_dns) DNS query
failed:
> >>> > > ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4
<<>> +tries=1 +time=5
> >>> > > ;; global options: +cmd
> >>> > > ;; connection timed out; no servers could be reached
> >>> > >
> >>> > > Thread-4::INFO::2021-06-08
> >>> > >
05:07:37,433::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
> >>> > > System load total=0.3196, engine=0.1724, non-engine=0.1472
> >>> > > Thread-3::INFO::2021-06-08
> >>> > > 05:07:37,839::mem_free::51::mem_free.MemFree::(action)
memFree: 1735
> >>> > > Thread-5::INFO::2021-06-08
> >>> > >
05:07:39,146::engine_health::246::engine_health.EngineHealth::(_result_from_stats)
> >>> > > VM is up on this host with healthy engine
> >>> > > Thread-1::WARNING::2021-06-08
> >>> > > 05:07:40,535::network::120::network.Network::(_dns) DNS query
failed:
> >>> > > ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4
<<>> +tries=1 +time=5
> >>> > > ;; global options: +cmd
> >>> > > ;; connection timed out; no servers could be reached
> >>> > >
> >>> > > Thread-1::WARNING::2021-06-08
> >>> > > 05:07:40,535::network::92::network.Network::(action) Failed to
verify
> >>> > > network status, (2 out of 5)
> >>> > >
> >>> > > - Not sure why. DNS servers:
> >>> > >
> >>> > >
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/20...
> >>> > >
> >>> > > # Generated by NetworkManager
> >>> > > search lago.local
> >>> > > nameserver 192.168.200.1
> >>> > > nameserver fe80::5054:ff:fe0c:9ad0%ovirtmgmt
> >>> > > nameserver fd8f:1391:3a82:200::1
> >>>
> >>> Now happened again:
> >>>
> >>>
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17434/
> >>>
> >>>
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/174...
> >>>
> >>> Thread-1::INFO::2021-06-22
> >>> 18:57:29,134::network::88::network.Network::(action) Successfully
> >>> verified network status
> >>> ...
> >>>
> >>> Thread-1::WARNING::2021-06-22
> >>> 18:58:13,390::network::92::network.Network::(action) Failed to verify
> >>> network status, (0 out of 5)
> >>> Thread-1::INFO::2021-06-22
> >>> 18:58:15,761::network::88::network.Network::(action) Successfully
> >>> verified network status
> >>> ...
> >>>
> >>> > >
> >>> > > - The command we run is 'dig +tries=1 +time=5', which
defaults to
> >>> > > querying for '.' (the dns root). This is normally
cached locally, but
> >>> > > has a TTL of 86400, meaning it can be cached for up to one
day. So if
> >>> > > we ran this query right after it expired, _and_ then the local
dns
> >>> > > server had some issues forwarding our request (due to external
issues,
> >>> > > perhaps), then it would fail like this. I am going to ignore
this
> >>> > > failure for now, assuming it was temporary, but it might be
worth
> >>> > > opening an RFE on ovirt-hosted-engine-ha asking for some more
> >>> > > flexibility - setting the query string or something similar. I
think
> >>> > > that this bug is probably quite hard to reproduce, because
normally,
> >>> > > all hosts will use the same dns server, and problems with it
will
> >>> > > affect all of them similarly.
> >>> > >
> >>> > > - Anyway, it seems like there were temporary connectivity
issues on
> >>> > > the network there. A minute later:
> >>> > >
> >>> > >
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/20...
> >>> > >
> >>> > > Thread-1::INFO::2021-06-08
> >>> > > 05:08:08,143::network::88::network.Network::(action)
Successfully
> >>> > > verified network status
> >>> > >
> >>> > > But that was too late and the engine VM was already on its way
down.
> >>> > >
> >>> > > A remaining open question is whether we should retry before
giving up,
> >>> > > and where - in the SDK, in OST code, etc. - or whether this
should be
> >>> > > considered normal.
> >>>
> >>> What do you think?
> >>
> >>
> >> Question is: is retry in place also on vdsm side? Because if it fails on
vdsm, it's better to fail here as well. If there's a retry process in vdsm for all
network calls, I think we can relax the check here and retry before giving up.
> >
> >
> > No idea, adding Eitan.
>
> I talked with Eitan about this in private, and he'll check. Thanks.
>
> This happened more often recently.
>
> I pushed a patch [1] to test the network alongside OST, and one of the
> CI check-patch runs for it also failed due to this reason [2] (check
> broker.log on host-0). The log generated by this patch [3] ends with
> "Passed 1311 out of 1338", meaning it lost 27 replies in less than an
> hour, which IMO is quite a lot. The latest version of the patch tries
> dig with '+tcp' - if that's enough to make it pass with (close to)
> zero losses, perhaps we can do the same in HA.
>
> Thanks and best regards,
>
> [1]
https://gerrit.ovirt.org/c/ovirt-system-tests/+/115586
> [2]
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17711/
> [3]
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/177...
> --
> Didi
>
_______________________________________________
Infra mailing list -- infra(a)ovirt.org
To unsubscribe send an email to infra-leave(a)ovirt.org
Privacy Statement:
https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/infra@ovirt.org/message/64EOPWAPNDK...