Il giorno mer 23 giu 2021 alle ore 07:48 Yedidyah Bar David <didi(a)redhat.com>
ha scritto:
On Wed, Jun 9, 2021 at 12:13 PM Yedidyah Bar David
<didi(a)redhat.com>
wrote:
>
> On Tue, Jun 8, 2021 at 9:01 AM Yedidyah Bar David <didi(a)redhat.com>
wrote:
> >
> > On Tue, Jun 8, 2021 at 6:08 AM <jenkins(a)jenkins.phx.ovirt.org> wrote:
> > >
> > > Project:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/
> > > Build:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/
> > > Build Number: 2046
> > > Build Status: Failure
> > > Triggered By: Started by timer
> > >
> > > -------------------------------------
> > > Changes Since Last Success:
> > > -------------------------------------
> > > Changes for Build #2046
> > > [Eitan Raviv] network: force select spm - wait for dc status
> > >
> > >
> > >
> > >
> > > -----------------
> > > Failed Tests:
> > > -----------------
> > > 1 tests failed.
> > > FAILED:
he-basic-suite-master.test-scenarios.test_004_basic_sanity.test_template_export
> > >
> > > Error Message:
> > > ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at
0x5624fe64d108>, 7, 'Failed to connect to 192.168.200.99 port 443:
Connection refused')]
> >
> > - The engine VM went down:
> >
> >
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/20...
> >
> > MainThread::INFO::2021-06-08
> >
05:07:34,414::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> > Current state EngineUp (score: 3400)
> > MainThread::INFO::2021-06-08
> >
05:07:44,575::states::135::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
> > Penalizing score by 960 due to network status
> >
> > - Because HA monitoring failed to get a reply from the dns server:
> >
> >
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/20...
> >
> > Thread-1::WARNING::2021-06-08
> > 05:07:25,486::network::120::network.Network::(_dns) DNS query failed:
> > ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1
+time=5
> > ;; global options: +cmd
> > ;; connection timed out; no servers could be reached
> >
> > Thread-3::INFO::2021-06-08
> > 05:07:28,543::mem_free::51::mem_free.MemFree::(action) memFree: 1801
> > Thread-5::INFO::2021-06-08
> >
05:07:28,972::engine_health::246::engine_health.EngineHealth::(_result_from_stats)
> > VM is up on this host with healthy engine
> > Thread-2::INFO::2021-06-08
> > 05:07:31,532::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found
> > bridge ovirtmgmt in up state
> > Thread-1::WARNING::2021-06-08
> > 05:07:33,011::network::120::network.Network::(_dns) DNS query failed:
> > ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1
+time=5
> > ;; global options: +cmd
> > ;; connection timed out; no servers could be reached
> >
> > Thread-4::INFO::2021-06-08
> >
05:07:37,433::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
> > System load total=0.3196, engine=0.1724, non-engine=0.1472
> > Thread-3::INFO::2021-06-08
> > 05:07:37,839::mem_free::51::mem_free.MemFree::(action) memFree: 1735
> > Thread-5::INFO::2021-06-08
> >
05:07:39,146::engine_health::246::engine_health.EngineHealth::(_result_from_stats)
> > VM is up on this host with healthy engine
> > Thread-1::WARNING::2021-06-08
> > 05:07:40,535::network::120::network.Network::(_dns) DNS query failed:
> > ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1
+time=5
> > ;; global options: +cmd
> > ;; connection timed out; no servers could be reached
> >
> > Thread-1::WARNING::2021-06-08
> > 05:07:40,535::network::92::network.Network::(action) Failed to verify
> > network status, (2 out of 5)
> >
> > - Not sure why. DNS servers:
> >
> >
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/20...
> >
> > # Generated by NetworkManager
> > search lago.local
> > nameserver 192.168.200.1
> > nameserver fe80::5054:ff:fe0c:9ad0%ovirtmgmt
> > nameserver fd8f:1391:3a82:200::1
Now happened again:
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17434/
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/174...
Thread-1::INFO::2021-06-22
18:57:29,134::network::88::network.Network::(action) Successfully
verified network status
...
Thread-1::WARNING::2021-06-22
18:58:13,390::network::92::network.Network::(action) Failed to verify
network status, (0 out of 5)
Thread-1::INFO::2021-06-22
18:58:15,761::network::88::network.Network::(action) Successfully
verified network status
...
> >
> > - The command we run is 'dig +tries=1 +time=5', which defaults to
> > querying for '.' (the dns root). This is normally cached locally, but
> > has a TTL of 86400, meaning it can be cached for up to one day. So if
> > we ran this query right after it expired, _and_ then the local dns
> > server had some issues forwarding our request (due to external issues,
> > perhaps), then it would fail like this. I am going to ignore this
> > failure for now, assuming it was temporary, but it might be worth
> > opening an RFE on ovirt-hosted-engine-ha asking for some more
> > flexibility - setting the query string or something similar. I think
> > that this bug is probably quite hard to reproduce, because normally,
> > all hosts will use the same dns server, and problems with it will
> > affect all of them similarly.
> >
> > - Anyway, it seems like there were temporary connectivity issues on
> > the network there. A minute later:
> >
> >
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/20...
> >
> > Thread-1::INFO::2021-06-08
> > 05:08:08,143::network::88::network.Network::(action) Successfully
> > verified network status
> >
> > But that was too late and the engine VM was already on its way down.
> >
> > A remaining open question is whether we should retry before giving up,
> > and where - in the SDK, in OST code, etc. - or whether this should be
> > considered normal.
What do you think?
Question is: is retry in place also on vdsm side? Because if it fails on
vdsm, it's better to fail here as well. If there's a retry process in vdsm
for all network calls, I think we can relax the check here and retry before
giving up.
--
Sandro Bonazzola
MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV
Red Hat EMEA <
*Red Hat respects your work life balance. Therefore there is no need to
answer this email out of your office hours.
<