Re: [oVirt Jenkins] ovirt-system-tests_he-basic-suite-master - Build # 2046 - Failure!

7 Jul 2021


      Adding Ales as well.
AFAIK vdsm does not actively poll engine for liveness, nor does any
retries. But retries might be at a deeper infra level where Marcin is the
person to ask IIUC.

On Wed, Jul 7, 2021 at 4:40 PM Yedidyah Bar David <didi@redhat.com> wrote:
...
On Wed, Jun 23, 2021 at 12:30 PM Yedidyah Bar David <didi@redhat.com>
wrote:
...
On Wed, Jun 23, 2021 at 12:02 PM Sandro Bonazzola <sbonazzo@redhat.com>
...
...
Il giorno mer 23 giu 2021 alle ore 07:48 Yedidyah Bar David <
didi@redhat.com> ha scritto:
...
...
On Wed, Jun 9, 2021 at 12:13 PM Yedidyah Bar David <didi@redhat.com>
wrote:
...
...
On Tue, Jun 8, 2021 at 9:01 AM Yedidyah Bar David <didi@redhat.com>
wrote:
...
...
On Tue, Jun 8, 2021 at 6:08 AM <jenkins@jenkins.phx.ovirt.org>
wrote:
...
>
> Project:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/
> Build:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/
> Build Number: 2046
> Build Status:  Failure
> Triggered By: Started by timer
>
> -------------------------------------
> Changes Since Last Success:
> -------------------------------------
> Changes for Build #2046
> [Eitan Raviv] network: force select spm - wait for dc status
>
>
>
>
> -----------------
> Failed Tests:
> -----------------
> 1 tests failed.
> FAILED:
he-basic-suite-master.test-scenarios.test_004_basic_sanity.test_template_export
>
> Error Message:
> ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object
at 0x5624fe64d108>, 7, 'Failed to connect to 192.168.200.99 port 443:
Connection refused')]
- The engine VM went down:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
...
MainThread::INFO::2021-06-08
05:07:34,414::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
...
Current state EngineUp (score: 3400)
MainThread::INFO::2021-06-08
05:07:44,575::states::135::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
...
Penalizing score by 960 due to network status
- Because HA monitoring failed to get a reply from the dns server:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
...
Thread-1::WARNING::2021-06-08
05:07:25,486::network::120::network.Network::(_dns) DNS query
failed:
...
; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
;; global options: +cmd
;; connection timed out; no servers could be reached
Thread-3::INFO::2021-06-08
05:07:28,543::mem_free::51::mem_free.MemFree::(action) memFree:
1801
Thread-5::INFO::2021-06-08
05:07:28,972::engine_health::246::engine_health.EngineHealth::(_result_from_stats)
...
VM is up on this host with healthy engine
Thread-2::INFO::2021-06-08
05:07:31,532::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action)
Found
bridge ovirtmgmt in up state
Thread-1::WARNING::2021-06-08
05:07:33,011::network::120::network.Network::(_dns) DNS query
failed:
; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
;; global options: +cmd
;; connection timed out; no servers could be reached
Thread-4::INFO::2021-06-08
05:07:37,433::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
...
System load total=0.3196, engine=0.1724, non-engine=0.1472
Thread-3::INFO::2021-06-08
05:07:37,839::mem_free::51::mem_free.MemFree::(action) memFree:
1735
Thread-5::INFO::2021-06-08
05:07:39,146::engine_health::246::engine_health.EngineHealth::(_result_from_stats)
...
VM is up on this host with healthy engine
Thread-1::WARNING::2021-06-08
05:07:40,535::network::120::network.Network::(_dns) DNS query
failed:
; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
;; global options: +cmd
;; connection timed out; no servers could be reached
Thread-1::WARNING::2021-06-08
05:07:40,535::network::92::network.Network::(action) Failed to
verify
network status, (2 out of 5)
- Not sure why. DNS servers:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
...
# Generated by NetworkManager
search lago.local
nameserver 192.168.200.1
nameserver fe80::5054:ff:fe0c:9ad0%ovirtmgmt
nameserver fd8f:1391:3a82:200::1
Now happened again:
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17434/
...
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17434/...
...
Thread-1::INFO::2021-06-22
18:57:29,134::network::88::network.Network::(action) Successfully
verified network status
...
Thread-1::WARNING::2021-06-22
18:58:13,390::network::92::network.Network::(action) Failed to verify
network status, (0 out of 5)
Thread-1::INFO::2021-06-22
18:58:15,761::network::88::network.Network::(action) Successfully
verified network status
...
...
...
- The command we run is 'dig +tries=1 +time=5', which defaults to
querying for '.' (the dns root). This is normally cached locally,
but
...
...
...
has a TTL of 86400, meaning it can be cached for up to one day. So
if
we ran this query right after it expired, _and_ then the local dns
server had some issues forwarding our request (due to external
issues,
perhaps), then it would fail like this. I am going to ignore this
failure for now, assuming it was temporary, but it might be worth
opening an RFE on ovirt-hosted-engine-ha asking for some more
flexibility - setting the query string or something similar. I
wrote:
think
...
...
...
...
...
that this bug is probably quite hard to reproduce, because
normally,
all hosts will use the same dns server, and problems with it will
affect all of them similarly.
- Anyway, it seems like there were temporary connectivity issues on
the network there. A minute later:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
...
Thread-1::INFO::2021-06-08
05:08:08,143::network::88::network.Network::(action) Successfully
verified network status
But that was too late and the engine VM was already on its way
down.
...
A remaining open question is whether we should retry before giving
up,
...
and where - in the SDK, in OST code, etc. - or whether this should
be
considered normal.
What do you think?
Question is: is retry in place also on vdsm side? Because if it fails
on vdsm, it's better to fail here as well. If there's a retry process in
vdsm for all network calls, I think we can relax the check here and retry
before giving up.
No idea, adding Eitan.
I talked with Eitan about this in private, and he'll check. Thanks.
This happened more often recently.
I pushed a patch [1] to test the network alongside OST, and one of the
CI check-patch runs for it also failed due to this reason [2] (check
broker.log on host-0). The log generated by this patch [3] ends with
"Passed 1311 out of 1338", meaning it lost 27 replies in less than an
hour, which IMO is quite a lot. The latest version of the patch tries
dig with '+tcp' - if that's enough to make it pass with (close to)
zero losses, perhaps we can do the same in HA.
Thanks and best regards,
[1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/115586
[2]
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17711/
[3]
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17711/...
--
Didi