Re: [oVirt Jenkins] ovirt-system-tests_he-basic-suite-master - Build # 2046 - Failure!

9 Jun 2021

      On Tue, Jun 8, 2021 at 9:01 AM Yedidyah Bar David <didi@redhat.com> wrote:
...
On Tue, Jun 8, 2021 at 6:08 AM <jenkins@jenkins.phx.ovirt.org> wrote:
...
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/
Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/
Build Number: 2046
Build Status:  Failure
Triggered By: Started by timer
-------------------------------------
Changes Since Last Success:
-------------------------------------
Changes for Build #2046
[Eitan Raviv] network: force select spm - wait for dc status
-----------------
Failed Tests:
-----------------
1 tests failed.
FAILED:  he-basic-suite-master.test-scenarios.test_004_basic_sanity.test_template_export
Error Message:
ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at 0x5624fe64d108>, 7, 'Failed to connect to 192.168.200.99 port 443: Connection refused')]
- The engine VM went down:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
MainThread::INFO::2021-06-08
05:07:34,414::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineUp (score: 3400)
MainThread::INFO::2021-06-08
05:07:44,575::states::135::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
Penalizing score by 960 due to network status
- Because HA monitoring failed to get a reply from the dns server:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
Thread-1::WARNING::2021-06-08
05:07:25,486::network::120::network.Network::(_dns) DNS query failed:
; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
;; global options: +cmd
;; connection timed out; no servers could be reached
Thread-3::INFO::2021-06-08
05:07:28,543::mem_free::51::mem_free.MemFree::(action) memFree: 1801
Thread-5::INFO::2021-06-08
05:07:28,972::engine_health::246::engine_health.EngineHealth::(_result_from_stats)
VM is up on this host with healthy engine
Thread-2::INFO::2021-06-08
05:07:31,532::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found
bridge ovirtmgmt in up state
Thread-1::WARNING::2021-06-08
05:07:33,011::network::120::network.Network::(_dns) DNS query failed:
; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
;; global options: +cmd
;; connection timed out; no servers could be reached
Thread-4::INFO::2021-06-08
05:07:37,433::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
System load total=0.3196, engine=0.1724, non-engine=0.1472
Thread-3::INFO::2021-06-08
05:07:37,839::mem_free::51::mem_free.MemFree::(action) memFree: 1735
Thread-5::INFO::2021-06-08
05:07:39,146::engine_health::246::engine_health.EngineHealth::(_result_from_stats)
VM is up on this host with healthy engine
Thread-1::WARNING::2021-06-08
05:07:40,535::network::120::network.Network::(_dns) DNS query failed:
; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5
;; global options: +cmd
;; connection timed out; no servers could be reached
Thread-1::WARNING::2021-06-08
05:07:40,535::network::92::network.Network::(action) Failed to verify
network status, (2 out of 5)
- Not sure why. DNS servers:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
# Generated by NetworkManager
search lago.local
nameserver 192.168.200.1
nameserver fe80::5054:ff:fe0c:9ad0%ovirtmgmt
nameserver fd8f:1391:3a82:200::1
- The command we run is 'dig +tries=1 +time=5', which defaults to
querying for '.' (the dns root). This is normally cached locally, but
has a TTL of 86400, meaning it can be cached for up to one day. So if
we ran this query right after it expired, _and_ then the local dns
server had some issues forwarding our request (due to external issues,
perhaps), then it would fail like this. I am going to ignore this
failure for now, assuming it was temporary, but it might be worth
opening an RFE on ovirt-hosted-engine-ha asking for some more
flexibility - setting the query string or something similar. I think
that this bug is probably quite hard to reproduce, because normally,
all hosts will use the same dns server, and problems with it will
affect all of them similarly.
- Anyway, it seems like there were temporary connectivity issues on
the network there. A minute later:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
Thread-1::INFO::2021-06-08
05:08:08,143::network::88::network.Network::(action) Successfully
verified network status
But that was too late and the engine VM was already on its way down.
A remaining open question is whether we should retry before giving up,
and where - in the SDK, in OST code, etc. - or whether this should be
considered normal.
This now happened again [1] (with [2], for testing [3], but I don't
think that's related):

https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2049/...
:

Sparing you the dns lines (you can search the log for details), but it
happened twice in a few minutes. First one was non-fatal as it was
"resolved" quickly:

Thread-1::WARNING::2021-06-09
10:46:28,504::network::92::network.Network::(action) Failed to verify
network status, (4 out of 5)
Thread-1::INFO::2021-06-09
10:46:31,737::network::88::network.Network::(action) Successfully
verified network status

Second was "fatal" - caused the score to become low and the agent to
stop the VM:

Thread-1::WARNING::2021-06-09
10:50:26,809::network::120::network.Network::(_dns) DNS query failed:
...
Thread-1::WARNING::2021-06-09
10:51:06,090::network::92::network.Network::(action) Failed to verify
network status, (4 out of 5)

Then, it did resolve, but this was too late:

Thread-1::INFO::2021-06-09
10:51:09,292::network::88::network.Network::(action) Successfully
verified network status

So the network wasn't completely dead (4 of 5 failed, got better in
less than a minute), but bad enough.

[1] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2049/

[2] https://jenkins.ovirt.org/job/oVirt_ovirt-ansible-collection_standard-check-...

[3] https://github.com/oVirt/ovirt-ansible-collection/pull/277

Best regards,
-- 
Didi