Re: [oVirt Jenkins] ovirt-system-tests_he-basic-suite-master - Build # 2046 - Failure!

On Tue, Jun 8, 2021 at 6:08 AM <jenkins@jenkins.phx.ovirt.org> wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/ Build Number: 2046 Build Status: Failure Triggered By: Started by timer
------------------------------------- Changes Since Last Success: ------------------------------------- Changes for Build #2046 [Eitan Raviv] network: force select spm - wait for dc status
----------------- Failed Tests: ----------------- 1 tests failed. FAILED: he-basic-suite-master.test-scenarios.test_004_basic_sanity.test_template_export
Error Message: ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at 0x5624fe64d108>, 7, 'Failed to connect to 192.168.200.99 port 443: Connection refused')]
- The engine VM went down: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/... MainThread::INFO::2021-06-08 05:07:34,414::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400) MainThread::INFO::2021-06-08 05:07:44,575::states::135::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 960 due to network status - Because HA monitoring failed to get a reply from the dns server: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/... Thread-1::WARNING::2021-06-08 05:07:25,486::network::120::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 ;; global options: +cmd ;; connection timed out; no servers could be reached Thread-3::INFO::2021-06-08 05:07:28,543::mem_free::51::mem_free.MemFree::(action) memFree: 1801 Thread-5::INFO::2021-06-08 05:07:28,972::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-2::INFO::2021-06-08 05:07:31,532::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in up state Thread-1::WARNING::2021-06-08 05:07:33,011::network::120::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 ;; global options: +cmd ;; connection timed out; no servers could be reached Thread-4::INFO::2021-06-08 05:07:37,433::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.3196, engine=0.1724, non-engine=0.1472 Thread-3::INFO::2021-06-08 05:07:37,839::mem_free::51::mem_free.MemFree::(action) memFree: 1735 Thread-5::INFO::2021-06-08 05:07:39,146::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-1::WARNING::2021-06-08 05:07:40,535::network::120::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 ;; global options: +cmd ;; connection timed out; no servers could be reached Thread-1::WARNING::2021-06-08 05:07:40,535::network::92::network.Network::(action) Failed to verify network status, (2 out of 5) - Not sure why. DNS servers: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/... # Generated by NetworkManager search lago.local nameserver 192.168.200.1 nameserver fe80::5054:ff:fe0c:9ad0%ovirtmgmt nameserver fd8f:1391:3a82:200::1 - The command we run is 'dig +tries=1 +time=5', which defaults to querying for '.' (the dns root). This is normally cached locally, but has a TTL of 86400, meaning it can be cached for up to one day. So if we ran this query right after it expired, _and_ then the local dns server had some issues forwarding our request (due to external issues, perhaps), then it would fail like this. I am going to ignore this failure for now, assuming it was temporary, but it might be worth opening an RFE on ovirt-hosted-engine-ha asking for some more flexibility - setting the query string or something similar. I think that this bug is probably quite hard to reproduce, because normally, all hosts will use the same dns server, and problems with it will affect all of them similarly. - Anyway, it seems like there were temporary connectivity issues on the network there. A minute later: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/... Thread-1::INFO::2021-06-08 05:08:08,143::network::88::network.Network::(action) Successfully verified network status But that was too late and the engine VM was already on its way down. A remaining open question is whether we should retry before giving up, and where - in the SDK, in OST code, etc. - or whether this should be considered normal. Best regards, -- Didi

On Tue, Jun 8, 2021 at 9:01 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Jun 8, 2021 at 6:08 AM <jenkins@jenkins.phx.ovirt.org> wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/ Build Number: 2046 Build Status: Failure Triggered By: Started by timer
------------------------------------- Changes Since Last Success: ------------------------------------- Changes for Build #2046 [Eitan Raviv] network: force select spm - wait for dc status
----------------- Failed Tests: ----------------- 1 tests failed. FAILED: he-basic-suite-master.test-scenarios.test_004_basic_sanity.test_template_export
Error Message: ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at 0x5624fe64d108>, 7, 'Failed to connect to 192.168.200.99 port 443: Connection refused')]
- The engine VM went down:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
MainThread::INFO::2021-06-08 05:07:34,414::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400) MainThread::INFO::2021-06-08 05:07:44,575::states::135::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 960 due to network status
- Because HA monitoring failed to get a reply from the dns server:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
Thread-1::WARNING::2021-06-08 05:07:25,486::network::120::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-3::INFO::2021-06-08 05:07:28,543::mem_free::51::mem_free.MemFree::(action) memFree: 1801 Thread-5::INFO::2021-06-08 05:07:28,972::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-2::INFO::2021-06-08 05:07:31,532::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in up state Thread-1::WARNING::2021-06-08 05:07:33,011::network::120::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-4::INFO::2021-06-08 05:07:37,433::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.3196, engine=0.1724, non-engine=0.1472 Thread-3::INFO::2021-06-08 05:07:37,839::mem_free::51::mem_free.MemFree::(action) memFree: 1735 Thread-5::INFO::2021-06-08 05:07:39,146::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-1::WARNING::2021-06-08 05:07:40,535::network::120::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-1::WARNING::2021-06-08 05:07:40,535::network::92::network.Network::(action) Failed to verify network status, (2 out of 5)
- Not sure why. DNS servers:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
# Generated by NetworkManager search lago.local nameserver 192.168.200.1 nameserver fe80::5054:ff:fe0c:9ad0%ovirtmgmt nameserver fd8f:1391:3a82:200::1
- The command we run is 'dig +tries=1 +time=5', which defaults to querying for '.' (the dns root). This is normally cached locally, but has a TTL of 86400, meaning it can be cached for up to one day. So if we ran this query right after it expired, _and_ then the local dns server had some issues forwarding our request (due to external issues, perhaps), then it would fail like this. I am going to ignore this failure for now, assuming it was temporary, but it might be worth opening an RFE on ovirt-hosted-engine-ha asking for some more flexibility - setting the query string or something similar. I think that this bug is probably quite hard to reproduce, because normally, all hosts will use the same dns server, and problems with it will affect all of them similarly.
- Anyway, it seems like there were temporary connectivity issues on the network there. A minute later:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
Thread-1::INFO::2021-06-08 05:08:08,143::network::88::network.Network::(action) Successfully verified network status
But that was too late and the engine VM was already on its way down.
A remaining open question is whether we should retry before giving up, and where - in the SDK, in OST code, etc. - or whether this should be considered normal.
This now happened again [1] (with [2], for testing [3], but I don't think that's related): https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2049/... : Sparing you the dns lines (you can search the log for details), but it happened twice in a few minutes. First one was non-fatal as it was "resolved" quickly: Thread-1::WARNING::2021-06-09 10:46:28,504::network::92::network.Network::(action) Failed to verify network status, (4 out of 5) Thread-1::INFO::2021-06-09 10:46:31,737::network::88::network.Network::(action) Successfully verified network status Second was "fatal" - caused the score to become low and the agent to stop the VM: Thread-1::WARNING::2021-06-09 10:50:26,809::network::120::network.Network::(_dns) DNS query failed: ... Thread-1::WARNING::2021-06-09 10:51:06,090::network::92::network.Network::(action) Failed to verify network status, (4 out of 5) Then, it did resolve, but this was too late: Thread-1::INFO::2021-06-09 10:51:09,292::network::88::network.Network::(action) Successfully verified network status So the network wasn't completely dead (4 of 5 failed, got better in less than a minute), but bad enough. [1] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2049/ [2] https://jenkins.ovirt.org/job/oVirt_ovirt-ansible-collection_standard-check-... [3] https://github.com/oVirt/ovirt-ansible-collection/pull/277 Best regards, -- Didi

On Wed, Jun 9, 2021 at 12:13 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Jun 8, 2021 at 9:01 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Jun 8, 2021 at 6:08 AM <jenkins@jenkins.phx.ovirt.org> wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/ Build Number: 2046 Build Status: Failure Triggered By: Started by timer
------------------------------------- Changes Since Last Success: ------------------------------------- Changes for Build #2046 [Eitan Raviv] network: force select spm - wait for dc status
----------------- Failed Tests: ----------------- 1 tests failed. FAILED: he-basic-suite-master.test-scenarios.test_004_basic_sanity.test_template_export
Error Message: ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at 0x5624fe64d108>, 7, 'Failed to connect to 192.168.200.99 port 443: Connection refused')]
- The engine VM went down:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
MainThread::INFO::2021-06-08 05:07:34,414::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400) MainThread::INFO::2021-06-08 05:07:44,575::states::135::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 960 due to network status
- Because HA monitoring failed to get a reply from the dns server:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
Thread-1::WARNING::2021-06-08 05:07:25,486::network::120::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-3::INFO::2021-06-08 05:07:28,543::mem_free::51::mem_free.MemFree::(action) memFree: 1801 Thread-5::INFO::2021-06-08 05:07:28,972::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-2::INFO::2021-06-08 05:07:31,532::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in up state Thread-1::WARNING::2021-06-08 05:07:33,011::network::120::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-4::INFO::2021-06-08 05:07:37,433::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.3196, engine=0.1724, non-engine=0.1472 Thread-3::INFO::2021-06-08 05:07:37,839::mem_free::51::mem_free.MemFree::(action) memFree: 1735 Thread-5::INFO::2021-06-08 05:07:39,146::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-1::WARNING::2021-06-08 05:07:40,535::network::120::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-1::WARNING::2021-06-08 05:07:40,535::network::92::network.Network::(action) Failed to verify network status, (2 out of 5)
- Not sure why. DNS servers:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
# Generated by NetworkManager search lago.local nameserver 192.168.200.1 nameserver fe80::5054:ff:fe0c:9ad0%ovirtmgmt nameserver fd8f:1391:3a82:200::1
Now happened again: https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17434/ https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17434/... Thread-1::INFO::2021-06-22 18:57:29,134::network::88::network.Network::(action) Successfully verified network status ... Thread-1::WARNING::2021-06-22 18:58:13,390::network::92::network.Network::(action) Failed to verify network status, (0 out of 5) Thread-1::INFO::2021-06-22 18:58:15,761::network::88::network.Network::(action) Successfully verified network status ...
- The command we run is 'dig +tries=1 +time=5', which defaults to querying for '.' (the dns root). This is normally cached locally, but has a TTL of 86400, meaning it can be cached for up to one day. So if we ran this query right after it expired, _and_ then the local dns server had some issues forwarding our request (due to external issues, perhaps), then it would fail like this. I am going to ignore this failure for now, assuming it was temporary, but it might be worth opening an RFE on ovirt-hosted-engine-ha asking for some more flexibility - setting the query string or something similar. I think that this bug is probably quite hard to reproduce, because normally, all hosts will use the same dns server, and problems with it will affect all of them similarly.
- Anyway, it seems like there were temporary connectivity issues on the network there. A minute later:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
Thread-1::INFO::2021-06-08 05:08:08,143::network::88::network.Network::(action) Successfully verified network status
But that was too late and the engine VM was already on its way down.
A remaining open question is whether we should retry before giving up, and where - in the SDK, in OST code, etc. - or whether this should be considered normal.
What do you think? Best regards, -- Didi

Il giorno mer 23 giu 2021 alle ore 07:48 Yedidyah Bar David <didi@redhat.com> ha scritto:
On Wed, Jun 9, 2021 at 12:13 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Jun 8, 2021 at 9:01 AM Yedidyah Bar David <didi@redhat.com>
wrote:
On Tue, Jun 8, 2021 at 6:08 AM <jenkins@jenkins.phx.ovirt.org> wrote:
Project:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/
Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/ Build Number: 2046 Build Status: Failure Triggered By: Started by timer
------------------------------------- Changes Since Last Success: ------------------------------------- Changes for Build #2046 [Eitan Raviv] network: force select spm - wait for dc status
----------------- Failed Tests: ----------------- 1 tests failed. FAILED: he-basic-suite-master.test-scenarios.test_004_basic_sanity.test_template_export
Error Message: ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at 0x5624fe64d108>, 7, 'Failed to connect to 192.168.200.99 port 443: Connection refused')]
- The engine VM went down:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
MainThread::INFO::2021-06-08
05:07:34,414::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineUp (score: 3400) MainThread::INFO::2021-06-08
05:07:44,575::states::135::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
Penalizing score by 960 due to network status
- Because HA monitoring failed to get a reply from the dns server:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
Thread-1::WARNING::2021-06-08 05:07:25,486::network::120::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-3::INFO::2021-06-08 05:07:28,543::mem_free::51::mem_free.MemFree::(action) memFree: 1801 Thread-5::INFO::2021-06-08
05:07:28,972::engine_health::246::engine_health.EngineHealth::(_result_from_stats)
VM is up on this host with healthy engine Thread-2::INFO::2021-06-08 05:07:31,532::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in up state Thread-1::WARNING::2021-06-08 05:07:33,011::network::120::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-4::INFO::2021-06-08
05:07:37,433::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
System load total=0.3196, engine=0.1724, non-engine=0.1472 Thread-3::INFO::2021-06-08 05:07:37,839::mem_free::51::mem_free.MemFree::(action) memFree: 1735 Thread-5::INFO::2021-06-08
05:07:39,146::engine_health::246::engine_health.EngineHealth::(_result_from_stats)
VM is up on this host with healthy engine Thread-1::WARNING::2021-06-08 05:07:40,535::network::120::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-1::WARNING::2021-06-08 05:07:40,535::network::92::network.Network::(action) Failed to verify network status, (2 out of 5)
- Not sure why. DNS servers:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
# Generated by NetworkManager search lago.local nameserver 192.168.200.1 nameserver fe80::5054:ff:fe0c:9ad0%ovirtmgmt nameserver fd8f:1391:3a82:200::1
Now happened again:
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17434/
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17434/...
Thread-1::INFO::2021-06-22 18:57:29,134::network::88::network.Network::(action) Successfully verified network status ...
Thread-1::WARNING::2021-06-22 18:58:13,390::network::92::network.Network::(action) Failed to verify network status, (0 out of 5) Thread-1::INFO::2021-06-22 18:58:15,761::network::88::network.Network::(action) Successfully verified network status ...
- The command we run is 'dig +tries=1 +time=5', which defaults to querying for '.' (the dns root). This is normally cached locally, but has a TTL of 86400, meaning it can be cached for up to one day. So if we ran this query right after it expired, _and_ then the local dns server had some issues forwarding our request (due to external issues, perhaps), then it would fail like this. I am going to ignore this failure for now, assuming it was temporary, but it might be worth opening an RFE on ovirt-hosted-engine-ha asking for some more flexibility - setting the query string or something similar. I think that this bug is probably quite hard to reproduce, because normally, all hosts will use the same dns server, and problems with it will affect all of them similarly.
- Anyway, it seems like there were temporary connectivity issues on the network there. A minute later:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
Thread-1::INFO::2021-06-08 05:08:08,143::network::88::network.Network::(action) Successfully verified network status
But that was too late and the engine VM was already on its way down.
A remaining open question is whether we should retry before giving up, and where - in the SDK, in OST code, etc. - or whether this should be considered normal.
What do you think?
Question is: is retry in place also on vdsm side? Because if it fails on vdsm, it's better to fail here as well. If there's a retry process in vdsm for all network calls, I think we can relax the check here and retry before giving up.
Best regards, -- Didi
-- Sandro Bonazzola MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV Red Hat EMEA <https://www.redhat.com/> sbonazzo@redhat.com <https://www.redhat.com/> *Red Hat respects your work life balance. Therefore there is no need to answer this email out of your office hours. <https://mojo.redhat.com/docs/DOC-1199578>*

On Wed, Jun 23, 2021 at 12:02 PM Sandro Bonazzola <sbonazzo@redhat.com> wrote:
Il giorno mer 23 giu 2021 alle ore 07:48 Yedidyah Bar David < didi@redhat.com> ha scritto:
On Wed, Jun 9, 2021 at 12:13 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Jun 8, 2021 at 9:01 AM Yedidyah Bar David <didi@redhat.com>
wrote:
On Tue, Jun 8, 2021 at 6:08 AM <jenkins@jenkins.phx.ovirt.org> wrote:
Project:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/
Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/ Build Number: 2046 Build Status: Failure Triggered By: Started by timer
------------------------------------- Changes Since Last Success: ------------------------------------- Changes for Build #2046 [Eitan Raviv] network: force select spm - wait for dc status
----------------- Failed Tests: ----------------- 1 tests failed. FAILED: he-basic-suite-master.test-scenarios.test_004_basic_sanity.test_template_export
Error Message: ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at 0x5624fe64d108>, 7, 'Failed to connect to 192.168.200.99 port 443: Connection refused')]
- The engine VM went down:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
MainThread::INFO::2021-06-08
05:07:34,414::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineUp (score: 3400) MainThread::INFO::2021-06-08
05:07:44,575::states::135::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
Penalizing score by 960 due to network status
- Because HA monitoring failed to get a reply from the dns server:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
Thread-1::WARNING::2021-06-08 05:07:25,486::network::120::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-3::INFO::2021-06-08 05:07:28,543::mem_free::51::mem_free.MemFree::(action) memFree: 1801 Thread-5::INFO::2021-06-08
05:07:28,972::engine_health::246::engine_health.EngineHealth::(_result_from_stats)
VM is up on this host with healthy engine Thread-2::INFO::2021-06-08 05:07:31,532::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in up state Thread-1::WARNING::2021-06-08 05:07:33,011::network::120::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-4::INFO::2021-06-08
05:07:37,433::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
System load total=0.3196, engine=0.1724, non-engine=0.1472 Thread-3::INFO::2021-06-08 05:07:37,839::mem_free::51::mem_free.MemFree::(action) memFree: 1735 Thread-5::INFO::2021-06-08
05:07:39,146::engine_health::246::engine_health.EngineHealth::(_result_from_stats)
VM is up on this host with healthy engine Thread-1::WARNING::2021-06-08 05:07:40,535::network::120::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-1::WARNING::2021-06-08 05:07:40,535::network::92::network.Network::(action) Failed to verify network status, (2 out of 5)
- Not sure why. DNS servers:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
# Generated by NetworkManager search lago.local nameserver 192.168.200.1 nameserver fe80::5054:ff:fe0c:9ad0%ovirtmgmt nameserver fd8f:1391:3a82:200::1
Now happened again:
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17434/
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17434/...
Thread-1::INFO::2021-06-22 18:57:29,134::network::88::network.Network::(action) Successfully verified network status ...
Thread-1::WARNING::2021-06-22 18:58:13,390::network::92::network.Network::(action) Failed to verify network status, (0 out of 5) Thread-1::INFO::2021-06-22 18:58:15,761::network::88::network.Network::(action) Successfully verified network status ...
- The command we run is 'dig +tries=1 +time=5', which defaults to querying for '.' (the dns root). This is normally cached locally, but has a TTL of 86400, meaning it can be cached for up to one day. So if we ran this query right after it expired, _and_ then the local dns server had some issues forwarding our request (due to external issues, perhaps), then it would fail like this. I am going to ignore this failure for now, assuming it was temporary, but it might be worth opening an RFE on ovirt-hosted-engine-ha asking for some more flexibility - setting the query string or something similar. I think that this bug is probably quite hard to reproduce, because normally, all hosts will use the same dns server, and problems with it will affect all of them similarly.
- Anyway, it seems like there were temporary connectivity issues on the network there. A minute later:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
Thread-1::INFO::2021-06-08 05:08:08,143::network::88::network.Network::(action) Successfully verified network status
But that was too late and the engine VM was already on its way down.
A remaining open question is whether we should retry before giving up, and where - in the SDK, in OST code, etc. - or whether this should be considered normal.
What do you think?
Question is: is retry in place also on vdsm side? Because if it fails on vdsm, it's better to fail here as well. If there's a retry process in vdsm for all network calls, I think we can relax the check here and retry before giving up.
No idea, adding Eitan. Best regards, -- Didi

On Wed, Jun 23, 2021 at 12:30 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Jun 23, 2021 at 12:02 PM Sandro Bonazzola <sbonazzo@redhat.com> wrote:
Il giorno mer 23 giu 2021 alle ore 07:48 Yedidyah Bar David <didi@redhat.com> ha scritto:
On Wed, Jun 9, 2021 at 12:13 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Jun 8, 2021 at 9:01 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Jun 8, 2021 at 6:08 AM <jenkins@jenkins.phx.ovirt.org> wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/ Build Number: 2046 Build Status: Failure Triggered By: Started by timer
------------------------------------- Changes Since Last Success: ------------------------------------- Changes for Build #2046 [Eitan Raviv] network: force select spm - wait for dc status
----------------- Failed Tests: ----------------- 1 tests failed. FAILED: he-basic-suite-master.test-scenarios.test_004_basic_sanity.test_template_export
Error Message: ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at 0x5624fe64d108>, 7, 'Failed to connect to 192.168.200.99 port 443: Connection refused')]
- The engine VM went down:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
MainThread::INFO::2021-06-08 05:07:34,414::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 3400) MainThread::INFO::2021-06-08 05:07:44,575::states::135::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 960 due to network status
- Because HA monitoring failed to get a reply from the dns server:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
Thread-1::WARNING::2021-06-08 05:07:25,486::network::120::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-3::INFO::2021-06-08 05:07:28,543::mem_free::51::mem_free.MemFree::(action) memFree: 1801 Thread-5::INFO::2021-06-08 05:07:28,972::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-2::INFO::2021-06-08 05:07:31,532::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in up state Thread-1::WARNING::2021-06-08 05:07:33,011::network::120::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-4::INFO::2021-06-08 05:07:37,433::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.3196, engine=0.1724, non-engine=0.1472 Thread-3::INFO::2021-06-08 05:07:37,839::mem_free::51::mem_free.MemFree::(action) memFree: 1735 Thread-5::INFO::2021-06-08 05:07:39,146::engine_health::246::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-1::WARNING::2021-06-08 05:07:40,535::network::120::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-1::WARNING::2021-06-08 05:07:40,535::network::92::network.Network::(action) Failed to verify network status, (2 out of 5)
- Not sure why. DNS servers:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
# Generated by NetworkManager search lago.local nameserver 192.168.200.1 nameserver fe80::5054:ff:fe0c:9ad0%ovirtmgmt nameserver fd8f:1391:3a82:200::1
Now happened again:
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17434/
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17434/...
Thread-1::INFO::2021-06-22 18:57:29,134::network::88::network.Network::(action) Successfully verified network status ...
Thread-1::WARNING::2021-06-22 18:58:13,390::network::92::network.Network::(action) Failed to verify network status, (0 out of 5) Thread-1::INFO::2021-06-22 18:58:15,761::network::88::network.Network::(action) Successfully verified network status ...
- The command we run is 'dig +tries=1 +time=5', which defaults to querying for '.' (the dns root). This is normally cached locally, but has a TTL of 86400, meaning it can be cached for up to one day. So if we ran this query right after it expired, _and_ then the local dns server had some issues forwarding our request (due to external issues, perhaps), then it would fail like this. I am going to ignore this failure for now, assuming it was temporary, but it might be worth opening an RFE on ovirt-hosted-engine-ha asking for some more flexibility - setting the query string or something similar. I think that this bug is probably quite hard to reproduce, because normally, all hosts will use the same dns server, and problems with it will affect all of them similarly.
- Anyway, it seems like there were temporary connectivity issues on the network there. A minute later:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
Thread-1::INFO::2021-06-08 05:08:08,143::network::88::network.Network::(action) Successfully verified network status
But that was too late and the engine VM was already on its way down.
A remaining open question is whether we should retry before giving up, and where - in the SDK, in OST code, etc. - or whether this should be considered normal.
What do you think?
Question is: is retry in place also on vdsm side? Because if it fails on vdsm, it's better to fail here as well. If there's a retry process in vdsm for all network calls, I think we can relax the check here and retry before giving up.
No idea, adding Eitan.
I talked with Eitan about this in private, and he'll check. Thanks. This happened more often recently. I pushed a patch [1] to test the network alongside OST, and one of the CI check-patch runs for it also failed due to this reason [2] (check broker.log on host-0). The log generated by this patch [3] ends with "Passed 1311 out of 1338", meaning it lost 27 replies in less than an hour, which IMO is quite a lot. The latest version of the patch tries dig with '+tcp' - if that's enough to make it pass with (close to) zero losses, perhaps we can do the same in HA. Thanks and best regards, [1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/115586 [2] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17711/ [3] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17711/... -- Didi

Adding Ales as well. AFAIK vdsm does not actively poll engine for liveness, nor does any retries. But retries might be at a deeper infra level where Marcin is the person to ask IIUC. On Wed, Jul 7, 2021 at 4:40 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Jun 23, 2021 at 12:30 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Jun 23, 2021 at 12:02 PM Sandro Bonazzola <sbonazzo@redhat.com>
Il giorno mer 23 giu 2021 alle ore 07:48 Yedidyah Bar David <
didi@redhat.com> ha scritto:
On Wed, Jun 9, 2021 at 12:13 PM Yedidyah Bar David <didi@redhat.com>
wrote:
On Tue, Jun 8, 2021 at 9:01 AM Yedidyah Bar David <didi@redhat.com>
wrote:
On Tue, Jun 8, 2021 at 6:08 AM <jenkins@jenkins.phx.ovirt.org>
wrote:
> > Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ > Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/ > Build Number: 2046 > Build Status: Failure > Triggered By: Started by timer > > ------------------------------------- > Changes Since Last Success: > ------------------------------------- > Changes for Build #2046 > [Eitan Raviv] network: force select spm - wait for dc status > > > > > ----------------- > Failed Tests: > ----------------- > 1 tests failed. > FAILED: he-basic-suite-master.test-scenarios.test_004_basic_sanity.test_template_export > > Error Message: > ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at 0x5624fe64d108>, 7, 'Failed to connect to 192.168.200.99 port 443: Connection refused')]
- The engine VM went down:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
MainThread::INFO::2021-06-08
05:07:34,414::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
Current state EngineUp (score: 3400) MainThread::INFO::2021-06-08
05:07:44,575::states::135::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
Penalizing score by 960 due to network status
- Because HA monitoring failed to get a reply from the dns server:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
Thread-1::WARNING::2021-06-08 05:07:25,486::network::120::network.Network::(_dns) DNS query
failed:
; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-3::INFO::2021-06-08 05:07:28,543::mem_free::51::mem_free.MemFree::(action) memFree: 1801 Thread-5::INFO::2021-06-08
05:07:28,972::engine_health::246::engine_health.EngineHealth::(_result_from_stats)
VM is up on this host with healthy engine Thread-2::INFO::2021-06-08 05:07:31,532::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt in up state Thread-1::WARNING::2021-06-08 05:07:33,011::network::120::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-4::INFO::2021-06-08
05:07:37,433::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
System load total=0.3196, engine=0.1724, non-engine=0.1472 Thread-3::INFO::2021-06-08 05:07:37,839::mem_free::51::mem_free.MemFree::(action) memFree: 1735 Thread-5::INFO::2021-06-08
05:07:39,146::engine_health::246::engine_health.EngineHealth::(_result_from_stats)
VM is up on this host with healthy engine Thread-1::WARNING::2021-06-08 05:07:40,535::network::120::network.Network::(_dns) DNS query failed: ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 ;; global options: +cmd ;; connection timed out; no servers could be reached
Thread-1::WARNING::2021-06-08 05:07:40,535::network::92::network.Network::(action) Failed to verify network status, (2 out of 5)
- Not sure why. DNS servers:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
# Generated by NetworkManager search lago.local nameserver 192.168.200.1 nameserver fe80::5054:ff:fe0c:9ad0%ovirtmgmt nameserver fd8f:1391:3a82:200::1
Now happened again:
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17434/
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17434/...
Thread-1::INFO::2021-06-22 18:57:29,134::network::88::network.Network::(action) Successfully verified network status ...
Thread-1::WARNING::2021-06-22 18:58:13,390::network::92::network.Network::(action) Failed to verify network status, (0 out of 5) Thread-1::INFO::2021-06-22 18:58:15,761::network::88::network.Network::(action) Successfully verified network status ...
- The command we run is 'dig +tries=1 +time=5', which defaults to querying for '.' (the dns root). This is normally cached locally,
but
has a TTL of 86400, meaning it can be cached for up to one day. So if we ran this query right after it expired, _and_ then the local dns server had some issues forwarding our request (due to external issues, perhaps), then it would fail like this. I am going to ignore this failure for now, assuming it was temporary, but it might be worth opening an RFE on ovirt-hosted-engine-ha asking for some more flexibility - setting the query string or something similar. I
wrote: think
that this bug is probably quite hard to reproduce, because normally, all hosts will use the same dns server, and problems with it will affect all of them similarly.
- Anyway, it seems like there were temporary connectivity issues on the network there. A minute later:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/...
Thread-1::INFO::2021-06-08 05:08:08,143::network::88::network.Network::(action) Successfully verified network status
But that was too late and the engine VM was already on its way
down.
A remaining open question is whether we should retry before giving
up,
and where - in the SDK, in OST code, etc. - or whether this should be considered normal.
What do you think?
Question is: is retry in place also on vdsm side? Because if it fails on vdsm, it's better to fail here as well. If there's a retry process in vdsm for all network calls, I think we can relax the check here and retry before giving up.
No idea, adding Eitan.
I talked with Eitan about this in private, and he'll check. Thanks.
This happened more often recently.
I pushed a patch [1] to test the network alongside OST, and one of the CI check-patch runs for it also failed due to this reason [2] (check broker.log on host-0). The log generated by this patch [3] ends with "Passed 1311 out of 1338", meaning it lost 27 replies in less than an hour, which IMO is quite a lot. The latest version of the patch tries dig with '+tcp' - if that's enough to make it pass with (close to) zero losses, perhaps we can do the same in HA.
Thanks and best regards,
[1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/115586 [2] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17711/ [3] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17711/... -- Didi

On Wed, Jul 7, 2021 at 4:42 PM Eitan Raviv <eraviv@redhat.com> wrote:
Adding Ales as well. AFAIK vdsm does not actively poll engine for liveness, nor does any retries. But retries might be at a deeper infra level where Marcin is the person to ask IIUC.
I now filed a bug to track this [1], and (think I) managed to verify my patch, to use +tcp [2]. So please review/merge the patch and flag/target/ack the bug. Thanks. [1] https://bugzilla.redhat.com/1984356 [2] https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/115596
On Wed, Jul 7, 2021 at 4:40 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Jun 23, 2021 at 12:30 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Jun 23, 2021 at 12:02 PM Sandro Bonazzola <sbonazzo@redhat.com> wrote:
Il giorno mer 23 giu 2021 alle ore 07:48 Yedidyah Bar David <didi@redhat.com> ha scritto:
On Wed, Jun 9, 2021 at 12:13 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Jun 8, 2021 at 9:01 AM Yedidyah Bar David <didi@redhat.com> wrote: > > On Tue, Jun 8, 2021 at 6:08 AM <jenkins@jenkins.phx.ovirt.org> wrote: > > > > Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ > > Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/ > > Build Number: 2046 > > Build Status: Failure > > Triggered By: Started by timer > > > > ------------------------------------- > > Changes Since Last Success: > > ------------------------------------- > > Changes for Build #2046 > > [Eitan Raviv] network: force select spm - wait for dc status > > > > > > > > > > ----------------- > > Failed Tests: > > ----------------- > > 1 tests failed. > > FAILED: he-basic-suite-master.test-scenarios.test_004_basic_sanity.test_template_export > > > > Error Message: > > ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at 0x5624fe64d108>, 7, 'Failed to connect to 192.168.200.99 port 443: Connection refused')] > > - The engine VM went down: > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/... > > MainThread::INFO::2021-06-08 > 05:07:34,414::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) > Current state EngineUp (score: 3400) > MainThread::INFO::2021-06-08 > 05:07:44,575::states::135::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) > Penalizing score by 960 due to network status > > - Because HA monitoring failed to get a reply from the dns server: > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/... > > Thread-1::WARNING::2021-06-08 > 05:07:25,486::network::120::network.Network::(_dns) DNS query failed: > ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 > ;; global options: +cmd > ;; connection timed out; no servers could be reached > > Thread-3::INFO::2021-06-08 > 05:07:28,543::mem_free::51::mem_free.MemFree::(action) memFree: 1801 > Thread-5::INFO::2021-06-08 > 05:07:28,972::engine_health::246::engine_health.EngineHealth::(_result_from_stats) > VM is up on this host with healthy engine > Thread-2::INFO::2021-06-08 > 05:07:31,532::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found > bridge ovirtmgmt in up state > Thread-1::WARNING::2021-06-08 > 05:07:33,011::network::120::network.Network::(_dns) DNS query failed: > ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 > ;; global options: +cmd > ;; connection timed out; no servers could be reached > > Thread-4::INFO::2021-06-08 > 05:07:37,433::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) > System load total=0.3196, engine=0.1724, non-engine=0.1472 > Thread-3::INFO::2021-06-08 > 05:07:37,839::mem_free::51::mem_free.MemFree::(action) memFree: 1735 > Thread-5::INFO::2021-06-08 > 05:07:39,146::engine_health::246::engine_health.EngineHealth::(_result_from_stats) > VM is up on this host with healthy engine > Thread-1::WARNING::2021-06-08 > 05:07:40,535::network::120::network.Network::(_dns) DNS query failed: > ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 > ;; global options: +cmd > ;; connection timed out; no servers could be reached > > Thread-1::WARNING::2021-06-08 > 05:07:40,535::network::92::network.Network::(action) Failed to verify > network status, (2 out of 5) > > - Not sure why. DNS servers: > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/... > > # Generated by NetworkManager > search lago.local > nameserver 192.168.200.1 > nameserver fe80::5054:ff:fe0c:9ad0%ovirtmgmt > nameserver fd8f:1391:3a82:200::1
Now happened again:
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17434/
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17434/...
Thread-1::INFO::2021-06-22 18:57:29,134::network::88::network.Network::(action) Successfully verified network status ...
Thread-1::WARNING::2021-06-22 18:58:13,390::network::92::network.Network::(action) Failed to verify network status, (0 out of 5) Thread-1::INFO::2021-06-22 18:58:15,761::network::88::network.Network::(action) Successfully verified network status ...
> > - The command we run is 'dig +tries=1 +time=5', which defaults to > querying for '.' (the dns root). This is normally cached locally, but > has a TTL of 86400, meaning it can be cached for up to one day. So if > we ran this query right after it expired, _and_ then the local dns > server had some issues forwarding our request (due to external issues, > perhaps), then it would fail like this. I am going to ignore this > failure for now, assuming it was temporary, but it might be worth > opening an RFE on ovirt-hosted-engine-ha asking for some more > flexibility - setting the query string or something similar. I think > that this bug is probably quite hard to reproduce, because normally, > all hosts will use the same dns server, and problems with it will > affect all of them similarly. > > - Anyway, it seems like there were temporary connectivity issues on > the network there. A minute later: > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/... > > Thread-1::INFO::2021-06-08 > 05:08:08,143::network::88::network.Network::(action) Successfully > verified network status > > But that was too late and the engine VM was already on its way down. > > A remaining open question is whether we should retry before giving up, > and where - in the SDK, in OST code, etc. - or whether this should be > considered normal.
What do you think?
Question is: is retry in place also on vdsm side? Because if it fails on vdsm, it's better to fail here as well. If there's a retry process in vdsm for all network calls, I think we can relax the check here and retry before giving up.
No idea, adding Eitan.
I talked with Eitan about this in private, and he'll check. Thanks.
This happened more often recently.
I pushed a patch [1] to test the network alongside OST, and one of the CI check-patch runs for it also failed due to this reason [2] (check broker.log on host-0). The log generated by this patch [3] ends with "Passed 1311 out of 1338", meaning it lost 27 replies in less than an hour, which IMO is quite a lot. The latest version of the patch tries dig with '+tcp' - if that's enough to make it pass with (close to) zero losses, perhaps we can do the same in HA.
Thanks and best regards,
[1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/115586 [2] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17711/ [3] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17711/... -- Didi
-- Didi

On Wed, Jul 7, 2021 at 4:42 PM Eitan Raviv <eraviv@redhat.com> wrote:
Adding Ales as well. AFAIK vdsm does not actively poll engine for liveness, nor does any retries. But retries might be at a deeper infra level where Marcin is the person to ask IIUC.
Right, no retries in vdsm, we send replies or events, and we don't have any way to tell if engine got the message.
On Wed, Jul 7, 2021 at 4:40 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Jun 23, 2021 at 12:30 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Jun 23, 2021 at 12:02 PM Sandro Bonazzola <sbonazzo@redhat.com> wrote:
Il giorno mer 23 giu 2021 alle ore 07:48 Yedidyah Bar David <didi@redhat.com> ha scritto:
On Wed, Jun 9, 2021 at 12:13 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Jun 8, 2021 at 9:01 AM Yedidyah Bar David <didi@redhat.com> wrote: > > On Tue, Jun 8, 2021 at 6:08 AM <jenkins@jenkins.phx.ovirt.org> wrote: > > > > Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ > > Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/ > > Build Number: 2046 > > Build Status: Failure > > Triggered By: Started by timer > > > > ------------------------------------- > > Changes Since Last Success: > > ------------------------------------- > > Changes for Build #2046 > > [Eitan Raviv] network: force select spm - wait for dc status > > > > > > > > > > ----------------- > > Failed Tests: > > ----------------- > > 1 tests failed. > > FAILED: he-basic-suite-master.test-scenarios.test_004_basic_sanity.test_template_export > > > > Error Message: > > ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at 0x5624fe64d108>, 7, 'Failed to connect to 192.168.200.99 port 443: Connection refused')] > > - The engine VM went down: > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/... > > MainThread::INFO::2021-06-08 > 05:07:34,414::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) > Current state EngineUp (score: 3400) > MainThread::INFO::2021-06-08 > 05:07:44,575::states::135::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) > Penalizing score by 960 due to network status > > - Because HA monitoring failed to get a reply from the dns server: > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/... > > Thread-1::WARNING::2021-06-08 > 05:07:25,486::network::120::network.Network::(_dns) DNS query failed: > ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 > ;; global options: +cmd > ;; connection timed out; no servers could be reached > > Thread-3::INFO::2021-06-08 > 05:07:28,543::mem_free::51::mem_free.MemFree::(action) memFree: 1801 > Thread-5::INFO::2021-06-08 > 05:07:28,972::engine_health::246::engine_health.EngineHealth::(_result_from_stats) > VM is up on this host with healthy engine > Thread-2::INFO::2021-06-08 > 05:07:31,532::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action) Found > bridge ovirtmgmt in up state > Thread-1::WARNING::2021-06-08 > 05:07:33,011::network::120::network.Network::(_dns) DNS query failed: > ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 > ;; global options: +cmd > ;; connection timed out; no servers could be reached > > Thread-4::INFO::2021-06-08 > 05:07:37,433::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) > System load total=0.3196, engine=0.1724, non-engine=0.1472 > Thread-3::INFO::2021-06-08 > 05:07:37,839::mem_free::51::mem_free.MemFree::(action) memFree: 1735 > Thread-5::INFO::2021-06-08 > 05:07:39,146::engine_health::246::engine_health.EngineHealth::(_result_from_stats) > VM is up on this host with healthy engine > Thread-1::WARNING::2021-06-08 > 05:07:40,535::network::120::network.Network::(_dns) DNS query failed: > ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> +tries=1 +time=5 > ;; global options: +cmd > ;; connection timed out; no servers could be reached > > Thread-1::WARNING::2021-06-08 > 05:07:40,535::network::92::network.Network::(action) Failed to verify > network status, (2 out of 5) > > - Not sure why. DNS servers: > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/... > > # Generated by NetworkManager > search lago.local > nameserver 192.168.200.1 > nameserver fe80::5054:ff:fe0c:9ad0%ovirtmgmt > nameserver fd8f:1391:3a82:200::1
Now happened again:
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17434/
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17434/...
Thread-1::INFO::2021-06-22 18:57:29,134::network::88::network.Network::(action) Successfully verified network status ...
Thread-1::WARNING::2021-06-22 18:58:13,390::network::92::network.Network::(action) Failed to verify network status, (0 out of 5) Thread-1::INFO::2021-06-22 18:58:15,761::network::88::network.Network::(action) Successfully verified network status ...
> > - The command we run is 'dig +tries=1 +time=5', which defaults to > querying for '.' (the dns root). This is normally cached locally, but > has a TTL of 86400, meaning it can be cached for up to one day. So if > we ran this query right after it expired, _and_ then the local dns > server had some issues forwarding our request (due to external issues, > perhaps), then it would fail like this. I am going to ignore this > failure for now, assuming it was temporary, but it might be worth > opening an RFE on ovirt-hosted-engine-ha asking for some more > flexibility - setting the query string or something similar. I think > that this bug is probably quite hard to reproduce, because normally, > all hosts will use the same dns server, and problems with it will > affect all of them similarly. > > - Anyway, it seems like there were temporary connectivity issues on > the network there. A minute later: > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/... > > Thread-1::INFO::2021-06-08 > 05:08:08,143::network::88::network.Network::(action) Successfully > verified network status > > But that was too late and the engine VM was already on its way down. > > A remaining open question is whether we should retry before giving up, > and where - in the SDK, in OST code, etc. - or whether this should be > considered normal.
What do you think?
Question is: is retry in place also on vdsm side? Because if it fails on vdsm, it's better to fail here as well. If there's a retry process in vdsm for all network calls, I think we can relax the check here and retry before giving up.
No idea, adding Eitan.
I talked with Eitan about this in private, and he'll check. Thanks.
This happened more often recently.
I pushed a patch [1] to test the network alongside OST, and one of the CI check-patch runs for it also failed due to this reason [2] (check broker.log on host-0). The log generated by this patch [3] ends with "Passed 1311 out of 1338", meaning it lost 27 replies in less than an hour, which IMO is quite a lot. The latest version of the patch tries dig with '+tcp' - if that's enough to make it pass with (close to) zero losses, perhaps we can do the same in HA.
Thanks and best regards,
[1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/115586 [2] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17711/ [3] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17711/... -- Didi
_______________________________________________ Infra mailing list -- infra@ovirt.org To unsubscribe send an email to infra-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/message/64EOPWAPNDKTMA...
participants (4)
-
Eitan Raviv
-
Nir Soffer
-
Sandro Bonazzola
-
Yedidyah Bar David