Adding Ales as well.
AFAIK vdsm does not actively poll engine for liveness, nor does any
retries. But retries might be at a deeper infra level where Marcin is the
person to ask IIUC.
On Wed, Jul 7, 2021 at 4:40 PM Yedidyah Bar David <didi(a)redhat.com> wrote:
On Wed, Jun 23, 2021 at 12:30 PM Yedidyah Bar David
<didi(a)redhat.com>
wrote:
>
> On Wed, Jun 23, 2021 at 12:02 PM Sandro Bonazzola <sbonazzo(a)redhat.com>
wrote:
>>
>>
>>
>> Il giorno mer 23 giu 2021 alle ore 07:48 Yedidyah Bar David <
didi(a)redhat.com> ha scritto:
>>>
>>> On Wed, Jun 9, 2021 at 12:13 PM Yedidyah Bar David <didi(a)redhat.com>
wrote:
>>> >
>>> > On Tue, Jun 8, 2021 at 9:01 AM Yedidyah Bar David
<didi(a)redhat.com>
wrote:
>>> > >
>>> > > On Tue, Jun 8, 2021 at 6:08 AM
<jenkins(a)jenkins.phx.ovirt.org>
wrote:
>>> > > >
>>> > > > Project:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/
>>> > > > Build:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2046/
>>> > > > Build Number: 2046
>>> > > > Build Status: Failure
>>> > > > Triggered By: Started by timer
>>> > > >
>>> > > > -------------------------------------
>>> > > > Changes Since Last Success:
>>> > > > -------------------------------------
>>> > > > Changes for Build #2046
>>> > > > [Eitan Raviv] network: force select spm - wait for dc status
>>> > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > > -----------------
>>> > > > Failed Tests:
>>> > > > -----------------
>>> > > > 1 tests failed.
>>> > > > FAILED:
he-basic-suite-master.test-scenarios.test_004_basic_sanity.test_template_export
>>> > > >
>>> > > > Error Message:
>>> > > > ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl
object
at 0x5624fe64d108>, 7, 'Failed to connect to 192.168.200.99 port 443:
Connection refused')]
>>> > >
>>> > > - The engine VM went down:
>>> > >
>>> > >
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/20...
>>> > >
>>> > > MainThread::INFO::2021-06-08
>>> > >
05:07:34,414::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
>>> > > Current state EngineUp (score: 3400)
>>> > > MainThread::INFO::2021-06-08
>>> > >
05:07:44,575::states::135::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
>>> > > Penalizing score by 960 due to network status
>>> > >
>>> > > - Because HA monitoring failed to get a reply from the dns
server:
>>> > >
>>> > >
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/20...
>>> > >
>>> > > Thread-1::WARNING::2021-06-08
>>> > > 05:07:25,486::network::120::network.Network::(_dns) DNS query
failed:
>>> > > ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4
<<>> +tries=1 +time=5
>>> > > ;; global options: +cmd
>>> > > ;; connection timed out; no servers could be reached
>>> > >
>>> > > Thread-3::INFO::2021-06-08
>>> > > 05:07:28,543::mem_free::51::mem_free.MemFree::(action) memFree:
1801
>>> > > Thread-5::INFO::2021-06-08
>>> > >
05:07:28,972::engine_health::246::engine_health.EngineHealth::(_result_from_stats)
>>> > > VM is up on this host with healthy engine
>>> > > Thread-2::INFO::2021-06-08
>>> > > 05:07:31,532::mgmt_bridge::65::mgmt_bridge.MgmtBridge::(action)
Found
>>> > > bridge ovirtmgmt in up state
>>> > > Thread-1::WARNING::2021-06-08
>>> > > 05:07:33,011::network::120::network.Network::(_dns) DNS query
failed:
>>> > > ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4
<<>> +tries=1 +time=5
>>> > > ;; global options: +cmd
>>> > > ;; connection timed out; no servers could be reached
>>> > >
>>> > > Thread-4::INFO::2021-06-08
>>> > >
05:07:37,433::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load)
>>> > > System load total=0.3196, engine=0.1724, non-engine=0.1472
>>> > > Thread-3::INFO::2021-06-08
>>> > > 05:07:37,839::mem_free::51::mem_free.MemFree::(action) memFree:
1735
>>> > > Thread-5::INFO::2021-06-08
>>> > >
05:07:39,146::engine_health::246::engine_health.EngineHealth::(_result_from_stats)
>>> > > VM is up on this host with healthy engine
>>> > > Thread-1::WARNING::2021-06-08
>>> > > 05:07:40,535::network::120::network.Network::(_dns) DNS query
failed:
>>> > > ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4
<<>> +tries=1 +time=5
>>> > > ;; global options: +cmd
>>> > > ;; connection timed out; no servers could be reached
>>> > >
>>> > > Thread-1::WARNING::2021-06-08
>>> > > 05:07:40,535::network::92::network.Network::(action) Failed to
verify
>>> > > network status, (2 out of 5)
>>> > >
>>> > > - Not sure why. DNS servers:
>>> > >
>>> > >
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/20...
>>> > >
>>> > > # Generated by NetworkManager
>>> > > search lago.local
>>> > > nameserver 192.168.200.1
>>> > > nameserver fe80::5054:ff:fe0c:9ad0%ovirtmgmt
>>> > > nameserver fd8f:1391:3a82:200::1
>>>
>>> Now happened again:
>>>
>>>
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17434/
>>>
>>>
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/174...
>>>
>>> Thread-1::INFO::2021-06-22
>>> 18:57:29,134::network::88::network.Network::(action) Successfully
>>> verified network status
>>> ...
>>>
>>> Thread-1::WARNING::2021-06-22
>>> 18:58:13,390::network::92::network.Network::(action) Failed to verify
>>> network status, (0 out of 5)
>>> Thread-1::INFO::2021-06-22
>>> 18:58:15,761::network::88::network.Network::(action) Successfully
>>> verified network status
>>> ...
>>>
>>> > >
>>> > > - The command we run is 'dig +tries=1 +time=5', which
defaults to
>>> > > querying for '.' (the dns root). This is normally cached
locally,
but
>>> > > has a TTL of 86400, meaning it can be cached for up to one day.
So
if
>>> > > we ran this query right after it expired, _and_ then the local
dns
>>> > > server had some issues forwarding our request (due to external
issues,
>>> > > perhaps), then it would fail like this. I am going to ignore this
>>> > > failure for now, assuming it was temporary, but it might be worth
>>> > > opening an RFE on ovirt-hosted-engine-ha asking for some more
>>> > > flexibility - setting the query string or something similar. I
think
>>> > > that this bug is probably quite hard to reproduce, because
normally,
>>> > > all hosts will use the same dns server, and problems with it will
>>> > > affect all of them similarly.
>>> > >
>>> > > - Anyway, it seems like there were temporary connectivity issues
on
>>> > > the network there. A minute later:
>>> > >
>>> > >
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/20...
>>> > >
>>> > > Thread-1::INFO::2021-06-08
>>> > > 05:08:08,143::network::88::network.Network::(action) Successfully
>>> > > verified network status
>>> > >
>>> > > But that was too late and the engine VM was already on its way
down.
>>> > >
>>> > > A remaining open question is whether we should retry before
giving
up,
>>> > > and where - in the SDK, in OST code, etc. - or whether this
should
be
>>> > > considered normal.
>>>
>>> What do you think?
>>
>>
>> Question is: is retry in place also on vdsm side? Because if it fails
on vdsm, it's better to fail here as well. If there's a retry process in
vdsm for all network calls, I think we can relax the check here and retry
before giving up.
>
>
> No idea, adding Eitan.
I talked with Eitan about this in private, and he'll check. Thanks.
This happened more often recently.
I pushed a patch [1] to test the network alongside OST, and one of the
CI check-patch runs for it also failed due to this reason [2] (check
broker.log on host-0). The log generated by this patch [3] ends with
"Passed 1311 out of 1338", meaning it lost 27 replies in less than an
hour, which IMO is quite a lot. The latest version of the patch tries
dig with '+tcp' - if that's enough to make it pass with (close to)
zero losses, perhaps we can do the same in HA.
Thanks and best regards,
[1]
https://gerrit.ovirt.org/c/ovirt-system-tests/+/115586
[2]
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/17711/
[3]
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/177...
--
Didi