Hi,

The first problem with 'Waiting for agent to be ready' looks like a race condition between HE broker and vdsm. 
The broker calls vdsm in the same moment that vdsm restarts but the broker keeps waiting for the result that never arrives. 
The timeout in the call is set to 15 minutes, but the test fails after 10.

I'm looking into how to fix it.

For the second problem, I'm not sure what is the cause yet. I will look into it more.


Andrej

On Mon, 2 Jul 2018 at 10:53, Yedidyah Bar David <didi@redhat.com> wrote:
On Mon, Jul 2, 2018 at 9:47 AM, Eyal Edri <eedri@redhat.com> wrote:
>
>
>
> On Mon, Jul 2, 2018 at 9:32 AM Yedidyah Bar David <didi@redhat.com> wrote:
>>
>> Hi all,
>>
>> I noticed that our hosted-engine suites [1] often fail recently, and
>> decided to have a look at [2], which are on 4.2, which should
>> hopefully be "rock solid" and basically never fail.
>>
>> I looked at these, [3][4][5][6][7], which are all the ones that still
>> appear in [2] and marked as failed.
>>
>> Among them:
>>
>> - All but one failed while "Waiting for agent to be ready" and timing
>> out after 10 minutes, as part of 008_restart_he_vm.py, which was added
>> a month ago [8] and then patched [9].
>>
>> - The other one [7] failed while "Waiting for engine to migrate", also
>> eventually timing out after 10 minutes, as part of
>> 010_local_mainentance.py, which was also added in [9].
>>
>> I also had a look at the last ones that succeeded, builds 329 to 337
>> of [2]. There:
>>
>> - "Waiting for agent to be ready" took between 26 and 48 seconds
>>
>> - "Waiting for engine to migrate" took between 69 and 82 seconds
>>
>> Assuming these numbers are reasonable (which might be debatable), 10
>> minutes indeed sounds like a reasonable timeout, and I think we should
>> handle each failure specifically. Did anyone check them? Was it an
>> infra issue/load/etc.? A bug? Something else?
>
>
> Suites should be monitored by their respectful maintainers, the infra team doesn't have the capasity nor resources
> to monitor any new suite that is running in CI.


OK.

Dug a bit more into [6] below, and saw this:

agent is stuck in the following loop:

https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-4.2/328/artifact/exported-artifacts/test_logs/he-basic-ansible-suite-4.2/post-008_restart_he_vm.py/lago-he-basic-ansible-suite-4-2-host-0/_var_log/ovirt-hosted-engine-ha/agent.log

MainThread::INFO::2018-06-28
20:06:37,547::brokerlink::77::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor)
Starting monitor ping, options {'addr': '192.168.200.1'}
MainThread::ERROR::2018-06-28
20:06:37,548::hosted_engine::538::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
Failed to start necessary monitors
MainThread::ERROR::2018-06-28
20:06:37,550::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
line 131, in _run_agent
    return action(he)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
line 55, in action_proper
    return he.start_monitoring()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 413, in start_monitoring
    self._initialize_broker()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
line 535, in _initialize_broker
    m.get('options', {}))
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
line 83, in start_monitor
    .format(type, options, e))
RequestError: Failed to start monitor ping, options {'addr':
'192.168.200.1'}: [Errno 2] No such file or directory

So it seems like it tries to connect to the ping monitor in the broker
and fails. broker has:

https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-4.2/328/artifact/exported-artifacts/test_logs/he-basic-ansible-suite-4.2/post-008_restart_he_vm.py/lago-he-basic-ansible-suite-4-2-host-0/_var_log/ovirt-hosted-engine-ha/broker.log


MainThread::INFO::2018-06-28
20:06:28,027::storage_backends::345::ovirt_hosted_engine_ha.lib.storage_backends::(connect)
Connecting the storage
MainThread::INFO::2018-06-28
20:06:28,028::storage_server::255::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
Connecting storage server
MainThread::INFO::2018-06-28
20:06:28,093::storage_server::262::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
Connecting storage server

and that's it. That's the end of the log. So it seems like it's trying
to connect to the storage before it initialized the ping monitor (and
others), and waits "forever" (and OST timing out after 10 minutes).

vdsm has:

https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-4.2/328/artifact/exported-artifacts/test_logs/he-basic-ansible-suite-4.2/post-008_restart_he_vm.py/lago-he-basic-ansible-suite-4-2-host-0/_var_log/vdsm/vdsm.log

2018-06-28 20:06:36,793-0400 INFO  (vmrecovery) [vdsm.api] START
getConnectedStoragePoolsList(options=None) from=internal,
task_id=c8f60a38-d4c5-48ee-8806-a645bd2f4316 (api:46)
2018-06-28 20:06:36,793-0400 INFO  (vmrecovery) [vdsm.api] FINISH
getConnectedStoragePoolsList return={'poollist': []} from=internal,
task_id=c8f60a38-d4c5-48ee-8806-a645bd2f4316 (api:52)
2018-06-28 20:06:36,793-0400 INFO  (vmrecovery) [vds] recovery:
waiting for storage pool to go up (clientIF:707)
...

2018-06-28 20:06:41,799-0400 INFO  (vmrecovery) [vdsm.api] START
getConnectedStoragePoolsList(options=None) from=internal,
task_id=23ab1933-e0eb-4dc2-8404-3cf016437ec0 (api:46)
2018-06-28 20:06:41,799-0400 INFO  (vmrecovery) [vdsm.api] FINISH
getConnectedStoragePoolsList return={'poollist': []} from=internal,
task_id=23ab1933-e0eb-4dc2-8404-3cf016437ec0 (api:52)
2018-06-28 20:06:41,799-0400 INFO  (vmrecovery) [vds] recovery:
waiting for storage pool to go up (clientIF:707)

etc.

Could not find anything interesting in supervdsm.log.

Adding Simone and Nir.

Any idea?

Whoever that has a clue, you might reply changing the subject, so it's
easier to follow.

Thanks.

>
>
> Having said that, if a certain infra issue is reported, either its Lago, OST or infra issue, we'll of course do our best to find and fix the issue.
>
>>
>>
>> I didn't check the logs yet, might do this later. Also didn't check
>> the failures in other jobs in [1].
>>
>> Best regards,
>>
>> [1] https://jenkins.ovirt.org/search/?q=he-basic
>>
>> [2] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-4.2/
>>
>> [3] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-4.2/310/consoleFull
>>
>> [4] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-4.2/320/consoleFull
>>
>> [5] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-4.2/321/consoleFull
>>
>> [6] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-4.2/328/consoleFull
>>
>> [7] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-ansible-suite-4.2/336/
>>
>> [8] https://gerrit.ovirt.org/91952
>>
>> [9] https://gerrit.ovirt.org/92341
>> --
>> Didi
>
>
>
> --
>
> Eyal edri
>
>
> MANAGER
>
> RHV DevOps
>
> EMEA VIRTUALIZATION R&D
>
>
> Red Hat EMEA
>
> TRIED. TESTED. TRUSTED.
> phone: +972-9-7692018
> irc: eedri (on #tlv #rhev-dev #rhev-integ)




--
Didi