Adding Marcin who is investigating the issue

On Thu, Mar 28, 2019 at 11:56 AM Yedidyah Bar David <didi@redhat.com> wrote:
Hi all,

I want to verify [1]. So I ran the manual job, basic suite 4.3 [2].

It failed [3] with $subject.

Right before that, verify_add_hosts did succeed, and took 59 seconds.

I gave a brief look at the code of verify_add_hosts, and it checks
that at least one host is UP.

The patch [1] *might* have caused ansible-host-deploy to take longer,
still not sure. In any case, ansible-host-deploy finished at 05:48:45
[5], 14 seconds before verify_add_hosts finished at 09:48:59 [4].

Can it be, that a host is considered UP (from the POV of
verify_add_hosts), but is still not ready for creating storage?

Yes, unfortunately from the beginning of oVirt the hosts will change status to Up as soon as engine is able to communicate with it and only afterwards additional actions like connect to storage are executed on the host:


And if host will fail any of those actions, its status will change to NonOperational. But I'm not aware of any limitation which would the 1st host to prevent adding master storage domain.
Tal/Freddy any thoughts?

AFAIR Marcin told me, that he was not able to reproduce outside Jenkins (both hosts were always on Up before adding master storage domain), but in the failed Jenkins OST there was always available only single host (the 2nd one was still installing).
Marcin, do you have any updates?


Also, host-1, at the point when OST was killed and collected logs, was
right after finishing host-deploy, and didn't start
ansible-host-deploy yet, according to engine.log. But host-0 should
have been enough, I think.


Thanks,

[1] https://gerrit.ovirt.org/99000
[2] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/4431/
[3] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/4431/testReport/junit/(root)/002_bootstrap/add_master_storage_domain/
[4] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/4431/consoleFull
[5] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests_manual/4431/artifact/exported-artifacts/test_logs/basic-suite-4.3/post-002_bootstrap.py/lago-basic-suite-4-3-engine/_var_log/ovirt-engine/host-deploy/ovirt-host-deploy-ansible-20190328054747-lago-basic-suite-4-3-host-0-626cf955.log
--
Didi


--
Martin Perina
Manager, Software Engineering
Red Hat Czech s.r.o.