For brave investigators, similar issue in later stage of the same test can
be found here [1]. Same symptom of fence action fail, but this time it
causes failure for adding storage itself:
*2019-09-12 09:53:32,571-04 ERROR
[org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default
task-1) [] Operation Failed: [Cannot attach Storage. There is no active
Host in the Data Center.]*
[1]
https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15821
On Fri, Sep 13, 2019 at 5:09 PM Dusan Fodor <dfodor(a)redhat.com> wrote:
Hello all,
lately i witnessed multiple failures for add_master_storage_domain test,
which were not related to changes themselves, nor any infra issue. One
example can be found here [1].
After investigation with huge help of Milan, issue is that Host falls from
up state to whatever-but-not-up suddenly.
1. add_storage_domain picks a random host that is in up state
2. meantime engine starts fence action for it, so probably something
gone bad with the host; the fence action fails with:
*[org.ovirt.engine.core.bll.pm.FenceProxyLocator]
(EE-ManagedThreadFactory-engineScheduled-Thread-38) [6692895f] Can not run
fence action on host 'lago-basic-suite-master-host-0', no suitable proxy
host was found.*
3. test fails on not being able to attach the domain to non-up host:
*[org.ovirt.engine.api.restapi.resource.AbstractBackendResource]
(default task-1) [] Operation Failed: [Cannot add storage server connection
when Host status is not up]*
For better orientation in failed job's engine log [1], fence action for
host fails at
:46:12,842-04
engine learns it cannot connect storage to host at
:46:16,105-04
The test itself add_master_storage_domain starts at ~ :46:13,753
(according to lago log).
Could you please check this?
Thanks
[1]
https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829
[2]
https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829/arti...