
For brave investigators, similar issue in later stage of the same test can be found here [1]. Same symptom of fence action fail, but this time it causes failure for adding storage itself: *2019-09-12 09:53:32,571-04 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-1) [] Operation Failed: [Cannot attach Storage. There is no active Host in the Data Center.]* [1] https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15821 On Fri, Sep 13, 2019 at 5:09 PM Dusan Fodor <dfodor@redhat.com> wrote:
Hello all, lately i witnessed multiple failures for add_master_storage_domain test, which were not related to changes themselves, nor any infra issue. One example can be found here [1]. After investigation with huge help of Milan, issue is that Host falls from up state to whatever-but-not-up suddenly.
1. add_storage_domain picks a random host that is in up state 2. meantime engine starts fence action for it, so probably something gone bad with the host; the fence action fails with: *[org.ovirt.engine.core.bll.pm.FenceProxyLocator] (EE-ManagedThreadFactory-engineScheduled-Thread-38) [6692895f] Can not run fence action on host 'lago-basic-suite-master-host-0', no suitable proxy host was found.* 3. test fails on not being able to attach the domain to non-up host: *[org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-1) [] Operation Failed: [Cannot add storage server connection when Host status is not up]*
For better orientation in failed job's engine log [1], fence action for host fails at :46:12,842-04 engine learns it cannot connect storage to host at :46:16,105-04
The test itself add_master_storage_domain starts at ~ :46:13,753 (according to lago log).
Could you please check this? Thanks
[1] https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829 [2] https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/15829/artifac...