On Fri, Dec 29, 2017 at 2:21 PM, Dan Kenigsberg <danken@redhat.com> wrote:
top posting is evil.

On Fri, Dec 29, 2017 at 1:00 PM, Marcin Mirecki <mmirecki@redhat.com> wrote:
>
> On Thu, Dec 28, 2017 at 11:48 PM, Yaniv Kaul <ykaul@redhat.com> wrote:
>>
>>
>>
>> On Fri, Dec 29, 2017 at 12:26 AM, Barak Korren <bkorren@redhat.com> wrote:
>>>
>>> On 29 December 2017 at 00:22, Barak Korren <bkorren@redhat.com> wrote:
>>> > On 28 December 2017 at 20:02, Dan Kenigsberg <danken@redhat.com> wrote:
>>> >> Yet
>>> >> http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4559/
>>> >> (which is the gating job for https://gerrit.ovirt.org/#/c/85797/2 )
>>> >> still fails.
>>> >> Could you look into why, Marcin?
>>> >> The failure seems unrelated to ovn, as it is about a *host* loosing
>>> >> connectivity. But it reproduces too much, so we need to get to the
>>> >> bottom of it.
>>> >>
>>> >
>>> > Re sending the change through the gate yielded a different error:
>>> > http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4563/
>>> >
>>> > If this is still unrelated, we need to think seriously what is raising
>>> > this large amount of unrelated failures. We cannot do any accurate
>>> > reporting when failures are sporadic.
>>> >
>>>
>>> And here is yet another host connectivity issue failing a test for a
>>> change that should have no effect whatsoever (its a tox patch for
>>> vdsm):
>>>
>>> http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4565/
>>
>>
>> I've added a fair number of changes this week. I doubt they are related,
>> but the one that stands out
>> is the addition of a fence-agent to one of the hosts.
>> https://gerrit.ovirt.org/#/c/85817/ disables this specific test, just in
>> case.
>>
>> I don't think it causes an issue, but it's the only one looking at the git
>> log I can suspect.

> Trying to rebuild Barak's build resulted in another fail:
> http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4571/
> (with the same problem as Dan's build)
>
> Engine log contains a few of "IOException: Broken pipe"
> which seem to correspond to a vdsm restart: "[vds] Exiting (vdsmd:170)"
> yet looking at my local successful run, I see the same issues in the log.
> I don't see any other obvious reasons for the problem so far.


This actually points back to ykaul's fencing patch. And indeed,
http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4571/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-005_network_by_label.py/lago-basic-suite-master-engine/_var_log/ovirt-engine/engine.log
has

2017-12-29 05:26:07,712-05 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient]
(EE-ManagedThreadFactory-engine-Thread-417) [1a4f9963] Executed:
'/usr/bin/vdsm-tool service-restart vdsmd'

which means that Engine decided that it wants to kill vdsm. There are
multiple communication errors prior to the soft fencing, but maybe
waiting a bit longer would have kept the host alive.

Note that there's a test called vdsm recovery, where we actually stop and start VDSM - perhaps it's there?
Anyway, disabled the test that adds fencing. I don't think this is the cause, but let's see.
Y.