<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Dec 29, 2017 at 2:21 PM, Dan Kenigsberg <span dir="ltr"><<a href="mailto:danken@redhat.com" target="_blank">danken@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">top posting is evil.<br>
<div><div class="h5"><br>
On Fri, Dec 29, 2017 at 1:00 PM, Marcin Mirecki <<a href="mailto:mmirecki@redhat.com">mmirecki@redhat.com</a>> wrote:<br>
><br>
> On Thu, Dec 28, 2017 at 11:48 PM, Yaniv Kaul <<a href="mailto:ykaul@redhat.com">ykaul@redhat.com</a>> wrote:<br>
>><br>
>><br>
>><br>
>> On Fri, Dec 29, 2017 at 12:26 AM, Barak Korren <<a href="mailto:bkorren@redhat.com">bkorren@redhat.com</a>> wrote:<br>
>>><br>
>>> On 29 December 2017 at 00:22, Barak Korren <<a href="mailto:bkorren@redhat.com">bkorren@redhat.com</a>> wrote:<br>
>>> > On 28 December 2017 at 20:02, Dan Kenigsberg <<a href="mailto:danken@redhat.com">danken@redhat.com</a>> wrote:<br>
>>> >> Yet<br>
>>> >> <a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4559/" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/<wbr>ovirt-master_change-queue-<wbr>tester/4559/</a><br>
>>> >> (which is the gating job for <a href="https://gerrit.ovirt.org/#/c/85797/2" rel="noreferrer" target="_blank">https://gerrit.ovirt.org/#/c/<wbr>85797/2</a> )<br>
>>> >> still fails.<br>
>>> >> Could you look into why, Marcin?<br>
>>> >> The failure seems unrelated to ovn, as it is about a *host* loosing<br>
>>> >> connectivity. But it reproduces too much, so we need to get to the<br>
>>> >> bottom of it.<br>
>>> >><br>
>>> ><br>
>>> > Re sending the change through the gate yielded a different error:<br>
>>> > <a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4563/" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/<wbr>ovirt-master_change-queue-<wbr>tester/4563/</a><br>
>>> ><br>
>>> > If this is still unrelated, we need to think seriously what is raising<br>
>>> > this large amount of unrelated failures. We cannot do any accurate<br>
>>> > reporting when failures are sporadic.<br>
>>> ><br>
>>><br>
>>> And here is yet another host connectivity issue failing a test for a<br>
>>> change that should have no effect whatsoever (its a tox patch for<br>
>>> vdsm):<br>
>>><br>
>>> <a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4565/" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/<wbr>ovirt-master_change-queue-<wbr>tester/4565/</a><br>
>><br>
>><br>
>> I've added a fair number of changes this week. I doubt they are related,<br>
>> but the one that stands out<br>
>> is the addition of a fence-agent to one of the hosts.<br>
>> <a href="https://gerrit.ovirt.org/#/c/85817/" rel="noreferrer" target="_blank">https://gerrit.ovirt.org/#/c/<wbr>85817/</a> disables this specific test, just in<br>
>> case.<br>
>><br>
>> I don't think it causes an issue, but it's the only one looking at the git<br>
>> log I can suspect.<br>
<br>
> Trying to rebuild Barak's build resulted in another fail:<br>
> <a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4571/" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/<wbr>ovirt-master_change-queue-<wbr>tester/4571/</a><br>
> (with the same problem as Dan's build)<br>
><br>
> Engine log contains a few of "IOException: Broken pipe"<br>
> which seem to correspond to a vdsm restart: "[vds] Exiting (vdsmd:170)"<br>
> yet looking at my local successful run, I see the same issues in the log.<br>
> I don't see any other obvious reasons for the problem so far.<br>
<br>
<br>
</div></div>This actually points back to ykaul's fencing patch. And indeed,<br>
<a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4571/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-005_network_by_label.py/lago-basic-suite-master-engine/_var_log/ovirt-engine/engine.log" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/<wbr>ovirt-master_change-queue-<wbr>tester/4571/artifact/exported-<wbr>artifacts/basic-suit-master-<wbr>el7/test_logs/basic-suite-<wbr>master/post-005_network_by_<wbr>label.py/lago-basic-suite-<wbr>master-engine/_var_log/ovirt-<wbr>engine/engine.log</a><br>
has<br>
<br>
2017-12-29 05:26:07,712-05 DEBUG<br>
[org.ovirt.engine.core.uutils.<wbr>ssh.SSHClient]<br>
(EE-ManagedThreadFactory-<wbr>engine-Thread-417) [1a4f9963] Executed:<br>
'/usr/bin/vdsm-tool service-restart vdsmd'<br>
<br>
which means that Engine decided that it wants to kill vdsm. There are<br>
multiple communication errors prior to the soft fencing, but maybe<br>
waiting a bit longer would have kept the host alive.<br></blockquote><div><br></div><div>Note that there's a test called vdsm recovery, where we actually stop and start VDSM - perhaps it's there?</div><div>Anyway, disabled the test that adds fencing. I don't think this is the cause, but let's see.</div><div>Y.</div><div><br></div></div><br></div></div>