<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Dec 29, 2017 at 2:21 PM, Dan Kenigsberg <span dir="ltr">&lt;<a href="mailto:danken@redhat.com" target="_blank">danken@redhat.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">top posting is evil.<br>

<div><div class="h5"><br>

On Fri, Dec 29, 2017 at 1:00 PM, Marcin Mirecki &lt;<a href="mailto:mmirecki@redhat.com">mmirecki@redhat.com</a>&gt; wrote:<br>

&gt;<br>

&gt; On Thu, Dec 28, 2017 at 11:48 PM, Yaniv Kaul &lt;<a href="mailto:ykaul@redhat.com">ykaul@redhat.com</a>&gt; wrote:<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; On Fri, Dec 29, 2017 at 12:26 AM, Barak Korren &lt;<a href="mailto:bkorren@redhat.com">bkorren@redhat.com</a>&gt; wrote:<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; On 29 December 2017 at 00:22, Barak Korren &lt;<a href="mailto:bkorren@redhat.com">bkorren@redhat.com</a>&gt; wrote:<br>

&gt;&gt;&gt; &gt; On 28 December 2017 at 20:02, Dan Kenigsberg &lt;<a href="mailto:danken@redhat.com">danken@redhat.com</a>&gt; wrote:<br>

&gt;&gt;&gt; &gt;&gt; Yet<br>

&gt;&gt;&gt; &gt;&gt; <a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4559/" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/<wbr>ovirt-master_change-queue-<wbr>tester/4559/</a><br>

&gt;&gt;&gt; &gt;&gt; (which is the gating job for <a href="https://gerrit.ovirt.org/#/c/85797/2" rel="noreferrer" target="_blank">https://gerrit.ovirt.org/#/c/<wbr>85797/2</a> )<br>

&gt;&gt;&gt; &gt;&gt; still fails.<br>

&gt;&gt;&gt; &gt;&gt; Could you look into why, Marcin?<br>

&gt;&gt;&gt; &gt;&gt; The failure seems unrelated to ovn, as it is about a *host* loosing<br>

&gt;&gt;&gt; &gt;&gt; connectivity. But it reproduces too much, so we need to get to the<br>

&gt;&gt;&gt; &gt;&gt; bottom of it.<br>

&gt;&gt;&gt; &gt;&gt;<br>

&gt;&gt;&gt; &gt;<br>

&gt;&gt;&gt; &gt; Re sending the change through the gate yielded a different error:<br>

&gt;&gt;&gt; &gt; <a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4563/" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/<wbr>ovirt-master_change-queue-<wbr>tester/4563/</a><br>

&gt;&gt;&gt; &gt;<br>

&gt;&gt;&gt; &gt; If this is still unrelated, we need to think seriously what is raising<br>

&gt;&gt;&gt; &gt; this large amount of unrelated failures. We cannot do any accurate<br>

&gt;&gt;&gt; &gt; reporting when failures are sporadic.<br>

&gt;&gt;&gt; &gt;<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; And here is yet another host connectivity issue failing a test for a<br>

&gt;&gt;&gt; change that should have no effect whatsoever (its a tox patch for<br>

&gt;&gt;&gt; vdsm):<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; <a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4565/" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/<wbr>ovirt-master_change-queue-<wbr>tester/4565/</a><br>

&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; I&#39;ve added a fair number of changes this week. I doubt they are related,<br>

&gt;&gt; but the one that stands out<br>

&gt;&gt; is the addition of a fence-agent to one of the hosts.<br>

&gt;&gt; <a href="https://gerrit.ovirt.org/#/c/85817/" rel="noreferrer" target="_blank">https://gerrit.ovirt.org/#/c/<wbr>85817/</a> disables this specific test, just in<br>

&gt;&gt; case.<br>

&gt;&gt;<br>

&gt;&gt; I don&#39;t think it causes an issue, but it&#39;s the only one looking at the git<br>

&gt;&gt; log I can suspect.<br>

<br>

&gt; Trying to rebuild Barak&#39;s build resulted in another fail:<br>

&gt; <a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4571/" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/<wbr>ovirt-master_change-queue-<wbr>tester/4571/</a><br>

&gt; (with the same problem as Dan&#39;s build)<br>

&gt;<br>

&gt; Engine log contains a few of &quot;IOException: Broken pipe&quot;<br>

&gt; which seem to correspond to a vdsm restart: &quot;[vds] Exiting (vdsmd:170)&quot;<br>

&gt; yet looking at my local successful run, I see the same issues in the log.<br>

&gt; I don&#39;t see any other obvious reasons for the problem so far.<br>

<br>

<br>

</div></div>This actually points back to ykaul&#39;s fencing patch. And indeed,<br>

<a href="http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4571/artifact/exported-artifacts/basic-suit-master-el7/test_logs/basic-suite-master/post-005_network_by_label.py/lago-basic-suite-master-engine/_var_log/ovirt-engine/engine.log" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/<wbr>ovirt-master_change-queue-<wbr>tester/4571/artifact/exported-<wbr>artifacts/basic-suit-master-<wbr>el7/test_logs/basic-suite-<wbr>master/post-005_network_by_<wbr>label.py/lago-basic-suite-<wbr>master-engine/_var_log/ovirt-<wbr>engine/engine.log</a><br>

has<br>

<br>

2017-12-29 05:26:07,712-05 DEBUG<br>

[org.ovirt.engine.core.uutils.<wbr>ssh.SSHClient]<br>

(EE-ManagedThreadFactory-<wbr>engine-Thread-417) [1a4f9963] Executed:<br>

&#39;/usr/bin/vdsm-tool service-restart vdsmd&#39;<br>

<br>

which means that Engine decided that it wants to kill vdsm. There are<br>

multiple communication errors prior to the soft fencing, but maybe<br>

waiting a bit longer would have kept the host alive.<br></blockquote><div><br></div><div>Note that there&#39;s a test called vdsm recovery, where we actually stop and start VDSM - perhaps it&#39;s there?</div><div>Anyway, disabled the test that adds fencing. I don&#39;t think this is the cause, but let&#39;s see.</div><div>Y.</div><div><br></div></div><br></div></div>