
top posting is evil. On Fri, Dec 29, 2017 at 1:00 PM, Marcin Mirecki <mmirecki@redhat.com> wrote:
On Thu, Dec 28, 2017 at 11:48 PM, Yaniv Kaul <ykaul@redhat.com> wrote:
On Fri, Dec 29, 2017 at 12:26 AM, Barak Korren <bkorren@redhat.com> wrote:
On 29 December 2017 at 00:22, Barak Korren <bkorren@redhat.com> wrote:
On 28 December 2017 at 20:02, Dan Kenigsberg <danken@redhat.com> wrote:
Yet http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4559/ (which is the gating job for https://gerrit.ovirt.org/#/c/85797/2 ) still fails. Could you look into why, Marcin? The failure seems unrelated to ovn, as it is about a *host* loosing connectivity. But it reproduces too much, so we need to get to the bottom of it.
Re sending the change through the gate yielded a different error: http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4563/
If this is still unrelated, we need to think seriously what is raising this large amount of unrelated failures. We cannot do any accurate reporting when failures are sporadic.
And here is yet another host connectivity issue failing a test for a change that should have no effect whatsoever (its a tox patch for vdsm):
http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4565/
I've added a fair number of changes this week. I doubt they are related, but the one that stands out is the addition of a fence-agent to one of the hosts. https://gerrit.ovirt.org/#/c/85817/ disables this specific test, just in case.
I don't think it causes an issue, but it's the only one looking at the git log I can suspect.
Trying to rebuild Barak's build resulted in another fail: http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4571/ (with the same problem as Dan's build)
Engine log contains a few of "IOException: Broken pipe" which seem to correspond to a vdsm restart: "[vds] Exiting (vdsmd:170)" yet looking at my local successful run, I see the same issues in the log. I don't see any other obvious reasons for the problem so far.
This actually points back to ykaul's fencing patch. And indeed, http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4571/artifact/... has 2017-12-29 05:26:07,712-05 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (EE-ManagedThreadFactory-engine-Thread-417) [1a4f9963] Executed: '/usr/bin/vdsm-tool service-restart vdsmd' which means that Engine decided that it wants to kill vdsm. There are multiple communication errors prior to the soft fencing, but maybe waiting a bit longer would have kept the host alive.