[ovirt-devel] [ OST Failure Report ] [ oVirt master ] [ 24/12/2017 ] [use_ovn_provider]

Yaniv Kaul ykaul at redhat.com
Fri Dec 29 14:38:25 UTC 2017


On Fri, Dec 29, 2017 at 2:21 PM, Dan Kenigsberg <danken at redhat.com> wrote:

> top posting is evil.
>
> On Fri, Dec 29, 2017 at 1:00 PM, Marcin Mirecki <mmirecki at redhat.com>
> wrote:
> >
> > On Thu, Dec 28, 2017 at 11:48 PM, Yaniv Kaul <ykaul at redhat.com> wrote:
> >>
> >>
> >>
> >> On Fri, Dec 29, 2017 at 12:26 AM, Barak Korren <bkorren at redhat.com>
> wrote:
> >>>
> >>> On 29 December 2017 at 00:22, Barak Korren <bkorren at redhat.com> wrote:
> >>> > On 28 December 2017 at 20:02, Dan Kenigsberg <danken at redhat.com>
> wrote:
> >>> >> Yet
> >>> >> http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4559/
> >>> >> (which is the gating job for https://gerrit.ovirt.org/#/c/85797/2 )
> >>> >> still fails.
> >>> >> Could you look into why, Marcin?
> >>> >> The failure seems unrelated to ovn, as it is about a *host* loosing
> >>> >> connectivity. But it reproduces too much, so we need to get to the
> >>> >> bottom of it.
> >>> >>
> >>> >
> >>> > Re sending the change through the gate yielded a different error:
> >>> > http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4563/
> >>> >
> >>> > If this is still unrelated, we need to think seriously what is
> raising
> >>> > this large amount of unrelated failures. We cannot do any accurate
> >>> > reporting when failures are sporadic.
> >>> >
> >>>
> >>> And here is yet another host connectivity issue failing a test for a
> >>> change that should have no effect whatsoever (its a tox patch for
> >>> vdsm):
> >>>
> >>> http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4565/
> >>
> >>
> >> I've added a fair number of changes this week. I doubt they are related,
> >> but the one that stands out
> >> is the addition of a fence-agent to one of the hosts.
> >> https://gerrit.ovirt.org/#/c/85817/ disables this specific test, just
> in
> >> case.
> >>
> >> I don't think it causes an issue, but it's the only one looking at the
> git
> >> log I can suspect.
>
> > Trying to rebuild Barak's build resulted in another fail:
> > http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4571/
> > (with the same problem as Dan's build)
> >
> > Engine log contains a few of "IOException: Broken pipe"
> > which seem to correspond to a vdsm restart: "[vds] Exiting (vdsmd:170)"
> > yet looking at my local successful run, I see the same issues in the log.
> > I don't see any other obvious reasons for the problem so far.
>
>
> This actually points back to ykaul's fencing patch. And indeed,
> http://jenkins.ovirt.org/job/ovirt-master_change-queue-
> tester/4571/artifact/exported-artifacts/basic-suit-master-
> el7/test_logs/basic-suite-master/post-005_network_by_
> label.py/lago-basic-suite-master-engine/_var_log/ovirt-engine/engine.log
> has
>
> 2017-12-29 05:26:07,712-05 DEBUG
> [org.ovirt.engine.core.uutils.ssh.SSHClient]
> (EE-ManagedThreadFactory-engine-Thread-417) [1a4f9963] Executed:
> '/usr/bin/vdsm-tool service-restart vdsmd'
>
> which means that Engine decided that it wants to kill vdsm. There are
> multiple communication errors prior to the soft fencing, but maybe
> waiting a bit longer would have kept the host alive.
>

Note that there's a test called vdsm recovery, where we actually stop and
start VDSM - perhaps it's there?
Anyway, disabled the test that adds fencing. I don't think this is the
cause, but let's see.
Y.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/devel/attachments/20171229/c4b1dd0c/attachment.html>


More information about the Devel mailing list