top posting is evil.
On Fri, Dec 29, 2017 at 1:00 PM, Marcin Mirecki <mmirecki(a)redhat.com> wrote:
On Thu, Dec 28, 2017 at 11:48 PM, Yaniv Kaul <ykaul(a)redhat.com> wrote:
>
>
>
> On Fri, Dec 29, 2017 at 12:26 AM, Barak Korren <bkorren(a)redhat.com> wrote:
>>
>> On 29 December 2017 at 00:22, Barak Korren <bkorren(a)redhat.com> wrote:
>> > On 28 December 2017 at 20:02, Dan Kenigsberg <danken(a)redhat.com>
wrote:
>> >> Yet
>> >>
http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4559/
>> >> (which is the gating job for
https://gerrit.ovirt.org/#/c/85797/2 )
>> >> still fails.
>> >> Could you look into why, Marcin?
>> >> The failure seems unrelated to ovn, as it is about a *host* loosing
>> >> connectivity. But it reproduces too much, so we need to get to the
>> >> bottom of it.
>> >>
>> >
>> > Re sending the change through the gate yielded a different error:
>> >
http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4563/
>> >
>> > If this is still unrelated, we need to think seriously what is raising
>> > this large amount of unrelated failures. We cannot do any accurate
>> > reporting when failures are sporadic.
>> >
>>
>> And here is yet another host connectivity issue failing a test for a
>> change that should have no effect whatsoever (its a tox patch for
>> vdsm):
>>
>>
http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4565/
>
>
> I've added a fair number of changes this week. I doubt they are related,
> but the one that stands out
> is the addition of a fence-agent to one of the hosts.
>
https://gerrit.ovirt.org/#/c/85817/ disables this specific test, just in
> case.
>
> I don't think it causes an issue, but it's the only one looking at the git
> log I can suspect.
Trying to rebuild Barak's build resulted in another fail:
http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4571/
(with the same problem as Dan's build)
Engine log contains a few of "IOException: Broken pipe"
which seem to correspond to a vdsm restart: "[vds] Exiting (vdsmd:170)"
yet looking at my local successful run, I see the same issues in the log.
I don't see any other obvious reasons for the problem so far.
This actually points back to ykaul's fencing patch. And indeed,
http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/4571/artifa...
has
2017-12-29 05:26:07,712-05 DEBUG
[org.ovirt.engine.core.uutils.ssh.SSHClient]
(EE-ManagedThreadFactory-engine-Thread-417) [1a4f9963] Executed:
'/usr/bin/vdsm-tool service-restart vdsmd'
which means that Engine decided that it wants to kill vdsm. There are
multiple communication errors prior to the soft fencing, but maybe
waiting a bit longer would have kept the host alive.