[ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

Dan Kenigsberg danken at redhat.com
Wed Apr 25 15:34:40 UTC 2018


On Wed, Apr 25, 2018 at 5:57 PM, Martin Perina <mperina at redhat.com> wrote:
>
>
> On Tue, Apr 24, 2018 at 3:28 PM, Dan Kenigsberg <danken at redhat.com> wrote:
>>
>> On Tue, Apr 24, 2018 at 4:17 PM, Ravi Shankar Nori <rnori at redhat.com>
>> wrote:
>> >
>> >
>> > On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <danken at redhat.com>
>> > wrote:
>> >>
>> >> Ravi's patch is in, but a similar problem remains, and the test cannot
>> >> be put back into its place.
>> >>
>> >> It seems that while Vdsm was taken down, a couple of getCapsAsync
>> >> requests queued up. At one point, the host resumed its connection,
>> >> before the requests have been cleared of the queue. After the host is
>> >> up, the following tests resume, and at a pseudorandom point in time,
>> >> an old getCapsAsync request times out and kills our connection.
>> >>
>> >> I believe that as long as ANY request is on flight, the monitoring
>> >> lock should not be released, and the host should not be declared as
>> >> up.
>> >>
>> >>
>> >
>> >
>> > Hi Dan,
>> >
>> > Can I have the link to the job on jenkins so I can look at the logs
>>
>> We disabled a network test that started failing after getCapsAsync was
>> merged.
>> Please own its re-introduction to OST: https://gerrit.ovirt.org/#/c/90264/
>>
>> Its most recent failure
>> http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/
>> has been discussed by Alona and Piotr over IRC.
>
>
> So https://bugzilla.redhat.com/1571768 was created to cover this issue
> discovered during Alona's and Piotr's conversation. But after further
> discussion we have found out that this issue is not related to non-blocking
> thread changes in engine 4.2 and this behavior exists from beginning of
> vdsm-jsonrpc-java. Ravi will continue verify the fix for BZ1571768 along
> with other locking changes he already posted to see if they will help
> network OST to succeed.
>
> But the fix for BZ1571768 is too dangerous for 4.2.3, let's try to fix that
> on master and let's see if it doesn't introduce any regressions. If not,
> then we can backport to 4.2.4.

I sense as if there is a regression in connection management, that
coincided with the introduction of async monitoring.
I am not alone: Gal Ben Haim was reluctant to take our test back.

Do you think that it is now safe to take it in
https://gerrit.ovirt.org/#/c/90264/ ?
I'd appreciate your support there. I don't want any test to be skipped
without a very good reason.


More information about the Devel mailing list