[ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

Tue Apr 24 13:36:56 UTC 2018

On Tue, Apr 24, 2018 at 9:24 AM, Martin Perina <mperina at redhat.com> wrote:

>
>
> On Tue, Apr 24, 2018 at 3:17 PM, Ravi Shankar Nori <rnori at redhat.com>
> wrote:
>
>>
>>
>> On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <danken at redhat.com>
>> wrote:
>>
>>> Ravi's patch is in, but a similar problem remains, and the test cannot
>>> be put back into its place.
>>>
>>> It seems that while Vdsm was taken down, a couple of getCapsAsync
>>> requests queued up. At one point, the host resumed its connection,
>>> before the requests have been cleared of the queue. After the host is
>>> up, the following tests resume, and at a pseudorandom point in time,
>>> an old getCapsAsync request times out and kills our connection.
>>>
>>> I believe that as long as ANY request is on flight, the monitoring
>>> lock should not be released, and the host should not be declared as
>>> up.
>>>
>>>
>>>
>>
>> Hi Dan,
>>
>> Can I have the link to the job on jenkins so I can look at the logs
>>
>
> http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/
> 
>
>

>From the logs the only VDS lock that is being released twice is VDS_FENCE
lock. Opened a BZ [1] for it. Will post a fix

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1571300

>
>>
>>
>>> On Wed, Apr 11, 2018 at 1:04 AM, Ravi Shankar Nori <rnori at redhat.com>
>>> wrote:
>>> > This [1] should fix the multiple release lock issue
>>> >
>>> > [1] https://gerrit.ovirt.org/#/c/90077/
>>> >
>>> > On Tue, Apr 10, 2018 at 3:53 PM, Ravi Shankar Nori <rnori at redhat.com>
>>> wrote:
>>> >>
>>> >> Working on a patch will post a fix
>>> >>
>>> >> Thanks
>>> >>
>>> >> Ravi
>>> >>
>>> >> On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan <alkaplan at redhat.com>
>>> wrote:
>>> >>>
>>> >>> Hi all,
>>> >>>
>>> >>> Looking at the log it seems that the new GetCapabilitiesAsync is
>>> >>> responsible for the mess.
>>> >>>
>>> >>> - 08:29:47 - engine loses connectivity to host
>>> >>> 'lago-basic-suite-4-2-host-0'.
>>> >>>
>>> >>> - Every 3 seconds a getCapabalititiesAsync request is sent to the
>>> host
>>> >>> (unsuccessfully).
>>> >>>
>>> >>>      * before each "getCapabilitiesAsync" the monitoring lock is
>>> taken
>>> >>> (VdsManager,refreshImpl)
>>> >>>
>>> >>>      * "getCapabilitiesAsync" immediately fails and throws
>>> >>> 'VDSNetworkException: java.net.ConnectException: Connection
>>> refused'. The
>>> >>> exception is caught by
>>> >>> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls
>>> >>> 'onFailure' of the callback and re-throws the exception.
>>> >>>
>>> >>>          catch (Throwable t) {
>>> >>>             getParameters().getCallback().onFailure(t);
>>> >>>             throw t;
>>> >>>          }
>>> >>>
>>> >>>     * The 'onFailure' of the callback releases the "monitoringLock"
>>> >>> ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded)
>>> >>> lockManager.releaseLock(monitoringLock);')
>>> >>>
>>> >>>     * 'VdsManager,refreshImpl' catches the network exception, marks
>>> >>> 'releaseLock = true' and tries to release the already released lock.
>>> >>>
>>> >>>       The following warning is printed to the log -
>>> >>>
>>> >>>       WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
>>> >>> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to
>>> release
>>> >>> exclusive lock which does not exist, lock key:
>>> >>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>>> >>>
>>> >>>
>>> >>> - 08:30:51 a successful getCapabilitiesAsync is sent.
>>> >>>
>>> >>> - 08:32:55 - The failing test starts (Setup Networks for setting
>>> ipv6).
>>> >>>
>>> >>>
>>> >>>     * SetupNetworks takes the monitoring lock.
>>> >>>
>>> >>> - 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests
>>> >>> from 4 minutes ago from its queue and prints a VDSNetworkException:
>>> Vds
>>> >>> timeout occured.
>>> >>>
>>> >>>       * When the first request is removed from the queue
>>> >>> ('ResponseTracker.remove()'), the 'Callback.onFailure' is invoked
>>> (for the
>>> >>> second time) -> monitoring lock is released (the lock taken by the
>>> >>> SetupNetworks!).
>>> >>>
>>> >>>       * The other requests removed from the queue also try to
>>> release the
>>> >>> monitoring lock, but there is nothing to release.
>>> >>>
>>> >>>       * The following warning log is printed -
>>> >>>         WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
>>> >>> (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to
>>> release
>>> >>> exclusive lock which does not exist, lock key:
>>> >>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>>> >>>
>>> >>> - 08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is
>>> started.
>>> >>> Why? I'm not 100% sure but I guess the late processing of the
>>> >>> 'getCapabilitiesAsync' that causes losing of the monitoring lock and
>>> the
>>> >>> late + mupltiple processing of failure is root cause.
>>> >>>
>>> >>>
>>> >>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is
>>> >>> trying to be released three times. Please share your opinion
>>> regarding how
>>> >>> it should be fixed.
>>> >>>
>>> >>>
>>> >>> Thanks,
>>> >>>
>>> >>> Alona.
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <danken at redhat.com>
>>> wrote:
>>> >>>>
>>> >>>> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <ehaas at redhat.com>
>>> wrote:
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <eedri at redhat.com>
>>> wrote:
>>> >>>>>>
>>> >>>>>> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851.
>>> >>>>>> Is it still failing?
>>> >>>>>>
>>> >>>>>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkorren at redhat.com>
>>> >>>>>> wrote:
>>> >>>>>>>
>>> >>>>>>> On 7 April 2018 at 00:30, Dan Kenigsberg <danken at redhat.com>
>>> wrote:
>>> >>>>>>> > No, I am afraid that we have not managed to understand why
>>> setting
>>> >>>>>>> > and
>>> >>>>>>> > ipv6 address too the host off the grid. We shall continue
>>> >>>>>>> > researching
>>> >>>>>>> > this next week.
>>> >>>>>>> >
>>> >>>>>>> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks
>>> old,
>>> >>>>>>> > but
>>> >>>>>>> > could it possibly be related (I really doubt that)?
>>> >>>>>>> >
>>> >>>>>
>>> >>>>>
>>> >>>>> Sorry, but I do not see how this problem is related to VDSM.
>>> >>>>> There is nothing that indicates that there is a VDSM problem.
>>> >>>>>
>>> >>>>> Has the RPC connection between Engine and VDSM failed?
>>> >>>>>
>>> >>>>
>>> >>>> Further up the thread, Piotr noticed that (at least on one failure
>>> of
>>> >>>> this test) that the Vdsm host lost connectivity to its storage, and
>>> Vdsm
>>> >>>> process was restarted. However, this does not seems to happen in
>>> all cases
>>> >>>> where this test fails.
>>> >>>>
>>> >>>> _______________________________________________
>>> >>>> Devel mailing list
>>> >>>> Devel at ovirt.org
>>> >>>> http://lists.ovirt.org/mailman/listinfo/devel
>>> >>>
>>> >>>
>>> >>
>>> >
>>>
>>
>>
>
>
> --
> Martin Perina
> Associate Manager, Software Engineering
> Red Hat Czech s.r.o.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/devel/attachments/20180424/fe44f775/attachment.html>