On Tue, Apr 24, 2018 at 3:17 PM, Ravi Shankar Nori <rnori(a)redhat.com> wrote:
On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <danken(a)redhat.com> wrote:
> Ravi's patch is in, but a similar problem remains, and the test cannot
> be put back into its place.
>
> It seems that while Vdsm was taken down, a couple of getCapsAsync
> requests queued up. At one point, the host resumed its connection,
> before the requests have been cleared of the queue. After the host is
> up, the following tests resume, and at a pseudorandom point in time,
> an old getCapsAsync request times out and kills our connection.
>
> I believe that as long as ANY request is on flight, the monitoring
> lock should not be released, and the host should not be declared as
> up.
>
>
>
Hi Dan,
Can I have the link to the job on jenkins so I can look at the logs
http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/
> On Wed, Apr 11, 2018 at 1:04 AM, Ravi Shankar Nori <rnori(a)redhat.com>
> wrote:
> > This [1] should fix the multiple release lock issue
> >
> > [1]
https://gerrit.ovirt.org/#/c/90077/
> >
> > On Tue, Apr 10, 2018 at 3:53 PM, Ravi Shankar Nori <rnori(a)redhat.com>
> wrote:
> >>
> >> Working on a patch will post a fix
> >>
> >> Thanks
> >>
> >> Ravi
> >>
> >> On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan <alkaplan(a)redhat.com>
> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> Looking at the log it seems that the new GetCapabilitiesAsync is
> >>> responsible for the mess.
> >>>
> >>> - 08:29:47 - engine loses connectivity to host
> >>> 'lago-basic-suite-4-2-host-0'.
> >>>
> >>> - Every 3 seconds a getCapabalititiesAsync request is sent to the host
> >>> (unsuccessfully).
> >>>
> >>> * before each "getCapabilitiesAsync" the monitoring lock
is taken
> >>> (VdsManager,refreshImpl)
> >>>
> >>> * "getCapabilitiesAsync" immediately fails and throws
> >>> 'VDSNetworkException: java.net.ConnectException: Connection
refused'.
> The
> >>> exception is caught by
> >>> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which
calls
> >>> 'onFailure' of the callback and re-throws the exception.
> >>>
> >>> catch (Throwable t) {
> >>> getParameters().getCallback().onFailure(t);
> >>> throw t;
> >>> }
> >>>
> >>> * The 'onFailure' of the callback releases the
"monitoringLock"
> >>> ('postProcessRefresh()->afterRefreshTreatment()-> if
(!succeeded)
> >>> lockManager.releaseLock(monitoringLock);')
> >>>
> >>> * 'VdsManager,refreshImpl' catches the network exception,
marks
> >>> 'releaseLock = true' and tries to release the already released
lock.
> >>>
> >>> The following warning is printed to the log -
> >>>
> >>> WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
> >>> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to
> release
> >>> exclusive lock which does not exist, lock key:
> >>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
> >>>
> >>>
> >>> - 08:30:51 a successful getCapabilitiesAsync is sent.
> >>>
> >>> - 08:32:55 - The failing test starts (Setup Networks for setting
> ipv6).
> >>>
> >>>
> >>> * SetupNetworks takes the monitoring lock.
> >>>
> >>> - 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests
> >>> from 4 minutes ago from its queue and prints a VDSNetworkException:
> Vds
> >>> timeout occured.
> >>>
> >>> * When the first request is removed from the queue
> >>> ('ResponseTracker.remove()'), the 'Callback.onFailure'
is invoked
> (for the
> >>> second time) -> monitoring lock is released (the lock taken by the
> >>> SetupNetworks!).
> >>>
> >>> * The other requests removed from the queue also try to release
> the
> >>> monitoring lock, but there is nothing to release.
> >>>
> >>> * The following warning log is printed -
> >>> WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
> >>> (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to
> release
> >>> exclusive lock which does not exist, lock key:
> >>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
> >>>
> >>> - 08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is
> started.
> >>> Why? I'm not 100% sure but I guess the late processing of the
> >>> 'getCapabilitiesAsync' that causes losing of the monitoring lock
and
> the
> >>> late + mupltiple processing of failure is root cause.
> >>>
> >>>
> >>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the
lock is
> >>> trying to be released three times. Please share your opinion
> regarding how
> >>> it should be fixed.
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> Alona.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg
<danken(a)redhat.com>
> wrote:
> >>>>
> >>>> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas
<ehaas(a)redhat.com>
> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri
<eedri(a)redhat.com> wrote:
> >>>>>>
> >>>>>> Was already done by Yaniv -
https://gerrit.ovirt.org/#/c/89851.
> >>>>>> Is it still failing?
> >>>>>>
> >>>>>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren
<bkorren(a)redhat.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> On 7 April 2018 at 00:30, Dan Kenigsberg
<danken(a)redhat.com>
> wrote:
> >>>>>>> > No, I am afraid that we have not managed to
understand why
> setting
> >>>>>>> > and
> >>>>>>> > ipv6 address too the host off the grid. We shall
continue
> >>>>>>> > researching
> >>>>>>> > this next week.
> >>>>>>> >
> >>>>>>> > Edy,
https://gerrit.ovirt.org/#/c/88637/ is already
4 weeks
> old,
> >>>>>>> > but
> >>>>>>> > could it possibly be related (I really doubt
that)?
> >>>>>>> >
> >>>>>
> >>>>>
> >>>>> Sorry, but I do not see how this problem is related to VDSM.
> >>>>> There is nothing that indicates that there is a VDSM problem.
> >>>>>
> >>>>> Has the RPC connection between Engine and VDSM failed?
> >>>>>
> >>>>
> >>>> Further up the thread, Piotr noticed that (at least on one failure
of
> >>>> this test) that the Vdsm host lost connectivity to its storage, and
> Vdsm
> >>>> process was restarted. However, this does not seems to happen in
all
> cases
> >>>> where this test fails.
> >>>>
> >>>> _______________________________________________
> >>>> Devel mailing list
> >>>> Devel(a)ovirt.org
> >>>>
http://lists.ovirt.org/mailman/listinfo/devel
> >>>
> >>>
> >>
> >
>
--
Martin Perina
Associate Manager, Software Engineering
Red Hat Czech s.r.o.