On Tue, Apr 24, 2018 at 9:24 AM, Martin Perina <mperina(a)redhat.com> wrote:
On Tue, Apr 24, 2018 at 3:17 PM, Ravi Shankar Nori <rnori(a)redhat.com>
wrote:
>
>
> On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <danken(a)redhat.com>
> wrote:
>
>> Ravi's patch is in, but a similar problem remains, and the test cannot
>> be put back into its place.
>>
>> It seems that while Vdsm was taken down, a couple of getCapsAsync
>> requests queued up. At one point, the host resumed its connection,
>> before the requests have been cleared of the queue. After the host is
>> up, the following tests resume, and at a pseudorandom point in time,
>> an old getCapsAsync request times out and kills our connection.
>>
>> I believe that as long as ANY request is on flight, the monitoring
>> lock should not be released, and the host should not be declared as
>> up.
>>
>>
>>
>
> Hi Dan,
>
> Can I have the link to the job on jenkins so I can look at the logs
>
http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/
From the logs the only VDS lock that is being released twice is
VDS_FENCE
lock. Opened a BZ [1] for it. Will post a fix
[1]
https://bugzilla.redhat.com/show_bug.cgi?id=1571300
>
>
>> On Wed, Apr 11, 2018 at 1:04 AM, Ravi Shankar Nori <rnori(a)redhat.com>
>> wrote:
>> > This [1] should fix the multiple release lock issue
>> >
>> > [1]
https://gerrit.ovirt.org/#/c/90077/
>> >
>> > On Tue, Apr 10, 2018 at 3:53 PM, Ravi Shankar Nori <rnori(a)redhat.com>
>> wrote:
>> >>
>> >> Working on a patch will post a fix
>> >>
>> >> Thanks
>> >>
>> >> Ravi
>> >>
>> >> On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan
<alkaplan(a)redhat.com>
>> wrote:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> Looking at the log it seems that the new GetCapabilitiesAsync is
>> >>> responsible for the mess.
>> >>>
>> >>> - 08:29:47 - engine loses connectivity to host
>> >>> 'lago-basic-suite-4-2-host-0'.
>> >>>
>> >>> - Every 3 seconds a getCapabalititiesAsync request is sent to the
>> host
>> >>> (unsuccessfully).
>> >>>
>> >>> * before each "getCapabilitiesAsync" the monitoring
lock is
>> taken
>> >>> (VdsManager,refreshImpl)
>> >>>
>> >>> * "getCapabilitiesAsync" immediately fails and
throws
>> >>> 'VDSNetworkException: java.net.ConnectException: Connection
>> refused'. The
>> >>> exception is caught by
>> >>> 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand'
which calls
>> >>> 'onFailure' of the callback and re-throws the exception.
>> >>>
>> >>> catch (Throwable t) {
>> >>> getParameters().getCallback().onFailure(t);
>> >>> throw t;
>> >>> }
>> >>>
>> >>> * The 'onFailure' of the callback releases the
"monitoringLock"
>> >>> ('postProcessRefresh()->afterRefreshTreatment()-> if
(!succeeded)
>> >>> lockManager.releaseLock(monitoringLock);')
>> >>>
>> >>> * 'VdsManager,refreshImpl' catches the network
exception, marks
>> >>> 'releaseLock = true' and tries to release the already
released lock.
>> >>>
>> >>> The following warning is printed to the log -
>> >>>
>> >>> WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
>> >>> (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to
>> release
>> >>> exclusive lock which does not exist, lock key:
>> >>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>> >>>
>> >>>
>> >>> - 08:30:51 a successful getCapabilitiesAsync is sent.
>> >>>
>> >>> - 08:32:55 - The failing test starts (Setup Networks for setting
>> ipv6).
>> >>>
>> >>>
>> >>> * SetupNetworks takes the monitoring lock.
>> >>>
>> >>> - 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync
requests
>> >>> from 4 minutes ago from its queue and prints a VDSNetworkException:
>> Vds
>> >>> timeout occured.
>> >>>
>> >>> * When the first request is removed from the queue
>> >>> ('ResponseTracker.remove()'), the
'Callback.onFailure' is invoked
>> (for the
>> >>> second time) -> monitoring lock is released (the lock taken by
the
>> >>> SetupNetworks!).
>> >>>
>> >>> * The other requests removed from the queue also try to
>> release the
>> >>> monitoring lock, but there is nothing to release.
>> >>>
>> >>> * The following warning log is printed -
>> >>> WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
>> >>> (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to
>> release
>> >>> exclusive lock which does not exist, lock key:
>> >>> 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>> >>>
>> >>> - 08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is
>> started.
>> >>> Why? I'm not 100% sure but I guess the late processing of the
>> >>> 'getCapabilitiesAsync' that causes losing of the monitoring
lock and
>> the
>> >>> late + mupltiple processing of failure is root cause.
>> >>>
>> >>>
>> >>> Ravi, 'getCapabilitiesAsync' failure is treated twice and
the lock is
>> >>> trying to be released three times. Please share your opinion
>> regarding how
>> >>> it should be fixed.
>> >>>
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Alona.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg
<danken(a)redhat.com>
>> wrote:
>> >>>>
>> >>>> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas
<ehaas(a)redhat.com>
>> wrote:
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri
<eedri(a)redhat.com>
>> wrote:
>> >>>>>>
>> >>>>>> Was already done by Yaniv -
https://gerrit.ovirt.org/#/c/89851.
>> >>>>>> Is it still failing?
>> >>>>>>
>> >>>>>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren
<bkorren(a)redhat.com>
>> >>>>>> wrote:
>> >>>>>>>
>> >>>>>>> On 7 April 2018 at 00:30, Dan Kenigsberg
<danken(a)redhat.com>
>> wrote:
>> >>>>>>> > No, I am afraid that we have not managed to
understand why
>> setting
>> >>>>>>> > and
>> >>>>>>> > ipv6 address too the host off the grid. We
shall continue
>> >>>>>>> > researching
>> >>>>>>> > this next week.
>> >>>>>>> >
>> >>>>>>> > Edy,
https://gerrit.ovirt.org/#/c/88637/ is
already 4 weeks
>> old,
>> >>>>>>> > but
>> >>>>>>> > could it possibly be related (I really doubt
that)?
>> >>>>>>> >
>> >>>>>
>> >>>>>
>> >>>>> Sorry, but I do not see how this problem is related to
VDSM.
>> >>>>> There is nothing that indicates that there is a VDSM
problem.
>> >>>>>
>> >>>>> Has the RPC connection between Engine and VDSM failed?
>> >>>>>
>> >>>>
>> >>>> Further up the thread, Piotr noticed that (at least on one
failure
>> of
>> >>>> this test) that the Vdsm host lost connectivity to its storage,
and
>> Vdsm
>> >>>> process was restarted. However, this does not seems to happen
in
>> all cases
>> >>>> where this test fails.
>> >>>>
>> >>>> _______________________________________________
>> >>>> Devel mailing list
>> >>>> Devel(a)ovirt.org
>> >>>>
http://lists.ovirt.org/mailman/listinfo/devel
>> >>>
>> >>>
>> >>
>> >
>>
>
>
--
Martin Perina
Associate Manager, Software Engineering
Red Hat Czech s.r.o.