Re: [ovirt-devel] [ OST Failure Report ] [ oVirt 4.2 ] [ 2018-04-04 ] [006_migrations.prepare_migration_attachments_ipv6]

24 Apr 2018


      On Tue, Apr 24, 2018 at 9:24 AM, Martin Perina <mperina@redhat.com> wrote:
...
On Tue, Apr 24, 2018 at 3:17 PM, Ravi Shankar Nori <rnori@redhat.com>
wrote:
...
On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <danken@redhat.com>
wrote:
...
Ravi's patch is in, but a similar problem remains, and the test cannot
be put back into its place.
It seems that while Vdsm was taken down, a couple of getCapsAsync
requests queued up. At one point, the host resumed its connection,
before the requests have been cleared of the queue. After the host is
up, the following tests resume, and at a pseudorandom point in time,
an old getCapsAsync request times out and kills our connection.
I believe that as long as ANY request is on flight, the monitoring
lock should not be released, and the host should not be declared as
up.
Hi Dan,
Can I have the link to the job on jenkins so I can look at the logs
http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/346/


...
From the logs the only VDS lock that is being released twice is VDS_FENCE
lock. Opened a BZ [1] for it. Will post a fix
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1571300
...
...
...
...
This [1] should fix the multiple release lock issue
[1] https://gerrit.ovirt.org/#/c/90077/
On Tue, Apr 10, 2018 at 3:53 PM, Ravi Shankar Nori <rnori@redhat.com>
wrote:
...
Working on a patch will post a fix
Thanks
Ravi
On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan <alkaplan@redhat.com>
wrote:
...
...
Hi all,
Looking at the log it seems that the new GetCapabilitiesAsync is
responsible for the mess.
- 08:29:47 - engine loses connectivity to host
'lago-basic-suite-4-2-host-0'.
- Every 3 seconds a getCapabalititiesAsync request is sent to the
host
...
(unsuccessfully).
* before each "getCapabilitiesAsync" the monitoring lock is
taken
(VdsManager,refreshImpl)
* "getCapabilitiesAsync" immediately fails and throws
'VDSNetworkException: java.net.ConnectException: Connection
refused'. The
exception is caught by
'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls
'onFailure' of the callback and re-throws the exception.
catch (Throwable t) {
            getParameters().getCallback().onFailure(t);
            throw t;
         }
* The 'onFailure' of the callback releases the "monitoringLock"
('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded)
lockManager.releaseLock(monitoringLock);')
* 'VdsManager,refreshImpl' catches the network exception, marks
'releaseLock = true' and tries to release the already released lock.
The following warning is printed to the log -
WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
(EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to
release
exclusive lock which does not exist, lock key:
'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
- 08:30:51 a successful getCapabilitiesAsync is sent.
- 08:32:55 - The failing test starts (Setup Networks for setting
ipv6).
* SetupNetworks takes the monitoring lock.
- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests
from 4 minutes ago from its queue and prints a VDSNetworkException:
Vds
timeout occured.
* When the first request is removed from the queue
('ResponseTracker.remove()'), the 'Callback.onFailure' is invoked
(for the
second time) -> monitoring lock is released (the lock taken by the
SetupNetworks!).
* The other requests removed from the queue also try to
release the
monitoring lock, but there is nothing to release.
* The following warning log is printed -
        WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
(EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to
release
exclusive lock which does not exist, lock key:
'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
- 08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is
started.
Why? I'm not 100% sure but I guess the late processing of the
'getCapabilitiesAsync' that causes losing of the monitoring lock and
...
...
...
late + mupltiple processing of failure is root cause.
Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is
trying to be released three times. Please share your opinion
regarding how
it should be fixed.
Thanks,
Alona.
On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <danken@redhat.com>
wrote:
>
> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <ehaas@redhat.com>
wrote:
>>
>>
>>
>> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <eedri@redhat.com>
wrote:
>>>
>>> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851.
>>> Is it still failing?
>>>
>>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkorren@redhat.com>
>>> wrote:
>>>>
>>>> On 7 April 2018 at 00:30, Dan Kenigsberg <danken@redhat.com>
wrote:
>>>> > No, I am afraid that we have not managed to understand why
setting
>>>> > and
>>>> > ipv6 address too the host off the grid. We shall continue
>>>> > researching
>>>> > this next week.
>>>> >
>>>> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks
On Wed, Apr 11, 2018 at 1:04 AM, Ravi Shankar Nori <rnori@redhat.com>
wrote:
the
old,
...
...
...
>>>> > but
>>>> > could it possibly be related (I really doubt that)?
>>>> >
>>
>>
>> Sorry, but I do not see how this problem is related to VDSM.
>> There is nothing that indicates that there is a VDSM problem.
>>
>> Has the RPC connection between Engine and VDSM failed?
>>
>
> Further up the thread, Piotr noticed that (at least on one failure
of
> this test) that the Vdsm host lost connectivity to its storage, and
Vdsm
> process was restarted. However, this does not seems to happen in
all cases
> where this test fails.
>
> _______________________________________________
> Devel mailing list
> Devel@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel
--
Martin Perina
Associate Manager, Software Engineering
Red Hat Czech s.r.o.