<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Apr 24, 2018 at 7:00 AM, Dan Kenigsberg <span dir="ltr"><<a href="mailto:danken@redhat.com" target="_blank">danken@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Ravi's patch is in, but a similar problem remains, and the test cannot<br>
be put back into its place.<br>
<br>
It seems that while Vdsm was taken down, a couple of getCapsAsync<br>
requests queued up. At one point, the host resumed its connection,<br>
before the requests have been cleared of the queue. After the host is<br>
up, the following tests resume, and at a pseudorandom point in time,<br>
an old getCapsAsync request times out and kills our connection.<br>
<br>
I believe that as long as ANY request is on flight, the monitoring<br>
lock should not be released, and the host should not be declared as<br>
up.<br>
<div class="HOEnZb"><div class="h5"><br>
<br></div></div></blockquote><div><br></div><div><br></div><div>Hi Dan,</div><div><br></div><div>Can I have the link to the job on jenkins so I can look at the logs</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">
On Wed, Apr 11, 2018 at 1:04 AM, Ravi Shankar Nori <<a href="mailto:rnori@redhat.com">rnori@redhat.com</a>> wrote:<br>
> This [1] should fix the multiple release lock issue<br>
><br>
> [1] <a href="https://gerrit.ovirt.org/#/c/90077/" rel="noreferrer" target="_blank">https://gerrit.ovirt.org/#/c/<wbr>90077/</a><br>
><br>
> On Tue, Apr 10, 2018 at 3:53 PM, Ravi Shankar Nori <<a href="mailto:rnori@redhat.com">rnori@redhat.com</a>> wrote:<br>
>><br>
>> Working on a patch will post a fix<br>
>><br>
>> Thanks<br>
>><br>
>> Ravi<br>
>><br>
>> On Tue, Apr 10, 2018 at 9:14 AM, Alona Kaplan <<a href="mailto:alkaplan@redhat.com">alkaplan@redhat.com</a>> wrote:<br>
>>><br>
>>> Hi all,<br>
>>><br>
>>> Looking at the log it seems that the new GetCapabilitiesAsync is<br>
>>> responsible for the mess.<br>
>>><br>
>>> - 08:29:47 - engine loses connectivity to host<br>
>>> 'lago-basic-suite-4-2-host-0'.<br>
>>><br>
>>> - Every 3 seconds a getCapabalititiesAsync request is sent to the host<br>
>>> (unsuccessfully).<br>
>>><br>
>>> * before each "getCapabilitiesAsync" the monitoring lock is taken<br>
>>> (VdsManager,refreshImpl)<br>
>>><br>
>>> * "getCapabilitiesAsync" immediately fails and throws<br>
>>> 'VDSNetworkException: java.net.ConnectException: Connection refused'. The<br>
>>> exception is caught by<br>
>>> '<wbr>GetCapabilitiesAsyncVDSCommand<wbr>.executeVdsBrokerCommand' which calls<br>
>>> 'onFailure' of the callback and re-throws the exception.<br>
>>><br>
>>> catch (Throwable t) {<br>
>>> getParameters().getCallback().<wbr>onFailure(t);<br>
>>> throw t;<br>
>>> }<br>
>>><br>
>>> * The 'onFailure' of the callback releases the "monitoringLock"<br>
>>> ('postProcessRefresh()-><wbr>afterRefreshTreatment()-> if (!succeeded)<br>
>>> lockManager.releaseLock(<wbr>monitoringLock);')<br>
>>><br>
>>> * 'VdsManager,refreshImpl' catches the network exception, marks<br>
>>> 'releaseLock = true' and tries to release the already released lock.<br>
>>><br>
>>> The following warning is printed to the log -<br>
>>><br>
>>> WARN [org.ovirt.engine.core.bll.<wbr>lock.InMemoryLockManager]<br>
>>> (EE-ManagedThreadFactory-<wbr>engineScheduled-Thread-53) [] Trying to release<br>
>>> exclusive lock which does not exist, lock key:<br>
>>> 'ecf53d69-eb68-4b11-8df2-<wbr>c4aa4e19bd93VDS_INIT'<br>
>>><br>
>>><br>
>>> - 08:30:51 a successful getCapabilitiesAsync is sent.<br>
>>><br>
>>> - 08:32:55 - The failing test starts (Setup Networks for setting ipv6).<br>
>>><br>
>>><br>
>>> * SetupNetworks takes the monitoring lock.<br>
>>><br>
>>> - 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests<br>
>>> from 4 minutes ago from its queue and prints a VDSNetworkException: Vds<br>
>>> timeout occured.<br>
>>><br>
>>> * When the first request is removed from the queue<br>
>>> ('ResponseTracker.remove()'), the 'Callback.onFailure' is invoked (for the<br>
>>> second time) -> monitoring lock is released (the lock taken by the<br>
>>> SetupNetworks!).<br>
>>><br>
>>> * The other requests removed from the queue also try to release the<br>
>>> monitoring lock, but there is nothing to release.<br>
>>><br>
>>> * The following warning log is printed -<br>
>>> WARN [org.ovirt.engine.core.bll.<wbr>lock.InMemoryLockManager]<br>
>>> (EE-ManagedThreadFactory-<wbr>engineScheduled-Thread-14) [] Trying to release<br>
>>> exclusive lock which does not exist, lock key:<br>
>>> 'ecf53d69-eb68-4b11-8df2-<wbr>c4aa4e19bd93VDS_INIT'<br>
>>><br>
>>> - 08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started.<br>
>>> Why? I'm not 100% sure but I guess the late processing of the<br>
>>> 'getCapabilitiesAsync' that causes losing of the monitoring lock and the<br>
>>> late + mupltiple processing of failure is root cause.<br>
>>><br>
>>><br>
>>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is<br>
>>> trying to be released three times. Please share your opinion regarding how<br>
>>> it should be fixed.<br>
>>><br>
>>><br>
>>> Thanks,<br>
>>><br>
>>> Alona.<br>
>>><br>
>>><br>
>>><br>
>>><br>
>>><br>
>>><br>
>>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <<a href="mailto:danken@redhat.com">danken@redhat.com</a>> wrote:<br>
>>>><br>
>>>> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <<a href="mailto:ehaas@redhat.com">ehaas@redhat.com</a>> wrote:<br>
>>>>><br>
>>>>><br>
>>>>><br>
>>>>> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <<a href="mailto:eedri@redhat.com">eedri@redhat.com</a>> wrote:<br>
>>>>>><br>
>>>>>> Was already done by Yaniv - <a href="https://gerrit.ovirt.org/#/c/89851" rel="noreferrer" target="_blank">https://gerrit.ovirt.org/#/c/<wbr>89851</a>.<br>
>>>>>> Is it still failing?<br>
>>>>>><br>
>>>>>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <<a href="mailto:bkorren@redhat.com">bkorren@redhat.com</a>><br>
>>>>>> wrote:<br>
>>>>>>><br>
>>>>>>> On 7 April 2018 at 00:30, Dan Kenigsberg <<a href="mailto:danken@redhat.com">danken@redhat.com</a>> wrote:<br>
>>>>>>> > No, I am afraid that we have not managed to understand why setting<br>
>>>>>>> > and<br>
>>>>>>> > ipv6 address too the host off the grid. We shall continue<br>
>>>>>>> > researching<br>
>>>>>>> > this next week.<br>
>>>>>>> ><br>
>>>>>>> > Edy, <a href="https://gerrit.ovirt.org/#/c/88637/" rel="noreferrer" target="_blank">https://gerrit.ovirt.org/#/c/<wbr>88637/</a> is already 4 weeks old,<br>
>>>>>>> > but<br>
>>>>>>> > could it possibly be related (I really doubt that)?<br>
>>>>>>> ><br>
>>>>><br>
>>>>><br>
>>>>> Sorry, but I do not see how this problem is related to VDSM.<br>
>>>>> There is nothing that indicates that there is a VDSM problem.<br>
>>>>><br>
>>>>> Has the RPC connection between Engine and VDSM failed?<br>
>>>>><br>
>>>><br>
>>>> Further up the thread, Piotr noticed that (at least on one failure of<br>
>>>> this test) that the Vdsm host lost connectivity to its storage, and Vdsm<br>
>>>> process was restarted. However, this does not seems to happen in all cases<br>
>>>> where this test fails.<br>
>>>><br>
>>>> ______________________________<wbr>_________________<br>
>>>> Devel mailing list<br>
>>>> <a href="mailto:Devel@ovirt.org">Devel@ovirt.org</a><br>
>>>> <a href="http://lists.ovirt.org/mailman/listinfo/devel" rel="noreferrer" target="_blank">http://lists.ovirt.org/<wbr>mailman/listinfo/devel</a><br>
>>><br>
>>><br>
>><br>
><br>
</div></div></blockquote></div><br></div></div>