<div dir="ltr"><div>hi<span style="font-weight:bold">,<br><br></span></div><div>we are failing randomly on test 006_migrations.migrate_vm with what seems to be the same issue. <br></div><div><br></div><div>the vm seems to be migrated successfully but engine thinks that it failed and re-calls migration getting a response of vm already exists. <br><br></div><div>I don't think this is an issue with the test but rather a regression so I opened a bug: <br><br><a href="https://bugzilla.redhat.com/show_bug.cgi?id=1566594">https://bugzilla.redhat.com/show_bug.cgi?id=1566594</a><br><br></div><div>Thanks, <br></div><div>Dafna<br><span style="font-weight:bold"><span style="font-weight:bold"><br></span></span></div><span style="font-weight:bold"></span></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Apr 11, 2018 at 1:52 PM, Milan Zamazal <span dir="ltr"><<a href="mailto:mzamazal@redhat.com" target="_blank">mzamazal@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">Arik Hadas <<a href="mailto:ahadas@redhat.com">ahadas@redhat.com</a>> writes:<br>
<br>
> On Wed, Apr 11, 2018 at 12:45 PM, Alona Kaplan <<a href="mailto:alkaplan@redhat.com">alkaplan@redhat.com</a>> wrote:<br>
><br>
>><br>
>><br>
>> On Tue, Apr 10, 2018 at 6:52 PM, Gal Ben Haim <<a href="mailto:gbenhaim@redhat.com">gbenhaim@redhat.com</a>> wrote:<br>
>><br>
>>> I'm seeing the same error in [1], during 006_migrations.migrate_vm.<br>
>>><br>
>>> [1] <a href="http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/" rel="noreferrer" target="_blank">http://jenkins.ovirt.org/job/<wbr>ovirt-4.2_change-queue-tester/<wbr>1650/</a><br>
>>><br>
>><br>
>> Seems like another bug. The migration failed since for some reason the vm<br>
>> is already defined on the destination host.<br>
>><br>
>> 2018-04-10 11:08:08,685-0400 ERROR (jsonrpc/0) [api] FINISH create<br>
>> error=Virtual machine already exists (api:129)<br>
>> Traceback (most recent call last):<br>
>> File "/usr/lib/python2.7/site-<wbr>packages/vdsm/common/api.py", line 122, in<br>
>> method<br>
>> ret = func(*args, **kwargs)<br>
>> File "/usr/lib/python2.7/site-<wbr>packages/vdsm/API.py", line 191, in create<br>
>> raise exception.VMExists()<br>
>> VMExists: Virtual machine already exists<br>
>><br>
>><br>
> Milan, Francesco, could it be that because of [1] that appears on the<br>
> destination host right after shutting down the VM, it remained defined on<br>
> that host?<br>
<br>
</span>I can't see any destroy call in the logs after the successful preceding<br>
migration from the given host. That would explain “VMExists” error.<br>
<span class=""><br>
> [1] 2018-04-10 11:01:40,005-0400 ERROR (libvirt/events) [vds] Error running<br>
> VM callback (clientIF:683)<br>
><br>
> Traceback (most recent call last):<br>
><br>
> File "/usr/lib/python2.7/site-<wbr>packages/vdsm/clientIF.py", line 646, in<br>
> dispatchLibvirtEvents<br>
><br>
> v.onLibvirtLifecycleEvent(<wbr>event, detail, None)<br>
><br>
> AttributeError: 'NoneType' object has no attribute 'onLibvirtLifecycleEvent'<br>
<br>
</span>That means that a life cycle event on an unknown VM has arrived, in this<br>
case apparently destroy event, following the destroy call after the<br>
failed incoming migration. The reported AttributeError is a minor bug,<br>
already fixed in master. So it's most likely unrelated to the discussed<br>
problem.<br>
<span class=""><br>
>>> On Tue, Apr 10, 2018 at 4:14 PM, Alona Kaplan <<a href="mailto:alkaplan@redhat.com">alkaplan@redhat.com</a>><br>
>>> wrote:<br>
>>><br>
>>>> Hi all,<br>
>>>><br>
>>>> Looking at the log it seems that the new GetCapabilitiesAsync is<br>
>>>> responsible for the mess.<br>
>>>><br>
>>>> -<br>
</span>>>>> * 08:29:47 - engine loses connectivity to host 'lago-basic-suite-4-2-host-0'.<wbr>*<br>
>>>><br>
>>>><br>
>>>><br>
>>>> *- Every 3 seconds a getCapabalititiesAsync request is sent to the host (unsuccessfully).*<br>
<span class="">>>>><br>
>>>> * before each "getCapabilitiesAsync" the monitoring lock is taken (VdsManager,refreshImpl)<br>
>>>><br>
>>>> * "getCapabilitiesAsync" immediately fails and throws 'VDSNetworkException: java.net.ConnectException: Connection refused'. The exception is caught by '<wbr>GetCapabilitiesAsyncVDSCommand<wbr>.executeVdsBrokerCommand' which calls 'onFailure' of the callback and re-throws the exception.<br>
>>>><br>
>>>> catch (Throwable t) {<br>
>>>> getParameters().getCallback().<wbr>onFailure(t);<br>
>>>> throw t;<br>
>>>> }<br>
>>>><br>
>>>> * The 'onFailure' of the callback releases the "monitoringLock" ('postProcessRefresh()-><wbr>afterRefreshTreatment()-> if (!succeeded) lockManager.releaseLock(<wbr>monitoringLock);')<br>
>>>><br>
</span>>>>> * 'VdsManager,refreshImpl' catches the network exception, marks 'releaseLock = true' and *tries to release the already released lock*.<br>
<span class="">>>>><br>
>>>> The following warning is printed to the log -<br>
>>>><br>
>>>> WARN [org.ovirt.engine.core.bll.<wbr>lock.InMemoryLockManager] (EE-ManagedThreadFactory-<wbr>engineScheduled-Thread-53) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-<wbr>c4aa4e19bd93VDS_INIT'<br>
>>>><br>
>>>><br>
>>>><br>
>>>><br>
</span>>>>> *- 08:30:51 a successful getCapabilitiesAsync is sent.*<br>
>>>><br>
>>>><br>
>>>> *- 08:32:55 - The failing test starts (Setup Networks for setting ipv6). *<br>
<span class="">>>>><br>
>>>> * SetupNetworks takes the monitoring lock.<br>
>>>><br>
</span>>>>> *- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 4 minutes ago from its queue and prints a VDSNetworkException: Vds timeout occured.*<br>
<span class="">>>>><br>
>>>> * When the first request is removed from the queue ('ResponseTracker.remove()'), the<br>
</span>>>>> *'Callback.onFailure' is invoked (for the second time) -> monitoring lock is released (the lock taken by the SetupNetworks!).*<br>
>>>><br>
>>>> * *The other requests removed from the queue also try to release the monitoring lock*, but there is nothing to release.<br>
<span class="">>>>><br>
>>>> * The following warning log is printed -<br>
>>>> WARN [org.ovirt.engine.core.bll.<wbr>lock.InMemoryLockManager] (EE-ManagedThreadFactory-<wbr>engineScheduled-Thread-14) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-<wbr>c4aa4e19bd93VDS_INIT'<br>
>>>><br>
</span>>>>> - *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*. Why? I'm not 100% sure but I guess the late processing of the 'getCapabilitiesAsync' that causes losing of the monitoring lock and the late + mupltiple processing of failure is root cause.<br>
<div><div class="h5">>>>><br>
>>>><br>
>>>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying to be released three times. Please share your opinion regarding how it should be fixed.<br>
>>>><br>
>>>><br>
>>>> Thanks,<br>
>>>><br>
>>>> Alona.<br>
>>>><br>
>>>><br>
>>>><br>
>>>><br>
>>>><br>
>>>><br>
>>>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <<a href="mailto:danken@redhat.com">danken@redhat.com</a>><br>
>>>> wrote:<br>
>>>><br>
>>>>> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <<a href="mailto:ehaas@redhat.com">ehaas@redhat.com</a>> wrote:<br>
>>>>><br>
>>>>>><br>
>>>>>><br>
>>>>>> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <<a href="mailto:eedri@redhat.com">eedri@redhat.com</a>> wrote:<br>
>>>>>><br>
>>>>>>> Was already done by Yaniv - <a href="https://gerrit.ovirt.org/#/c/89851" rel="noreferrer" target="_blank">https://gerrit.ovirt.org/#/c/<wbr>89851</a>.<br>
>>>>>>> Is it still failing?<br>
>>>>>>><br>
>>>>>>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <<a href="mailto:bkorren@redhat.com">bkorren@redhat.com</a>><br>
>>>>>>> wrote:<br>
>>>>>>><br>
>>>>>>>> On 7 April 2018 at 00:30, Dan Kenigsberg <<a href="mailto:danken@redhat.com">danken@redhat.com</a>> wrote:<br>
>>>>>>>> > No, I am afraid that we have not managed to understand why setting<br>
>>>>>>>> and<br>
>>>>>>>> > ipv6 address too the host off the grid. We shall continue<br>
>>>>>>>> researching<br>
>>>>>>>> > this next week.<br>
>>>>>>>> ><br>
>>>>>>>> > Edy, <a href="https://gerrit.ovirt.org/#/c/88637/" rel="noreferrer" target="_blank">https://gerrit.ovirt.org/#/c/<wbr>88637/</a> is already 4 weeks old,<br>
>>>>>>>> but<br>
>>>>>>>> > could it possibly be related (I really doubt that)?<br>
>>>>>>>> ><br>
>>>>>>>><br>
>>>>>>><br>
>>>>>> Sorry, but I do not see how this problem is related to VDSM.<br>
>>>>>> There is nothing that indicates that there is a VDSM problem.<br>
>>>>>><br>
>>>>>> Has the RPC connection between Engine and VDSM failed?<br>
>>>>>><br>
>>>>>><br>
>>>>> Further up the thread, Piotr noticed that (at least on one failure of<br>
>>>>> this test) that the Vdsm host lost connectivity to its storage, and Vdsm<br>
>>>>> process was restarted. However, this does not seems to happen in all cases<br>
>>>>> where this test fails.<br>
>>>>><br>
>>>>> ______________________________<wbr>_________________<br>
>>>>> Devel mailing list<br>
>>>>> <a href="mailto:Devel@ovirt.org">Devel@ovirt.org</a><br>
>>>>> <a href="http://lists.ovirt.org/mailman/listinfo/devel" rel="noreferrer" target="_blank">http://lists.ovirt.org/<wbr>mailman/listinfo/devel</a><br>
>>>>><br>
>>>><br>
>>>><br>
>>>> ______________________________<wbr>_________________<br>
>>>> Devel mailing list<br>
>>>> <a href="mailto:Devel@ovirt.org">Devel@ovirt.org</a><br>
>>>> <a href="http://lists.ovirt.org/mailman/listinfo/devel" rel="noreferrer" target="_blank">http://lists.ovirt.org/<wbr>mailman/listinfo/devel</a><br>
>>>><br>
>>><br>
>>><br>
>>><br>
>>> --<br>
</div></div>>>> *GAL bEN HAIM*<br>
<span class="im HOEnZb">>>> RHV DEVOPS<br>
>>><br>
>><br>
>><br>
>> ______________________________<wbr>_________________<br>
>> Devel mailing list<br>
>> <a href="mailto:Devel@ovirt.org">Devel@ovirt.org</a><br>
>> <a href="http://lists.ovirt.org/mailman/listinfo/devel" rel="noreferrer" target="_blank">http://lists.ovirt.org/<wbr>mailman/listinfo/devel</a><br>
>><br>
</span><div class="HOEnZb"><div class="h5">______________________________<wbr>_________________<br>
Devel mailing list<br>
<a href="mailto:Devel@ovirt.org">Devel@ovirt.org</a><br>
<a href="http://lists.ovirt.org/mailman/listinfo/devel" rel="noreferrer" target="_blank">http://lists.ovirt.org/<wbr>mailman/listinfo/devel</a></div></div></blockquote></div><br></div>