hi,

we are failing randomly on test 006_migrations.migrate_vm with what seems to be the same issue.

the vm seems to be migrated successfully but engine thinks that it failed and re-calls migration getting a response of vm already exists.

I don't think this is an issue with the test but rather a regression so I opened a bug:

https://bugzilla.redhat.com/show_bug.cgi?id=1566594

Thanks,
Dafna


On Wed, Apr 11, 2018 at 1:52 PM, Milan Zamazal <mzamazal@redhat.com> wrote:
Arik Hadas <ahadas@redhat.com> writes:

> On Wed, Apr 11, 2018 at 12:45 PM, Alona Kaplan <alkaplan@redhat.com> wrote:
>
>>
>>
>> On Tue, Apr 10, 2018 at 6:52 PM, Gal Ben Haim <gbenhaim@redhat.com> wrote:
>>
>>> I'm seeing the same error in [1], during 006_migrations.migrate_vm.
>>>
>>> [1] http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/
>>>
>>
>> Seems like another bug. The migration failed since for some reason the vm
>> is already defined on the destination host.
>>
>> 2018-04-10 11:08:08,685-0400 ERROR (jsonrpc/0) [api] FINISH create
>> error=Virtual machine already exists (api:129)
>> Traceback (most recent call last):
>> File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in
>> method
>> ret = func(*args, **kwargs)
>> File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 191, in create
>> raise exception.VMExists()
>> VMExists: Virtual machine already exists
>>
>>
> Milan, Francesco, could it be that because of [1] that appears on the
> destination host right after shutting down the VM, it remained defined on
> that host?

I can't see any destroy call in the logs after the successful preceding
migration from the given host.  That would explain “VMExists” error.

> [1] 2018-04-10 11:01:40,005-0400 ERROR (libvirt/events) [vds] Error running
> VM callback (clientIF:683)
>
> Traceback (most recent call last):
>
>   File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 646, in
> dispatchLibvirtEvents
>
>     v.onLibvirtLifecycleEvent(event, detail, None)
>
> AttributeError: 'NoneType' object has no attribute 'onLibvirtLifecycleEvent'

That means that a life cycle event on an unknown VM has arrived, in this
case apparently destroy event, following the destroy call after the
failed incoming migration.  The reported AttributeError is a minor bug,
already fixed in master.  So it's most likely unrelated to the discussed
problem.

>>> On Tue, Apr 10, 2018 at 4:14 PM, Alona Kaplan <alkaplan@redhat.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Looking at the log it seems that the new GetCapabilitiesAsync is
>>>> responsible for the mess.
>>>>
>>>> -
>>>> * 08:29:47 - engine loses connectivity to host 'lago-basic-suite-4-2-host-0'.*
>>>>
>>>>
>>>>
>>>> *- Every 3 seconds a getCapabalititiesAsync request is sent to the host (unsuccessfully).*
>>>>
>>>>      * before each "getCapabilitiesAsync" the monitoring lock is taken (VdsManager,refreshImpl)
>>>>
>>>>      * "getCapabilitiesAsync" immediately fails and throws 'VDSNetworkException: java.net.ConnectException: Connection refused'. The exception is caught by 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 'onFailure' of the callback and re-throws the exception.
>>>>
>>>>          catch (Throwable t) {
>>>>             getParameters().getCallback().onFailure(t);
>>>>             throw t;
>>>>          }
>>>>
>>>>     * The 'onFailure' of the callback releases the "monitoringLock" ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) lockManager.releaseLock(monitoringLock);')
>>>>
>>>>     * 'VdsManager,refreshImpl' catches the network exception, marks 'releaseLock = true' and *tries to release the already released lock*.
>>>>
>>>>       The following warning is printed to the log -
>>>>
>>>>       WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>>>>
>>>>
>>>>
>>>>
>>>> *- 08:30:51 a successful getCapabilitiesAsync is sent.*
>>>>
>>>>
>>>> *- 08:32:55 - The failing test starts (Setup Networks for setting ipv6).    *
>>>>
>>>>     * SetupNetworks takes the monitoring lock.
>>>>
>>>> *- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 4 minutes ago from its queue and prints a VDSNetworkException: Vds timeout occured.*
>>>>
>>>>       * When the first request is removed from the queue ('ResponseTracker.remove()'), the
>>>> *'Callback.onFailure' is invoked (for the second time) -> monitoring lock is released (the lock taken by the SetupNetworks!).*
>>>>
>>>>       * *The other requests removed from the queue also try to release the monitoring lock*, but there is nothing to release.
>>>>
>>>>       * The following warning log is printed -
>>>>         WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>>>>
>>>> - *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*. Why? I'm not 100% sure but I guess the late processing of the 'getCapabilitiesAsync' that causes losing of the monitoring lock and the late + mupltiple processing of failure is root cause.
>>>>
>>>>
>>>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying to be released three times. Please share your opinion regarding how it should be fixed.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Alona.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <danken@redhat.com>
>>>> wrote:
>>>>
>>>>> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <ehaas@redhat.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <eedri@redhat.com> wrote:
>>>>>>
>>>>>>> Was already done by Yaniv - https://gerrit.ovirt.org/#/c/89851.
>>>>>>> Is it still failing?
>>>>>>>
>>>>>>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren <bkorren@redhat.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On 7 April 2018 at 00:30, Dan Kenigsberg <danken@redhat.com> wrote:
>>>>>>>> > No, I am afraid that we have not managed to understand why setting
>>>>>>>> and
>>>>>>>> > ipv6 address too the host off the grid. We shall continue
>>>>>>>> researching
>>>>>>>> > this next week.
>>>>>>>> >
>>>>>>>> > Edy, https://gerrit.ovirt.org/#/c/88637/ is already 4 weeks old,
>>>>>>>> but
>>>>>>>> > could it possibly be related (I really doubt that)?
>>>>>>>> >
>>>>>>>>
>>>>>>>
>>>>>> Sorry, but I do not see how this problem is related to VDSM.
>>>>>> There is nothing that indicates that there is a VDSM problem.
>>>>>>
>>>>>> Has the RPC connection between Engine and VDSM failed?
>>>>>>
>>>>>>
>>>>> Further up the thread, Piotr noticed that (at least on one failure of
>>>>> this test) that the Vdsm host lost connectivity to its storage, and Vdsm
>>>>> process was restarted. However, this does not seems to happen in all cases
>>>>> where this test fails.
>>>>>
>>>>> _______________________________________________
>>>>> Devel mailing list
>>>>> Devel@ovirt.org
>>>>> http://lists.ovirt.org/mailman/listinfo/devel
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Devel mailing list
>>>> Devel@ovirt.org
>>>> http://lists.ovirt.org/mailman/listinfo/devel
>>>>
>>>
>>>
>>>
>>> --
>>> *GAL bEN HAIM*
>>> RHV DEVOPS
>>>
>>
>>
>> _______________________________________________
>> Devel mailing list
>> Devel@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/devel
>>
_______________________________________________
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel