hi,
we are failing randomly on test 006_migrations.migrate_vm with what seems
to be the same issue.
the vm seems to be migrated successfully but engine thinks that it failed
and re-calls migration getting a response of vm already exists.
I don't think this is an issue with the test but rather a regression so I
opened a bug:
Thanks,
Dafna
On Wed, Apr 11, 2018 at 1:52 PM, Milan Zamazal <mzamazal(a)redhat.com> wrote:
Arik Hadas <ahadas(a)redhat.com> writes:
> On Wed, Apr 11, 2018 at 12:45 PM, Alona Kaplan <alkaplan(a)redhat.com>
wrote:
>
>>
>>
>> On Tue, Apr 10, 2018 at 6:52 PM, Gal Ben Haim <gbenhaim(a)redhat.com>
wrote:
>>
>>> I'm seeing the same error in [1], during 006_migrations.migrate_vm.
>>>
>>> [1]
http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/
>>>
>>
>> Seems like another bug. The migration failed since for some reason the
vm
>> is already defined on the destination host.
>>
>> 2018-04-10 11:08:08,685-0400 ERROR (jsonrpc/0) [api] FINISH create
>> error=Virtual machine already exists (api:129)
>> Traceback (most recent call last):
>> File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122,
in
>> method
>> ret = func(*args, **kwargs)
>> File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 191, in
create
>> raise exception.VMExists()
>> VMExists: Virtual machine already exists
>>
>>
> Milan, Francesco, could it be that because of [1] that appears on the
> destination host right after shutting down the VM, it remained defined on
> that host?
I can't see any destroy call in the logs after the successful preceding
migration from the given host. That would explain “VMExists” error.
> [1] 2018-04-10 11:01:40,005-0400 ERROR (libvirt/events) [vds] Error
running
> VM callback (clientIF:683)
>
> Traceback (most recent call last):
>
> File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 646, in
> dispatchLibvirtEvents
>
> v.onLibvirtLifecycleEvent(event, detail, None)
>
> AttributeError: 'NoneType' object has no attribute
'onLibvirtLifecycleEvent'
That means that a life cycle event on an unknown VM has arrived, in this
case apparently destroy event, following the destroy call after the
failed incoming migration. The reported AttributeError is a minor bug,
already fixed in master. So it's most likely unrelated to the discussed
problem.
>>> On Tue, Apr 10, 2018 at 4:14 PM, Alona Kaplan <alkaplan(a)redhat.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Looking at the log it seems that the new GetCapabilitiesAsync is
>>>> responsible for the mess.
>>>>
>>>> -
>>>> * 08:29:47 - engine loses connectivity to host
'lago-basic-suite-4-2-host-0'.*
>>>>
>>>>
>>>>
>>>> *- Every 3 seconds a getCapabalititiesAsync request is sent to the
host (unsuccessfully).*
>>>>
>>>> * before each "getCapabilitiesAsync" the monitoring lock
is
taken (VdsManager,refreshImpl)
>>>>
>>>> * "getCapabilitiesAsync" immediately fails and throws
'VDSNetworkException: java.net.ConnectException: Connection refused'. The
exception is caught by 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand'
which calls 'onFailure' of the callback and re-throws the exception.
>>>>
>>>> catch (Throwable t) {
>>>> getParameters().getCallback().onFailure(t);
>>>> throw t;
>>>> }
>>>>
>>>> * The 'onFailure' of the callback releases the
"monitoringLock"
('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded)
lockManager.releaseLock(monitoringLock);')
>>>>
>>>> * 'VdsManager,refreshImpl' catches the network exception,
marks
'releaseLock = true' and *tries to release the already released lock*.
>>>>
>>>> The following warning is printed to the log -
>>>>
>>>> WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
(EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release
exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-
c4aa4e19bd93VDS_INIT'
>>>>
>>>>
>>>>
>>>>
>>>> *- 08:30:51 a successful getCapabilitiesAsync is sent.*
>>>>
>>>>
>>>> *- 08:32:55 - The failing test starts (Setup Networks for setting
ipv6). *
>>>>
>>>> * SetupNetworks takes the monitoring lock.
>>>>
>>>> *- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync
requests from 4 minutes ago from its queue and prints a
VDSNetworkException: Vds timeout occured.*
>>>>
>>>> * When the first request is removed from the queue
('ResponseTracker.remove()'), the
>>>> *'Callback.onFailure' is invoked (for the second time) ->
monitoring
lock is released (the lock taken by the SetupNetworks!).*
>>>>
>>>> * *The other requests removed from the queue also try to
release the monitoring lock*, but there is nothing to release.
>>>>
>>>> * The following warning log is printed -
>>>> WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
(EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release
exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-
c4aa4e19bd93VDS_INIT'
>>>>
>>>> - *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is
started*. Why? I'm not 100% sure but I guess the late processing of the
'getCapabilitiesAsync' that causes losing of the monitoring lock and the
late + mupltiple processing of failure is root cause.
>>>>
>>>>
>>>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the
lock is
trying to be released three times. Please share your opinion regarding how
it should be fixed.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Alona.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg
<danken(a)redhat.com>
>>>> wrote:
>>>>
>>>>> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas
<ehaas(a)redhat.com>
wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri
<eedri(a)redhat.com> wrote:
>>>>>>
>>>>>>> Was already done by Yaniv -
https://gerrit.ovirt.org/#/c/89851.
>>>>>>> Is it still failing?
>>>>>>>
>>>>>>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren
<bkorren(a)redhat.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On 7 April 2018 at 00:30, Dan Kenigsberg
<danken(a)redhat.com>
wrote:
>>>>>>>> > No, I am afraid that we have not managed to
understand why
setting
>>>>>>>> and
>>>>>>>> > ipv6 address too the host off the grid. We shall
continue
>>>>>>>> researching
>>>>>>>> > this next week.
>>>>>>>> >
>>>>>>>> > Edy,
https://gerrit.ovirt.org/#/c/88637/ is already
4 weeks
old,
>>>>>>>> but
>>>>>>>> > could it possibly be related (I really doubt
that)?
>>>>>>>> >
>>>>>>>>
>>>>>>>
>>>>>> Sorry, but I do not see how this problem is related to VDSM.
>>>>>> There is nothing that indicates that there is a VDSM problem.
>>>>>>
>>>>>> Has the RPC connection between Engine and VDSM failed?
>>>>>>
>>>>>>
>>>>> Further up the thread, Piotr noticed that (at least on one failure
of
>>>>> this test) that the Vdsm host lost connectivity to its storage, and
Vdsm
>>>>> process was restarted. However, this does not seems to happen in
all
cases
>>>>> where this test fails.
>>>>>
>>>>> _______________________________________________
>>>>> Devel mailing list
>>>>> Devel(a)ovirt.org
>>>>>
http://lists.ovirt.org/mailman/listinfo/devel
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Devel mailing list
>>>> Devel(a)ovirt.org
>>>>
http://lists.ovirt.org/mailman/listinfo/devel
>>>>
>>>
>>>
>>>
>>> --
>>> *GAL bEN HAIM*
>>> RHV DEVOPS
>>>
>>
>>
>> _______________________________________________
>> Devel mailing list
>> Devel(a)ovirt.org
>>
http://lists.ovirt.org/mailman/listinfo/devel
>>
_______________________________________________
Devel mailing list
Devel(a)ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel