Arik Hadas <ahadas(a)redhat.com> writes:
On Wed, Apr 11, 2018 at 12:45 PM, Alona Kaplan
<alkaplan(a)redhat.com> wrote:
>
>
> On Tue, Apr 10, 2018 at 6:52 PM, Gal Ben Haim <gbenhaim(a)redhat.com> wrote:
>
>> I'm seeing the same error in [1], during 006_migrations.migrate_vm.
>>
>> [1]
http://jenkins.ovirt.org/job/ovirt-4.2_change-queue-tester/1650/
>>
>
> Seems like another bug. The migration failed since for some reason the vm
> is already defined on the destination host.
>
> 2018-04-10 11:08:08,685-0400 ERROR (jsonrpc/0) [api] FINISH create
> error=Virtual machine already exists (api:129)
> Traceback (most recent call last):
> File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 122, in
> method
> ret = func(*args, **kwargs)
> File "/usr/lib/python2.7/site-packages/vdsm/API.py", line 191, in create
> raise exception.VMExists()
> VMExists: Virtual machine already exists
>
>
Milan, Francesco, could it be that because of [1] that appears on the
destination host right after shutting down the VM, it remained defined on
that host?
I can't see any destroy call in the logs after the successful preceding
migration from the given host. That would explain “VMExists” error.
[1] 2018-04-10 11:01:40,005-0400 ERROR (libvirt/events) [vds] Error
running
VM callback (clientIF:683)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 646, in
dispatchLibvirtEvents
v.onLibvirtLifecycleEvent(event, detail, None)
AttributeError: 'NoneType' object has no attribute
'onLibvirtLifecycleEvent'
That means that a life cycle event on an unknown VM has arrived, in this
case apparently destroy event, following the destroy call after the
failed incoming migration. The reported AttributeError is a minor bug,
already fixed in master. So it's most likely unrelated to the discussed
problem.
>> On Tue, Apr 10, 2018 at 4:14 PM, Alona Kaplan
<alkaplan(a)redhat.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> Looking at the log it seems that the new GetCapabilitiesAsync is
>>> responsible for the mess.
>>>
>>> -
>>> * 08:29:47 - engine loses connectivity to host
'lago-basic-suite-4-2-host-0'.*
>>>
>>>
>>>
>>> *- Every 3 seconds a getCapabalititiesAsync request is sent to the host
(unsuccessfully).*
>>>
>>> * before each "getCapabilitiesAsync" the monitoring lock is
taken (VdsManager,refreshImpl)
>>>
>>> * "getCapabilitiesAsync" immediately fails and throws
'VDSNetworkException: java.net.ConnectException: Connection refused'. The
exception is caught by 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand'
which calls 'onFailure' of the callback and re-throws the exception.
>>>
>>> catch (Throwable t) {
>>> getParameters().getCallback().onFailure(t);
>>> throw t;
>>> }
>>>
>>> * The 'onFailure' of the callback releases the
"monitoringLock" ('postProcessRefresh()->afterRefreshTreatment()-> if
(!succeeded) lockManager.releaseLock(monitoringLock);')
>>>
>>> * 'VdsManager,refreshImpl' catches the network exception, marks
'releaseLock = true' and *tries to release the already released lock*.
>>>
>>> The following warning is printed to the log -
>>>
>>> WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
(EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release exclusive lock
which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>>>
>>>
>>>
>>>
>>> *- 08:30:51 a successful getCapabilitiesAsync is sent.*
>>>
>>>
>>> *- 08:32:55 - The failing test starts (Setup Networks for setting ipv6).
*
>>>
>>> * SetupNetworks takes the monitoring lock.
>>>
>>> *- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 4
minutes ago from its queue and prints a VDSNetworkException: Vds timeout occured.*
>>>
>>> * When the first request is removed from the queue
('ResponseTracker.remove()'), the
>>> *'Callback.onFailure' is invoked (for the second time) ->
monitoring lock is released (the lock taken by the SetupNetworks!).*
>>>
>>> * *The other requests removed from the queue also try to release the
monitoring lock*, but there is nothing to release.
>>>
>>> * The following warning log is printed -
>>> WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager]
(EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release exclusive lock
which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
>>>
>>> - *08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started*.
Why? I'm not 100% sure but I guess the late processing of the
'getCapabilitiesAsync' that causes losing of the monitoring lock and the late +
mupltiple processing of failure is root cause.
>>>
>>>
>>> Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is
trying to be released three times. Please share your opinion regarding how it should be
fixed.
>>>
>>>
>>> Thanks,
>>>
>>> Alona.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Apr 8, 2018 at 1:21 PM, Dan Kenigsberg <danken(a)redhat.com>
>>> wrote:
>>>
>>>> On Sun, Apr 8, 2018 at 9:21 AM, Edward Haas <ehaas(a)redhat.com>
wrote:
>>>>
>>>>>
>>>>>
>>>>> On Sun, Apr 8, 2018 at 9:15 AM, Eyal Edri <eedri(a)redhat.com>
wrote:
>>>>>
>>>>>> Was already done by Yaniv -
https://gerrit.ovirt.org/#/c/89851.
>>>>>> Is it still failing?
>>>>>>
>>>>>> On Sun, Apr 8, 2018 at 8:59 AM, Barak Korren
<bkorren(a)redhat.com>
>>>>>> wrote:
>>>>>>
>>>>>>> On 7 April 2018 at 00:30, Dan Kenigsberg
<danken(a)redhat.com> wrote:
>>>>>>> > No, I am afraid that we have not managed to understand
why setting
>>>>>>> and
>>>>>>> > ipv6 address too the host off the grid. We shall
continue
>>>>>>> researching
>>>>>>> > this next week.
>>>>>>> >
>>>>>>> > Edy,
https://gerrit.ovirt.org/#/c/88637/ is already 4
weeks old,
>>>>>>> but
>>>>>>> > could it possibly be related (I really doubt that)?
>>>>>>> >
>>>>>>>
>>>>>>
>>>>> Sorry, but I do not see how this problem is related to VDSM.
>>>>> There is nothing that indicates that there is a VDSM problem.
>>>>>
>>>>> Has the RPC connection between Engine and VDSM failed?
>>>>>
>>>>>
>>>> Further up the thread, Piotr noticed that (at least on one failure of
>>>> this test) that the Vdsm host lost connectivity to its storage, and Vdsm
>>>> process was restarted. However, this does not seems to happen in all
cases
>>>> where this test fails.
>>>>
>>>> _______________________________________________
>>>> Devel mailing list
>>>> Devel(a)ovirt.org
>>>>
http://lists.ovirt.org/mailman/listinfo/devel
>>>>
>>>
>>>
>>> _______________________________________________
>>> Devel mailing list
>>> Devel(a)ovirt.org
>>>
http://lists.ovirt.org/mailman/listinfo/devel
>>>
>>
>>
>>
>> --
>> *GAL bEN HAIM*
>> RHV DEVOPS
>>
>
>
> _______________________________________________
> Devel mailing list
> Devel(a)ovirt.org
>
http://lists.ovirt.org/mailman/listinfo/devel
>