Looking at the log it seems that the new GetCapabilitiesAsync is responsible for the mess.
- 08:29:47 - engine loses connectivity to host 'lago-basic-suite-4-2-host-0'.
- Every 3 seconds a getCapabalititiesAsync request is sent to the host (unsuccessfully).
* before each "getCapabilitiesAsync" the monitoring lock is taken (VdsManager,refreshImpl)
* "getCapabilitiesAsync" immediately fails and throws 'VDSNetworkException: java.net.ConnectException: Connection refused'. The exception is caught by 'GetCapabilitiesAsyncVDSCommand.executeVdsBrokerCommand' which calls 'onFailure' of the callback and re-throws the exception.
catch (Throwable t) {
getParameters().getCallback().onFailure(t);
throw t;
}
* The 'onFailure' of the callback releases the "monitoringLock" ('postProcessRefresh()->afterRefreshTreatment()-> if (!succeeded) lockManager.releaseLock(monitoringLock);')
* 'VdsManager,refreshImpl' catches the network exception, marks 'releaseLock = true' and tries to release the already released lock.
The following warning is printed to the log - WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
- 08:30:51 a successful getCapabilitiesAsync is sent.
- 08:32:55 - The failing test starts (Setup Networks for setting ipv6).
* SetupNetworks takes the monitoring lock.
- 08:33:00 - ResponseTracker cleans the getCapabilitiesAsync requests from 4 minutes ago from its queue and prints a VDSNetworkException: Vds timeout occured.
* When the first request is removed from the queue ('ResponseTracker.remove()'), the 'Callback.onFailure' is invoked (for the second time) -> monitoring lock is released (the lock taken by the SetupNetworks!).
* The other requests removed from the queue also try to release the monitoring lock, but there is nothing to release.
* The following warning log is printed -
WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engineScheduled-Thread-14) [] Trying to release exclusive lock which does not exist, lock key: 'ecf53d69-eb68-4b11-8df2-c4aa4e19bd93VDS_INIT'
- 08:33:00 - SetupNetwork fails on Timeout ~4 seconds after is started. Why? I'm not 100% sure but I guess the late processing of the 'getCapabilitiesAsync' that causes losing of the monitoring lock and the late + mupltiple processing of failure is root cause.
Ravi, 'getCapabilitiesAsync' failure is treated twice and the lock is trying to be released three times. Please share your opinion regarding how it should be fixed.
Thanks,
Alona.