[ovirt-devel] VDSM stopped while adding host to newly created cluster
Christopher Pereira
kripper at imatronix.cl
Fri Mar 27 13:26:53 UTC 2015
On 27-03-2015 7:03, Dan Kenigsberg wrote:
> On Thu, Mar 26, 2015 at 06:16:24PM -0300, Christopher Pereira wrote:
>> Continuing with the 3.6 Night Builds testing...
>>
>> While hosted-engine-setup was adding the host to the newly created cluster,
>> VDSM crashed, probably because the gluster engine storage disappeared as in
>> BZ 1201355 [1]
>>
>> Facts:
>> - the engine storage (/rhev/data-center/mmt/...) was umounted during
>> this process
>> - another mount of the same volume was still mounted after the VDSM
>> crash (maybe the problem is not related with gluster)
> What exactly happened to vdsm? Did the process die? Why? Was it stopped?
> did it segfault? Did it stop responding? Can you share vdsm.log and
> /var/log/message showing what happened during the crash?
Hi Dan,
You will find relevants logs here:
https://bugzilla.redhat.com/show_bug.cgi?id=1201355#c4
Summary:
1) During setup, VDSM receives a SIGTERM:
MainThread::DEBUG::2015-03-26
18:36:56,767::vdsm::66::vds::(sigtermHandler) Received signal 15
Maybe the activation process installs VDSM and/or restarts it.
2) Since the gluster storage is mounted from a VDSM ChildProcess, it
disappears when VDSM stops.
Thus, the VM is paused and will never resume (even after remounting the
storage, because the paused QEMU process keeps invalid file descriptors):
https://bugzilla.redhat.com/show_bug.cgi?id=1058300
https://bugzilla.redhat.com/show_bug.cgi?id=1172905
3) After the VDSM stopped, it's not possible to restart it since you
will get an "invalid lockspace" in sanlock.
This can be solved with hosted-engine --start-pool.
4) You will be able to reproduce the VDSM sigterm with less effort (no
need to re-deploy) by accessing the engine portal and reactivating the host.
You will see that VDSM gets stopped and the storage lost.
As a workarround to avoid the storage to get lost, you can mount it
manually so that it doesn't relay on the VDSM ChildProcess.
Questions:
1) I'm affraid that by activating the host manually after an interrupted
setup I may be skipping some special configurations.
Is there any difference between activating the host manually from the
web-manager and activating the host with the setup script?
How can I complete the setup manually?
Status:
I'm still unable to activate the host manually, because engine is now
having problems with the JsonRPC communcation:
2015-03-27 10:11:54,889 INFO
[org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp
Reactor) [] Connecting to h2.imatronix.com/209.126.105.36
2015-03-27 10:11:54,893 ERROR
[org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor)
[] *Unable to process messages*
2015-03-27 10:11:54,893 ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand]
(DefaultQuartzScheduler_Worker-96) [] Command
'ListVDSCommand(HostName = h2, HostId =
46d4659a-4efe-4427-aa68-a4536508fa08,
vds=Host[h2,46d4659a-4efe-4427-aa68-a4536508fa08])' execution
failed: VDSGenericException: VDSNetworkException: General SSLEngine
problem
2015-03-27 10:11:54,894 ERROR
[org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl]
(DefaultQuartzScheduler_Worker-96) [] Failed to invoke scheduled
method vmsMonitoring: null
2015-03-27 10:11:57,894 INFO
[org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp
Reactor) [] Connecting to h2.imatronix.com/209.126.105.36
2015-03-27 10:11:57,897 ERROR
[org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor)
[] *Unable to process messages*
2015-03-27 10:11:57,897 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-95)
[] Command
'org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand' return
value
'org.ovirt.engine.core.vdsbroker.vdsbroker.VDSInfoReturnForXmlRpc at 79313585'
2015-03-27 10:11:57,898 INFO
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-95)
[] HostName = h2
2015-03-27 10:11:57,898 ERROR
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-95)
[] Command 'GetCapabilitiesVDSCommand(HostName = h2, HostId =
46d4659a-4efe-4427-aa68-a4536508fa08,
vds=Host[h2,46d4659a-4efe-4427-aa68-a4536508fa08])' execution
failed: VDSGenericException: VDSNetworkException: *General SSLEngine
problem*
2015-03-27 10:11:57,898 ERROR
[org.ovirt.engine.core.vdsbroker.HostMonitoring]
(DefaultQuartzScheduler_Worker-95) [] Failure to refresh Vds runtime
info: VDSGenericException: VDSNetworkException: General SSLEngine
problem
2015-03-27 10:11:57,898 ERROR
[org.ovirt.engine.core.vdsbroker.HostMonitoring]
(DefaultQuartzScheduler_Worker-95) [] Exception:
org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
VDSGenericException: VDSNetworkException: General SSLEngine problem
at
org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:183)
[vdsbroker.jar:]
at
org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand.executeVdsBrokerCommand(GetCapabilitiesVDSCommand.java:16)
[vdsbroker.jar:]
at
org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:101)
[vdsbroker.jar:]
at
org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:55)
[vdsbroker.jar:]
at
org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33)
[dal.jar:]
at
org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:465)
[vdsbroker.jar:]
at
org.ovirt.engine.core.vdsbroker.VdsManager.refreshCapabilities(VdsManager.java:587)
[vdsbroker.jar:]
at
org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:111)
[vdsbroker.jar:]
at
org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:76)
[vdsbroker.jar:]
at
org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:199)
[vdsbroker.jar:]
at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown
Source) [:1.7.0_75]
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[rt.jar:1.7.0_75]
at java.lang.reflect.Method.invoke(Method.java:606)
[rt.jar:1.7.0_75]
at
org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81)
[scheduler.jar:]
at
org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52)
[scheduler.jar:]
at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
[quartz.jar:]
at
org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
[quartz.jar:]
2015-03-27 10:11:57,899 WARN
[org.ovirt.engine.core.vdsbroker.VdsManager]
(DefaultQuartzScheduler_Worker-95) [] Failed to refresh VDS, network
error, continuing, vds='h2'(46d4659a-4efe-4427-aa68-a4536508fa08):
VDSGenericException: VDSNetworkException: *General SSLEngine problem*
[...]
On the VDSM side, we have:
clientIFinit::DEBUG::2015-03-27
10:11:53,098::task::592::Storage.TaskManager.Task::(_updateState)
Task=`87ed5b66-3abb-4edc-aec3-59f071b33276`::moving from state init
-> state preparing
clientIFinit::INFO::2015-03-27
10:11:53,098::logUtils::48::dispatcher::(wrapper) Run and protect:
getConnectedStoragePoolsList(options=None)
clientIFinit::INFO::2015-03-27
10:11:53,098::logUtils::51::dispatcher::(wrapper) Run and protect:
getConnectedStoragePoolsList, Return response: {'poollist': []}
clientIFinit::DEBUG::2015-03-27
10:11:53,098::task::1188::Storage.TaskManager.Task::(prepare)
Task=`87ed5b66-3abb-4edc-aec3-59f071b33276`::finished: {'poollist': []}
clientIFinit::DEBUG::2015-03-27
10:11:53,098::task::592::Storage.TaskManager.Task::(_updateState)
Task=`87ed5b66-3abb-4edc-aec3-59f071b33276`::moving from state
preparing -> state finished
clientIFinit::DEBUG::2015-03-27
10:11:53,098::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll)
Owner.releaseAll requests {} resources {}
clientIFinit::DEBUG::2015-03-27
10:11:53,098::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll)
Owner.cancelAll requests {}
clientIFinit::DEBUG::2015-03-27
10:11:53,098::task::990::Storage.TaskManager.Task::(_decref)
Task=`87ed5b66-3abb-4edc-aec3-59f071b33276`::ref 0 aborting False
Detector thread::DEBUG::2015-03-27
10:11:53,450::protocoldetector::201::vds.MultiProtocolAcceptor::(_add_connection)
*Adding connection 209.239.124.8:54218*
Detector thread::DEBUG::2015-03-27
10:11:53,459::protocoldetector::225::vds.MultiProtocolAcceptor::(_process_handshake)
*Error during handshake: sslv3 alert certificate unknown*
Detector thread::DEBUG::2015-03-27
10:11:53,459::protocoldetector::215::vds.MultiProtocolAcceptor::(_remove_connection)
Removing connection 209.239.124.8:54218
Detector thread::DEBUG::2015-03-27
10:11:55,249::protocoldetector::201::vds.MultiProtocolAcceptor::(_add_connection)
Adding connection 209.126.113.73:54119
Detector thread::DEBUG::2015-03-27
10:11:55,252::protocoldetector::225::vds.MultiProtocolAcceptor::(_process_handshake)
Error during handshake: unexpected eof
Detector thread::DEBUG::2015-03-27
10:11:55,252::protocoldetector::215::vds.MultiProtocolAcceptor::(_remove_connection)
Removing connection 209.126.113.73:54119
Detector thread::DEBUG::2015-03-27
10:11:56,582::protocoldetector::201::vds.MultiProtocolAcceptor::(_add_connection)
Adding connection 209.239.124.8:39606
Detector thread::DEBUG::2015-03-27
10:11:56,629::protocoldetector::225::vds.MultiProtocolAcceptor::(_process_handshake)
Error during handshake: sslv3 alert certificate unknown
Detector thread::DEBUG::2015-03-27
10:11:56,629::protocoldetector::215::vds.MultiProtocolAcceptor::(_remove_connection)
Removing connection 209.239.124.8:39606
clientIFinit::DEBUG::2015-03-27
10:11:58,104::task::592::Storage.TaskManager.Task::(_updateState)
Task=`9e6db6dc-3ce0-4e93-8ddd-2aa1d09fa687`::moving from state init
-> state preparing
clientIFinit::INFO::2015-03-27
10:11:58,104::logUtils::48::dispatcher::(wrapper) Run and protect:
getConnectedStoragePoolsList(options=None)
clientIFinit::INFO::2015-03-27
10:11:58,104::logUtils::51::dispatcher::(wrapper) Run and protect:
getConnectedStoragePoolsList, Return response: {'poollist': []}
clientIFinit::DEBUG::2015-03-27
10:11:58,104::task::1188::Storage.TaskManager.Task::(prepare)
Task=`9e6db6dc-3ce0-4e93-8ddd-2aa1d09fa687`::finished: {'poollist': []}
clientIFinit::DEBUG::2015-03-27
10:11:58,104::task::592::Storage.TaskManager.Task::(_updateState)
Task=`9e6db6dc-3ce0-4e93-8ddd-2aa1d09fa687`::moving from state
preparing -> state finished
clientIFinit::DEBUG::2015-03-27
10:11:58,104::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll)
Owner.releaseAll requests {} resources {}
clientIFinit::DEBUG::2015-03-27
10:11:58,104::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll)
Owner.cancelAll requests {}
clientIFinit::DEBUG::2015-03-27
10:11:58,104::task::990::Storage.TaskManager.Task::(_decref)
Task=`9e6db6dc-3ce0-4e93-8ddd-2aa1d09fa687`::ref 0 aborting False
[...]
I guess this is related to an invalid certificate or some protocol
version missmatch.
How can I fix it?
Regards,
Christopher
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/devel/attachments/20150327/d32bebf9/attachment.html>
More information about the Devel
mailing list