[ovirt-devel] VDSM stopped while adding host to newly created cluster

Christopher Pereira kripper at imatronix.cl
Fri Mar 27 13:26:53 UTC 2015


On 27-03-2015 7:03, Dan Kenigsberg wrote:
> On Thu, Mar 26, 2015 at 06:16:24PM -0300, Christopher Pereira wrote:
>> Continuing with the 3.6 Night Builds testing...
>>
>> While hosted-engine-setup was adding the host to the newly created cluster,
>> VDSM crashed, probably because the gluster engine storage disappeared as in
>> BZ 1201355 [1]
>>
>> Facts:
>>      - the engine storage (/rhev/data-center/mmt/...) was umounted during
>> this process
>>      - another mount of the same volume was still mounted after the VDSM
>> crash (maybe the problem is not related with gluster)
> What exactly happened to vdsm? Did the process die? Why? Was it stopped?
> did it segfault? Did it stop responding? Can you share vdsm.log and
> /var/log/message showing what happened during the crash?
Hi Dan,

You will find relevants logs here:
https://bugzilla.redhat.com/show_bug.cgi?id=1201355#c4

Summary:

1) During setup, VDSM receives a SIGTERM:
MainThread::DEBUG::2015-03-26 
18:36:56,767::vdsm::66::vds::(sigtermHandler) Received signal 15

Maybe the activation process installs VDSM and/or restarts it.

2) Since the gluster storage is mounted from a VDSM ChildProcess, it 
disappears when VDSM stops.
Thus, the VM is paused and will never resume (even after remounting the 
storage, because the paused QEMU process keeps invalid file descriptors):
https://bugzilla.redhat.com/show_bug.cgi?id=1058300
https://bugzilla.redhat.com/show_bug.cgi?id=1172905

3) After the VDSM stopped, it's not possible to restart it since you 
will get an "invalid lockspace" in sanlock.
This can be solved with hosted-engine --start-pool.

4) You will be able to reproduce the VDSM sigterm with less effort (no 
need to re-deploy) by accessing the engine portal and reactivating the host.
You will see that VDSM gets stopped and the storage lost.
As a workarround to avoid the storage to get lost, you can mount it 
manually so that it doesn't relay on the VDSM ChildProcess.

Questions:

1) I'm affraid that by activating the host manually after an interrupted 
setup I may be skipping some special configurations.
Is there any difference between activating the host manually from the 
web-manager and activating the host with the setup script?
How can I complete the setup manually?

Status:

I'm still unable to activate the host manually, because engine is now 
having problems with the JsonRPC communcation:

    2015-03-27 10:11:54,889 INFO
    [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp
    Reactor) [] Connecting to h2.imatronix.com/209.126.105.36
    2015-03-27 10:11:54,893 ERROR
    [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor)
    [] *Unable to process messages*
    2015-03-27 10:11:54,893 ERROR
    [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand]
    (DefaultQuartzScheduler_Worker-96) [] Command
    'ListVDSCommand(HostName = h2, HostId =
    46d4659a-4efe-4427-aa68-a4536508fa08,
    vds=Host[h2,46d4659a-4efe-4427-aa68-a4536508fa08])' execution
    failed: VDSGenericException: VDSNetworkException: General SSLEngine
    problem
    2015-03-27 10:11:54,894 ERROR
    [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl]
    (DefaultQuartzScheduler_Worker-96) [] Failed to invoke scheduled
    method vmsMonitoring: null

    2015-03-27 10:11:57,894 INFO
    [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp
    Reactor) [] Connecting to h2.imatronix.com/209.126.105.36
    2015-03-27 10:11:57,897 ERROR
    [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor)
    [] *Unable to process messages*
    2015-03-27 10:11:57,897 INFO
    [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-95)
    [] Command
    'org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand' return
    value
    'org.ovirt.engine.core.vdsbroker.vdsbroker.VDSInfoReturnForXmlRpc at 79313585'
    2015-03-27 10:11:57,898 INFO
    [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-95)
    [] HostName = h2
    2015-03-27 10:11:57,898 ERROR
    [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-95)
    [] Command 'GetCapabilitiesVDSCommand(HostName = h2, HostId =
    46d4659a-4efe-4427-aa68-a4536508fa08,
    vds=Host[h2,46d4659a-4efe-4427-aa68-a4536508fa08])' execution
    failed: VDSGenericException: VDSNetworkException: *General SSLEngine
    problem*
    2015-03-27 10:11:57,898 ERROR
    [org.ovirt.engine.core.vdsbroker.HostMonitoring]
    (DefaultQuartzScheduler_Worker-95) [] Failure to refresh Vds runtime
    info: VDSGenericException: VDSNetworkException: General SSLEngine
    problem
    2015-03-27 10:11:57,898 ERROR
    [org.ovirt.engine.core.vdsbroker.HostMonitoring]
    (DefaultQuartzScheduler_Worker-95) [] Exception:
    org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException:
    VDSGenericException: VDSNetworkException: General SSLEngine problem
             at
    org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:183)
    [vdsbroker.jar:]
             at
    org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand.executeVdsBrokerCommand(GetCapabilitiesVDSCommand.java:16)
    [vdsbroker.jar:]
             at
    org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:101)
    [vdsbroker.jar:]
             at
    org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:55)
    [vdsbroker.jar:]
             at
    org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33)
    [dal.jar:]
             at
    org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:465)
    [vdsbroker.jar:]
             at
    org.ovirt.engine.core.vdsbroker.VdsManager.refreshCapabilities(VdsManager.java:587)
    [vdsbroker.jar:]
             at
    org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:111)
    [vdsbroker.jar:]
             at
    org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:76)
    [vdsbroker.jar:]
             at
    org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:199)
    [vdsbroker.jar:]
             at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown
    Source) [:1.7.0_75]
             at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    [rt.jar:1.7.0_75]
             at java.lang.reflect.Method.invoke(Method.java:606)
    [rt.jar:1.7.0_75]
             at
    org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81)
    [scheduler.jar:]
             at
    org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52)
    [scheduler.jar:]
             at org.quartz.core.JobRunShell.run(JobRunShell.java:213)
    [quartz.jar:]
             at
    org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
    [quartz.jar:]

    2015-03-27 10:11:57,899 WARN
    [org.ovirt.engine.core.vdsbroker.VdsManager]
    (DefaultQuartzScheduler_Worker-95) [] Failed to refresh VDS, network
    error, continuing, vds='h2'(46d4659a-4efe-4427-aa68-a4536508fa08):
    VDSGenericException: VDSNetworkException: *General SSLEngine problem*
    [...]

On the VDSM side, we have:

    clientIFinit::DEBUG::2015-03-27
    10:11:53,098::task::592::Storage.TaskManager.Task::(_updateState)
    Task=`87ed5b66-3abb-4edc-aec3-59f071b33276`::moving from state init
    -> state preparing
    clientIFinit::INFO::2015-03-27
    10:11:53,098::logUtils::48::dispatcher::(wrapper) Run and protect:
    getConnectedStoragePoolsList(options=None)
    clientIFinit::INFO::2015-03-27
    10:11:53,098::logUtils::51::dispatcher::(wrapper) Run and protect:
    getConnectedStoragePoolsList, Return response: {'poollist': []}
    clientIFinit::DEBUG::2015-03-27
    10:11:53,098::task::1188::Storage.TaskManager.Task::(prepare)
    Task=`87ed5b66-3abb-4edc-aec3-59f071b33276`::finished: {'poollist': []}
    clientIFinit::DEBUG::2015-03-27
    10:11:53,098::task::592::Storage.TaskManager.Task::(_updateState)
    Task=`87ed5b66-3abb-4edc-aec3-59f071b33276`::moving from state
    preparing -> state finished
    clientIFinit::DEBUG::2015-03-27
    10:11:53,098::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll)
    Owner.releaseAll requests {} resources {}
    clientIFinit::DEBUG::2015-03-27
    10:11:53,098::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll)
    Owner.cancelAll requests {}
    clientIFinit::DEBUG::2015-03-27
    10:11:53,098::task::990::Storage.TaskManager.Task::(_decref)
    Task=`87ed5b66-3abb-4edc-aec3-59f071b33276`::ref 0 aborting False
    Detector thread::DEBUG::2015-03-27
    10:11:53,450::protocoldetector::201::vds.MultiProtocolAcceptor::(_add_connection)
    *Adding connection 209.239.124.8:54218*
    Detector thread::DEBUG::2015-03-27
    10:11:53,459::protocoldetector::225::vds.MultiProtocolAcceptor::(_process_handshake)
    *Error during handshake: sslv3 alert certificate unknown*
    Detector thread::DEBUG::2015-03-27
    10:11:53,459::protocoldetector::215::vds.MultiProtocolAcceptor::(_remove_connection)
    Removing connection 209.239.124.8:54218
    Detector thread::DEBUG::2015-03-27
    10:11:55,249::protocoldetector::201::vds.MultiProtocolAcceptor::(_add_connection)
    Adding connection 209.126.113.73:54119
    Detector thread::DEBUG::2015-03-27
    10:11:55,252::protocoldetector::225::vds.MultiProtocolAcceptor::(_process_handshake)
    Error during handshake: unexpected eof
    Detector thread::DEBUG::2015-03-27
    10:11:55,252::protocoldetector::215::vds.MultiProtocolAcceptor::(_remove_connection)
    Removing connection 209.126.113.73:54119
    Detector thread::DEBUG::2015-03-27
    10:11:56,582::protocoldetector::201::vds.MultiProtocolAcceptor::(_add_connection)
    Adding connection 209.239.124.8:39606
    Detector thread::DEBUG::2015-03-27
    10:11:56,629::protocoldetector::225::vds.MultiProtocolAcceptor::(_process_handshake)
    Error during handshake: sslv3 alert certificate unknown
    Detector thread::DEBUG::2015-03-27
    10:11:56,629::protocoldetector::215::vds.MultiProtocolAcceptor::(_remove_connection)
    Removing connection 209.239.124.8:39606
    clientIFinit::DEBUG::2015-03-27
    10:11:58,104::task::592::Storage.TaskManager.Task::(_updateState)
    Task=`9e6db6dc-3ce0-4e93-8ddd-2aa1d09fa687`::moving from state init
    -> state preparing
    clientIFinit::INFO::2015-03-27
    10:11:58,104::logUtils::48::dispatcher::(wrapper) Run and protect:
    getConnectedStoragePoolsList(options=None)
    clientIFinit::INFO::2015-03-27
    10:11:58,104::logUtils::51::dispatcher::(wrapper) Run and protect:
    getConnectedStoragePoolsList, Return response: {'poollist': []}
    clientIFinit::DEBUG::2015-03-27
    10:11:58,104::task::1188::Storage.TaskManager.Task::(prepare)
    Task=`9e6db6dc-3ce0-4e93-8ddd-2aa1d09fa687`::finished: {'poollist': []}
    clientIFinit::DEBUG::2015-03-27
    10:11:58,104::task::592::Storage.TaskManager.Task::(_updateState)
    Task=`9e6db6dc-3ce0-4e93-8ddd-2aa1d09fa687`::moving from state
    preparing -> state finished
    clientIFinit::DEBUG::2015-03-27
    10:11:58,104::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll)
    Owner.releaseAll requests {} resources {}
    clientIFinit::DEBUG::2015-03-27
    10:11:58,104::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll)
    Owner.cancelAll requests {}
    clientIFinit::DEBUG::2015-03-27
    10:11:58,104::task::990::Storage.TaskManager.Task::(_decref)
    Task=`9e6db6dc-3ce0-4e93-8ddd-2aa1d09fa687`::ref 0 aborting False
    [...]

I guess this is related to an invalid certificate or some protocol 
version missmatch.
How can I fix it?

Regards,
Christopher

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/devel/attachments/20150327/d32bebf9/attachment.html>


More information about the Devel mailing list