Self Hosted Engine Gluster Support

Hi, I would like to get involved in testing "Self Hosted Engine Gluster Support". Is the code available in the nightly builds? What is the status and who can give me some initial directions? Thanks. PS: I have over 28 years coding experience and my goal is to build a stable oVirt setup (using Gluster) and promote it to chilean gov. organizations (my company is a certified gov. provider for data center related services).

Il 18/03/2015 03:45, Christopher Pereira ha scritto:
Hi,
I would like to get involved in testing "Self Hosted Engine Gluster Support". Is the code available in the nightly builds? What is the status and who can give me some initial directions?
Hi Chritopher, nice to meet you and happy to see your interest in Self Hosted Engine Gluster Support. Features description are available in the wiki [1][2] The external Gluster support can be already tested using master nightly snapshot[3]. For the Hyper Converged support, a patch has been submitted [4] and builds are available in Jenkins: http://jenkins.ovirt.org/job/ovirt-hosted-engine-setup_master_create-rpms-fc... : SUCCESS http://jenkins.ovirt.org/job/ovirt-hosted-engine-setup_master_create-rpms-fc... : SUCCESS http://jenkins.ovirt.org/job/ovirt-hosted-engine-setup_master_create-rpms-el... : SUCCESS http://jenkins.ovirt.org/job/ovirt-hosted-engine-setup_master_create-rpms-el... : SUCCESS I suggest to test it with VDSM including libgfapi support, currently experimental in Jenkins: http://jenkins.ovirt.org/search/?q=libgfapi Last issue I encountered while testing the Hyper Converged support is Bug 1201355 - [HC] Hosted Engine storage domains disappear while running ovirt-host-deploy in Hyper Converged configuration Help with testing, debugging and coding is welcome!
Thanks.
PS: I have over 28 years coding experience and my goal is to build a stable oVirt setup (using Gluster) and promote it to chilean gov. organizations (my company is a certified gov. provider for data center related services). _______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
[1] http://www.ovirt.org/Features/Self_Hosted_Engine_Gluster_Support [2] http://www.ovirt.org/Features/Self_Hosted_Engine_Hyper_Converged_Gluster_Sup... [3] http://www.ovirt.org/Install_nightly_snapshot [4] https://gerrit.ovirt.org/36108 -- Sandro Bonazzola Better technology. Faster innovation. Powered by community collaboration. See how it works at redhat.com

Continuing with the 3.6 Night Builds testing... While hosted-engine-setup was adding the host to the newly created cluster, VDSM crashed, probably because the gluster engine storage disappeared as in BZ 1201355 [1] Facts: - the engine storage (/rhev/data-center/mmt/...) was umounted during this process - another mount of the same volume was still mounted after the VDSM crash (maybe the problem is not related with gluster) After doing a "hosted-engine --connect-storage", the volume is mounted again. Now, when trying to restart VDSM, I get an "invalid lockspace": Thread-46::ERROR::2015-03-26 19:24:31,843::vm::1237::vm.Vm::(_startUnderlyingVm) vmId=`191045ac-79e4-4ce8-aad7-52cc9af313c5`::The vm start process failed Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 1185, in _startUnderlyingVm self._run() File "/usr/share/vdsm/virt/vm.py", line 2253, in _run self._connection.createXML(domxml, flags), File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 126, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3427, in createXML if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self) libvirtError: Failed to acquire lock: No space left on device Thread-46::INFO::2015-03-26 19:24:31,844::vm::1709::vm.Vm::(setDownStatus) vmId=`191045ac-79e4-4ce8-aad7-52cc9af313c5`::Changed state to Down: Failed to acquire lock: No space left on device (code=1) Thread-46::DEBUG::2015-03-26 19:24:31,844::vmchannels::214::vds::(unregister) Delete fileno 60 from listener. VM Channels Listener::DEBUG::2015-03-26 19:24:32,346::vmchannels::121::vds::(_do_del_channels) fileno 60 was removed from listener. In sanlock.log we have: 2015-03-26 19:24:30+0000 7589 [752]: cmd 9 target pid 9559 not found 2015-03-26 19:24:31+0000 7589 [764]: r7 cmd_acquire 2,8,9559 invalid lockspace found -1 failed 935819904 name 7ba46e75-51af-4648-becc-5a469cb8e9c2 (All 3 lease files are present) This problem is similar to BZ 1201355 reported by Sandro [1]. About the hosted-engine VM not being resumed after restarting VDSM, please check [2] and [3] (duplicated). I confirmed that QEMU is not reopening the file descriptors when resuming a paused VMs, which explains those issues. Now, how can I fix the "invalid lockspace"? [1] https://bugzilla.redhat.com/show_bug.cgi?id=1201355 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1172905 [3] https://bugzilla.redhat.com/show_bug.cgi?id=1058300

This is a multi-part message in MIME format. --------------070308020702090204080908 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit On 26-03-2015 18:16, Christopher Pereira wrote:
Now, how can I fix the "invalid lockspace"? hosted-engine start-pool + connect-storage + vm-start solved the invalid lockspace problem and made it possible to restart the engine VM, but the hosted-engine-setup is unable to resume:
2015-03-26 21:56:42 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:189 VDSM host in non_responsive state 2015-03-26 21:56:44 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:189 VDSM host in non_responsive state 2015-03-26 21:56:46 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:189 VDSM host in non_responsive state [...] Then, inside the web-manager, I tried to activate the host manually and it seems like the engine stops VDSM: Mar 26 22:01:10 h2 systemd: Stopping Virtual Desktop Server Manager... Mar 26 22:01:10 h2 journal: vdsm IOProcessClient ERROR IOProcess failure Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 107, in _communicate raise Exception("FD closed") Exception: FD closed And the engine-storage disappeared again. Then, since QEMU doesn't reopen the (invalid) file descriptors, the VM can't be resumed (BZ 1058300). Is it normal that 1) the engine stops VDSM during activation and 2) that VDSM stops or restarts the storage during shutdown? --------------070308020702090204080908 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: 8bit <html> <head> <meta content="text/html; charset=windows-1252" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> <br> <div class="moz-cite-prefix">On 26-03-2015 18:16, Christopher Pereira wrote:<br> </div> <blockquote cite="mid:55147728.9030204@imatronix.cl" type="cite">Now, how can I fix the "invalid lockspace"? <br> </blockquote> hosted-engine start-pool + connect-storage + vm-start solved the invalid lockspace problem and made it possible to restart the engine VM, but the hosted-engine-setup is unable to resume:<br> <blockquote>2015-03-26 21:56:42 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:189 VDSM host in non_responsive state<br> 2015-03-26 21:56:44 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:189 VDSM host in non_responsive state<br> 2015-03-26 21:56:46 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:189 VDSM host in non_responsive state<br> [...]<br> </blockquote> Then, inside the web-manager, I tried to activate the host manually and it seems like the engine stops VDSM:<br> <blockquote>Mar 26 22:01:10 h2 systemd: Stopping Virtual Desktop Server Manager...<br> Mar 26 22:01:10 h2 journal: vdsm IOProcessClient ERROR IOProcess failure<br> Traceback (most recent call last):<br> File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 107, in _communicate<br> raise Exception("FD closed")<br> Exception: FD closed<br> </blockquote> And the engine-storage disappeared again.<br> Then, since QEMU doesn't reopen the (invalid) file descriptors, the VM can't be resumed (BZ 1058300).<br> <br> Is it normal that 1) the engine stops VDSM during activation and 2) that VDSM stops or restarts the storage during shutdown?<br> </body> </html> --------------070308020702090204080908--

This is a multi-part message in MIME format. --------------030607080205030709080601 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 26-03-2015 19:24, Christopher Pereira wrote:
Is it normal that 1) the engine stops VDSM during activation and 2) that VDSM stops or restarts the storage during shutdown?
Well, no wonder that the engine-storage disappears since the gluster process is a child-process of the vdsmd.service CGroup: /system.slice/vdsmd.service ├─14895 /usr/bin/python /usr/share/vdsm/vdsm ├─14964 /usr/libexec/ioprocess --read-pipe-fd 43 --write-pipe-fd 42 --max-threads 10 --ma ├─15836 /usr/libexec/ioprocess --read-pipe-fd 49 --write-pipe-fd 48 --max-threads 10 --ma * ├─15911 /usr/sbin/glusterfs --volfile-server=h2.imatronix.com --volfile-id=engine /rhev/d* └─15922 /usr/libexec/ioprocess --read-pipe-fd 61 --write-pipe-fd 60 --max-threads 10 --ma A workarround is to mount the engine-storage manually, so that restarting vdsmd on the host doesn't kill the engine VM. --------------030607080205030709080601 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit <html> <head> <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> On 26-03-2015 19:24, Christopher Pereira wrote:<br> <blockquote cite="mid:55148720.3060607@imatronix.cl" type="cite"> <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> Is it normal that 1) the engine stops VDSM during activation and 2) that VDSM stops or restarts the storage during shutdown?<br> </blockquote> <br> Well, no wonder that the engine-storage disappears since the gluster process is a child-process of the vdsmd.service<br> <br> CGroup: /system.slice/vdsmd.service<br> ├─14895 /usr/bin/python /usr/share/vdsm/vdsm<br> ├─14964 /usr/libexec/ioprocess --read-pipe-fd 43 --write-pipe-fd 42 --max-threads 10 --ma<br> ├─15836 /usr/libexec/ioprocess --read-pipe-fd 49 --write-pipe-fd 48 --max-threads 10 --ma<br> <b> ├─15911 /usr/sbin/glusterfs --volfile-server=h2.imatronix.com --volfile-id=engine /rhev/d</b><br> └─15922 /usr/libexec/ioprocess --read-pipe-fd 61 --write-pipe-fd 60 --max-threads 10 --ma<br> <br> A workarround is to mount the engine-storage manually, so that restarting vdsmd on the host doesn't kill the engine VM.<br> <br> </body> </html> --------------030607080205030709080601--

On Thu, Mar 26, 2015 at 06:16:24PM -0300, Christopher Pereira wrote:
Continuing with the 3.6 Night Builds testing...
While hosted-engine-setup was adding the host to the newly created cluster, VDSM crashed, probably because the gluster engine storage disappeared as in BZ 1201355 [1]
Facts: - the engine storage (/rhev/data-center/mmt/...) was umounted during this process - another mount of the same volume was still mounted after the VDSM crash (maybe the problem is not related with gluster)
What exactly happened to vdsm? Did the process die? Why? Was it stopped? did it segfault? Did it stop responding? Can you share vdsm.log and /var/log/message showing what happened during the crash?

This is a multi-part message in MIME format. --------------010404090105060805090907 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit On 27-03-2015 7:03, Dan Kenigsberg wrote:
On Thu, Mar 26, 2015 at 06:16:24PM -0300, Christopher Pereira wrote:
Continuing with the 3.6 Night Builds testing...
While hosted-engine-setup was adding the host to the newly created cluster, VDSM crashed, probably because the gluster engine storage disappeared as in BZ 1201355 [1]
Facts: - the engine storage (/rhev/data-center/mmt/...) was umounted during this process - another mount of the same volume was still mounted after the VDSM crash (maybe the problem is not related with gluster) What exactly happened to vdsm? Did the process die? Why? Was it stopped? did it segfault? Did it stop responding? Can you share vdsm.log and /var/log/message showing what happened during the crash? Hi Dan,
You will find relevants logs here: https://bugzilla.redhat.com/show_bug.cgi?id=1201355#c4 Summary: 1) During setup, VDSM receives a SIGTERM: MainThread::DEBUG::2015-03-26 18:36:56,767::vdsm::66::vds::(sigtermHandler) Received signal 15 Maybe the activation process installs VDSM and/or restarts it. 2) Since the gluster storage is mounted from a VDSM ChildProcess, it disappears when VDSM stops. Thus, the VM is paused and will never resume (even after remounting the storage, because the paused QEMU process keeps invalid file descriptors): https://bugzilla.redhat.com/show_bug.cgi?id=1058300 https://bugzilla.redhat.com/show_bug.cgi?id=1172905 3) After the VDSM stopped, it's not possible to restart it since you will get an "invalid lockspace" in sanlock. This can be solved with hosted-engine --start-pool. 4) You will be able to reproduce the VDSM sigterm with less effort (no need to re-deploy) by accessing the engine portal and reactivating the host. You will see that VDSM gets stopped and the storage lost. As a workarround to avoid the storage to get lost, you can mount it manually so that it doesn't relay on the VDSM ChildProcess. Questions: 1) I'm affraid that by activating the host manually after an interrupted setup I may be skipping some special configurations. Is there any difference between activating the host manually from the web-manager and activating the host with the setup script? How can I complete the setup manually? Status: I'm still unable to activate the host manually, because engine is now having problems with the JsonRPC communcation: 2015-03-27 10:11:54,889 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to h2.imatronix.com/209.126.105.36 2015-03-27 10:11:54,893 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] *Unable to process messages* 2015-03-27 10:11:54,893 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand] (DefaultQuartzScheduler_Worker-96) [] Command 'ListVDSCommand(HostName = h2, HostId = 46d4659a-4efe-4427-aa68-a4536508fa08, vds=Host[h2,46d4659a-4efe-4427-aa68-a4536508fa08])' execution failed: VDSGenericException: VDSNetworkException: General SSLEngine problem 2015-03-27 10:11:54,894 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler_Worker-96) [] Failed to invoke scheduled method vmsMonitoring: null 2015-03-27 10:11:57,894 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to h2.imatronix.com/209.126.105.36 2015-03-27 10:11:57,897 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] *Unable to process messages* 2015-03-27 10:11:57,897 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-95) [] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand' return value 'org.ovirt.engine.core.vdsbroker.vdsbroker.VDSInfoReturnForXmlRpc@79313585' 2015-03-27 10:11:57,898 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-95) [] HostName = h2 2015-03-27 10:11:57,898 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-95) [] Command 'GetCapabilitiesVDSCommand(HostName = h2, HostId = 46d4659a-4efe-4427-aa68-a4536508fa08, vds=Host[h2,46d4659a-4efe-4427-aa68-a4536508fa08])' execution failed: VDSGenericException: VDSNetworkException: *General SSLEngine problem* 2015-03-27 10:11:57,898 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-95) [] Failure to refresh Vds runtime info: VDSGenericException: VDSNetworkException: General SSLEngine problem 2015-03-27 10:11:57,898 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-95) [] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: General SSLEngine problem at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:183) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand.executeVdsBrokerCommand(GetCapabilitiesVDSCommand.java:16) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:101) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:55) [vdsbroker.jar:] at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:465) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.refreshCapabilities(VdsManager.java:587) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:111) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:76) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:199) [vdsbroker.jar:] at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source) [:1.7.0_75] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_75] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_75] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:] 2015-03-27 10:11:57,899 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-95) [] Failed to refresh VDS, network error, continuing, vds='h2'(46d4659a-4efe-4427-aa68-a4536508fa08): VDSGenericException: VDSNetworkException: *General SSLEngine problem* [...] On the VDSM side, we have: clientIFinit::DEBUG::2015-03-27 10:11:53,098::task::592::Storage.TaskManager.Task::(_updateState) Task=`87ed5b66-3abb-4edc-aec3-59f071b33276`::moving from state init -> state preparing clientIFinit::INFO::2015-03-27 10:11:53,098::logUtils::48::dispatcher::(wrapper) Run and protect: getConnectedStoragePoolsList(options=None) clientIFinit::INFO::2015-03-27 10:11:53,098::logUtils::51::dispatcher::(wrapper) Run and protect: getConnectedStoragePoolsList, Return response: {'poollist': []} clientIFinit::DEBUG::2015-03-27 10:11:53,098::task::1188::Storage.TaskManager.Task::(prepare) Task=`87ed5b66-3abb-4edc-aec3-59f071b33276`::finished: {'poollist': []} clientIFinit::DEBUG::2015-03-27 10:11:53,098::task::592::Storage.TaskManager.Task::(_updateState) Task=`87ed5b66-3abb-4edc-aec3-59f071b33276`::moving from state preparing -> state finished clientIFinit::DEBUG::2015-03-27 10:11:53,098::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} clientIFinit::DEBUG::2015-03-27 10:11:53,098::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} clientIFinit::DEBUG::2015-03-27 10:11:53,098::task::990::Storage.TaskManager.Task::(_decref) Task=`87ed5b66-3abb-4edc-aec3-59f071b33276`::ref 0 aborting False Detector thread::DEBUG::2015-03-27 10:11:53,450::protocoldetector::201::vds.MultiProtocolAcceptor::(_add_connection) *Adding connection 209.239.124.8:54218* Detector thread::DEBUG::2015-03-27 10:11:53,459::protocoldetector::225::vds.MultiProtocolAcceptor::(_process_handshake) *Error during handshake: sslv3 alert certificate unknown* Detector thread::DEBUG::2015-03-27 10:11:53,459::protocoldetector::215::vds.MultiProtocolAcceptor::(_remove_connection) Removing connection 209.239.124.8:54218 Detector thread::DEBUG::2015-03-27 10:11:55,249::protocoldetector::201::vds.MultiProtocolAcceptor::(_add_connection) Adding connection 209.126.113.73:54119 Detector thread::DEBUG::2015-03-27 10:11:55,252::protocoldetector::225::vds.MultiProtocolAcceptor::(_process_handshake) Error during handshake: unexpected eof Detector thread::DEBUG::2015-03-27 10:11:55,252::protocoldetector::215::vds.MultiProtocolAcceptor::(_remove_connection) Removing connection 209.126.113.73:54119 Detector thread::DEBUG::2015-03-27 10:11:56,582::protocoldetector::201::vds.MultiProtocolAcceptor::(_add_connection) Adding connection 209.239.124.8:39606 Detector thread::DEBUG::2015-03-27 10:11:56,629::protocoldetector::225::vds.MultiProtocolAcceptor::(_process_handshake) Error during handshake: sslv3 alert certificate unknown Detector thread::DEBUG::2015-03-27 10:11:56,629::protocoldetector::215::vds.MultiProtocolAcceptor::(_remove_connection) Removing connection 209.239.124.8:39606 clientIFinit::DEBUG::2015-03-27 10:11:58,104::task::592::Storage.TaskManager.Task::(_updateState) Task=`9e6db6dc-3ce0-4e93-8ddd-2aa1d09fa687`::moving from state init -> state preparing clientIFinit::INFO::2015-03-27 10:11:58,104::logUtils::48::dispatcher::(wrapper) Run and protect: getConnectedStoragePoolsList(options=None) clientIFinit::INFO::2015-03-27 10:11:58,104::logUtils::51::dispatcher::(wrapper) Run and protect: getConnectedStoragePoolsList, Return response: {'poollist': []} clientIFinit::DEBUG::2015-03-27 10:11:58,104::task::1188::Storage.TaskManager.Task::(prepare) Task=`9e6db6dc-3ce0-4e93-8ddd-2aa1d09fa687`::finished: {'poollist': []} clientIFinit::DEBUG::2015-03-27 10:11:58,104::task::592::Storage.TaskManager.Task::(_updateState) Task=`9e6db6dc-3ce0-4e93-8ddd-2aa1d09fa687`::moving from state preparing -> state finished clientIFinit::DEBUG::2015-03-27 10:11:58,104::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} clientIFinit::DEBUG::2015-03-27 10:11:58,104::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} clientIFinit::DEBUG::2015-03-27 10:11:58,104::task::990::Storage.TaskManager.Task::(_decref) Task=`9e6db6dc-3ce0-4e93-8ddd-2aa1d09fa687`::ref 0 aborting False [...] I guess this is related to an invalid certificate or some protocol version missmatch. How can I fix it? Regards, Christopher --------------010404090105060805090907 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: 8bit <html> <head> <meta content="text/html; charset=windows-1252" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> On 27-03-2015 7:03, Dan Kenigsberg wrote:<br> <blockquote cite="mid:20150327100348.GB11819@redhat.com" type="cite"> <pre wrap="">On Thu, Mar 26, 2015 at 06:16:24PM -0300, Christopher Pereira wrote: </pre> <blockquote type="cite"> <pre wrap="">Continuing with the 3.6 Night Builds testing... While hosted-engine-setup was adding the host to the newly created cluster, VDSM crashed, probably because the gluster engine storage disappeared as in BZ 1201355 [1] Facts: - the engine storage (/rhev/data-center/mmt/...) was umounted during this process - another mount of the same volume was still mounted after the VDSM crash (maybe the problem is not related with gluster) </pre> </blockquote> <pre wrap=""> What exactly happened to vdsm? Did the process die? Why? Was it stopped? did it segfault? Did it stop responding? Can you share vdsm.log and /var/log/message showing what happened during the crash? </pre> </blockquote> Hi Dan,<br> <br> You will find relevants logs here:<br> <a class="moz-txt-link-freetext" href="https://bugzilla.redhat.com/show_bug.cgi?id=1201355#c4">https://bugzilla.redhat.com/show_bug.cgi?id=1201355#c4</a><br> <br> Summary:<br> <br> 1) During setup, VDSM receives a SIGTERM:<br> MainThread::DEBUG::2015-03-26 18:36:56,767::vdsm::66::vds::(sigtermHandler) Received signal 15<br> <br> Maybe the activation process installs VDSM and/or restarts it.<br> <br> 2) Since the gluster storage is mounted from a VDSM ChildProcess, it disappears when VDSM stops.<br> Thus, the VM is paused and will never resume (even after remounting the storage, because the paused QEMU process keeps invalid file descriptors):<br> <a class="moz-txt-link-freetext" href="https://bugzilla.redhat.com/show_bug.cgi?id=1058300">https://bugzilla.redhat.com/show_bug.cgi?id=1058300</a><br> <a class="moz-txt-link-freetext" href="https://bugzilla.redhat.com/show_bug.cgi?id=1172905">https://bugzilla.redhat.com/show_bug.cgi?id=1172905</a><br> <br> 3) After the VDSM stopped, it's not possible to restart it since you will get an "invalid lockspace" in sanlock.<br> This can be solved with hosted-engine --start-pool.<br> <br> 4) You will be able to reproduce the VDSM sigterm with less effort (no need to re-deploy) by accessing the engine portal and reactivating the host.<br> You will see that VDSM gets stopped and the storage lost.<br> As a workarround to avoid the storage to get lost, you can mount it manually so that it doesn't relay on the VDSM ChildProcess.<br> <br> Questions:<br> <br> 1) I'm affraid that by activating the host manually after an interrupted setup I may be skipping some special configurations.<br> Is there any difference between activating the host manually from the web-manager and activating the host with the setup script?<br> How can I complete the setup manually?<br> <br> Status:<br> <br> I'm still unable to activate the host manually, because engine is now having problems with the JsonRPC communcation:<br> <blockquote>2015-03-27 10:11:54,889 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to h2.imatronix.com/209.126.105.36<br> 2015-03-27 10:11:54,893 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] <b>Unable to process messages</b><br> 2015-03-27 10:11:54,893 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand] (DefaultQuartzScheduler_Worker-96) [] Command 'ListVDSCommand(HostName = h2, HostId = 46d4659a-4efe-4427-aa68-a4536508fa08, vds=Host[h2,46d4659a-4efe-4427-aa68-a4536508fa08])' execution failed: VDSGenericException: VDSNetworkException: General SSLEngine problem<br> 2015-03-27 10:11:54,894 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler_Worker-96) [] Failed to invoke scheduled method vmsMonitoring: null<br> <br> 2015-03-27 10:11:57,894 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to h2.imatronix.com/209.126.105.36<br> 2015-03-27 10:11:57,897 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] <b>Unable to process messages</b><br> 2015-03-27 10:11:57,897 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-95) [] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand' return value 'org.ovirt.engine.core.vdsbroker.vdsbroker.VDSInfoReturnForXmlRpc@79313585'<br> 2015-03-27 10:11:57,898 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-95) [] HostName = h2<br> 2015-03-27 10:11:57,898 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-95) [] Command 'GetCapabilitiesVDSCommand(HostName = h2, HostId = 46d4659a-4efe-4427-aa68-a4536508fa08, vds=Host[h2,46d4659a-4efe-4427-aa68-a4536508fa08])' execution failed: VDSGenericException: VDSNetworkException: <b>General SSLEngine problem</b><br> 2015-03-27 10:11:57,898 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-95) [] Failure to refresh Vds runtime info: VDSGenericException: VDSNetworkException: General SSLEngine problem<br> 2015-03-27 10:11:57,898 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-95) [] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: General SSLEngine problem<br> at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:183) [vdsbroker.jar:]<br> at org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand.executeVdsBrokerCommand(GetCapabilitiesVDSCommand.java:16) [vdsbroker.jar:]<br> at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:101) [vdsbroker.jar:]<br> at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:55) [vdsbroker.jar:]<br> at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:]<br> at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:465) [vdsbroker.jar:]<br> at org.ovirt.engine.core.vdsbroker.VdsManager.refreshCapabilities(VdsManager.java:587) [vdsbroker.jar:]<br> at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:111) [vdsbroker.jar:]<br> at org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:76) [vdsbroker.jar:]<br> at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:199) [vdsbroker.jar:]<br> at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source) [:1.7.0_75]<br> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_75]<br> at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_75]<br> at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:]<br> at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:]<br> at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]<br> at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]<br> <br> 2015-03-27 10:11:57,899 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-95) [] Failed to refresh VDS, network error, continuing, vds='h2'(46d4659a-4efe-4427-aa68-a4536508fa08): VDSGenericException: VDSNetworkException: <b>General SSLEngine problem</b><br> [...]<br> <br> </blockquote> On the VDSM side, we have:<br> <br> <blockquote>clientIFinit::DEBUG::2015-03-27 10:11:53,098::task::592::Storage.TaskManager.Task::(_updateState) Task=`87ed5b66-3abb-4edc-aec3-59f071b33276`::moving from state init -> state preparing<br> clientIFinit::INFO::2015-03-27 10:11:53,098::logUtils::48::dispatcher::(wrapper) Run and protect: getConnectedStoragePoolsList(options=None)<br> clientIFinit::INFO::2015-03-27 10:11:53,098::logUtils::51::dispatcher::(wrapper) Run and protect: getConnectedStoragePoolsList, Return response: {'poollist': []}<br> clientIFinit::DEBUG::2015-03-27 10:11:53,098::task::1188::Storage.TaskManager.Task::(prepare) Task=`87ed5b66-3abb-4edc-aec3-59f071b33276`::finished: {'poollist': []}<br> clientIFinit::DEBUG::2015-03-27 10:11:53,098::task::592::Storage.TaskManager.Task::(_updateState) Task=`87ed5b66-3abb-4edc-aec3-59f071b33276`::moving from state preparing -> state finished<br> clientIFinit::DEBUG::2015-03-27 10:11:53,098::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}<br> clientIFinit::DEBUG::2015-03-27 10:11:53,098::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}<br> clientIFinit::DEBUG::2015-03-27 10:11:53,098::task::990::Storage.TaskManager.Task::(_decref) Task=`87ed5b66-3abb-4edc-aec3-59f071b33276`::ref 0 aborting False<br> Detector thread::DEBUG::2015-03-27 10:11:53,450::protocoldetector::201::vds.MultiProtocolAcceptor::(_add_connection) <b>Adding connection 209.239.124.8:54218</b><br> Detector thread::DEBUG::2015-03-27 10:11:53,459::protocoldetector::225::vds.MultiProtocolAcceptor::(_process_handshake) <b>Error during handshake: sslv3 alert certificate unknown</b><br> Detector thread::DEBUG::2015-03-27 10:11:53,459::protocoldetector::215::vds.MultiProtocolAcceptor::(_remove_connection) Removing connection 209.239.124.8:54218<br> Detector thread::DEBUG::2015-03-27 10:11:55,249::protocoldetector::201::vds.MultiProtocolAcceptor::(_add_connection) Adding connection 209.126.113.73:54119<br> Detector thread::DEBUG::2015-03-27 10:11:55,252::protocoldetector::225::vds.MultiProtocolAcceptor::(_process_handshake) Error during handshake: unexpected eof<br> Detector thread::DEBUG::2015-03-27 10:11:55,252::protocoldetector::215::vds.MultiProtocolAcceptor::(_remove_connection) Removing connection 209.126.113.73:54119<br> Detector thread::DEBUG::2015-03-27 10:11:56,582::protocoldetector::201::vds.MultiProtocolAcceptor::(_add_connection) Adding connection 209.239.124.8:39606<br> Detector thread::DEBUG::2015-03-27 10:11:56,629::protocoldetector::225::vds.MultiProtocolAcceptor::(_process_handshake) Error during handshake: sslv3 alert certificate unknown<br> Detector thread::DEBUG::2015-03-27 10:11:56,629::protocoldetector::215::vds.MultiProtocolAcceptor::(_remove_connection) Removing connection 209.239.124.8:39606<br> clientIFinit::DEBUG::2015-03-27 10:11:58,104::task::592::Storage.TaskManager.Task::(_updateState) Task=`9e6db6dc-3ce0-4e93-8ddd-2aa1d09fa687`::moving from state init -> state preparing<br> clientIFinit::INFO::2015-03-27 10:11:58,104::logUtils::48::dispatcher::(wrapper) Run and protect: getConnectedStoragePoolsList(options=None)<br> clientIFinit::INFO::2015-03-27 10:11:58,104::logUtils::51::dispatcher::(wrapper) Run and protect: getConnectedStoragePoolsList, Return response: {'poollist': []}<br> clientIFinit::DEBUG::2015-03-27 10:11:58,104::task::1188::Storage.TaskManager.Task::(prepare) Task=`9e6db6dc-3ce0-4e93-8ddd-2aa1d09fa687`::finished: {'poollist': []}<br> clientIFinit::DEBUG::2015-03-27 10:11:58,104::task::592::Storage.TaskManager.Task::(_updateState) Task=`9e6db6dc-3ce0-4e93-8ddd-2aa1d09fa687`::moving from state preparing -> state finished<br> clientIFinit::DEBUG::2015-03-27 10:11:58,104::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}<br> clientIFinit::DEBUG::2015-03-27 10:11:58,104::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}<br> clientIFinit::DEBUG::2015-03-27 10:11:58,104::task::990::Storage.TaskManager.Task::(_decref) Task=`9e6db6dc-3ce0-4e93-8ddd-2aa1d09fa687`::ref 0 aborting False<br> [...]<br> </blockquote> I guess this is related to an invalid certificate or some protocol version missmatch.<br> How can I fix it?<br> <br> Regards,<br> Christopher<br> <br> </body> </html> --------------010404090105060805090907--
participants (3)
-
Christopher Pereira
-
Dan Kenigsberg
-
Sandro Bonazzola