[HC] Weird issue while deploying hosted-engine

12 Feb 2015

      While deploying Hosted Engine on Hyper Converged Gluster Storage I've hit the following issue.

ovirt-hosted-engine-setup:

[ INFO  ] Engine replied: DB Up!Welcome to Health Status!
          Enter the name of the cluster to which you want to add the host (Default) [Default]:
[ INFO  ] Waiting for the host to become operational in the engine. This may take several minutes...
[ INFO  ] Still waiting for VDSM host to become operational...

Accessed to the engine, VM and engine working correctly, installed using 3.5 snapshot repositories.
host-deploy started on the host (see attached logs) and got stuck at:

2015-02-12 13:08:33 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:215 DIALOG:SEND       ### Response is VALUE TIME=type:value or ABORT TIME

until several minutes later when:
2015-02-12 13:24:31 DEBUG otopi.context context._executeMethod:152 method exception
Traceback (most recent call last):
  File "/tmp/ovirt-ePMfjwrlpe/pythonlib/otopi/context.py", line 142, in _executeMethod
    method['method']()
  File "/tmp/ovirt-ePMfjwrlpe/otopi-plugins/otopi/system/clock.py", line 111, in _set_clock
    format="YYYYmmddHHMMSS+0000"
  File "/tmp/ovirt-ePMfjwrlpe/otopi-plugins/otopi/dialog/machine.py", line 235, in queryValue
    opcode, variable = self._readline().split(' ', 1)
  File "/tmp/ovirt-ePMfjwrlpe/pythonlib/otopi/dialog.py", line 259, in _readline
    raise IOError(_('End of file'))
IOError: End of file

resumed the process.

Around the time when host-deploy got stucked I see in sanlock logs:

2015-02-12 13:08:25+0100 7254 [683]: open error -1
/rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/images/51156c24-5ea1-42df-bf33-daebf2c4780c/3719b0
06-68a2-4742-964c-413a3f77c0b4.lease
2015-02-12 13:08:25+0100 7254 [683]: r4 release_token open error -2
2015-02-12 13:08:31+0100 7260 [16910]: a7dba6e3 aio collect 0 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
2015-02-12 13:08:31+0100 7260 [16910]: s3 delta_renew read rv -107 offset 0
/rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
2015-02-12 13:08:31+0100 7260 [16910]: s3 renewal error -107 delta_length 0 last_success 7239
2015-02-12 13:08:31+0100 7260 [16910]: a7dba6e3 aio collect 0 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
2015-02-12 13:08:31+0100 7260 [16910]: s3 delta_renew read rv -107 offset 0
/rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
2015-02-12 13:08:31+0100 7260 [16910]: s3 renewal error -107 delta_length 0 last_success 7239
2015-02-12 13:08:32+0100 7261 [16910]: a7dba6e3 aio collect 0 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
2015-02-12 13:08:32+0100 7261 [16910]: s3 delta_renew read rv -107 offset 0
/rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
2015-02-12 13:08:32+0100 7261 [16910]: s3 renewal error -107 delta_length 0 last_success 7239

And in VDSM logs:

MainThread::DEBUG::2015-02-12 13:08:25,529::vdsm::58::vds::(sigtermHandler) Received signal 15
MainThread::DEBUG::2015-02-12 13:08:25,529::protocoldetector::144::vds.MultiProtocolAcceptor::(stop) Stopping Acceptor
MainThread::INFO::2015-02-12 13:08:25,530::__init__::565::jsonrpc.JsonRpcServer::(stop) Stopping JsonRPC Server
Detector thread::DEBUG::2015-02-12 13:08:25,530::protocoldetector::115::vds.MultiProtocolAcceptor::(_cleanup) Cleaning Acceptor
MainThread::INFO::2015-02-12 13:08:25,531::vmchannels::188::vds::(stop) VM channels listener was stopped.
MainThread::INFO::2015-02-12 13:08:25,535::momIF::91::MOM::(stop) Shutting down MOM
MainThread::DEBUG::2015-02-12 13:08:25,535::task::595::Storage.TaskManager.Task::(_updateState) Task=`c52bd13a-24f0-4d8a-97d8-f6c53ca8bddc`::moving
from state init -> state preparing
MainThread::INFO::2015-02-12 13:08:25,535::logUtils::44::dispatcher::(wrapper) Run and protect: prepareForShutdown(options=None)
Thread-11::DEBUG::2015-02-12 13:08:25,535::storageServer::706::Storage.ConnectionMonitor::(_monitorConnections) Monitoring stopped
ioprocess communication (16795)::ERROR::2015-02-12 13:08:25,530::__init__::152::IOProcessClient::(_communicate) IOProcess failure
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 107, in _communicate
    raise Exception("FD closed")
Exception: FD closed
ioprocess communication (16795)::DEBUG::2015-02-12 13:08:25,537::__init__::298::IOProcessClient::(_run) Starting IOProcess...

Followed by:

MainThread::INFO::2015-02-12 13:08:25,548::domainMonitor::137::Storage.DomainMonitor::(close) Stopping domain monitors
MainThread::INFO::2015-02-12 13:08:25,549::domainMonitor::114::Storage.DomainMonitor::(stopMonitoring) Stop monitoring
a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
Thread-94::DEBUG::2015-02-12 13:08:25,549::domainMonitor::192::Storage.DomainMonitorThread::(_monitorLoop) Stopping domain monitor for
a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
Thread-94::INFO::2015-02-12 13:08:25,549::clusterlock::245::Storage.SANLock::(releaseHostId) Releasing host id for domain
a7dba6e3-09ac-430a-8d6e-eea17cf18b8f (id: 1)
Thread-94::DEBUG::2015-02-12 13:08:25,549::domainMonitor::201::Storage.DomainMonitorThread::(_monitorLoop) Unable to release the host id 1 for domain
a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 198, in _monitorLoop
    self.domain.releaseHostId(self.hostId, unused=True)
  File "/usr/share/vdsm/storage/sd.py", line 480, in releaseHostId
    self._clusterLock.releaseHostId(hostId, async, unused)
  File "/usr/share/vdsm/storage/clusterlock.py", line 252, in releaseHostId
    raise se.ReleaseHostIdFailure(self._sdUUID, e)
ReleaseHostIdFailure: Cannot release host id: (u'a7dba6e3-09ac-430a-8d6e-eea17cf18b8f', SanlockException(16, 'Sanlock lockspace remove failure',
'Device or resource busy'))

supervdsm logs doesn't show anything interesting but:

# service supervdsmd status
Redirecting to /bin/systemctl status  supervdsmd.service
supervdsmd.service - "Auxiliary vdsm service for running helper functions as root"
   Loaded: loaded (/usr/lib/systemd/system/supervdsmd.service; static)
   Active: inactive (dead) since gio 2015-02-12 13:08:30 CET; 27min ago
 Main PID: 15859 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/supervdsmd.service

feb 12 13:08:30 minidell.home systemd[1]: Stopping "Auxiliary vdsm service for running helper functions as root"...
feb 12 13:08:30 minidell.home daemonAdapter[15859]: Traceback (most recent call last):
feb 12 13:08:30 minidell.home daemonAdapter[15859]: File "/usr/lib64/python2.7/multiprocessing/util.py", line 268, in _run_finalizers
feb 12 13:08:30 minidell.home daemonAdapter[15859]: finalizer()
feb 12 13:08:30 minidell.home daemonAdapter[15859]: File "/usr/lib64/python2.7/multiprocessing/util.py", line 201, in __call__
feb 12 13:08:30 minidell.home daemonAdapter[15859]: res = self._callback(*self._args, **self._kwargs)
feb 12 13:08:30 minidell.home daemonAdapter[15859]: OSError: [Errno 2] No such file or directory: '/var/run/vdsm/svdsm.sock'
feb 12 13:08:30 minidell.home python[15859]: DIGEST-MD5 client mech dispose
feb 12 13:08:30 minidell.home python[15859]: DIGEST-MD5 common mech dispose
feb 12 13:08:30 minidell.home systemd[1]: Stopped "Auxiliary vdsm service for running helper functions as root".

After a while the VM became unreachable as ovirt-hosted-engine-setup shows:

2015-02-12 13:10:33 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:183 Error fetching host state:
[ERROR]::oVirt API connection failure, [Errno 110] Connection timed out
2015-02-12 13:10:33 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:189 VDSM host in  state

Gluster logs don't show anything special.

I'm guessing that the 15 minutes of host-deploy being stuck caused the issue here, but I don't see any relevant change that may have caused it.

Attaching the tar.gz with the code used for ovirt-hosted-engine-setup; the rest is 3.5-snapshot over RHEL 7.1

Any hint is welcome.

-- 
Sandro Bonazzola
Better technology. Faster innovation. Powered by community collaboration.
See how it works at redhat.com

Sandro Bonazzola

Nir Soffer

Sandro Bonazzola

Nir Soffer

Sandro Bonazzola

Nir Soffer

Sandro Bonazzola

David Teigland

Nir Soffer

Sandro Bonazzola

Nir Soffer

Sandro Bonazzola

Nir Soffer

Sandro Bonazzola

Nir Soffer

Yedidyah Bar David

Sandro Bonazzola

tags

participants (4)