[ovirt-devel] [HC] Weird issue while deploying hosted-engine

12 Feb 2015

      While deploying Hosted Engine on Hyper Converged Gluster Storage I've hit the following issue.

ovirt-hosted-engine-setup:

[ INFO  ] Engine replied: DB Up!Welcome to Health Status!
          Enter the name of the cluster to which you want to add the host (Default) [Default]:
[ INFO  ] Waiting for the host to become operational in the engine. This may take several minutes...
[ INFO  ] Still waiting for VDSM host to become operational...

Accessed to the engine, VM and engine working correctly, installed using 3.5 snapshot repositories.
host-deploy started on the host (see attached logs) and got stuck at:

2015-02-12 13:08:33 DEBUG otopi.plugins.otopi.dialog.machine dialog.__logString:215 DIALOG:SEND       ### Response is VALUE TIME=type:value or ABORT TIME

until several minutes later when:
2015-02-12 13:24:31 DEBUG otopi.context context._executeMethod:152 method exception
Traceback (most recent call last):
  File "/tmp/ovirt-ePMfjwrlpe/pythonlib/otopi/context.py", line 142, in _executeMethod
    method['method']()
  File "/tmp/ovirt-ePMfjwrlpe/otopi-plugins/otopi/system/clock.py", line 111, in _set_clock
    format="YYYYmmddHHMMSS+0000"
  File "/tmp/ovirt-ePMfjwrlpe/otopi-plugins/otopi/dialog/machine.py", line 235, in queryValue
    opcode, variable = self._readline().split(' ', 1)
  File "/tmp/ovirt-ePMfjwrlpe/pythonlib/otopi/dialog.py", line 259, in _readline
    raise IOError(_('End of file'))
IOError: End of file

resumed the process.

Around the time when host-deploy got stucked I see in sanlock logs:

2015-02-12 13:08:25+0100 7254 [683]: open error -1
/rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/images/51156c24-5ea1-42df-bf33-daebf2c4780c/3719b0
06-68a2-4742-964c-413a3f77c0b4.lease
2015-02-12 13:08:25+0100 7254 [683]: r4 release_token open error -2
2015-02-12 13:08:31+0100 7260 [16910]: a7dba6e3 aio collect 0 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
2015-02-12 13:08:31+0100 7260 [16910]: s3 delta_renew read rv -107 offset 0
/rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
2015-02-12 13:08:31+0100 7260 [16910]: s3 renewal error -107 delta_length 0 last_success 7239
2015-02-12 13:08:31+0100 7260 [16910]: a7dba6e3 aio collect 0 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
2015-02-12 13:08:31+0100 7260 [16910]: s3 delta_renew read rv -107 offset 0
/rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
2015-02-12 13:08:31+0100 7260 [16910]: s3 renewal error -107 delta_length 0 last_success 7239
2015-02-12 13:08:32+0100 7261 [16910]: a7dba6e3 aio collect 0 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
2015-02-12 13:08:32+0100 7261 [16910]: s3 delta_renew read rv -107 offset 0
/rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
2015-02-12 13:08:32+0100 7261 [16910]: s3 renewal error -107 delta_length 0 last_success 7239

And in VDSM logs:

MainThread::DEBUG::2015-02-12 13:08:25,529::vdsm::58::vds::(sigtermHandler) Received signal 15
MainThread::DEBUG::2015-02-12 13:08:25,529::protocoldetector::144::vds.MultiProtocolAcceptor::(stop) Stopping Acceptor
MainThread::INFO::2015-02-12 13:08:25,530::__init__::565::jsonrpc.JsonRpcServer::(stop) Stopping JsonRPC Server
Detector thread::DEBUG::2015-02-12 13:08:25,530::protocoldetector::115::vds.MultiProtocolAcceptor::(_cleanup) Cleaning Acceptor
MainThread::INFO::2015-02-12 13:08:25,531::vmchannels::188::vds::(stop) VM channels listener was stopped.
MainThread::INFO::2015-02-12 13:08:25,535::momIF::91::MOM::(stop) Shutting down MOM
MainThread::DEBUG::2015-02-12 13:08:25,535::task::595::Storage.TaskManager.Task::(_updateState) Task=`c52bd13a-24f0-4d8a-97d8-f6c53ca8bddc`::moving
from state init -> state preparing
MainThread::INFO::2015-02-12 13:08:25,535::logUtils::44::dispatcher::(wrapper) Run and protect: prepareForShutdown(options=None)
Thread-11::DEBUG::2015-02-12 13:08:25,535::storageServer::706::Storage.ConnectionMonitor::(_monitorConnections) Monitoring stopped
ioprocess communication (16795)::ERROR::2015-02-12 13:08:25,530::__init__::152::IOProcessClient::(_communicate) IOProcess failure
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 107, in _communicate
    raise Exception("FD closed")
Exception: FD closed
ioprocess communication (16795)::DEBUG::2015-02-12 13:08:25,537::__init__::298::IOProcessClient::(_run) Starting IOProcess...

Followed by:

MainThread::INFO::2015-02-12 13:08:25,548::domainMonitor::137::Storage.DomainMonitor::(close) Stopping domain monitors
MainThread::INFO::2015-02-12 13:08:25,549::domainMonitor::114::Storage.DomainMonitor::(stopMonitoring) Stop monitoring
a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
Thread-94::DEBUG::2015-02-12 13:08:25,549::domainMonitor::192::Storage.DomainMonitorThread::(_monitorLoop) Stopping domain monitor for
a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
Thread-94::INFO::2015-02-12 13:08:25,549::clusterlock::245::Storage.SANLock::(releaseHostId) Releasing host id for domain
a7dba6e3-09ac-430a-8d6e-eea17cf18b8f (id: 1)
Thread-94::DEBUG::2015-02-12 13:08:25,549::domainMonitor::201::Storage.DomainMonitorThread::(_monitorLoop) Unable to release the host id 1 for domain
a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 198, in _monitorLoop
    self.domain.releaseHostId(self.hostId, unused=True)
  File "/usr/share/vdsm/storage/sd.py", line 480, in releaseHostId
    self._clusterLock.releaseHostId(hostId, async, unused)
  File "/usr/share/vdsm/storage/clusterlock.py", line 252, in releaseHostId
    raise se.ReleaseHostIdFailure(self._sdUUID, e)
ReleaseHostIdFailure: Cannot release host id: (u'a7dba6e3-09ac-430a-8d6e-eea17cf18b8f', SanlockException(16, 'Sanlock lockspace remove failure',
'Device or resource busy'))

supervdsm logs doesn't show anything interesting but:

# service supervdsmd status
Redirecting to /bin/systemctl status  supervdsmd.service
supervdsmd.service - "Auxiliary vdsm service for running helper functions as root"
   Loaded: loaded (/usr/lib/systemd/system/supervdsmd.service; static)
   Active: inactive (dead) since gio 2015-02-12 13:08:30 CET; 27min ago
 Main PID: 15859 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/supervdsmd.service

feb 12 13:08:30 minidell.home systemd[1]: Stopping "Auxiliary vdsm service for running helper functions as root"...
feb 12 13:08:30 minidell.home daemonAdapter[15859]: Traceback (most recent call last):
feb 12 13:08:30 minidell.home daemonAdapter[15859]: File "/usr/lib64/python2.7/multiprocessing/util.py", line 268, in _run_finalizers
feb 12 13:08:30 minidell.home daemonAdapter[15859]: finalizer()
feb 12 13:08:30 minidell.home daemonAdapter[15859]: File "/usr/lib64/python2.7/multiprocessing/util.py", line 201, in __call__
feb 12 13:08:30 minidell.home daemonAdapter[15859]: res = self._callback(*self._args, **self._kwargs)
feb 12 13:08:30 minidell.home daemonAdapter[15859]: OSError: [Errno 2] No such file or directory: '/var/run/vdsm/svdsm.sock'
feb 12 13:08:30 minidell.home python[15859]: DIGEST-MD5 client mech dispose
feb 12 13:08:30 minidell.home python[15859]: DIGEST-MD5 common mech dispose
feb 12 13:08:30 minidell.home systemd[1]: Stopped "Auxiliary vdsm service for running helper functions as root".

After a while the VM became unreachable as ovirt-hosted-engine-setup shows:

2015-02-12 13:10:33 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:183 Error fetching host state:
[ERROR]::oVirt API connection failure, [Errno 110] Connection timed out
2015-02-12 13:10:33 DEBUG otopi.plugins.ovirt_hosted_engine_setup.engine.add_host add_host._wait_host_ready:189 VDSM host in  state

Gluster logs don't show anything special.

I'm guessing that the 15 minutes of host-deploy being stuck caused the issue here, but I don't see any relevant change that may have caused it.

Attaching the tar.gz with the code used for ovirt-hosted-engine-setup; the rest is 3.5-snapshot over RHEL 7.1

Any hint is welcome.

-- 
Sandro Bonazzola
Better technology. Faster innovation. Powered by community collaboration.
See how it works at redhat.com