[ovirt-devel] [HC] Weird issue while deploying hosted-engine

Thu Feb 12 14:24:55 UTC 2015

Il 12/02/2015 14:03, Nir Soffer ha scritto:
> ----- Original Message -----
>> From: "Sandro Bonazzola" <sbonazzo at redhat.com>
>> To: "Yeela Kaplan" <ykaplan at redhat.com>, "Alon Bar-Lev" <alonbl at redhat.com>, "Allon Mureinik" <amureini at redhat.com>,
>> "Yaniv Dary" <ydary at redhat.com>, "Federico Simoncelli" <fsimonce at redhat.com>, "Sahina Bose" <sabose at redhat.com>,
>> "Yedidyah Bar David" <didi at redhat.com>, "Simone Tiraboschi" <stirabos at redhat.com>, devel at ovirt.org, "Michal
>> Skrivanek" <mskrivan at redhat.com>
>> Sent: Thursday, February 12, 2015 2:55:57 PM
>> Subject: [ovirt-devel] [HC] Weird issue while deploying hosted-engine
>>
>> While deploying Hosted Engine on Hyper Converged Gluster Storage I've hit the
>> following issue.
>>
>> ovirt-hosted-engine-setup:
>>
>> [ INFO  ] Engine replied: DB Up!Welcome to Health Status!
>>           Enter the name of the cluster to which you want to add the host
>>           (Default) [Default]:
>> [ INFO  ] Waiting for the host to become operational in the engine. This may
>> take several minutes...
>> [ INFO  ] Still waiting for VDSM host to become operational...
>>
>> Accessed to the engine, VM and engine working correctly, installed using 3.5
>> snapshot repositories.
>> host-deploy started on the host (see attached logs) and got stuck at:
>>
>> 2015-02-12 13:08:33 DEBUG otopi.plugins.otopi.dialog.machine
>> dialog.__logString:215 DIALOG:SEND       ### Response is VALUE
>> TIME=type:value or ABORT TIME
>>
>> until several minutes later when:
>> 2015-02-12 13:24:31 DEBUG otopi.context context._executeMethod:152 method
>> exception
>> Traceback (most recent call last):
>>   File "/tmp/ovirt-ePMfjwrlpe/pythonlib/otopi/context.py", line 142, in
>>   _executeMethod
>>     method['method']()
>>   File "/tmp/ovirt-ePMfjwrlpe/otopi-plugins/otopi/system/clock.py", line 111,
>>   in _set_clock
>>     format="YYYYmmddHHMMSS+0000"
>>   File "/tmp/ovirt-ePMfjwrlpe/otopi-plugins/otopi/dialog/machine.py", line
>>   235, in queryValue
>>     opcode, variable = self._readline().split(' ', 1)
>>   File "/tmp/ovirt-ePMfjwrlpe/pythonlib/otopi/dialog.py", line 259, in
>>   _readline
>>     raise IOError(_('End of file'))
>> IOError: End of file
>>
>> resumed the process.
>>
>> Around the time when host-deploy got stucked I see in sanlock logs:
>>
>> 2015-02-12 13:08:25+0100 7254 [683]: open error -1
>> /rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/images/51156c24-5ea1-42df-bf33-daebf2c4780c/3719b0
>> 06-68a2-4742-964c-413a3f77c0b4.lease
>> 2015-02-12 13:08:25+0100 7254 [683]: r4 release_token open error -2
>> 2015-02-12 13:08:31+0100 7260 [16910]: a7dba6e3 aio collect 0
>> 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 delta_renew read rv -107 offset 0
>> /rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 renewal error -107 delta_length 0
>> last_success 7239
>> 2015-02-12 13:08:31+0100 7260 [16910]: a7dba6e3 aio collect 0
>> 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 delta_renew read rv -107 offset 0
>> /rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 renewal error -107 delta_length 0
>> last_success 7239
>> 2015-02-12 13:08:32+0100 7261 [16910]: a7dba6e3 aio collect 0
>> 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
>> 2015-02-12 13:08:32+0100 7261 [16910]: s3 delta_renew read rv -107 offset 0
>> /rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
>> 2015-02-12 13:08:32+0100 7261 [16910]: s3 renewal error -107 delta_length 0
>> last_success 7239
> 
> Sanlock cannot access your storage. The rest of the errors are less interesting,
> as nothing will work without storage.
> 
> While sanlock is failing, can you access that storage from another host or
> from the host running sanlock?
> 
> Adding David for helping with debugging the sanlock errors.

If you mean something like:
# mount -t glusterfs minidell.home:/hosted_engine_glusterfs /tmp/test
# # ll /tmp/test/
totale 4
drwxr-xr-x. 6 vdsm kvm 4096 12 feb 11.49 a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
-rwxr-xr-x. 1 vdsm kvm    0 12 feb 13.04 __DIRECT_IO_TEST__

yes, I can still access the gluster volume.

# gluster volume info hosted_engine_glusterfs

Volume Name: hosted_engine_glusterfs
Type: Distribute
Volume ID: 8d78571d-69ef-41fb-bc0a-34dc7d903df5
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: minidell.home:/path1
Options Reconfigured:
nfs.disable: on
storage.owner-gid: 36
storage.owner-uid: 36
network.ping-timeout: 10
auth.allow: *
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
server.allow-insecure: on
user.cifs: disable

# gluster volume status hosted_engine_glusterfs
Status of volume: hosted_engine_glusterfs
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick minidell.home:/path1				49217	Y	16660

Task Status of Volume hosted_engine_glusterfs
------------------------------------------------------------------------------
There are no active volume tasks

> 
>>
>>
>> And in VDSM logs:
>>
>> MainThread::DEBUG::2015-02-12 13:08:25,529::vdsm::58::vds::(sigtermHandler)
>> Received signal 15
>> MainThread::DEBUG::2015-02-12
>> 13:08:25,529::protocoldetector::144::vds.MultiProtocolAcceptor::(stop)
>> Stopping Acceptor
>> MainThread::INFO::2015-02-12
>> 13:08:25,530::__init__::565::jsonrpc.JsonRpcServer::(stop) Stopping JsonRPC
>> Server
>> Detector thread::DEBUG::2015-02-12
>> 13:08:25,530::protocoldetector::115::vds.MultiProtocolAcceptor::(_cleanup)
>> Cleaning Acceptor
>> MainThread::INFO::2015-02-12 13:08:25,531::vmchannels::188::vds::(stop) VM
>> channels listener was stopped.
>> MainThread::INFO::2015-02-12 13:08:25,535::momIF::91::MOM::(stop) Shutting
>> down MOM
>> MainThread::DEBUG::2015-02-12
>> 13:08:25,535::task::595::Storage.TaskManager.Task::(_updateState)
>> Task=`c52bd13a-24f0-4d8a-97d8-f6c53ca8bddc`::moving
>> from state init -> state preparing
>> MainThread::INFO::2015-02-12
>> 13:08:25,535::logUtils::44::dispatcher::(wrapper) Run and protect:
>> prepareForShutdown(options=None)
>> Thread-11::DEBUG::2015-02-12
>> 13:08:25,535::storageServer::706::Storage.ConnectionMonitor::(_monitorConnections)
>> Monitoring stopped
>> ioprocess communication (16795)::ERROR::2015-02-12
>> 13:08:25,530::__init__::152::IOProcessClient::(_communicate) IOProcess
>> failure
>> Traceback (most recent call last):
>>   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 107, in
>>   _communicate
>>     raise Exception("FD closed")
>> Exception: FD closed
>> ioprocess communication (16795)::DEBUG::2015-02-12
>> 13:08:25,537::__init__::298::IOProcessClient::(_run) Starting IOProcess...
>>
>>
>> Followed by:
>>
>> MainThread::INFO::2015-02-12
>> 13:08:25,548::domainMonitor::137::Storage.DomainMonitor::(close) Stopping
>> domain monitors
>> MainThread::INFO::2015-02-12
>> 13:08:25,549::domainMonitor::114::Storage.DomainMonitor::(stopMonitoring)
>> Stop monitoring
>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
>> Thread-94::DEBUG::2015-02-12
>> 13:08:25,549::domainMonitor::192::Storage.DomainMonitorThread::(_monitorLoop)
>> Stopping domain monitor for
>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
>> Thread-94::INFO::2015-02-12
>> 13:08:25,549::clusterlock::245::Storage.SANLock::(releaseHostId) Releasing
>> host id for domain
>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f (id: 1)
>> Thread-94::DEBUG::2015-02-12
>> 13:08:25,549::domainMonitor::201::Storage.DomainMonitorThread::(_monitorLoop)
>> Unable to release the host id 1 for domain
>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
>> Traceback (most recent call last):
>>   File "/usr/share/vdsm/storage/domainMonitor.py", line 198, in _monitorLoop
>>     self.domain.releaseHostId(self.hostId, unused=True)
>>   File "/usr/share/vdsm/storage/sd.py", line 480, in releaseHostId
>>     self._clusterLock.releaseHostId(hostId, async, unused)
>>   File "/usr/share/vdsm/storage/clusterlock.py", line 252, in releaseHostId
>>     raise se.ReleaseHostIdFailure(self._sdUUID, e)
>> ReleaseHostIdFailure: Cannot release host id:
>> (u'a7dba6e3-09ac-430a-8d6e-eea17cf18b8f', SanlockException(16, 'Sanlock
>> lockspace remove failure',
>> 'Device or resource busy'))
>>
>> supervdsm logs doesn't show anything interesting but:
>>
>> # service supervdsmd status
>> Redirecting to /bin/systemctl status  supervdsmd.service
>> supervdsmd.service - "Auxiliary vdsm service for running helper functions as
>> root"
>>    Loaded: loaded (/usr/lib/systemd/system/supervdsmd.service; static)
>>    Active: inactive (dead) since gio 2015-02-12 13:08:30 CET; 27min ago
>>  Main PID: 15859 (code=exited, status=0/SUCCESS)
>>    CGroup: /system.slice/supervdsmd.service
>>
>> feb 12 13:08:30 minidell.home systemd[1]: Stopping "Auxiliary vdsm service
>> for running helper functions as root"...
>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: Traceback (most recent
>> call last):
>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: File
>> "/usr/lib64/python2.7/multiprocessing/util.py", line 268, in _run_finalizers
>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: finalizer()
>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: File
>> "/usr/lib64/python2.7/multiprocessing/util.py", line 201, in __call__
>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: res =
>> self._callback(*self._args, **self._kwargs)
>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: OSError: [Errno 2] No
>> such file or directory: '/var/run/vdsm/svdsm.sock'
>> feb 12 13:08:30 minidell.home python[15859]: DIGEST-MD5 client mech dispose
>> feb 12 13:08:30 minidell.home python[15859]: DIGEST-MD5 common mech dispose
>> feb 12 13:08:30 minidell.home systemd[1]: Stopped "Auxiliary vdsm service for
>> running helper functions as root".
>>
>>
>>
>> After a while the VM became unreachable as ovirt-hosted-engine-setup shows:
>>
>> 2015-02-12 13:10:33 DEBUG
>> otopi.plugins.ovirt_hosted_engine_setup.engine.add_host
>> add_host._wait_host_ready:183 Error fetching host state:
>> [ERROR]::oVirt API connection failure, [Errno 110] Connection timed out
>> 2015-02-12 13:10:33 DEBUG
>> otopi.plugins.ovirt_hosted_engine_setup.engine.add_host
>> add_host._wait_host_ready:189 VDSM host in  state
>>
>>
>> Gluster logs don't show anything special.
>>
>> I'm guessing that the 15 minutes of host-deploy being stuck caused the issue
>> here, but I don't see any relevant change that may have caused it.
>>
>> Attaching the tar.gz with the code used for ovirt-hosted-engine-setup; the
>> rest is 3.5-snapshot over RHEL 7.1
>>
>> Any hint is welcome.
>>
>> --
>> Sandro Bonazzola
>> Better technology. Faster innovation. Powered by community collaboration.
>> See how it works at redhat.com
>>
>> _______________________________________________
>> Devel mailing list
>> Devel at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/devel

-- 
Sandro Bonazzola
Better technology. Faster innovation. Powered by community collaboration.
See how it works at redhat.com