----- Original Message -----
From: "Sandro Bonazzola" <sbonazzo(a)redhat.com>
To: "Nir Soffer" <nsoffer(a)redhat.com>
Cc: "Yeela Kaplan" <ykaplan(a)redhat.com>, "Alon Bar-Lev"
<alonbl(a)redhat.com>, "Allon Mureinik" <amureini(a)redhat.com>,
"Yaniv Dary" <ydary(a)redhat.com>, "Federico Simoncelli"
<fsimonce(a)redhat.com>, "Sahina Bose" <sabose(a)redhat.com>,
"Yedidyah Bar David" <didi(a)redhat.com>, "Simone Tiraboschi"
<stirabos(a)redhat.com>, devel(a)ovirt.org, "Michal
Skrivanek" <mskrivan(a)redhat.com>, "David Teigland"
<teigland(a)redhat.com>
Sent: Thursday, February 12, 2015 4:34:48 PM
Subject: Re: [ovirt-devel] [HC] Weird issue while deploying hosted-engine
Il 12/02/2015 15:31, Nir Soffer ha scritto:
>
>
> ----- Original Message -----
>> From: "Sandro Bonazzola" <sbonazzo(a)redhat.com>
>> To: "Nir Soffer" <nsoffer(a)redhat.com>
>> Cc: "Yeela Kaplan" <ykaplan(a)redhat.com>, "Alon
Bar-Lev"
>> <alonbl(a)redhat.com>, "Allon Mureinik"
<amureini(a)redhat.com>,
>> "Yaniv Dary" <ydary(a)redhat.com>, "Federico
Simoncelli"
>> <fsimonce(a)redhat.com>, "Sahina Bose" <sabose(a)redhat.com>,
>> "Yedidyah Bar David" <didi(a)redhat.com>, "Simone
Tiraboschi"
>> <stirabos(a)redhat.com>, devel(a)ovirt.org, "Michal
>> Skrivanek" <mskrivan(a)redhat.com>, "David Teigland"
<teigland(a)redhat.com>
>> Sent: Thursday, February 12, 2015 4:24:55 PM
>> Subject: Re: [ovirt-devel] [HC] Weird issue while deploying hosted-engine
>>
>> Il 12/02/2015 14:03, Nir Soffer ha scritto:
>>> ----- Original Message -----
>>>> From: "Sandro Bonazzola" <sbonazzo(a)redhat.com>
>>>> To: "Yeela Kaplan" <ykaplan(a)redhat.com>, "Alon
Bar-Lev"
>>>> <alonbl(a)redhat.com>, "Allon Mureinik"
<amureini(a)redhat.com>,
>>>> "Yaniv Dary" <ydary(a)redhat.com>, "Federico
Simoncelli"
>>>> <fsimonce(a)redhat.com>, "Sahina Bose"
<sabose(a)redhat.com>,
>>>> "Yedidyah Bar David" <didi(a)redhat.com>, "Simone
Tiraboschi"
>>>> <stirabos(a)redhat.com>, devel(a)ovirt.org, "Michal
>>>> Skrivanek" <mskrivan(a)redhat.com>
>>>> Sent: Thursday, February 12, 2015 2:55:57 PM
>>>> Subject: [ovirt-devel] [HC] Weird issue while deploying hosted-engine
>>>>
>>>> While deploying Hosted Engine on Hyper Converged Gluster Storage
I've
>>>> hit
>>>> the
>>>> following issue.
>>>>
>>>> ovirt-hosted-engine-setup:
>>>>
>>>> [ INFO ] Engine replied: DB Up!Welcome to Health Status!
>>>> Enter the name of the cluster to which you want to add the
>>>> host
>>>> (Default) [Default]:
>>>> [ INFO ] Waiting for the host to become operational in the engine.
This
>>>> may
>>>> take several minutes...
>>>> [ INFO ] Still waiting for VDSM host to become operational...
>>>>
>>>> Accessed to the engine, VM and engine working correctly, installed
using
>>>> 3.5
>>>> snapshot repositories.
>>>> host-deploy started on the host (see attached logs) and got stuck at:
>>>>
>>>> 2015-02-12 13:08:33 DEBUG otopi.plugins.otopi.dialog.machine
>>>> dialog.__logString:215 DIALOG:SEND ### Response is VALUE
>>>> TIME=type:value or ABORT TIME
>>>>
>>>> until several minutes later when:
>>>> 2015-02-12 13:24:31 DEBUG otopi.context context._executeMethod:152
>>>> method
>>>> exception
>>>> Traceback (most recent call last):
>>>> File "/tmp/ovirt-ePMfjwrlpe/pythonlib/otopi/context.py",
line 142, in
>>>> _executeMethod
>>>> method['method']()
>>>> File
"/tmp/ovirt-ePMfjwrlpe/otopi-plugins/otopi/system/clock.py", line
>>>> 111,
>>>> in _set_clock
>>>> format="YYYYmmddHHMMSS+0000"
>>>> File
"/tmp/ovirt-ePMfjwrlpe/otopi-plugins/otopi/dialog/machine.py",
>>>> line
>>>> 235, in queryValue
>>>> opcode, variable = self._readline().split(' ', 1)
>>>> File "/tmp/ovirt-ePMfjwrlpe/pythonlib/otopi/dialog.py", line
259, in
>>>> _readline
>>>> raise IOError(_('End of file'))
>>>> IOError: End of file
>>>>
>>>> resumed the process.
>>>>
>>>> Around the time when host-deploy got stucked I see in sanlock logs:
>>>>
>>>> 2015-02-12 13:08:25+0100 7254 [683]: open error -1
>>>>
/rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/images/51156c24-5ea1-42df-bf33-daebf2c4780c/3719b0
>>>> 06-68a2-4742-964c-413a3f77c0b4.lease
>>>> 2015-02-12 13:08:25+0100 7254 [683]: r4 release_token open error -2
>>>> 2015-02-12 13:08:31+0100 7260 [16910]: a7dba6e3 aio collect 0
>>>> 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
>>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 delta_renew read rv -107
>>>> offset
>>>> 0
>>>>
/rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
>>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 renewal error -107
>>>> delta_length
>>>> 0
>>>> last_success 7239
>>>> 2015-02-12 13:08:31+0100 7260 [16910]: a7dba6e3 aio collect 0
>>>> 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
>>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 delta_renew read rv -107
>>>> offset
>>>> 0
>>>>
/rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
>>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 renewal error -107
>>>> delta_length
>>>> 0
>>>> last_success 7239
>>>> 2015-02-12 13:08:32+0100 7261 [16910]: a7dba6e3 aio collect 0
>>>> 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
>>>> 2015-02-12 13:08:32+0100 7261 [16910]: s3 delta_renew read rv -107
>>>> offset
>>>> 0
>>>>
/rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
>>>> 2015-02-12 13:08:32+0100 7261 [16910]: s3 renewal error -107
>>>> delta_length
>>>> 0
>>>> last_success 7239
>>>
>>> Sanlock cannot access your storage. The rest of the errors are less
>>> interesting,
>>> as nothing will work without storage.
>>>
>>> While sanlock is failing, can you access that storage from another host
>>> or
>>> from the host running sanlock?
>>>
>>> Adding David for helping with debugging the sanlock errors.
>>
>>
>> If you mean something like:
>> # mount -t glusterfs minidell.home:/hosted_engine_glusterfs /tmp/test
>> # # ll /tmp/test/
>> totale 4
>> drwxr-xr-x. 6 vdsm kvm 4096 12 feb 11.49
>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
>> -rwxr-xr-x. 1 vdsm kvm 0 12 feb 13.04 __DIRECT_IO_TEST__
>>
>> yes, I can still access the gluster volume.
>
> While sanlock fail to update the lease or now?
?
>
> Try also to write to this storage.
dd if=/dev/zero of=/tmp/test/my_write_test bs=4k count=1k
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 0.0289396 s, 145 MB/s
ll /tmp/test/
total 4100
-rwxr-xr-x. 1 vdsm kvm 0 Feb 12 13:04 __DIRECT_IO_TEST__
drwxr-xr-x. 6 vdsm kvm 4096 Feb 12 11:49
a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
-rw-r--r--. 1 root root 4194304 Feb 12 15:33 my_write_test
looks like it works
>
>>
>> # gluster volume info hosted_engine_glusterfs
>>
>> Volume Name: hosted_engine_glusterfs
>> Type: Distribute
>> Volume ID: 8d78571d-69ef-41fb-bc0a-34dc7d903df5
>> Status: Started
>> Number of Bricks: 1
>> Transport-type: tcp
>> Bricks:
>> Brick1: minidell.home:/path1
>> Options Reconfigured:
>> nfs.disable: on
>> storage.owner-gid: 36
>> storage.owner-uid: 36
>> network.ping-timeout: 10
>> auth.allow: *
>> cluster.server-quorum-type: server
>> cluster.quorum-type: auto
>> network.remote-dio: enable
>> cluster.eager-lock: enable
>> performance.stat-prefetch: off
>> performance.io-cache: off
>> performance.read-ahead: off
>> performance.quick-read: off
>> server.allow-insecure: on
>> user.cifs: disable
>>
>> # gluster volume status hosted_engine_glusterfs
>> Status of volume: hosted_engine_glusterfs
>> Gluster process Port Online Pid
>> ------------------------------------------------------------------------------
>> Brick minidell.home:/path1 49217 Y 16660
>>
>> Task Status of Volume hosted_engine_glusterfs
>> ------------------------------------------------------------------------------
>> There are no active volume tasks
>>
>>
>>
>>>
>>>>
>>>>
>>>> And in VDSM logs:
>>>>
>>>> MainThread::DEBUG::2015-02-12
>>>> 13:08:25,529::vdsm::58::vds::(sigtermHandler)
>>>> Received signal 15
>>>> MainThread::DEBUG::2015-02-12
>>>> 13:08:25,529::protocoldetector::144::vds.MultiProtocolAcceptor::(stop)
>>>> Stopping Acceptor
>>>> MainThread::INFO::2015-02-12
>>>> 13:08:25,530::__init__::565::jsonrpc.JsonRpcServer::(stop) Stopping
>>>> JsonRPC
>>>> Server
>>>> Detector thread::DEBUG::2015-02-12
>>>>
13:08:25,530::protocoldetector::115::vds.MultiProtocolAcceptor::(_cleanup)
>>>> Cleaning Acceptor
>>>> MainThread::INFO::2015-02-12 13:08:25,531::vmchannels::188::vds::(stop)
>>>> VM
>>>> channels listener was stopped.
>>>> MainThread::INFO::2015-02-12 13:08:25,535::momIF::91::MOM::(stop)
>>>> Shutting
>>>> down MOM
>>>> MainThread::DEBUG::2015-02-12
>>>> 13:08:25,535::task::595::Storage.TaskManager.Task::(_updateState)
>>>> Task=`c52bd13a-24f0-4d8a-97d8-f6c53ca8bddc`::moving
>>>> from state init -> state preparing
>>>> MainThread::INFO::2015-02-12
>>>> 13:08:25,535::logUtils::44::dispatcher::(wrapper) Run and protect:
>>>> prepareForShutdown(options=None)
>>>> Thread-11::DEBUG::2015-02-12
>>>>
13:08:25,535::storageServer::706::Storage.ConnectionMonitor::(_monitorConnections)
>>>> Monitoring stopped
>>>> ioprocess communication (16795)::ERROR::2015-02-12
>>>> 13:08:25,530::__init__::152::IOProcessClient::(_communicate) IOProcess
>>>> failure
>>>> Traceback (most recent call last):
>>>> File
"/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line
>>>> 107,
>>>> in
>>>> _communicate
>>>> raise Exception("FD closed")
>>>> Exception: FD closed
>>>> ioprocess communication (16795)::DEBUG::2015-02-12
>>>> 13:08:25,537::__init__::298::IOProcessClient::(_run) Starting
>>>> IOProcess...
>>>>
>>>>
>>>> Followed by:
>>>>
>>>> MainThread::INFO::2015-02-12
>>>> 13:08:25,548::domainMonitor::137::Storage.DomainMonitor::(close)
>>>> Stopping
>>>> domain monitors
>>>> MainThread::INFO::2015-02-12
>>>>
13:08:25,549::domainMonitor::114::Storage.DomainMonitor::(stopMonitoring)
>>>> Stop monitoring
>>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
>>>> Thread-94::DEBUG::2015-02-12
>>>>
13:08:25,549::domainMonitor::192::Storage.DomainMonitorThread::(_monitorLoop)
>>>> Stopping domain monitor for
>>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
>>>> Thread-94::INFO::2015-02-12
>>>> 13:08:25,549::clusterlock::245::Storage.SANLock::(releaseHostId)
>>>> Releasing
>>>> host id for domain
>>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f (id: 1)
>>>> Thread-94::DEBUG::2015-02-12
>>>>
13:08:25,549::domainMonitor::201::Storage.DomainMonitorThread::(_monitorLoop)
>>>> Unable to release the host id 1 for domain
>>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
>>>> Traceback (most recent call last):
>>>> File "/usr/share/vdsm/storage/domainMonitor.py", line 198,
in
>>>> _monitorLoop
>>>> self.domain.releaseHostId(self.hostId, unused=True)
>>>> File "/usr/share/vdsm/storage/sd.py", line 480, in
releaseHostId
>>>> self._clusterLock.releaseHostId(hostId, async, unused)
>>>> File "/usr/share/vdsm/storage/clusterlock.py", line 252, in
>>>> releaseHostId
>>>> raise se.ReleaseHostIdFailure(self._sdUUID, e)
>>>> ReleaseHostIdFailure: Cannot release host id:
>>>> (u'a7dba6e3-09ac-430a-8d6e-eea17cf18b8f', SanlockException(16,
'Sanlock
>>>> lockspace remove failure',
>>>> 'Device or resource busy'))
>>>>
>>>> supervdsm logs doesn't show anything interesting but:
>>>>
>>>> # service supervdsmd status
>>>> Redirecting to /bin/systemctl status supervdsmd.service
>>>> supervdsmd.service - "Auxiliary vdsm service for running helper
>>>> functions
>>>> as
>>>> root"
>>>> Loaded: loaded (/usr/lib/systemd/system/supervdsmd.service; static)
>>>> Active: inactive (dead) since gio 2015-02-12 13:08:30 CET; 27min ago
>>>> Main PID: 15859 (code=exited, status=0/SUCCESS)
>>>> CGroup: /system.slice/supervdsmd.service
>>>>
>>>> feb 12 13:08:30 minidell.home systemd[1]: Stopping "Auxiliary vdsm
>>>> service
>>>> for running helper functions as root"...
>>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: Traceback (most
>>>> recent
>>>> call last):
>>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: File
>>>> "/usr/lib64/python2.7/multiprocessing/util.py", line 268, in
>>>> _run_finalizers
>>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: finalizer()
>>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: File
>>>> "/usr/lib64/python2.7/multiprocessing/util.py", line 201, in
__call__
>>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: res =
>>>> self._callback(*self._args, **self._kwargs)
>>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: OSError: [Errno 2]
>>>> No
>>>> such file or directory: '/var/run/vdsm/svdsm.sock'
>>>> feb 12 13:08:30 minidell.home python[15859]: DIGEST-MD5 client mech
>>>> dispose
>>>> feb 12 13:08:30 minidell.home python[15859]: DIGEST-MD5 common mech
>>>> dispose
>>>> feb 12 13:08:30 minidell.home systemd[1]: Stopped "Auxiliary vdsm
>>>> service
>>>> for
>>>> running helper functions as root".
>>>>
>>>>
>>>>
>>>> After a while the VM became unreachable as ovirt-hosted-engine-setup
>>>> shows:
>>>>
>>>> 2015-02-12 13:10:33 DEBUG
>>>> otopi.plugins.ovirt_hosted_engine_setup.engine.add_host
>>>> add_host._wait_host_ready:183 Error fetching host state:
>>>> [ERROR]::oVirt API connection failure, [Errno 110] Connection timed out
>>>> 2015-02-12 13:10:33 DEBUG
>>>> otopi.plugins.ovirt_hosted_engine_setup.engine.add_host
>>>> add_host._wait_host_ready:189 VDSM host in state
>>>>
>>>>
>>>> Gluster logs don't show anything special.
>>>>
>>>> I'm guessing that the 15 minutes of host-deploy being stuck caused
the
>>>> issue
>>>> here, but I don't see any relevant change that may have caused it.
>>>>
>>>> Attaching the tar.gz with the code used for ovirt-hosted-engine-setup;
>>>> the
>>>> rest is 3.5-snapshot over RHEL 7.1
>>>>
>>>> Any hint is welcome.
>>>>
>>>> --
>>>> Sandro Bonazzola
>>>> Better technology. Faster innovation. Powered by community
>>>> collaboration.
>>>> See how it works at
redhat.com
>>>>
>>>> _______________________________________________
>>>> Devel mailing list
>>>> Devel(a)ovirt.org
>>>>
http://lists.ovirt.org/mailman/listinfo/devel
>>
>>
>> --
>> Sandro Bonazzola
>> Better technology. Faster innovation. Powered by community collaboration.
>> See how it works at
redhat.com
>>
--
Sandro Bonazzola
Better technology. Faster innovation. Powered by community collaboration.
See how it works at
redhat.com