Il 12/02/2015 15:41, Nir Soffer ha scritto:
----- Original Message -----
> From: "Sandro Bonazzola" <sbonazzo(a)redhat.com>
> To: "Nir Soffer" <nsoffer(a)redhat.com>
> Cc: "Yeela Kaplan" <ykaplan(a)redhat.com>, "Alon Bar-Lev"
<alonbl(a)redhat.com>, "Allon Mureinik" <amureini(a)redhat.com>,
> "Yaniv Dary" <ydary(a)redhat.com>, "Federico Simoncelli"
<fsimonce(a)redhat.com>, "Sahina Bose" <sabose(a)redhat.com>,
> "Yedidyah Bar David" <didi(a)redhat.com>, "Simone Tiraboschi"
<stirabos(a)redhat.com>, devel(a)ovirt.org, "Michal
> Skrivanek" <mskrivan(a)redhat.com>, "David Teigland"
<teigland(a)redhat.com>
> Sent: Thursday, February 12, 2015 4:34:48 PM
> Subject: Re: [ovirt-devel] [HC] Weird issue while deploying hosted-engine
>
> Il 12/02/2015 15:31, Nir Soffer ha scritto:
>>
>>
>> ----- Original Message -----
>>> From: "Sandro Bonazzola" <sbonazzo(a)redhat.com>
>>> To: "Nir Soffer" <nsoffer(a)redhat.com>
>>> Cc: "Yeela Kaplan" <ykaplan(a)redhat.com>, "Alon
Bar-Lev"
>>> <alonbl(a)redhat.com>, "Allon Mureinik"
<amureini(a)redhat.com>,
>>> "Yaniv Dary" <ydary(a)redhat.com>, "Federico
Simoncelli"
>>> <fsimonce(a)redhat.com>, "Sahina Bose"
<sabose(a)redhat.com>,
>>> "Yedidyah Bar David" <didi(a)redhat.com>, "Simone
Tiraboschi"
>>> <stirabos(a)redhat.com>, devel(a)ovirt.org, "Michal
>>> Skrivanek" <mskrivan(a)redhat.com>, "David Teigland"
<teigland(a)redhat.com>
>>> Sent: Thursday, February 12, 2015 4:24:55 PM
>>> Subject: Re: [ovirt-devel] [HC] Weird issue while deploying hosted-engine
>>>
>>> Il 12/02/2015 14:03, Nir Soffer ha scritto:
>>>> ----- Original Message -----
>>>>> From: "Sandro Bonazzola" <sbonazzo(a)redhat.com>
>>>>> To: "Yeela Kaplan" <ykaplan(a)redhat.com>, "Alon
Bar-Lev"
>>>>> <alonbl(a)redhat.com>, "Allon Mureinik"
<amureini(a)redhat.com>,
>>>>> "Yaniv Dary" <ydary(a)redhat.com>, "Federico
Simoncelli"
>>>>> <fsimonce(a)redhat.com>, "Sahina Bose"
<sabose(a)redhat.com>,
>>>>> "Yedidyah Bar David" <didi(a)redhat.com>, "Simone
Tiraboschi"
>>>>> <stirabos(a)redhat.com>, devel(a)ovirt.org, "Michal
>>>>> Skrivanek" <mskrivan(a)redhat.com>
>>>>> Sent: Thursday, February 12, 2015 2:55:57 PM
>>>>> Subject: [ovirt-devel] [HC] Weird issue while deploying
hosted-engine
>>>>>
>>>>> While deploying Hosted Engine on Hyper Converged Gluster Storage
I've
>>>>> hit
>>>>> the
>>>>> following issue.
>>>>>
>>>>> ovirt-hosted-engine-setup:
>>>>>
>>>>> [ INFO ] Engine replied: DB Up!Welcome to Health Status!
>>>>> Enter the name of the cluster to which you want to add the
>>>>> host
>>>>> (Default) [Default]:
>>>>> [ INFO ] Waiting for the host to become operational in the engine.
This
>>>>> may
>>>>> take several minutes...
>>>>> [ INFO ] Still waiting for VDSM host to become operational...
>>>>>
>>>>> Accessed to the engine, VM and engine working correctly, installed
using
>>>>> 3.5
>>>>> snapshot repositories.
>>>>> host-deploy started on the host (see attached logs) and got stuck
at:
>>>>>
>>>>> 2015-02-12 13:08:33 DEBUG otopi.plugins.otopi.dialog.machine
>>>>> dialog.__logString:215 DIALOG:SEND ### Response is VALUE
>>>>> TIME=type:value or ABORT TIME
>>>>>
>>>>> until several minutes later when:
>>>>> 2015-02-12 13:24:31 DEBUG otopi.context context._executeMethod:152
>>>>> method
>>>>> exception
>>>>> Traceback (most recent call last):
>>>>> File "/tmp/ovirt-ePMfjwrlpe/pythonlib/otopi/context.py",
line 142, in
>>>>> _executeMethod
>>>>> method['method']()
>>>>> File
"/tmp/ovirt-ePMfjwrlpe/otopi-plugins/otopi/system/clock.py", line
>>>>> 111,
>>>>> in _set_clock
>>>>> format="YYYYmmddHHMMSS+0000"
>>>>> File
"/tmp/ovirt-ePMfjwrlpe/otopi-plugins/otopi/dialog/machine.py",
>>>>> line
>>>>> 235, in queryValue
>>>>> opcode, variable = self._readline().split(' ', 1)
>>>>> File "/tmp/ovirt-ePMfjwrlpe/pythonlib/otopi/dialog.py",
line 259, in
>>>>> _readline
>>>>> raise IOError(_('End of file'))
>>>>> IOError: End of file
>>>>>
>>>>> resumed the process.
>>>>>
>>>>> Around the time when host-deploy got stucked I see in sanlock logs:
>>>>>
>>>>> 2015-02-12 13:08:25+0100 7254 [683]: open error -1
>>>>>
/rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/images/51156c24-5ea1-42df-bf33-daebf2c4780c/3719b0
>>>>> 06-68a2-4742-964c-413a3f77c0b4.lease
>>>>> 2015-02-12 13:08:25+0100 7254 [683]: r4 release_token open error -2
>>>>> 2015-02-12 13:08:31+0100 7260 [16910]: a7dba6e3 aio collect 0
>>>>> 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
>>>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 delta_renew read rv -107
>>>>> offset
>>>>> 0
>>>>>
/rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
>>>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 renewal error -107
>>>>> delta_length
>>>>> 0
>>>>> last_success 7239
>>>>> 2015-02-12 13:08:31+0100 7260 [16910]: a7dba6e3 aio collect 0
>>>>> 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
>>>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 delta_renew read rv -107
>>>>> offset
>>>>> 0
>>>>>
/rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
>>>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 renewal error -107
>>>>> delta_length
>>>>> 0
>>>>> last_success 7239
>>>>> 2015-02-12 13:08:32+0100 7261 [16910]: a7dba6e3 aio collect 0
>>>>> 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
>>>>> 2015-02-12 13:08:32+0100 7261 [16910]: s3 delta_renew read rv -107
>>>>> offset
>>>>> 0
>>>>>
/rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
>>>>> 2015-02-12 13:08:32+0100 7261 [16910]: s3 renewal error -107
>>>>> delta_length
>>>>> 0
>>>>> last_success 7239
>>>>
>>>> Sanlock cannot access your storage. The rest of the errors are less
>>>> interesting,
>>>> as nothing will work without storage.
>>>>
>>>> While sanlock is failing, can you access that storage from another host
>>>> or
>>>> from the host running sanlock?
>>>>
>>>> Adding David for helping with debugging the sanlock errors.
>>>
>>>
>>> If you mean something like:
>>> # mount -t glusterfs minidell.home:/hosted_engine_glusterfs /tmp/test
>>> # # ll /tmp/test/
>>> totale 4
>>> drwxr-xr-x. 6 vdsm kvm 4096 12 feb 11.49
>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
>>> -rwxr-xr-x. 1 vdsm kvm 0 12 feb 13.04 __DIRECT_IO_TEST__
>>>
>>> yes, I can still access the gluster volume.
>>
>> While sanlock fail to update the lease or now?
?
monitoring the storage while trying to reproduce on different host.
attached the messages log requested in IRC
>>
>> Try also to write to this storage.
>
> dd if=/dev/zero of=/tmp/test/my_write_test bs=4k count=1k
> 1024+0 records in
> 1024+0 records out
> 4194304 bytes (4.2 MB) copied, 0.0289396 s, 145 MB/s
>
> ll /tmp/test/
> total 4100
> -rwxr-xr-x. 1 vdsm kvm 0 Feb 12 13:04 __DIRECT_IO_TEST__
> drwxr-xr-x. 6 vdsm kvm 4096 Feb 12 11:49
> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
> -rw-r--r--. 1 root root 4194304 Feb 12 15:33 my_write_test
>
> looks like it works
>
>
>
>>
>>>
>>> # gluster volume info hosted_engine_glusterfs
>>>
>>> Volume Name: hosted_engine_glusterfs
>>> Type: Distribute
>>> Volume ID: 8d78571d-69ef-41fb-bc0a-34dc7d903df5
>>> Status: Started
>>> Number of Bricks: 1
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: minidell.home:/path1
>>> Options Reconfigured:
>>> nfs.disable: on
>>> storage.owner-gid: 36
>>> storage.owner-uid: 36
>>> network.ping-timeout: 10
>>> auth.allow: *
>>> cluster.server-quorum-type: server
>>> cluster.quorum-type: auto
>>> network.remote-dio: enable
>>> cluster.eager-lock: enable
>>> performance.stat-prefetch: off
>>> performance.io-cache: off
>>> performance.read-ahead: off
>>> performance.quick-read: off
>>> server.allow-insecure: on
>>> user.cifs: disable
>>>
>>> # gluster volume status hosted_engine_glusterfs
>>> Status of volume: hosted_engine_glusterfs
>>> Gluster process Port Online Pid
>>>
------------------------------------------------------------------------------
>>> Brick minidell.home:/path1 49217 Y 16660
>>>
>>> Task Status of Volume hosted_engine_glusterfs
>>>
------------------------------------------------------------------------------
>>> There are no active volume tasks
>>>
>>>
>>>
>>>>
>>>>>
>>>>>
>>>>> And in VDSM logs:
>>>>>
>>>>> MainThread::DEBUG::2015-02-12
>>>>> 13:08:25,529::vdsm::58::vds::(sigtermHandler)
>>>>> Received signal 15
>>>>> MainThread::DEBUG::2015-02-12
>>>>>
13:08:25,529::protocoldetector::144::vds.MultiProtocolAcceptor::(stop)
>>>>> Stopping Acceptor
>>>>> MainThread::INFO::2015-02-12
>>>>> 13:08:25,530::__init__::565::jsonrpc.JsonRpcServer::(stop) Stopping
>>>>> JsonRPC
>>>>> Server
>>>>> Detector thread::DEBUG::2015-02-12
>>>>>
13:08:25,530::protocoldetector::115::vds.MultiProtocolAcceptor::(_cleanup)
>>>>> Cleaning Acceptor
>>>>> MainThread::INFO::2015-02-12
13:08:25,531::vmchannels::188::vds::(stop)
>>>>> VM
>>>>> channels listener was stopped.
>>>>> MainThread::INFO::2015-02-12 13:08:25,535::momIF::91::MOM::(stop)
>>>>> Shutting
>>>>> down MOM
>>>>> MainThread::DEBUG::2015-02-12
>>>>> 13:08:25,535::task::595::Storage.TaskManager.Task::(_updateState)
>>>>> Task=`c52bd13a-24f0-4d8a-97d8-f6c53ca8bddc`::moving
>>>>> from state init -> state preparing
>>>>> MainThread::INFO::2015-02-12
>>>>> 13:08:25,535::logUtils::44::dispatcher::(wrapper) Run and protect:
>>>>> prepareForShutdown(options=None)
>>>>> Thread-11::DEBUG::2015-02-12
>>>>>
13:08:25,535::storageServer::706::Storage.ConnectionMonitor::(_monitorConnections)
>>>>> Monitoring stopped
>>>>> ioprocess communication (16795)::ERROR::2015-02-12
>>>>> 13:08:25,530::__init__::152::IOProcessClient::(_communicate)
IOProcess
>>>>> failure
>>>>> Traceback (most recent call last):
>>>>> File
"/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line
>>>>> 107,
>>>>> in
>>>>> _communicate
>>>>> raise Exception("FD closed")
>>>>> Exception: FD closed
>>>>> ioprocess communication (16795)::DEBUG::2015-02-12
>>>>> 13:08:25,537::__init__::298::IOProcessClient::(_run) Starting
>>>>> IOProcess...
>>>>>
>>>>>
>>>>> Followed by:
>>>>>
>>>>> MainThread::INFO::2015-02-12
>>>>> 13:08:25,548::domainMonitor::137::Storage.DomainMonitor::(close)
>>>>> Stopping
>>>>> domain monitors
>>>>> MainThread::INFO::2015-02-12
>>>>>
13:08:25,549::domainMonitor::114::Storage.DomainMonitor::(stopMonitoring)
>>>>> Stop monitoring
>>>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
>>>>> Thread-94::DEBUG::2015-02-12
>>>>>
13:08:25,549::domainMonitor::192::Storage.DomainMonitorThread::(_monitorLoop)
>>>>> Stopping domain monitor for
>>>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
>>>>> Thread-94::INFO::2015-02-12
>>>>> 13:08:25,549::clusterlock::245::Storage.SANLock::(releaseHostId)
>>>>> Releasing
>>>>> host id for domain
>>>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f (id: 1)
>>>>> Thread-94::DEBUG::2015-02-12
>>>>>
13:08:25,549::domainMonitor::201::Storage.DomainMonitorThread::(_monitorLoop)
>>>>> Unable to release the host id 1 for domain
>>>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
>>>>> Traceback (most recent call last):
>>>>> File "/usr/share/vdsm/storage/domainMonitor.py", line
198, in
>>>>> _monitorLoop
>>>>> self.domain.releaseHostId(self.hostId, unused=True)
>>>>> File "/usr/share/vdsm/storage/sd.py", line 480, in
releaseHostId
>>>>> self._clusterLock.releaseHostId(hostId, async, unused)
>>>>> File "/usr/share/vdsm/storage/clusterlock.py", line 252,
in
>>>>> releaseHostId
>>>>> raise se.ReleaseHostIdFailure(self._sdUUID, e)
>>>>> ReleaseHostIdFailure: Cannot release host id:
>>>>> (u'a7dba6e3-09ac-430a-8d6e-eea17cf18b8f',
SanlockException(16, 'Sanlock
>>>>> lockspace remove failure',
>>>>> 'Device or resource busy'))
>>>>>
>>>>> supervdsm logs doesn't show anything interesting but:
>>>>>
>>>>> # service supervdsmd status
>>>>> Redirecting to /bin/systemctl status supervdsmd.service
>>>>> supervdsmd.service - "Auxiliary vdsm service for running helper
>>>>> functions
>>>>> as
>>>>> root"
>>>>> Loaded: loaded (/usr/lib/systemd/system/supervdsmd.service;
static)
>>>>> Active: inactive (dead) since gio 2015-02-12 13:08:30 CET; 27min
ago
>>>>> Main PID: 15859 (code=exited, status=0/SUCCESS)
>>>>> CGroup: /system.slice/supervdsmd.service
>>>>>
>>>>> feb 12 13:08:30 minidell.home systemd[1]: Stopping "Auxiliary
vdsm
>>>>> service
>>>>> for running helper functions as root"...
>>>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: Traceback (most
>>>>> recent
>>>>> call last):
>>>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: File
>>>>> "/usr/lib64/python2.7/multiprocessing/util.py", line 268,
in
>>>>> _run_finalizers
>>>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: finalizer()
>>>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: File
>>>>> "/usr/lib64/python2.7/multiprocessing/util.py", line 201,
in __call__
>>>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: res =
>>>>> self._callback(*self._args, **self._kwargs)
>>>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: OSError: [Errno
2]
>>>>> No
>>>>> such file or directory: '/var/run/vdsm/svdsm.sock'
>>>>> feb 12 13:08:30 minidell.home python[15859]: DIGEST-MD5 client mech
>>>>> dispose
>>>>> feb 12 13:08:30 minidell.home python[15859]: DIGEST-MD5 common mech
>>>>> dispose
>>>>> feb 12 13:08:30 minidell.home systemd[1]: Stopped "Auxiliary
vdsm
>>>>> service
>>>>> for
>>>>> running helper functions as root".
>>>>>
>>>>>
>>>>>
>>>>> After a while the VM became unreachable as ovirt-hosted-engine-setup
>>>>> shows:
>>>>>
>>>>> 2015-02-12 13:10:33 DEBUG
>>>>> otopi.plugins.ovirt_hosted_engine_setup.engine.add_host
>>>>> add_host._wait_host_ready:183 Error fetching host state:
>>>>> [ERROR]::oVirt API connection failure, [Errno 110] Connection timed
out
>>>>> 2015-02-12 13:10:33 DEBUG
>>>>> otopi.plugins.ovirt_hosted_engine_setup.engine.add_host
>>>>> add_host._wait_host_ready:189 VDSM host in state
>>>>>
>>>>>
>>>>> Gluster logs don't show anything special.
>>>>>
>>>>> I'm guessing that the 15 minutes of host-deploy being stuck
caused the
>>>>> issue
>>>>> here, but I don't see any relevant change that may have caused
it.
>>>>>
>>>>> Attaching the tar.gz with the code used for
ovirt-hosted-engine-setup;
>>>>> the
>>>>> rest is 3.5-snapshot over RHEL 7.1
>>>>>
>>>>> Any hint is welcome.
>>>>>
>>>>> --
>>>>> Sandro Bonazzola
>>>>> Better technology. Faster innovation. Powered by community
>>>>> collaboration.
>>>>> See how it works at
redhat.com
>>>>>
>>>>> _______________________________________________
>>>>> Devel mailing list
>>>>> Devel(a)ovirt.org
>>>>>
http://lists.ovirt.org/mailman/listinfo/devel
>>>
>>>
>>> --
>>> Sandro Bonazzola
>>> Better technology. Faster innovation. Powered by community collaboration.
>>> See how it works at
redhat.com
>>>
>
>
> --
> Sandro Bonazzola
> Better technology. Faster innovation. Powered by community collaboration.
> See how it works at
redhat.com
>
--
Sandro Bonazzola
Better technology. Faster innovation. Powered by community collaboration.
See how it works at
redhat.com