[ovirt-devel] [HC] Weird issue while deploying hosted-engine

Sandro Bonazzola sbonazzo at redhat.com
Thu Feb 12 15:04:28 UTC 2015


Il 12/02/2015 15:41, Nir Soffer ha scritto:
> 
> 
> ----- Original Message -----
>> From: "Sandro Bonazzola" <sbonazzo at redhat.com>
>> To: "Nir Soffer" <nsoffer at redhat.com>
>> Cc: "Yeela Kaplan" <ykaplan at redhat.com>, "Alon Bar-Lev" <alonbl at redhat.com>, "Allon Mureinik" <amureini at redhat.com>,
>> "Yaniv Dary" <ydary at redhat.com>, "Federico Simoncelli" <fsimonce at redhat.com>, "Sahina Bose" <sabose at redhat.com>,
>> "Yedidyah Bar David" <didi at redhat.com>, "Simone Tiraboschi" <stirabos at redhat.com>, devel at ovirt.org, "Michal
>> Skrivanek" <mskrivan at redhat.com>, "David Teigland" <teigland at redhat.com>
>> Sent: Thursday, February 12, 2015 4:34:48 PM
>> Subject: Re: [ovirt-devel] [HC] Weird issue while deploying hosted-engine
>>
>> Il 12/02/2015 15:31, Nir Soffer ha scritto:
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Sandro Bonazzola" <sbonazzo at redhat.com>
>>>> To: "Nir Soffer" <nsoffer at redhat.com>
>>>> Cc: "Yeela Kaplan" <ykaplan at redhat.com>, "Alon Bar-Lev"
>>>> <alonbl at redhat.com>, "Allon Mureinik" <amureini at redhat.com>,
>>>> "Yaniv Dary" <ydary at redhat.com>, "Federico Simoncelli"
>>>> <fsimonce at redhat.com>, "Sahina Bose" <sabose at redhat.com>,
>>>> "Yedidyah Bar David" <didi at redhat.com>, "Simone Tiraboschi"
>>>> <stirabos at redhat.com>, devel at ovirt.org, "Michal
>>>> Skrivanek" <mskrivan at redhat.com>, "David Teigland" <teigland at redhat.com>
>>>> Sent: Thursday, February 12, 2015 4:24:55 PM
>>>> Subject: Re: [ovirt-devel] [HC] Weird issue while deploying hosted-engine
>>>>
>>>> Il 12/02/2015 14:03, Nir Soffer ha scritto:
>>>>> ----- Original Message -----
>>>>>> From: "Sandro Bonazzola" <sbonazzo at redhat.com>
>>>>>> To: "Yeela Kaplan" <ykaplan at redhat.com>, "Alon Bar-Lev"
>>>>>> <alonbl at redhat.com>, "Allon Mureinik" <amureini at redhat.com>,
>>>>>> "Yaniv Dary" <ydary at redhat.com>, "Federico Simoncelli"
>>>>>> <fsimonce at redhat.com>, "Sahina Bose" <sabose at redhat.com>,
>>>>>> "Yedidyah Bar David" <didi at redhat.com>, "Simone Tiraboschi"
>>>>>> <stirabos at redhat.com>, devel at ovirt.org, "Michal
>>>>>> Skrivanek" <mskrivan at redhat.com>
>>>>>> Sent: Thursday, February 12, 2015 2:55:57 PM
>>>>>> Subject: [ovirt-devel] [HC] Weird issue while deploying hosted-engine
>>>>>>
>>>>>> While deploying Hosted Engine on Hyper Converged Gluster Storage I've
>>>>>> hit
>>>>>> the
>>>>>> following issue.
>>>>>>
>>>>>> ovirt-hosted-engine-setup:
>>>>>>
>>>>>> [ INFO  ] Engine replied: DB Up!Welcome to Health Status!
>>>>>>           Enter the name of the cluster to which you want to add the
>>>>>>           host
>>>>>>           (Default) [Default]:
>>>>>> [ INFO  ] Waiting for the host to become operational in the engine. This
>>>>>> may
>>>>>> take several minutes...
>>>>>> [ INFO  ] Still waiting for VDSM host to become operational...
>>>>>>
>>>>>> Accessed to the engine, VM and engine working correctly, installed using
>>>>>> 3.5
>>>>>> snapshot repositories.
>>>>>> host-deploy started on the host (see attached logs) and got stuck at:
>>>>>>
>>>>>> 2015-02-12 13:08:33 DEBUG otopi.plugins.otopi.dialog.machine
>>>>>> dialog.__logString:215 DIALOG:SEND       ### Response is VALUE
>>>>>> TIME=type:value or ABORT TIME
>>>>>>
>>>>>> until several minutes later when:
>>>>>> 2015-02-12 13:24:31 DEBUG otopi.context context._executeMethod:152
>>>>>> method
>>>>>> exception
>>>>>> Traceback (most recent call last):
>>>>>>   File "/tmp/ovirt-ePMfjwrlpe/pythonlib/otopi/context.py", line 142, in
>>>>>>   _executeMethod
>>>>>>     method['method']()
>>>>>>   File "/tmp/ovirt-ePMfjwrlpe/otopi-plugins/otopi/system/clock.py", line
>>>>>>   111,
>>>>>>   in _set_clock
>>>>>>     format="YYYYmmddHHMMSS+0000"
>>>>>>   File "/tmp/ovirt-ePMfjwrlpe/otopi-plugins/otopi/dialog/machine.py",
>>>>>>   line
>>>>>>   235, in queryValue
>>>>>>     opcode, variable = self._readline().split(' ', 1)
>>>>>>   File "/tmp/ovirt-ePMfjwrlpe/pythonlib/otopi/dialog.py", line 259, in
>>>>>>   _readline
>>>>>>     raise IOError(_('End of file'))
>>>>>> IOError: End of file
>>>>>>
>>>>>> resumed the process.
>>>>>>
>>>>>> Around the time when host-deploy got stucked I see in sanlock logs:
>>>>>>
>>>>>> 2015-02-12 13:08:25+0100 7254 [683]: open error -1
>>>>>> /rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/images/51156c24-5ea1-42df-bf33-daebf2c4780c/3719b0
>>>>>> 06-68a2-4742-964c-413a3f77c0b4.lease
>>>>>> 2015-02-12 13:08:25+0100 7254 [683]: r4 release_token open error -2
>>>>>> 2015-02-12 13:08:31+0100 7260 [16910]: a7dba6e3 aio collect 0
>>>>>> 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
>>>>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 delta_renew read rv -107
>>>>>> offset
>>>>>> 0
>>>>>> /rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
>>>>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 renewal error -107
>>>>>> delta_length
>>>>>> 0
>>>>>> last_success 7239
>>>>>> 2015-02-12 13:08:31+0100 7260 [16910]: a7dba6e3 aio collect 0
>>>>>> 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
>>>>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 delta_renew read rv -107
>>>>>> offset
>>>>>> 0
>>>>>> /rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
>>>>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 renewal error -107
>>>>>> delta_length
>>>>>> 0
>>>>>> last_success 7239
>>>>>> 2015-02-12 13:08:32+0100 7261 [16910]: a7dba6e3 aio collect 0
>>>>>> 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
>>>>>> 2015-02-12 13:08:32+0100 7261 [16910]: s3 delta_renew read rv -107
>>>>>> offset
>>>>>> 0
>>>>>> /rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
>>>>>> 2015-02-12 13:08:32+0100 7261 [16910]: s3 renewal error -107
>>>>>> delta_length
>>>>>> 0
>>>>>> last_success 7239
>>>>>
>>>>> Sanlock cannot access your storage. The rest of the errors are less
>>>>> interesting,
>>>>> as nothing will work without storage.
>>>>>
>>>>> While sanlock is failing, can you access that storage from another host
>>>>> or
>>>>> from the host running sanlock?
>>>>>
>>>>> Adding David for helping with debugging the sanlock errors.
>>>>
>>>>
>>>> If you mean something like:
>>>> # mount -t glusterfs minidell.home:/hosted_engine_glusterfs /tmp/test
>>>> # # ll /tmp/test/
>>>> totale 4
>>>> drwxr-xr-x. 6 vdsm kvm 4096 12 feb 11.49
>>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
>>>> -rwxr-xr-x. 1 vdsm kvm    0 12 feb 13.04 __DIRECT_IO_TEST__
>>>>
>>>> yes, I can still access the gluster volume.
>>>
>>> While sanlock fail to update the lease or now?
> 
> ?

monitoring the storage while trying to reproduce on different host.

attached the messages log requested in IRC



> 
>>>
>>> Try also to write to this storage.
>>
>> dd if=/dev/zero of=/tmp/test/my_write_test bs=4k count=1k
>> 1024+0 records in
>> 1024+0 records out
>> 4194304 bytes (4.2 MB) copied, 0.0289396 s, 145 MB/s
>>
>> ll /tmp/test/
>> total 4100
>> -rwxr-xr-x. 1 vdsm kvm        0 Feb 12 13:04 __DIRECT_IO_TEST__
>> drwxr-xr-x. 6 vdsm kvm     4096 Feb 12 11:49
>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
>> -rw-r--r--. 1 root root 4194304 Feb 12 15:33 my_write_test
>>
>> looks like it works
>>
>>
>>
>>>
>>>>
>>>> # gluster volume info hosted_engine_glusterfs
>>>>
>>>> Volume Name: hosted_engine_glusterfs
>>>> Type: Distribute
>>>> Volume ID: 8d78571d-69ef-41fb-bc0a-34dc7d903df5
>>>> Status: Started
>>>> Number of Bricks: 1
>>>> Transport-type: tcp
>>>> Bricks:
>>>> Brick1: minidell.home:/path1
>>>> Options Reconfigured:
>>>> nfs.disable: on
>>>> storage.owner-gid: 36
>>>> storage.owner-uid: 36
>>>> network.ping-timeout: 10
>>>> auth.allow: *
>>>> cluster.server-quorum-type: server
>>>> cluster.quorum-type: auto
>>>> network.remote-dio: enable
>>>> cluster.eager-lock: enable
>>>> performance.stat-prefetch: off
>>>> performance.io-cache: off
>>>> performance.read-ahead: off
>>>> performance.quick-read: off
>>>> server.allow-insecure: on
>>>> user.cifs: disable
>>>>
>>>> # gluster volume status hosted_engine_glusterfs
>>>> Status of volume: hosted_engine_glusterfs
>>>> Gluster process						Port	Online	Pid
>>>> ------------------------------------------------------------------------------
>>>> Brick minidell.home:/path1				49217	Y	16660
>>>>
>>>> Task Status of Volume hosted_engine_glusterfs
>>>> ------------------------------------------------------------------------------
>>>> There are no active volume tasks
>>>>
>>>>
>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> And in VDSM logs:
>>>>>>
>>>>>> MainThread::DEBUG::2015-02-12
>>>>>> 13:08:25,529::vdsm::58::vds::(sigtermHandler)
>>>>>> Received signal 15
>>>>>> MainThread::DEBUG::2015-02-12
>>>>>> 13:08:25,529::protocoldetector::144::vds.MultiProtocolAcceptor::(stop)
>>>>>> Stopping Acceptor
>>>>>> MainThread::INFO::2015-02-12
>>>>>> 13:08:25,530::__init__::565::jsonrpc.JsonRpcServer::(stop) Stopping
>>>>>> JsonRPC
>>>>>> Server
>>>>>> Detector thread::DEBUG::2015-02-12
>>>>>> 13:08:25,530::protocoldetector::115::vds.MultiProtocolAcceptor::(_cleanup)
>>>>>> Cleaning Acceptor
>>>>>> MainThread::INFO::2015-02-12 13:08:25,531::vmchannels::188::vds::(stop)
>>>>>> VM
>>>>>> channels listener was stopped.
>>>>>> MainThread::INFO::2015-02-12 13:08:25,535::momIF::91::MOM::(stop)
>>>>>> Shutting
>>>>>> down MOM
>>>>>> MainThread::DEBUG::2015-02-12
>>>>>> 13:08:25,535::task::595::Storage.TaskManager.Task::(_updateState)
>>>>>> Task=`c52bd13a-24f0-4d8a-97d8-f6c53ca8bddc`::moving
>>>>>> from state init -> state preparing
>>>>>> MainThread::INFO::2015-02-12
>>>>>> 13:08:25,535::logUtils::44::dispatcher::(wrapper) Run and protect:
>>>>>> prepareForShutdown(options=None)
>>>>>> Thread-11::DEBUG::2015-02-12
>>>>>> 13:08:25,535::storageServer::706::Storage.ConnectionMonitor::(_monitorConnections)
>>>>>> Monitoring stopped
>>>>>> ioprocess communication (16795)::ERROR::2015-02-12
>>>>>> 13:08:25,530::__init__::152::IOProcessClient::(_communicate) IOProcess
>>>>>> failure
>>>>>> Traceback (most recent call last):
>>>>>>   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line
>>>>>>   107,
>>>>>>   in
>>>>>>   _communicate
>>>>>>     raise Exception("FD closed")
>>>>>> Exception: FD closed
>>>>>> ioprocess communication (16795)::DEBUG::2015-02-12
>>>>>> 13:08:25,537::__init__::298::IOProcessClient::(_run) Starting
>>>>>> IOProcess...
>>>>>>
>>>>>>
>>>>>> Followed by:
>>>>>>
>>>>>> MainThread::INFO::2015-02-12
>>>>>> 13:08:25,548::domainMonitor::137::Storage.DomainMonitor::(close)
>>>>>> Stopping
>>>>>> domain monitors
>>>>>> MainThread::INFO::2015-02-12
>>>>>> 13:08:25,549::domainMonitor::114::Storage.DomainMonitor::(stopMonitoring)
>>>>>> Stop monitoring
>>>>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
>>>>>> Thread-94::DEBUG::2015-02-12
>>>>>> 13:08:25,549::domainMonitor::192::Storage.DomainMonitorThread::(_monitorLoop)
>>>>>> Stopping domain monitor for
>>>>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
>>>>>> Thread-94::INFO::2015-02-12
>>>>>> 13:08:25,549::clusterlock::245::Storage.SANLock::(releaseHostId)
>>>>>> Releasing
>>>>>> host id for domain
>>>>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f (id: 1)
>>>>>> Thread-94::DEBUG::2015-02-12
>>>>>> 13:08:25,549::domainMonitor::201::Storage.DomainMonitorThread::(_monitorLoop)
>>>>>> Unable to release the host id 1 for domain
>>>>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
>>>>>> Traceback (most recent call last):
>>>>>>   File "/usr/share/vdsm/storage/domainMonitor.py", line 198, in
>>>>>>   _monitorLoop
>>>>>>     self.domain.releaseHostId(self.hostId, unused=True)
>>>>>>   File "/usr/share/vdsm/storage/sd.py", line 480, in releaseHostId
>>>>>>     self._clusterLock.releaseHostId(hostId, async, unused)
>>>>>>   File "/usr/share/vdsm/storage/clusterlock.py", line 252, in
>>>>>>   releaseHostId
>>>>>>     raise se.ReleaseHostIdFailure(self._sdUUID, e)
>>>>>> ReleaseHostIdFailure: Cannot release host id:
>>>>>> (u'a7dba6e3-09ac-430a-8d6e-eea17cf18b8f', SanlockException(16, 'Sanlock
>>>>>> lockspace remove failure',
>>>>>> 'Device or resource busy'))
>>>>>>
>>>>>> supervdsm logs doesn't show anything interesting but:
>>>>>>
>>>>>> # service supervdsmd status
>>>>>> Redirecting to /bin/systemctl status  supervdsmd.service
>>>>>> supervdsmd.service - "Auxiliary vdsm service for running helper
>>>>>> functions
>>>>>> as
>>>>>> root"
>>>>>>    Loaded: loaded (/usr/lib/systemd/system/supervdsmd.service; static)
>>>>>>    Active: inactive (dead) since gio 2015-02-12 13:08:30 CET; 27min ago
>>>>>>  Main PID: 15859 (code=exited, status=0/SUCCESS)
>>>>>>    CGroup: /system.slice/supervdsmd.service
>>>>>>
>>>>>> feb 12 13:08:30 minidell.home systemd[1]: Stopping "Auxiliary vdsm
>>>>>> service
>>>>>> for running helper functions as root"...
>>>>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: Traceback (most
>>>>>> recent
>>>>>> call last):
>>>>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: File
>>>>>> "/usr/lib64/python2.7/multiprocessing/util.py", line 268, in
>>>>>> _run_finalizers
>>>>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: finalizer()
>>>>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: File
>>>>>> "/usr/lib64/python2.7/multiprocessing/util.py", line 201, in __call__
>>>>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: res =
>>>>>> self._callback(*self._args, **self._kwargs)
>>>>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: OSError: [Errno 2]
>>>>>> No
>>>>>> such file or directory: '/var/run/vdsm/svdsm.sock'
>>>>>> feb 12 13:08:30 minidell.home python[15859]: DIGEST-MD5 client mech
>>>>>> dispose
>>>>>> feb 12 13:08:30 minidell.home python[15859]: DIGEST-MD5 common mech
>>>>>> dispose
>>>>>> feb 12 13:08:30 minidell.home systemd[1]: Stopped "Auxiliary vdsm
>>>>>> service
>>>>>> for
>>>>>> running helper functions as root".
>>>>>>
>>>>>>
>>>>>>
>>>>>> After a while the VM became unreachable as ovirt-hosted-engine-setup
>>>>>> shows:
>>>>>>
>>>>>> 2015-02-12 13:10:33 DEBUG
>>>>>> otopi.plugins.ovirt_hosted_engine_setup.engine.add_host
>>>>>> add_host._wait_host_ready:183 Error fetching host state:
>>>>>> [ERROR]::oVirt API connection failure, [Errno 110] Connection timed out
>>>>>> 2015-02-12 13:10:33 DEBUG
>>>>>> otopi.plugins.ovirt_hosted_engine_setup.engine.add_host
>>>>>> add_host._wait_host_ready:189 VDSM host in  state
>>>>>>
>>>>>>
>>>>>> Gluster logs don't show anything special.
>>>>>>
>>>>>> I'm guessing that the 15 minutes of host-deploy being stuck caused the
>>>>>> issue
>>>>>> here, but I don't see any relevant change that may have caused it.
>>>>>>
>>>>>> Attaching the tar.gz with the code used for ovirt-hosted-engine-setup;
>>>>>> the
>>>>>> rest is 3.5-snapshot over RHEL 7.1
>>>>>>
>>>>>> Any hint is welcome.
>>>>>>
>>>>>> --
>>>>>> Sandro Bonazzola
>>>>>> Better technology. Faster innovation. Powered by community
>>>>>> collaboration.
>>>>>> See how it works at redhat.com
>>>>>>
>>>>>> _______________________________________________
>>>>>> Devel mailing list
>>>>>> Devel at ovirt.org
>>>>>> http://lists.ovirt.org/mailman/listinfo/devel
>>>>
>>>>
>>>> --
>>>> Sandro Bonazzola
>>>> Better technology. Faster innovation. Powered by community collaboration.
>>>> See how it works at redhat.com
>>>>
>>
>>
>> --
>> Sandro Bonazzola
>> Better technology. Faster innovation. Powered by community collaboration.
>> See how it works at redhat.com
>>


-- 
Sandro Bonazzola
Better technology. Faster innovation. Powered by community collaboration.
See how it works at redhat.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: messages.gz
Type: application/gzip
Size: 59569 bytes
Desc: not available
URL: <http://lists.ovirt.org/pipermail/devel/attachments/20150212/e58d0791/attachment-0001.bin>


More information about the Devel mailing list