Re: [ovirt-devel] [HC] Weird issue while deploying hosted-engine

Thursday, 12 February 2015

Il 12/02/2015 15:31, Nir Soffer ha scritto:
...

 ----- Original Message -----
> From: "Sandro Bonazzola" <sbonazzo(a)redhat.com&gt;
> To: "Nir Soffer" <nsoffer(a)redhat.com&gt;
> Cc: "Yeela Kaplan" <ykaplan(a)redhat.com&gt;, "Alon Bar-Lev"
<alonbl(a)redhat.com&gt;, "Allon Mureinik" <amureini(a)redhat.com&gt;,
> "Yaniv Dary" <ydary(a)redhat.com&gt;, "Federico Simoncelli"
<fsimonce(a)redhat.com&gt;, "Sahina Bose" <sabose(a)redhat.com&gt;,
> "Yedidyah Bar David" <didi(a)redhat.com&gt;, "Simone Tiraboschi"
<stirabos(a)redhat.com&gt;, devel(a)ovirt.org, "Michal
> Skrivanek" <mskrivan(a)redhat.com&gt;, "David Teigland"
<teigland(a)redhat.com&gt;
> Sent: Thursday, February 12, 2015 4:24:55 PM
> Subject: Re: [ovirt-devel] [HC] Weird issue while deploying hosted-engine
>
> Il 12/02/2015 14:03, Nir Soffer ha scritto:
>> ----- Original Message -----
>>> From: "Sandro Bonazzola" <sbonazzo(a)redhat.com&gt;
>>> To: "Yeela Kaplan" <ykaplan(a)redhat.com&gt;, "Alon
Bar-Lev"
>>> <alonbl(a)redhat.com&gt;, "Allon Mureinik"
<amureini(a)redhat.com&gt;,
>>> "Yaniv Dary" <ydary(a)redhat.com&gt;, "Federico
Simoncelli"
>>> <fsimonce(a)redhat.com&gt;, "Sahina Bose"
<sabose(a)redhat.com&gt;,
>>> "Yedidyah Bar David" <didi(a)redhat.com&gt;, "Simone
Tiraboschi"
>>> <stirabos(a)redhat.com&gt;, devel(a)ovirt.org, "Michal
>>> Skrivanek" <mskrivan(a)redhat.com&gt;
>>> Sent: Thursday, February 12, 2015 2:55:57 PM
>>> Subject: [ovirt-devel] [HC] Weird issue while deploying hosted-engine
>>>
>>> While deploying Hosted Engine on Hyper Converged Gluster Storage I've
hit
>>> the
>>> following issue.
>>>
>>> ovirt-hosted-engine-setup:
>>>
>>> [ INFO  ] Engine replied: DB Up!Welcome to Health Status!
>>>           Enter the name of the cluster to which you want to add the host
>>>           (Default) [Default]:
>>> [ INFO  ] Waiting for the host to become operational in the engine. This
>>> may
>>> take several minutes...
>>> [ INFO  ] Still waiting for VDSM host to become operational...
>>>
>>> Accessed to the engine, VM and engine working correctly, installed using
>>> 3.5
>>> snapshot repositories.
>>> host-deploy started on the host (see attached logs) and got stuck at:
>>>
>>> 2015-02-12 13:08:33 DEBUG otopi.plugins.otopi.dialog.machine
>>> dialog.__logString:215 DIALOG:SEND       ### Response is VALUE
>>> TIME=type:value or ABORT TIME
>>>
>>> until several minutes later when:
>>> 2015-02-12 13:24:31 DEBUG otopi.context context._executeMethod:152 method
>>> exception
>>> Traceback (most recent call last):
>>>   File "/tmp/ovirt-ePMfjwrlpe/pythonlib/otopi/context.py", line
142, in
>>>   _executeMethod
>>>     method['method']()
>>>   File "/tmp/ovirt-ePMfjwrlpe/otopi-plugins/otopi/system/clock.py",
line
>>>   111,
>>>   in _set_clock
>>>     format="YYYYmmddHHMMSS+0000"
>>>   File
"/tmp/ovirt-ePMfjwrlpe/otopi-plugins/otopi/dialog/machine.py", line
>>>   235, in queryValue
>>>     opcode, variable = self._readline().split(' ', 1)
>>>   File "/tmp/ovirt-ePMfjwrlpe/pythonlib/otopi/dialog.py", line 259,
in
>>>   _readline
>>>     raise IOError(_('End of file'))
>>> IOError: End of file
>>>
>>> resumed the process.
>>>
>>> Around the time when host-deploy got stucked I see in sanlock logs:
>>>
>>> 2015-02-12 13:08:25+0100 7254 [683]: open error -1
>>>
/rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/images/51156c24-5ea1-42df-bf33-daebf2c4780c/3719b0
>>> 06-68a2-4742-964c-413a3f77c0b4.lease
>>> 2015-02-12 13:08:25+0100 7254 [683]: r4 release_token open error -2
>>> 2015-02-12 13:08:31+0100 7260 [16910]: a7dba6e3 aio collect 0
>>> 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 delta_renew read rv -107 offset
>>> 0
>>>
/rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 renewal error -107 delta_length
>>> 0
>>> last_success 7239
>>> 2015-02-12 13:08:31+0100 7260 [16910]: a7dba6e3 aio collect 0
>>> 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 delta_renew read rv -107 offset
>>> 0
>>>
/rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 renewal error -107 delta_length
>>> 0
>>> last_success 7239
>>> 2015-02-12 13:08:32+0100 7261 [16910]: a7dba6e3 aio collect 0
>>> 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
>>> 2015-02-12 13:08:32+0100 7261 [16910]: s3 delta_renew read rv -107 offset
>>> 0
>>>
/rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
>>> 2015-02-12 13:08:32+0100 7261 [16910]: s3 renewal error -107 delta_length
>>> 0
>>> last_success 7239
>>
>> Sanlock cannot access your storage. The rest of the errors are less
>> interesting,
>> as nothing will work without storage.
>>
>> While sanlock is failing, can you access that storage from another host or
>> from the host running sanlock?
>>
>> Adding David for helping with debugging the sanlock errors.
>
>
> If you mean something like:
> # mount -t glusterfs minidell.home:/hosted_engine_glusterfs /tmp/test
> # # ll /tmp/test/
> totale 4
> drwxr-xr-x. 6 vdsm kvm 4096 12 feb 11.49 a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
> -rwxr-xr-x. 1 vdsm kvm    0 12 feb 13.04 __DIRECT_IO_TEST__
>
> yes, I can still access the gluster volume.

 While sanlock fail to update the lease or now?

 Try also to write to this storage. 
dd if=/dev/zero of=/tmp/test/my_write_test bs=4k count=1k
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 0.0289396 s, 145 MB/s

ll /tmp/test/
total 4100
-rwxr-xr-x. 1 vdsm kvm        0 Feb 12 13:04 __DIRECT_IO_TEST__
drwxr-xr-x. 6 vdsm kvm     4096 Feb 12 11:49 a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
-rw-r--r--. 1 root root 4194304 Feb 12 15:33 my_write_test

looks like it works

...

>
> # gluster volume info hosted_engine_glusterfs
>
> Volume Name: hosted_engine_glusterfs
> Type: Distribute
> Volume ID: 8d78571d-69ef-41fb-bc0a-34dc7d903df5
> Status: Started
> Number of Bricks: 1
> Transport-type: tcp
> Bricks:
> Brick1: minidell.home:/path1
> Options Reconfigured:
> nfs.disable: on
> storage.owner-gid: 36
> storage.owner-uid: 36
> network.ping-timeout: 10
> auth.allow: *
> cluster.server-quorum-type: server
> cluster.quorum-type: auto
> network.remote-dio: enable
> cluster.eager-lock: enable
> performance.stat-prefetch: off
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> server.allow-insecure: on
> user.cifs: disable
>
> # gluster volume status hosted_engine_glusterfs
> Status of volume: hosted_engine_glusterfs
> Gluster process						Port	Online	Pid
> ------------------------------------------------------------------------------
> Brick minidell.home:/path1				49217	Y	16660
>
> Task Status of Volume hosted_engine_glusterfs
> ------------------------------------------------------------------------------
> There are no active volume tasks
>
>
>
>>
>>>
>>>
>>> And in VDSM logs:
>>>
>>> MainThread::DEBUG::2015-02-12
>>> 13:08:25,529::vdsm::58::vds::(sigtermHandler)
>>> Received signal 15
>>> MainThread::DEBUG::2015-02-12
>>> 13:08:25,529::protocoldetector::144::vds.MultiProtocolAcceptor::(stop)
>>> Stopping Acceptor
>>> MainThread::INFO::2015-02-12
>>> 13:08:25,530::__init__::565::jsonrpc.JsonRpcServer::(stop) Stopping
>>> JsonRPC
>>> Server
>>> Detector thread::DEBUG::2015-02-12
>>> 13:08:25,530::protocoldetector::115::vds.MultiProtocolAcceptor::(_cleanup)
>>> Cleaning Acceptor
>>> MainThread::INFO::2015-02-12 13:08:25,531::vmchannels::188::vds::(stop) VM
>>> channels listener was stopped.
>>> MainThread::INFO::2015-02-12 13:08:25,535::momIF::91::MOM::(stop) Shutting
>>> down MOM
>>> MainThread::DEBUG::2015-02-12
>>> 13:08:25,535::task::595::Storage.TaskManager.Task::(_updateState)
>>> Task=`c52bd13a-24f0-4d8a-97d8-f6c53ca8bddc`::moving
>>> from state init -> state preparing
>>> MainThread::INFO::2015-02-12
>>> 13:08:25,535::logUtils::44::dispatcher::(wrapper) Run and protect:
>>> prepareForShutdown(options=None)
>>> Thread-11::DEBUG::2015-02-12
>>>
13:08:25,535::storageServer::706::Storage.ConnectionMonitor::(_monitorConnections)
>>> Monitoring stopped
>>> ioprocess communication (16795)::ERROR::2015-02-12
>>> 13:08:25,530::__init__::152::IOProcessClient::(_communicate) IOProcess
>>> failure
>>> Traceback (most recent call last):
>>>   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py",
line 107,
>>>   in
>>>   _communicate
>>>     raise Exception("FD closed")
>>> Exception: FD closed
>>> ioprocess communication (16795)::DEBUG::2015-02-12
>>> 13:08:25,537::__init__::298::IOProcessClient::(_run) Starting IOProcess...
>>>
>>>
>>> Followed by:
>>>
>>> MainThread::INFO::2015-02-12
>>> 13:08:25,548::domainMonitor::137::Storage.DomainMonitor::(close) Stopping
>>> domain monitors
>>> MainThread::INFO::2015-02-12
>>> 13:08:25,549::domainMonitor::114::Storage.DomainMonitor::(stopMonitoring)
>>> Stop monitoring
>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
>>> Thread-94::DEBUG::2015-02-12
>>>
13:08:25,549::domainMonitor::192::Storage.DomainMonitorThread::(_monitorLoop)
>>> Stopping domain monitor for
>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
>>> Thread-94::INFO::2015-02-12
>>> 13:08:25,549::clusterlock::245::Storage.SANLock::(releaseHostId) Releasing
>>> host id for domain
>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f (id: 1)
>>> Thread-94::DEBUG::2015-02-12
>>>
13:08:25,549::domainMonitor::201::Storage.DomainMonitorThread::(_monitorLoop)
>>> Unable to release the host id 1 for domain
>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
>>> Traceback (most recent call last):
>>>   File "/usr/share/vdsm/storage/domainMonitor.py", line 198, in
>>>   _monitorLoop
>>>     self.domain.releaseHostId(self.hostId, unused=True)
>>>   File "/usr/share/vdsm/storage/sd.py", line 480, in releaseHostId
>>>     self._clusterLock.releaseHostId(hostId, async, unused)
>>>   File "/usr/share/vdsm/storage/clusterlock.py", line 252, in
>>>   releaseHostId
>>>     raise se.ReleaseHostIdFailure(self._sdUUID, e)
>>> ReleaseHostIdFailure: Cannot release host id:
>>> (u'a7dba6e3-09ac-430a-8d6e-eea17cf18b8f', SanlockException(16,
'Sanlock
>>> lockspace remove failure',
>>> 'Device or resource busy'))
>>>
>>> supervdsm logs doesn't show anything interesting but:
>>>
>>> # service supervdsmd status
>>> Redirecting to /bin/systemctl status  supervdsmd.service
>>> supervdsmd.service - "Auxiliary vdsm service for running helper
functions
>>> as
>>> root"
>>>    Loaded: loaded (/usr/lib/systemd/system/supervdsmd.service; static)
>>>    Active: inactive (dead) since gio 2015-02-12 13:08:30 CET; 27min ago
>>>  Main PID: 15859 (code=exited, status=0/SUCCESS)
>>>    CGroup: /system.slice/supervdsmd.service
>>>
>>> feb 12 13:08:30 minidell.home systemd[1]: Stopping "Auxiliary vdsm
service
>>> for running helper functions as root"...
>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: Traceback (most recent
>>> call last):
>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: File
>>> "/usr/lib64/python2.7/multiprocessing/util.py", line 268, in
>>> _run_finalizers
>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: finalizer()
>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: File
>>> "/usr/lib64/python2.7/multiprocessing/util.py", line 201, in
__call__
>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: res =
>>> self._callback(*self._args, **self._kwargs)
>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: OSError: [Errno 2] No
>>> such file or directory: '/var/run/vdsm/svdsm.sock'
>>> feb 12 13:08:30 minidell.home python[15859]: DIGEST-MD5 client mech
>>> dispose
>>> feb 12 13:08:30 minidell.home python[15859]: DIGEST-MD5 common mech
>>> dispose
>>> feb 12 13:08:30 minidell.home systemd[1]: Stopped "Auxiliary vdsm
service
>>> for
>>> running helper functions as root".
>>>
>>>
>>>
>>> After a while the VM became unreachable as ovirt-hosted-engine-setup
>>> shows:
>>>
>>> 2015-02-12 13:10:33 DEBUG
>>> otopi.plugins.ovirt_hosted_engine_setup.engine.add_host
>>> add_host._wait_host_ready:183 Error fetching host state:
>>> [ERROR]::oVirt API connection failure, [Errno 110] Connection timed out
>>> 2015-02-12 13:10:33 DEBUG
>>> otopi.plugins.ovirt_hosted_engine_setup.engine.add_host
>>> add_host._wait_host_ready:189 VDSM host in  state
>>>
>>>
>>> Gluster logs don't show anything special.
>>>
>>> I'm guessing that the 15 minutes of host-deploy being stuck caused the
>>> issue
>>> here, but I don't see any relevant change that may have caused it.
>>>
>>> Attaching the tar.gz with the code used for ovirt-hosted-engine-setup; the
>>> rest is 3.5-snapshot over RHEL 7.1
>>>
>>> Any hint is welcome.
>>>
>>> --
>>> Sandro Bonazzola
>>> Better technology. Faster innovation. Powered by community collaboration.
>>> See how it works at redhat.com
>>>
>>> _______________________________________________
>>> Devel mailing list
>>> Devel(a)ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/devel
>
>
> --
> Sandro Bonazzola
> Better technology. Faster innovation. Powered by community collaboration.
> See how it works at redhat.com
> 

-- 
Sandro Bonazzola
Better technology. Faster innovation. Powered by community collaboration.
See how it works at redhat.com

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [ovirt-devel] [HC] Weird issue while deploying hosted-engine