[ovirt-devel] [HC] Weird issue while deploying hosted-engine

Nir Soffer nsoffer at redhat.com
Thu Feb 12 14:41:59 UTC 2015



----- Original Message -----
> From: "Sandro Bonazzola" <sbonazzo at redhat.com>
> To: "Nir Soffer" <nsoffer at redhat.com>
> Cc: "Yeela Kaplan" <ykaplan at redhat.com>, "Alon Bar-Lev" <alonbl at redhat.com>, "Allon Mureinik" <amureini at redhat.com>,
> "Yaniv Dary" <ydary at redhat.com>, "Federico Simoncelli" <fsimonce at redhat.com>, "Sahina Bose" <sabose at redhat.com>,
> "Yedidyah Bar David" <didi at redhat.com>, "Simone Tiraboschi" <stirabos at redhat.com>, devel at ovirt.org, "Michal
> Skrivanek" <mskrivan at redhat.com>, "David Teigland" <teigland at redhat.com>
> Sent: Thursday, February 12, 2015 4:34:48 PM
> Subject: Re: [ovirt-devel] [HC] Weird issue while deploying hosted-engine
> 
> Il 12/02/2015 15:31, Nir Soffer ha scritto:
> > 
> > 
> > ----- Original Message -----
> >> From: "Sandro Bonazzola" <sbonazzo at redhat.com>
> >> To: "Nir Soffer" <nsoffer at redhat.com>
> >> Cc: "Yeela Kaplan" <ykaplan at redhat.com>, "Alon Bar-Lev"
> >> <alonbl at redhat.com>, "Allon Mureinik" <amureini at redhat.com>,
> >> "Yaniv Dary" <ydary at redhat.com>, "Federico Simoncelli"
> >> <fsimonce at redhat.com>, "Sahina Bose" <sabose at redhat.com>,
> >> "Yedidyah Bar David" <didi at redhat.com>, "Simone Tiraboschi"
> >> <stirabos at redhat.com>, devel at ovirt.org, "Michal
> >> Skrivanek" <mskrivan at redhat.com>, "David Teigland" <teigland at redhat.com>
> >> Sent: Thursday, February 12, 2015 4:24:55 PM
> >> Subject: Re: [ovirt-devel] [HC] Weird issue while deploying hosted-engine
> >>
> >> Il 12/02/2015 14:03, Nir Soffer ha scritto:
> >>> ----- Original Message -----
> >>>> From: "Sandro Bonazzola" <sbonazzo at redhat.com>
> >>>> To: "Yeela Kaplan" <ykaplan at redhat.com>, "Alon Bar-Lev"
> >>>> <alonbl at redhat.com>, "Allon Mureinik" <amureini at redhat.com>,
> >>>> "Yaniv Dary" <ydary at redhat.com>, "Federico Simoncelli"
> >>>> <fsimonce at redhat.com>, "Sahina Bose" <sabose at redhat.com>,
> >>>> "Yedidyah Bar David" <didi at redhat.com>, "Simone Tiraboschi"
> >>>> <stirabos at redhat.com>, devel at ovirt.org, "Michal
> >>>> Skrivanek" <mskrivan at redhat.com>
> >>>> Sent: Thursday, February 12, 2015 2:55:57 PM
> >>>> Subject: [ovirt-devel] [HC] Weird issue while deploying hosted-engine
> >>>>
> >>>> While deploying Hosted Engine on Hyper Converged Gluster Storage I've
> >>>> hit
> >>>> the
> >>>> following issue.
> >>>>
> >>>> ovirt-hosted-engine-setup:
> >>>>
> >>>> [ INFO  ] Engine replied: DB Up!Welcome to Health Status!
> >>>>           Enter the name of the cluster to which you want to add the
> >>>>           host
> >>>>           (Default) [Default]:
> >>>> [ INFO  ] Waiting for the host to become operational in the engine. This
> >>>> may
> >>>> take several minutes...
> >>>> [ INFO  ] Still waiting for VDSM host to become operational...
> >>>>
> >>>> Accessed to the engine, VM and engine working correctly, installed using
> >>>> 3.5
> >>>> snapshot repositories.
> >>>> host-deploy started on the host (see attached logs) and got stuck at:
> >>>>
> >>>> 2015-02-12 13:08:33 DEBUG otopi.plugins.otopi.dialog.machine
> >>>> dialog.__logString:215 DIALOG:SEND       ### Response is VALUE
> >>>> TIME=type:value or ABORT TIME
> >>>>
> >>>> until several minutes later when:
> >>>> 2015-02-12 13:24:31 DEBUG otopi.context context._executeMethod:152
> >>>> method
> >>>> exception
> >>>> Traceback (most recent call last):
> >>>>   File "/tmp/ovirt-ePMfjwrlpe/pythonlib/otopi/context.py", line 142, in
> >>>>   _executeMethod
> >>>>     method['method']()
> >>>>   File "/tmp/ovirt-ePMfjwrlpe/otopi-plugins/otopi/system/clock.py", line
> >>>>   111,
> >>>>   in _set_clock
> >>>>     format="YYYYmmddHHMMSS+0000"
> >>>>   File "/tmp/ovirt-ePMfjwrlpe/otopi-plugins/otopi/dialog/machine.py",
> >>>>   line
> >>>>   235, in queryValue
> >>>>     opcode, variable = self._readline().split(' ', 1)
> >>>>   File "/tmp/ovirt-ePMfjwrlpe/pythonlib/otopi/dialog.py", line 259, in
> >>>>   _readline
> >>>>     raise IOError(_('End of file'))
> >>>> IOError: End of file
> >>>>
> >>>> resumed the process.
> >>>>
> >>>> Around the time when host-deploy got stucked I see in sanlock logs:
> >>>>
> >>>> 2015-02-12 13:08:25+0100 7254 [683]: open error -1
> >>>> /rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/images/51156c24-5ea1-42df-bf33-daebf2c4780c/3719b0
> >>>> 06-68a2-4742-964c-413a3f77c0b4.lease
> >>>> 2015-02-12 13:08:25+0100 7254 [683]: r4 release_token open error -2
> >>>> 2015-02-12 13:08:31+0100 7260 [16910]: a7dba6e3 aio collect 0
> >>>> 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
> >>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 delta_renew read rv -107
> >>>> offset
> >>>> 0
> >>>> /rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
> >>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 renewal error -107
> >>>> delta_length
> >>>> 0
> >>>> last_success 7239
> >>>> 2015-02-12 13:08:31+0100 7260 [16910]: a7dba6e3 aio collect 0
> >>>> 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
> >>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 delta_renew read rv -107
> >>>> offset
> >>>> 0
> >>>> /rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
> >>>> 2015-02-12 13:08:31+0100 7260 [16910]: s3 renewal error -107
> >>>> delta_length
> >>>> 0
> >>>> last_success 7239
> >>>> 2015-02-12 13:08:32+0100 7261 [16910]: a7dba6e3 aio collect 0
> >>>> 0x7f1d640008c0:0x7f1d640008d0:0x7f1d7ec52000 result -107:0 match res
> >>>> 2015-02-12 13:08:32+0100 7261 [16910]: s3 delta_renew read rv -107
> >>>> offset
> >>>> 0
> >>>> /rhev/data-center/mnt/glusterSD/minidell.home:_hosted__engine__glusterfs/a7dba6e3-09ac-430a-8d6e-eea17cf18b8f/dom_md/ids
> >>>> 2015-02-12 13:08:32+0100 7261 [16910]: s3 renewal error -107
> >>>> delta_length
> >>>> 0
> >>>> last_success 7239
> >>>
> >>> Sanlock cannot access your storage. The rest of the errors are less
> >>> interesting,
> >>> as nothing will work without storage.
> >>>
> >>> While sanlock is failing, can you access that storage from another host
> >>> or
> >>> from the host running sanlock?
> >>>
> >>> Adding David for helping with debugging the sanlock errors.
> >>
> >>
> >> If you mean something like:
> >> # mount -t glusterfs minidell.home:/hosted_engine_glusterfs /tmp/test
> >> # # ll /tmp/test/
> >> totale 4
> >> drwxr-xr-x. 6 vdsm kvm 4096 12 feb 11.49
> >> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
> >> -rwxr-xr-x. 1 vdsm kvm    0 12 feb 13.04 __DIRECT_IO_TEST__
> >>
> >> yes, I can still access the gluster volume.
> > 
> > While sanlock fail to update the lease or now?

?

> > 
> > Try also to write to this storage.
> 
> dd if=/dev/zero of=/tmp/test/my_write_test bs=4k count=1k
> 1024+0 records in
> 1024+0 records out
> 4194304 bytes (4.2 MB) copied, 0.0289396 s, 145 MB/s
> 
> ll /tmp/test/
> total 4100
> -rwxr-xr-x. 1 vdsm kvm        0 Feb 12 13:04 __DIRECT_IO_TEST__
> drwxr-xr-x. 6 vdsm kvm     4096 Feb 12 11:49
> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
> -rw-r--r--. 1 root root 4194304 Feb 12 15:33 my_write_test
> 
> looks like it works
> 
> 
> 
> > 
> >>
> >> # gluster volume info hosted_engine_glusterfs
> >>
> >> Volume Name: hosted_engine_glusterfs
> >> Type: Distribute
> >> Volume ID: 8d78571d-69ef-41fb-bc0a-34dc7d903df5
> >> Status: Started
> >> Number of Bricks: 1
> >> Transport-type: tcp
> >> Bricks:
> >> Brick1: minidell.home:/path1
> >> Options Reconfigured:
> >> nfs.disable: on
> >> storage.owner-gid: 36
> >> storage.owner-uid: 36
> >> network.ping-timeout: 10
> >> auth.allow: *
> >> cluster.server-quorum-type: server
> >> cluster.quorum-type: auto
> >> network.remote-dio: enable
> >> cluster.eager-lock: enable
> >> performance.stat-prefetch: off
> >> performance.io-cache: off
> >> performance.read-ahead: off
> >> performance.quick-read: off
> >> server.allow-insecure: on
> >> user.cifs: disable
> >>
> >> # gluster volume status hosted_engine_glusterfs
> >> Status of volume: hosted_engine_glusterfs
> >> Gluster process						Port	Online	Pid
> >> ------------------------------------------------------------------------------
> >> Brick minidell.home:/path1				49217	Y	16660
> >>
> >> Task Status of Volume hosted_engine_glusterfs
> >> ------------------------------------------------------------------------------
> >> There are no active volume tasks
> >>
> >>
> >>
> >>>
> >>>>
> >>>>
> >>>> And in VDSM logs:
> >>>>
> >>>> MainThread::DEBUG::2015-02-12
> >>>> 13:08:25,529::vdsm::58::vds::(sigtermHandler)
> >>>> Received signal 15
> >>>> MainThread::DEBUG::2015-02-12
> >>>> 13:08:25,529::protocoldetector::144::vds.MultiProtocolAcceptor::(stop)
> >>>> Stopping Acceptor
> >>>> MainThread::INFO::2015-02-12
> >>>> 13:08:25,530::__init__::565::jsonrpc.JsonRpcServer::(stop) Stopping
> >>>> JsonRPC
> >>>> Server
> >>>> Detector thread::DEBUG::2015-02-12
> >>>> 13:08:25,530::protocoldetector::115::vds.MultiProtocolAcceptor::(_cleanup)
> >>>> Cleaning Acceptor
> >>>> MainThread::INFO::2015-02-12 13:08:25,531::vmchannels::188::vds::(stop)
> >>>> VM
> >>>> channels listener was stopped.
> >>>> MainThread::INFO::2015-02-12 13:08:25,535::momIF::91::MOM::(stop)
> >>>> Shutting
> >>>> down MOM
> >>>> MainThread::DEBUG::2015-02-12
> >>>> 13:08:25,535::task::595::Storage.TaskManager.Task::(_updateState)
> >>>> Task=`c52bd13a-24f0-4d8a-97d8-f6c53ca8bddc`::moving
> >>>> from state init -> state preparing
> >>>> MainThread::INFO::2015-02-12
> >>>> 13:08:25,535::logUtils::44::dispatcher::(wrapper) Run and protect:
> >>>> prepareForShutdown(options=None)
> >>>> Thread-11::DEBUG::2015-02-12
> >>>> 13:08:25,535::storageServer::706::Storage.ConnectionMonitor::(_monitorConnections)
> >>>> Monitoring stopped
> >>>> ioprocess communication (16795)::ERROR::2015-02-12
> >>>> 13:08:25,530::__init__::152::IOProcessClient::(_communicate) IOProcess
> >>>> failure
> >>>> Traceback (most recent call last):
> >>>>   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line
> >>>>   107,
> >>>>   in
> >>>>   _communicate
> >>>>     raise Exception("FD closed")
> >>>> Exception: FD closed
> >>>> ioprocess communication (16795)::DEBUG::2015-02-12
> >>>> 13:08:25,537::__init__::298::IOProcessClient::(_run) Starting
> >>>> IOProcess...
> >>>>
> >>>>
> >>>> Followed by:
> >>>>
> >>>> MainThread::INFO::2015-02-12
> >>>> 13:08:25,548::domainMonitor::137::Storage.DomainMonitor::(close)
> >>>> Stopping
> >>>> domain monitors
> >>>> MainThread::INFO::2015-02-12
> >>>> 13:08:25,549::domainMonitor::114::Storage.DomainMonitor::(stopMonitoring)
> >>>> Stop monitoring
> >>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
> >>>> Thread-94::DEBUG::2015-02-12
> >>>> 13:08:25,549::domainMonitor::192::Storage.DomainMonitorThread::(_monitorLoop)
> >>>> Stopping domain monitor for
> >>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
> >>>> Thread-94::INFO::2015-02-12
> >>>> 13:08:25,549::clusterlock::245::Storage.SANLock::(releaseHostId)
> >>>> Releasing
> >>>> host id for domain
> >>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f (id: 1)
> >>>> Thread-94::DEBUG::2015-02-12
> >>>> 13:08:25,549::domainMonitor::201::Storage.DomainMonitorThread::(_monitorLoop)
> >>>> Unable to release the host id 1 for domain
> >>>> a7dba6e3-09ac-430a-8d6e-eea17cf18b8f
> >>>> Traceback (most recent call last):
> >>>>   File "/usr/share/vdsm/storage/domainMonitor.py", line 198, in
> >>>>   _monitorLoop
> >>>>     self.domain.releaseHostId(self.hostId, unused=True)
> >>>>   File "/usr/share/vdsm/storage/sd.py", line 480, in releaseHostId
> >>>>     self._clusterLock.releaseHostId(hostId, async, unused)
> >>>>   File "/usr/share/vdsm/storage/clusterlock.py", line 252, in
> >>>>   releaseHostId
> >>>>     raise se.ReleaseHostIdFailure(self._sdUUID, e)
> >>>> ReleaseHostIdFailure: Cannot release host id:
> >>>> (u'a7dba6e3-09ac-430a-8d6e-eea17cf18b8f', SanlockException(16, 'Sanlock
> >>>> lockspace remove failure',
> >>>> 'Device or resource busy'))
> >>>>
> >>>> supervdsm logs doesn't show anything interesting but:
> >>>>
> >>>> # service supervdsmd status
> >>>> Redirecting to /bin/systemctl status  supervdsmd.service
> >>>> supervdsmd.service - "Auxiliary vdsm service for running helper
> >>>> functions
> >>>> as
> >>>> root"
> >>>>    Loaded: loaded (/usr/lib/systemd/system/supervdsmd.service; static)
> >>>>    Active: inactive (dead) since gio 2015-02-12 13:08:30 CET; 27min ago
> >>>>  Main PID: 15859 (code=exited, status=0/SUCCESS)
> >>>>    CGroup: /system.slice/supervdsmd.service
> >>>>
> >>>> feb 12 13:08:30 minidell.home systemd[1]: Stopping "Auxiliary vdsm
> >>>> service
> >>>> for running helper functions as root"...
> >>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: Traceback (most
> >>>> recent
> >>>> call last):
> >>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: File
> >>>> "/usr/lib64/python2.7/multiprocessing/util.py", line 268, in
> >>>> _run_finalizers
> >>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: finalizer()
> >>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: File
> >>>> "/usr/lib64/python2.7/multiprocessing/util.py", line 201, in __call__
> >>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: res =
> >>>> self._callback(*self._args, **self._kwargs)
> >>>> feb 12 13:08:30 minidell.home daemonAdapter[15859]: OSError: [Errno 2]
> >>>> No
> >>>> such file or directory: '/var/run/vdsm/svdsm.sock'
> >>>> feb 12 13:08:30 minidell.home python[15859]: DIGEST-MD5 client mech
> >>>> dispose
> >>>> feb 12 13:08:30 minidell.home python[15859]: DIGEST-MD5 common mech
> >>>> dispose
> >>>> feb 12 13:08:30 minidell.home systemd[1]: Stopped "Auxiliary vdsm
> >>>> service
> >>>> for
> >>>> running helper functions as root".
> >>>>
> >>>>
> >>>>
> >>>> After a while the VM became unreachable as ovirt-hosted-engine-setup
> >>>> shows:
> >>>>
> >>>> 2015-02-12 13:10:33 DEBUG
> >>>> otopi.plugins.ovirt_hosted_engine_setup.engine.add_host
> >>>> add_host._wait_host_ready:183 Error fetching host state:
> >>>> [ERROR]::oVirt API connection failure, [Errno 110] Connection timed out
> >>>> 2015-02-12 13:10:33 DEBUG
> >>>> otopi.plugins.ovirt_hosted_engine_setup.engine.add_host
> >>>> add_host._wait_host_ready:189 VDSM host in  state
> >>>>
> >>>>
> >>>> Gluster logs don't show anything special.
> >>>>
> >>>> I'm guessing that the 15 minutes of host-deploy being stuck caused the
> >>>> issue
> >>>> here, but I don't see any relevant change that may have caused it.
> >>>>
> >>>> Attaching the tar.gz with the code used for ovirt-hosted-engine-setup;
> >>>> the
> >>>> rest is 3.5-snapshot over RHEL 7.1
> >>>>
> >>>> Any hint is welcome.
> >>>>
> >>>> --
> >>>> Sandro Bonazzola
> >>>> Better technology. Faster innovation. Powered by community
> >>>> collaboration.
> >>>> See how it works at redhat.com
> >>>>
> >>>> _______________________________________________
> >>>> Devel mailing list
> >>>> Devel at ovirt.org
> >>>> http://lists.ovirt.org/mailman/listinfo/devel
> >>
> >>
> >> --
> >> Sandro Bonazzola
> >> Better technology. Faster innovation. Powered by community collaboration.
> >> See how it works at redhat.com
> >>
> 
> 
> --
> Sandro Bonazzola
> Better technology. Faster innovation. Powered by community collaboration.
> See how it works at redhat.com
> 



More information about the Devel mailing list