[Users] latest vdsm cannot read ib device speeds causing storage attach fail
Royce Lv
lvroyce at linux.vnet.ibm.com
Fri Feb 1 02:43:44 UTC 2013
On 01/31/2013 10:27 AM, Shu Ming wrote:
> Royce,
>
> It is a good trouble shooting. See my comments about the IOError in
> serve_client() below.
>
> Royce Lv:
>>
>> Hi,
>> I reproduced this issue, and I believe it's a python bug.
>> 1. How to reproduce:
>> with the test case attached, put it under /usr/share/vdsm/tests/,
>> run #./run_tests.sh superVdsmTests.py
>> and this issue will be reproduced.
>> 2.Log analyse:
>> We notice a strange pattern in this log: connectStorageServer be
>> called twice, first supervdsm call succeed, second fails becasue of
>> validateAccess().
>> That is because for the first call validateAccess returns normally
>> and leave a child there, when the second validateAccess call arrives
>> and multirprocessing manager is receiving the method message, it is
>> just the time first child exit and SIGCHLD comming, this signal
>> interrupted multiprocessing receive system call, python managers.py
>> should handle INTR and retry recv() like we do in vdsm but it's not,
>> so the second one raise error.
>> >Thread-18::DEBUG::2013-01-22 10:41:03,570::misc::85::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 192.168.0.1:/ovirt/silvermoon /rhev/data-center/mnt/192.168.0.1:_ovirt_silvermoon' (cwd None)
>> >Thread-18::DEBUG::2013-01-22 10:41:03,607::misc::85::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 192.168.0.1:/ovirt/undercity /rhev/data-center/mnt/192.168.0.1:_ovirt_undercity' (cwd None)
>> >Thread-18::ERROR::2013-01-22 10:41:03,627::hsm::2215::Storage.HSM::(connectStorageServer) Could not connect to storageServer
>> >Traceback (most recent call last):
>> > File "/usr/share/vdsm/storage/hsm.py", line 2211, in connectStorageServer
>> > conObj.connect()
>> > File "/usr/share/vdsm/storage/storageServer.py", line 303, in connect
>> > return self._mountCon.connect()
>> > File "/usr/share/vdsm/storage/storageServer.py", line 209, in connect
>> > fileSD.validateDirAccess(self.getMountObj().getRecord().fs_file)
>> > File "/usr/share/vdsm/storage/fileSD.py", line 55, in validateDirAccess
>> > (os.R_OK | os.X_OK))
>> > File "/usr/share/vdsm/supervdsm.py", line 81, in __call__
>> > return callMethod()
>> > File "/usr/share/vdsm/supervdsm.py", line 72, in <lambda>
>> > **kwargs)
>> > File "<string>", line 2, in validateAccess
>> > File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
>> > raise convert_to_error(kind, result)
>> the vdsm side receive RemoteError because of supervdsm server
>> multiprocessing manager raise error KIND='TRACEBACK'
>> >RemoteError:
>> The upper part is the trace back from the client side, the following
>> part is from server side:
>> >---------------------------------------------------------------------------
>> >Traceback (most recent call last):
>> > File "/usr/lib64/python2.6/multiprocessing/managers.py", line 214, in serve_client
>> > request = recv()
>> >IOError: [Errno 4] Interrupted system call
>> >---------------------------------------------------------------------------
>>
>> Corresponding Python source code:managers.py(Server side)
>> def serve_client(self, conn):
>> '''
>> Handle requests from the proxies in a particular process/thread
>> '''
>> util.debug('starting server thread to service %r',
>> threading.current_thread().name)
>> recv = conn.recv
>> send = conn.send
>> id_to_obj = self.id_to_obj
>> while not self.stop:
>> try:
>> methodname = obj = None
>> request = recv()<------------------this line been interrupted by SIGCHLD
>> ident, methodname, args, kwds = request
>> obj, exposed, gettypeid = id_to_obj[ident]
>> if methodname not in exposed:
>> raise AttributeError(
>> 'method %r of %r object is not in exposed=%r' %
>> (methodname, type(obj), exposed)
>> )
>> function = getattr(obj, methodname)
>> try:
>> res = function(*args, **kwds)
>> except Exception, e:
>> msg = ('#ERROR', e)
>> else:
>> typeid = gettypeid and gettypeid.get(methodname, None)
>> if typeid:
>> rident, rexposed = self.create(conn, typeid, res)
>> token = Token(typeid, self.address, rident)
>> msg = ('#PROXY', (rexposed, token))
>> else:
>> msg = ('#RETURN', res)
>> except AttributeError:
>> if methodname is None:
>> msg = ('#TRACEBACK', format_exc())
>> else:
>> try:
>> fallback_func = self.fallback_mapping[methodname]
>> result = fallback_func(
>> self, conn, ident, obj, *args, **kwds
>> )
>> msg = ('#RETURN', result)
>> except Exception:
>> msg = ('#TRACEBACK', format_exc())
>>
>> except EOFError:
>> util.debug('got EOF -- exiting thread serving %r',
>> threading.current_thread().name)
>> sys.exit(0)
>>
>> except Exception:<------does not handle IOError,INTR here should retry recv()
>> msg = ('#TRACEBACK', format_exc())
> Not sure I under what is your opinion here.
>
> Whenever this line was interrupted by a SIGCHLD ,
> request = recv()<------------------this line been interrupted by SIGCHLD
> the code flow should fall into
>
> except Exception:<------does not handle IOError,INTR here should retry recv()
> msg = ('#TRACEBACK', format_exc())
>
>
> However, the big while loop was not stopped, it should reach this line again.
Yep, I'm not saying this error makes baseManager server totally
non-functional,The loop does not stop but this call returns error msg
to vdsm and make it raise RemoteError , see line#273:
272
<http://hg.python.org/cpython/file/2c0197c95ec6/Lib/multiprocessing/managers.py#l272>
try:
273
<http://hg.python.org/cpython/file/2c0197c95ec6/Lib/multiprocessing/managers.py#l273>
send(msg)
274
<http://hg.python.org/cpython/file/2c0197c95ec6/Lib/multiprocessing/managers.py#l274>
except Exception as e:
This makes validateAccess fails every other time.
> try:
> methodname = obj = None
> request = recv()<------------------this line been interrupted by SIGCHLD
> Then the interrupted recv() will be called again and again until the
> recv() call is not interrupted. What's wrong here?
>
>>
>>
>> 3. Actions we will take:
>> (1)As a work round we can first remove the zombie reaper from
>> supervdsm server
>> (2)I'll see whether python has a fixed version for this
>> (3)Yaniv is working on changing vdsm/svdsm communication channel to
>> pipe and handle it ourselves, I believe we'll get rid of this with
>> that properly handled.
>>
>>
>> On 01/25/2013 06:00 AM, Dead Horse wrote:
>>> Tried some manual edits to SD states in the dbase. The net result
>>> was I was able to get a node active. However as reconstructing the
>>> master storage domain kicked in it was unable to do so. It was also
>>> not able to recognize the other SD with similar failure modes to the
>>> unrecognized master above. Guessing the newer VDSM version borked
>>> things pretty good. So being as this is a test harness and the SD
>>> data is not worth saving I just smoked the all SD. I ran
>>> engine-cleanup and started from fresh and all is well now.
>>>
>>> - DHC
>>>
>>>
>>> On Thu, Jan 24, 2013 at 11:53 AM, Dead Horse
>>> <deadhorseconsulting at gmail.com
>>> <mailto:deadhorseconsulting at gmail.com>> wrote:
>>>
>>> This test harness setup here consists of two servers tied to NFS
>>> storage via IB (NFS mounts are via IPoIB, NFS over RDMA is
>>> disabled) . All storage domains are NFS. The issue does occur
>>> with both servers on when attempting to bring them out of
>>> maintenance mode with the end result being non-operational due
>>> to storage attach fail.
>>>
>>> The current issue is now that with a working older commit the
>>> master storage domain is "stuck" in state "locked" and I see the
>>> secondary issue wherein VDSM cannot seem to find or contact the
>>> master storage domain even though it is there. I am can mount
>>> the master storage domain manually and and all content appears
>>> to be accounted for accordingly on either host.
>>>
>>> Here is the current contents of the master storage domain metadata:
>>> CLASS=Data
>>> DESCRIPTION=orgrimmar
>>> IOOPTIMEOUTSEC=1
>>> LEASERETRIES=3
>>> LEASETIMESEC=5
>>> LOCKPOLICY=
>>> LOCKRENEWALINTERVALSEC=5
>>> MASTER_VERSION=417
>>> POOL_DESCRIPTION=Azeroth
>>> POOL_DOMAINS=0549ee91-4498-4130-8c23-4c173b5c0959:Active,d8b55105-c90a-465d-9803-8130da9a671e:Active,67534cca-1327-462a-b455-a04464084b31:Active,c331a800-839d-4d23-9059-870a7471240a:Active,f8984825-ff8d-43d9-91db-0d0959f8bae9:Active,c434056e-96be-4702-8beb-82a408a5c8cb:Active,f7da73c7-b5fe-48b6-93a0-0c773018c94f:Active,82e3b34a-6f89-4299-8cd8-2cc8f973a3b4:Active,e615c975-6b00-469f-8fb6-ff58ae3fdb2c:Active,5bc86532-55f7-4a91-a52c-fad261f322d5:Active,1130b87a-3b34-45d6-8016-d435825c68ef:Active
>>> POOL_SPM_ID=1
>>> POOL_SPM_LVER=6
>>> POOL_UUID=f90a0d1c-06ca-11e2-a05b-00151712f280
>>> REMOTE_PATH=192.168.0.1:/ovirt/orgrimmar
>>> ROLE=Master
>>> SDUUID=67534cca-1327-462a-b455-a04464084b31
>>> TYPE=NFS
>>> VERSION=3
>>> _SHA_CKSUM=1442bb078fd8c9468d241ff141e9bf53839f0721
>>>
>>> So now with the older working commit I now get this the
>>> "StoragePoolMasterNotFound: Cannot find master domain" error
>>> (prior details above when I worked backwards to that commit)
>>>
>>> This is odd as the nodes can definitely reach the master storage
>>> domain:
>>>
>>> showmount from one of the el6.3 nodes:
>>> [root at kezan ~]# showmount -e 192.168.0.1
>>> Export list for 192.168.0.1 <http://192.168.0.1>:
>>> /ovirt/orgrimmar 192.168.0.0/16 <http://192.168.0.0/16>
>>>
>>> mount/ls from one of the nodes:
>>> [root at kezan ~]# mount 192.168.0.1:/ovirt/orgrimmar /mnt
>>> [root at kezan ~]# ls -al
>>> /mnt/67534cca-1327-462a-b455-a04464084b31/dom_md/
>>> total 1100
>>> drwxr-xr-x 2 vdsm kvm 4096 Jan 24 11:44 .
>>> drwxr-xr-x 5 vdsm kvm 4096 Oct 19 16:16 ..
>>> -rw-rw---- 1 vdsm kvm 1048576 Jan 19 22:09 ids
>>> -rw-rw---- 1 vdsm kvm 0 Sep 25 00:46 inbox
>>> -rw-rw---- 1 vdsm kvm 2097152 Jan 10 13:33 leases
>>> -rw-r--r-- 1 vdsm kvm 903 Jan 10 13:39 metadata
>>> -rw-rw---- 1 vdsm kvm 0 Sep 25 00:46 outbox
>>>
>>>
>>> - DHC
>>>
>>>
>>>
>>> On Thu, Jan 24, 2013 at 7:51 AM, ybronhei <ybronhei at redhat.com
>>> <mailto:ybronhei at redhat.com>> wrote:
>>>
>>> On 01/24/2013 12:44 AM, Dead Horse wrote:
>>>
>>> I narrowed down on the commit where the originally
>>> reported issue crept in:
>>> commitfc3a44f71d2ef202cff18d7203b9e4165b546621building
>>> and testing with
>>>
>>> this commit or subsequent commits yields the original issue.
>>>
>>> Interesting.. it might be related to this commit and we're
>>> trying to reproduce it.
>>>
>>> Did you try to remove that code and run again? does it work
>>> without the additional of zombieReaper?
>>> does the connectivity to the storage work well? when you run
>>> 'ls' on the mounted folder you get see the files without a
>>> long delay ? it might related to too long timeout when
>>> validating access to this mount..
>>> we work on that.. any additional info can help
>>>
>>> Thanks.
>>>
>>>
>>> - DHC
>>>
>>>
>>> On Wed, Jan 23, 2013 at 3:56 PM, Dead Horse
>>> <deadhorseconsulting at gmail.com
>>> <mailto:deadhorseconsulting at gmail.com>>wrote:
>>>
>>> Indeed reverting back to an older vdsm clears up the
>>> above issue. However
>>> now I the issue is see is:
>>> Thread-18::ERROR::2013-01-23
>>> 15:50:42,885::task::833::TaskManager.Task::(_setError)
>>> Task=`08709e68-bcbc-40d8-843a-d69d4df40ac6`::Unexpected
>>> error
>>>
>>> Traceback (most recent call last):
>>> File "/usr/share/vdsm/storage/task.py", line 840,
>>> in _run
>>> return fn(*args, **kargs)
>>> File "/usr/share/vdsm/logUtils.py", line 42, in
>>> wrapper
>>> res = f(*args, **kwargs)
>>> File "/usr/share/vdsm/storage/hsm.py", line 923,
>>> in connectStoragePool
>>> masterVersion, options)
>>> File "/usr/share/vdsm/storage/hsm.py", line 970,
>>> in _connectStoragePool
>>> res = pool.connect(hostID, scsiKey, msdUUID,
>>> masterVersion)
>>> File "/usr/share/vdsm/storage/sp.py", line 643,
>>> in connect
>>> self.__rebuild(msdUUID=msdUUID,
>>> masterVersion=masterVersion)
>>> File "/usr/share/vdsm/storage/sp.py", line 1167,
>>> in __rebuild
>>> self.masterDomain =
>>> self.getMasterDomain(msdUUID=msdUUID,
>>> masterVersion=masterVersion)
>>> File "/usr/share/vdsm/storage/sp.py", line 1506,
>>> in getMasterDomain
>>> raise se.StoragePoolMasterNotFound(self.spUUID,
>>> msdUUID)
>>> StoragePoolMasterNotFound: Cannot find master domain:
>>> 'spUUID=f90a0d1c-06ca-11e2-a05b-00151712f280,
>>> msdUUID=67534cca-1327-462a-b455-a04464084b31'
>>> Thread-18::DEBUG::2013-01-23
>>> 15:50:42,887::task::852::TaskManager.Task::(_run)
>>> Task=`08709e68-bcbc-40d8-843a-d69d4df40ac6`::Task._run:
>>> 08709e68-bcbc-40d8-843a-d69d4df40ac6
>>> ('f90a0d1c-06ca-11e2-a05b-00151712f280', 2,
>>> 'f90a0d1c-06ca-11e2-a05b-00151712f280',
>>> '67534cca-1327-462a-b455-a04464084b31', 433) {}
>>> failed - stopping task
>>>
>>> This is with vdsm built from
>>> commit25a2d8572ad32352227c98a86631300fbd6523c1
>>> - DHC
>>>
>>>
>>> On Wed, Jan 23, 2013 at 10:44 AM, Dead Horse <
>>> deadhorseconsulting at gmail.com
>>> <mailto:deadhorseconsulting at gmail.com>> wrote:
>>>
>>> VDSM was built from:
>>> commit 166138e37e75767b32227746bb671b1dab9cdd5e
>>>
>>> Attached is the full vdsm log
>>>
>>> I should also note that from engine perspective
>>> it sees the master
>>> storage domain as locked and the others as unknown.
>>>
>>>
>>> On Wed, Jan 23, 2013 at 2:49 AM, Dan Kenigsberg
>>> <danken at redhat.com <mailto:danken at redhat.com>>wrote:
>>>
>>> On Tue, Jan 22, 2013 at 04:02:24PM -0600,
>>> Dead Horse wrote:
>>>
>>> Any ideas on this one? (from VDSM log):
>>> Thread-25::DEBUG::2013-01-22
>>> 15:35:29,065::BindingXMLRPC::914::vds::(wrapper)
>>> client
>>>
>>> [3.57.111.30]::call
>>>
>>> getCapabilities with () {}
>>> Thread-25::ERROR::2013-01-22
>>> 15:35:29,113::netinfo::159::root::(speed)
>>> cannot read ib0 speed
>>> Traceback (most recent call last):
>>> File
>>> "/usr/lib64/python2.6/site-packages/vdsm/netinfo.py",
>>> line 155,
>>>
>>> in
>>>
>>> speed
>>> s =
>>> int(file('/sys/class/net/%s/speed' %
>>> dev).read())
>>> IOError: [Errno 22] Invalid argument
>>>
>>> Causes VDSM to fail to attach storage
>>>
>>>
>>> I doubt that this is the cause of the
>>> failure, as vdsm has always
>>> reported "0" for ib devices, and still is.
>>>
>>> it happens only when you call to getCapabilities.. so it
>>> doesn't related to the flow, and it can't effect the storage.
>>> Dan: I guess this is not the issue but why is the IOError?
>>>
>>>
>>> Does a former version works with your Engine?
>>> Could you share more of your vdsm.log? I
>>> suppose the culprit lies in one
>>> one of the storage-related commands, not in
>>> statistics retrieval.
>>>
>>>
>>> Engine side sees:
>>> ERROR
>>> [org.ovirt.engine.core.bll.storage.NFSStorageHelper]
>>> (QuartzScheduler_Worker-96) [553ef26e]
>>> The connection with details
>>> 192.168.0.1:/ovirt/ds failed because of
>>> error code 100 and error
>>>
>>> message
>>>
>>> is: general exception
>>> 2013-01-22 15:35:30,160 INFO
>>> [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand]
>>> (QuartzScheduler_Worker-96) [1ab78378]
>>> Running command:
>>> SetNonOperationalVdsCommand internal:
>>> true. Entities affected : ID:
>>> 8970b3fe-1faf-11e2-bc1f-00151712f280
>>> Type: VDS
>>> 2013-01-22 15:35:30,200 INFO
>>> [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
>>> (QuartzScheduler_Worker-96) [1ab78378]
>>> START,
>>> SetVdsStatusVDSCommand(HostName = kezan,
>>> HostId =
>>> 8970b3fe-1faf-11e2-bc1f-00151712f280,
>>> status=NonOperational,
>>> nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE),
>>> log id: 4af5c4cd
>>> 2013-01-22 15:35:30,211 INFO
>>> [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
>>> (QuartzScheduler_Worker-96) [1ab78378]
>>> FINISH, SetVdsStatusVDSCommand,
>>>
>>> log
>>>
>>> id: 4af5c4cd
>>> 2013-01-22 15:35:30,242 ERROR
>>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>> (QuartzScheduler_Worker-96) [1ab78378]
>>> Try to add duplicate audit log
>>> values with the same name. Type:
>>> VDS_SET_NONOPERATIONAL_DOMAIN. Value:
>>> storagepoolname
>>>
>>> Engine = latest master
>>> VDSM = latest master
>>>
>>>
>>> Since "latest master" is an unstable
>>> reference by definition, I'm sure
>>> that History would thank you if you post the
>>> exact version (git hash?)
>>> of the code.
>>>
>>> node = el6
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users at ovirt.org <mailto:Users at ovirt.org>
>>> http://lists.ovirt.org/mailman/listinfo/users
>>>
>>>
>>>
>>> --
>>> Yaniv Bronhaim.
>>> RedHat, Israel
>>> 09-7692289
>>> 054-7744187
>>>
>>>
>>>
>>
>>
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/users
>
>
> --
> ---
> ?? Shu Ming
> Open Virtualization Engineerning; CSTL, IBM Corp.
> Tel: 86-10-82451626 Tieline: 9051626 E-mail:shuming at cn.ibm.com orshuming at linux.vnet.ibm.com
> Address: 3/F Ring Building, ZhongGuanCun Software Park, Haidian District, Beijing 100193, PRC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20130201/07dc2b93/attachment-0001.html>
More information about the Users
mailing list