[Users] latest vdsm cannot read ib device speeds causing storage attach fail
Shu Ming
shuming at linux.vnet.ibm.com
Wed Jan 30 21:27:10 EST 2013
Royce,
It is a good trouble shooting. See my comments about the IOError in
serve_client() below.
Royce Lv:
>
> Hi,
> I reproduced this issue, and I believe it's a python bug.
> 1. How to reproduce:
> with the test case attached, put it under /usr/share/vdsm/tests/,
> run #./run_tests.sh superVdsmTests.py
> and this issue will be reproduced.
> 2.Log analyse:
> We notice a strange pattern in this log: connectStorageServer be
> called twice, first supervdsm call succeed, second fails becasue of
> validateAccess().
> That is because for the first call validateAccess returns normally
> and leave a child there, when the second validateAccess call arrives
> and multirprocessing manager is receiving the method message, it is
> just the time first child exit and SIGCHLD comming, this signal
> interrupted multiprocessing receive system call, python managers.py
> should handle INTR and retry recv() like we do in vdsm but it's not,
> so the second one raise error.
> >Thread-18::DEBUG::2013-01-22 10:41:03,570::misc::85::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 192.168.0.1:/ovirt/silvermoon /rhev/data-center/mnt/192.168.0.1:_ovirt_silvermoon' (cwd None)
> >Thread-18::DEBUG::2013-01-22 10:41:03,607::misc::85::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 192.168.0.1:/ovirt/undercity /rhev/data-center/mnt/192.168.0.1:_ovirt_undercity' (cwd None)
> >Thread-18::ERROR::2013-01-22 10:41:03,627::hsm::2215::Storage.HSM::(connectStorageServer) Could not connect to storageServer
> >Traceback (most recent call last):
> > File "/usr/share/vdsm/storage/hsm.py", line 2211, in connectStorageServer
> > conObj.connect()
> > File "/usr/share/vdsm/storage/storageServer.py", line 303, in connect
> > return self._mountCon.connect()
> > File "/usr/share/vdsm/storage/storageServer.py", line 209, in connect
> > fileSD.validateDirAccess(self.getMountObj().getRecord().fs_file)
> > File "/usr/share/vdsm/storage/fileSD.py", line 55, in validateDirAccess
> > (os.R_OK | os.X_OK))
> > File "/usr/share/vdsm/supervdsm.py", line 81, in __call__
> > return callMethod()
> > File "/usr/share/vdsm/supervdsm.py", line 72, in <lambda>
> > **kwargs)
> > File "<string>", line 2, in validateAccess
> > File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
> > raise convert_to_error(kind, result)
> the vdsm side receive RemoteError because of supervdsm server
> multiprocessing manager raise error KIND='TRACEBACK'
> >RemoteError:
> The upper part is the trace back from the client side, the following
> part is from server side:
> >---------------------------------------------------------------------------
> >Traceback (most recent call last):
> > File "/usr/lib64/python2.6/multiprocessing/managers.py", line 214, in serve_client
> > request = recv()
> >IOError: [Errno 4] Interrupted system call
> >---------------------------------------------------------------------------
>
> Corresponding Python source code:managers.py(Server side)
> def serve_client(self, conn):
> '''
> Handle requests from the proxies in a particular process/thread
> '''
> util.debug('starting server thread to service %r',
> threading.current_thread().name)
> recv = conn.recv
> send = conn.send
> id_to_obj = self.id_to_obj
> while not self.stop:
> try:
> methodname = obj = None
> request = recv()<------------------this line been interrupted by SIGCHLD
> ident, methodname, args, kwds = request
> obj, exposed, gettypeid = id_to_obj[ident]
> if methodname not in exposed:
> raise AttributeError(
> 'method %r of %r object is not in exposed=%r' %
> (methodname, type(obj), exposed)
> )
> function = getattr(obj, methodname)
> try:
> res = function(*args, **kwds)
> except Exception, e:
> msg = ('#ERROR', e)
> else:
> typeid = gettypeid and gettypeid.get(methodname, None)
> if typeid:
> rident, rexposed = self.create(conn, typeid, res)
> token = Token(typeid, self.address, rident)
> msg = ('#PROXY', (rexposed, token))
> else:
> msg = ('#RETURN', res)
> except AttributeError:
> if methodname is None:
> msg = ('#TRACEBACK', format_exc())
> else:
> try:
> fallback_func = self.fallback_mapping[methodname]
> result = fallback_func(
> self, conn, ident, obj, *args, **kwds
> )
> msg = ('#RETURN', result)
> except Exception:
> msg = ('#TRACEBACK', format_exc())
>
> except EOFError:
> util.debug('got EOF -- exiting thread serving %r',
> threading.current_thread().name)
> sys.exit(0)
>
> except Exception:<------does not handle IOError,INTR here should retry recv()
> msg = ('#TRACEBACK', format_exc())
Not sure I under what is your opinion here.
Whenever this line was interrupted by a SIGCHLD ,
request = recv()<------------------this line been interrupted by SIGCHLD
the code flow should fall into
except Exception:<------does not handle IOError,INTR here should retry recv()
msg = ('#TRACEBACK', format_exc())
However, the big while loop was not stopped, it should reach this line again.
try:
methodname = obj = None
request = recv()<------------------this line been interrupted by SIGCHLD
Then the interrupted recv() will be called again and again until the
recv() call is not interrupted. What's wrong here?
>
>
> 3. Actions we will take:
> (1)As a work round we can first remove the zombie reaper from
> supervdsm server
> (2)I'll see whether python has a fixed version for this
> (3)Yaniv is working on changing vdsm/svdsm communication channel to
> pipe and handle it ourselves, I believe we'll get rid of this with
> that properly handled.
>
>
> On 01/25/2013 06:00 AM, Dead Horse wrote:
>> Tried some manual edits to SD states in the dbase. The net result was
>> I was able to get a node active. However as reconstructing the master
>> storage domain kicked in it was unable to do so. It was also not able
>> to recognize the other SD with similar failure modes to the
>> unrecognized master above. Guessing the newer VDSM version borked
>> things pretty good. So being as this is a test harness and the SD
>> data is not worth saving I just smoked the all SD. I ran
>> engine-cleanup and started from fresh and all is well now.
>>
>> - DHC
>>
>>
>> On Thu, Jan 24, 2013 at 11:53 AM, Dead Horse
>> <deadhorseconsulting at gmail.com
>> <mailto:deadhorseconsulting at gmail.com>> wrote:
>>
>> This test harness setup here consists of two servers tied to NFS
>> storage via IB (NFS mounts are via IPoIB, NFS over RDMA is
>> disabled) . All storage domains are NFS. The issue does occur
>> with both servers on when attempting to bring them out of
>> maintenance mode with the end result being non-operational due to
>> storage attach fail.
>>
>> The current issue is now that with a working older commit the
>> master storage domain is "stuck" in state "locked" and I see the
>> secondary issue wherein VDSM cannot seem to find or contact the
>> master storage domain even though it is there. I am can mount the
>> master storage domain manually and and all content appears to be
>> accounted for accordingly on either host.
>>
>> Here is the current contents of the master storage domain metadata:
>> CLASS=Data
>> DESCRIPTION=orgrimmar
>> IOOPTIMEOUTSEC=1
>> LEASERETRIES=3
>> LEASETIMESEC=5
>> LOCKPOLICY=
>> LOCKRENEWALINTERVALSEC=5
>> MASTER_VERSION=417
>> POOL_DESCRIPTION=Azeroth
>> POOL_DOMAINS=0549ee91-4498-4130-8c23-4c173b5c0959:Active,d8b55105-c90a-465d-9803-8130da9a671e:Active,67534cca-1327-462a-b455-a04464084b31:Active,c331a800-839d-4d23-9059-870a7471240a:Active,f8984825-ff8d-43d9-91db-0d0959f8bae9:Active,c434056e-96be-4702-8beb-82a408a5c8cb:Active,f7da73c7-b5fe-48b6-93a0-0c773018c94f:Active,82e3b34a-6f89-4299-8cd8-2cc8f973a3b4:Active,e615c975-6b00-469f-8fb6-ff58ae3fdb2c:Active,5bc86532-55f7-4a91-a52c-fad261f322d5:Active,1130b87a-3b34-45d6-8016-d435825c68ef:Active
>> POOL_SPM_ID=1
>> POOL_SPM_LVER=6
>> POOL_UUID=f90a0d1c-06ca-11e2-a05b-00151712f280
>> REMOTE_PATH=192.168.0.1:/ovirt/orgrimmar
>> ROLE=Master
>> SDUUID=67534cca-1327-462a-b455-a04464084b31
>> TYPE=NFS
>> VERSION=3
>> _SHA_CKSUM=1442bb078fd8c9468d241ff141e9bf53839f0721
>>
>> So now with the older working commit I now get this the
>> "StoragePoolMasterNotFound: Cannot find master domain" error
>> (prior details above when I worked backwards to that commit)
>>
>> This is odd as the nodes can definitely reach the master storage
>> domain:
>>
>> showmount from one of the el6.3 nodes:
>> [root at kezan ~]# showmount -e 192.168.0.1
>> Export list for 192.168.0.1 <http://192.168.0.1>:
>> /ovirt/orgrimmar 192.168.0.0/16 <http://192.168.0.0/16>
>>
>> mount/ls from one of the nodes:
>> [root at kezan ~]# mount 192.168.0.1:/ovirt/orgrimmar /mnt
>> [root at kezan ~]# ls -al
>> /mnt/67534cca-1327-462a-b455-a04464084b31/dom_md/
>> total 1100
>> drwxr-xr-x 2 vdsm kvm 4096 Jan 24 11:44 .
>> drwxr-xr-x 5 vdsm kvm 4096 Oct 19 16:16 ..
>> -rw-rw---- 1 vdsm kvm 1048576 Jan 19 22:09 ids
>> -rw-rw---- 1 vdsm kvm 0 Sep 25 00:46 inbox
>> -rw-rw---- 1 vdsm kvm 2097152 Jan 10 13:33 leases
>> -rw-r--r-- 1 vdsm kvm 903 Jan 10 13:39 metadata
>> -rw-rw---- 1 vdsm kvm 0 Sep 25 00:46 outbox
>>
>>
>> - DHC
>>
>>
>>
>> On Thu, Jan 24, 2013 at 7:51 AM, ybronhei <ybronhei at redhat.com
>> <mailto:ybronhei at redhat.com>> wrote:
>>
>> On 01/24/2013 12:44 AM, Dead Horse wrote:
>>
>> I narrowed down on the commit where the originally
>> reported issue crept in:
>> commitfc3a44f71d2ef202cff18d7203b9e4165b546621building
>> and testing with
>>
>> this commit or subsequent commits yields the original issue.
>>
>> Interesting.. it might be related to this commit and we're
>> trying to reproduce it.
>>
>> Did you try to remove that code and run again? does it work
>> without the additional of zombieReaper?
>> does the connectivity to the storage work well? when you run
>> 'ls' on the mounted folder you get see the files without a
>> long delay ? it might related to too long timeout when
>> validating access to this mount..
>> we work on that.. any additional info can help
>>
>> Thanks.
>>
>>
>> - DHC
>>
>>
>> On Wed, Jan 23, 2013 at 3:56 PM, Dead Horse
>> <deadhorseconsulting at gmail.com
>> <mailto:deadhorseconsulting at gmail.com>>wrote:
>>
>> Indeed reverting back to an older vdsm clears up the
>> above issue. However
>> now I the issue is see is:
>> Thread-18::ERROR::2013-01-23
>> 15:50:42,885::task::833::TaskManager.Task::(_setError)
>> Task=`08709e68-bcbc-40d8-843a-d69d4df40ac6`::Unexpected
>> error
>>
>> Traceback (most recent call last):
>> File "/usr/share/vdsm/storage/task.py", line 840,
>> in _run
>> return fn(*args, **kargs)
>> File "/usr/share/vdsm/logUtils.py", line 42, in
>> wrapper
>> res = f(*args, **kwargs)
>> File "/usr/share/vdsm/storage/hsm.py", line 923,
>> in connectStoragePool
>> masterVersion, options)
>> File "/usr/share/vdsm/storage/hsm.py", line 970,
>> in _connectStoragePool
>> res = pool.connect(hostID, scsiKey, msdUUID,
>> masterVersion)
>> File "/usr/share/vdsm/storage/sp.py", line 643, in
>> connect
>> self.__rebuild(msdUUID=msdUUID,
>> masterVersion=masterVersion)
>> File "/usr/share/vdsm/storage/sp.py", line 1167,
>> in __rebuild
>> self.masterDomain =
>> self.getMasterDomain(msdUUID=msdUUID,
>> masterVersion=masterVersion)
>> File "/usr/share/vdsm/storage/sp.py", line 1506,
>> in getMasterDomain
>> raise se.StoragePoolMasterNotFound(self.spUUID,
>> msdUUID)
>> StoragePoolMasterNotFound: Cannot find master domain:
>> 'spUUID=f90a0d1c-06ca-11e2-a05b-00151712f280,
>> msdUUID=67534cca-1327-462a-b455-a04464084b31'
>> Thread-18::DEBUG::2013-01-23
>> 15:50:42,887::task::852::TaskManager.Task::(_run)
>> Task=`08709e68-bcbc-40d8-843a-d69d4df40ac6`::Task._run:
>> 08709e68-bcbc-40d8-843a-d69d4df40ac6
>> ('f90a0d1c-06ca-11e2-a05b-00151712f280', 2,
>> 'f90a0d1c-06ca-11e2-a05b-00151712f280',
>> '67534cca-1327-462a-b455-a04464084b31', 433) {}
>> failed - stopping task
>>
>> This is with vdsm built from
>> commit25a2d8572ad32352227c98a86631300fbd6523c1
>> - DHC
>>
>>
>> On Wed, Jan 23, 2013 at 10:44 AM, Dead Horse <
>> deadhorseconsulting at gmail.com
>> <mailto:deadhorseconsulting at gmail.com>> wrote:
>>
>> VDSM was built from:
>> commit 166138e37e75767b32227746bb671b1dab9cdd5e
>>
>> Attached is the full vdsm log
>>
>> I should also note that from engine perspective
>> it sees the master
>> storage domain as locked and the others as unknown.
>>
>>
>> On Wed, Jan 23, 2013 at 2:49 AM, Dan Kenigsberg
>> <danken at redhat.com <mailto:danken at redhat.com>>wrote:
>>
>> On Tue, Jan 22, 2013 at 04:02:24PM -0600,
>> Dead Horse wrote:
>>
>> Any ideas on this one? (from VDSM log):
>> Thread-25::DEBUG::2013-01-22
>> 15:35:29,065::BindingXMLRPC::914::vds::(wrapper)
>> client
>>
>> [3.57.111.30]::call
>>
>> getCapabilities with () {}
>> Thread-25::ERROR::2013-01-22
>> 15:35:29,113::netinfo::159::root::(speed)
>> cannot read ib0 speed
>> Traceback (most recent call last):
>> File
>> "/usr/lib64/python2.6/site-packages/vdsm/netinfo.py",
>> line 155,
>>
>> in
>>
>> speed
>> s =
>> int(file('/sys/class/net/%s/speed' %
>> dev).read())
>> IOError: [Errno 22] Invalid argument
>>
>> Causes VDSM to fail to attach storage
>>
>>
>> I doubt that this is the cause of the
>> failure, as vdsm has always
>> reported "0" for ib devices, and still is.
>>
>> it happens only when you call to getCapabilities.. so it
>> doesn't related to the flow, and it can't effect the storage.
>> Dan: I guess this is not the issue but why is the IOError?
>>
>>
>> Does a former version works with your Engine?
>> Could you share more of your vdsm.log? I
>> suppose the culprit lies in one
>> one of the storage-related commands, not in
>> statistics retrieval.
>>
>>
>> Engine side sees:
>> ERROR
>> [org.ovirt.engine.core.bll.storage.NFSStorageHelper]
>> (QuartzScheduler_Worker-96) [553ef26e]
>> The connection with details
>> 192.168.0.1:/ovirt/ds failed because of
>> error code 100 and error
>>
>> message
>>
>> is: general exception
>> 2013-01-22 15:35:30,160 INFO
>> [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand]
>> (QuartzScheduler_Worker-96) [1ab78378]
>> Running command:
>> SetNonOperationalVdsCommand internal:
>> true. Entities affected : ID:
>> 8970b3fe-1faf-11e2-bc1f-00151712f280
>> Type: VDS
>> 2013-01-22 15:35:30,200 INFO
>> [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
>> (QuartzScheduler_Worker-96) [1ab78378] START,
>> SetVdsStatusVDSCommand(HostName = kezan,
>> HostId =
>> 8970b3fe-1faf-11e2-bc1f-00151712f280,
>> status=NonOperational,
>> nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE),
>> log id: 4af5c4cd
>> 2013-01-22 15:35:30,211 INFO
>> [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
>> (QuartzScheduler_Worker-96) [1ab78378]
>> FINISH, SetVdsStatusVDSCommand,
>>
>> log
>>
>> id: 4af5c4cd
>> 2013-01-22 15:35:30,242 ERROR
>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>> (QuartzScheduler_Worker-96) [1ab78378]
>> Try to add duplicate audit log
>> values with the same name. Type:
>> VDS_SET_NONOPERATIONAL_DOMAIN. Value:
>> storagepoolname
>>
>> Engine = latest master
>> VDSM = latest master
>>
>>
>> Since "latest master" is an unstable
>> reference by definition, I'm sure
>> that History would thank you if you post the
>> exact version (git hash?)
>> of the code.
>>
>> node = el6
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users at ovirt.org <mailto:Users at ovirt.org>
>> http://lists.ovirt.org/mailman/listinfo/users
>>
>>
>>
>> --
>> Yaniv Bronhaim.
>> RedHat, Israel
>> 09-7692289
>> 054-7744187
>>
>>
>>
>
>
>
>
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
--
---
?? Shu Ming
Open Virtualization Engineerning; CSTL, IBM Corp.
Tel: 86-10-82451626 Tieline: 9051626 E-mail: shuming at cn.ibm.com or shuming at linux.vnet.ibm.com
Address: 3/F Ring Building, ZhongGuanCun Software Park, Haidian District, Beijing 100193, PRC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20130131/92547216/attachment-0001.html>
More information about the Users
mailing list