[Users] latest vdsm cannot read ib device speeds causing storage attach fail

Thu Jan 31 02:27:10 UTC 2013

Royce,

It is a good trouble shooting.  See my comments about the IOError in 
serve_client() below.

Royce Lv:
>
> Hi,
>    I reproduced this issue, and I believe it's a python bug.
>    1. How to reproduce:
>    with the test case attached, put it under /usr/share/vdsm/tests/, 
> run #./run_tests.sh superVdsmTests.py
>    and this issue will be reproduced.
>    2.Log analyse:
>    We notice a strange pattern in this log: connectStorageServer be 
> called twice, first supervdsm call succeed, second fails becasue of  
> validateAccess().
>    That is because for the first call validateAccess returns normally 
> and leave a child there, when the second validateAccess call arrives 
> and multirprocessing manager is receiving the method message, it is 
> just the time first child exit and SIGCHLD comming, this signal 
> interrupted multiprocessing receive system call, python managers.py 
> should handle INTR and retry recv() like we do in vdsm but it's not, 
> so the second one raise error.
> >Thread-18::DEBUG::2013-01-22 10:41:03,570::misc::85::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 192.168.0.1:/ovirt/silvermoon /rhev/data-center/mnt/192.168.0.1:_ovirt_silvermoon' (cwd None)
> >Thread-18::DEBUG::2013-01-22 10:41:03,607::misc::85::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 192.168.0.1:/ovirt/undercity /rhev/data-center/mnt/192.168.0.1:_ovirt_undercity' (cwd None)
> >Thread-18::ERROR::2013-01-22 10:41:03,627::hsm::2215::Storage.HSM::(connectStorageServer) Could not connect to storageServer
> >Traceback (most recent call last):
> >  File "/usr/share/vdsm/storage/hsm.py", line 2211, in connectStorageServer
> >    conObj.connect()
> >  File "/usr/share/vdsm/storage/storageServer.py", line 303, in connect
> >    return self._mountCon.connect()
> >  File "/usr/share/vdsm/storage/storageServer.py", line 209, in connect
> >    fileSD.validateDirAccess(self.getMountObj().getRecord().fs_file)
> >  File "/usr/share/vdsm/storage/fileSD.py", line 55, in validateDirAccess
> >    (os.R_OK | os.X_OK))
> >  File "/usr/share/vdsm/supervdsm.py", line 81, in __call__
> >    return callMethod()
> >  File "/usr/share/vdsm/supervdsm.py", line 72, in <lambda>
> >    **kwargs)
> >  File "<string>", line 2, in validateAccess
> >  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
> >    raise convert_to_error(kind, result)
> the vdsm side receive RemoteError because of supervdsm server 
> multiprocessing manager raise error KIND='TRACEBACK'
>   >RemoteError:
> The upper part is the trace back from the client side, the following 
> part is from server side:
> >---------------------------------------------------------------------------
> >Traceback (most recent call last):
> >  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 214, in serve_client
> >    request = recv()
> >IOError: [Errno 4] Interrupted system call
> >---------------------------------------------------------------------------
>
> Corresponding Python source code:managers.py(Server side)
>      def serve_client(self, conn):
>          '''
>          Handle requests from the proxies in a particular process/thread
>          '''
>          util.debug('starting server thread to service %r',
>                     threading.current_thread().name)
>          recv = conn.recv
>          send = conn.send
>          id_to_obj = self.id_to_obj
>          while not self.stop:
>              try:
>                  methodname = obj = None
>                  request = recv()<------------------this line been interrupted by SIGCHLD
>                  ident, methodname, args, kwds = request
>                  obj, exposed, gettypeid = id_to_obj[ident]
>                  if methodname not in exposed:
>                      raise AttributeError(
>                          'method %r of %r object is not in exposed=%r' %
>                          (methodname, type(obj), exposed)
>                          )
>                  function = getattr(obj, methodname)
>                  try:
>                      res = function(*args, **kwds)
>                  except Exception, e:
>                      msg = ('#ERROR', e)
>                  else:
>                      typeid = gettypeid and gettypeid.get(methodname, None)
>                      if typeid:
>                          rident, rexposed = self.create(conn, typeid, res)
>                          token = Token(typeid, self.address, rident)
>                          msg = ('#PROXY', (rexposed, token))
>                      else:
>                          msg = ('#RETURN', res)
>              except AttributeError:
>                  if methodname is None:
>                      msg = ('#TRACEBACK', format_exc())
>                  else:
>                      try:
>                          fallback_func = self.fallback_mapping[methodname]
>                          result = fallback_func(
>                              self, conn, ident, obj, *args, **kwds
>                              )
>                          msg = ('#RETURN', result)
>                      except Exception:
>                          msg = ('#TRACEBACK', format_exc())
>
>              except EOFError:
>                  util.debug('got EOF -- exiting thread serving %r',
>                             threading.current_thread().name)
>                  sys.exit(0)
>
>              except Exception:<------does not handle IOError,INTR here should retry recv()
>                  msg = ('#TRACEBACK', format_exc())
Not sure I under what is your opinion here.

Whenever this line was interrupted by a SIGCHLD ,

request = recv()<------------------this line been interrupted by SIGCHLD

the code flow should fall into

  except Exception:<------does not handle IOError,INTR here should retry recv()
                 msg = ('#TRACEBACK', format_exc())

However, the big while loop was not stopped, it should reach this line again.

try:

                 methodname = obj = None

                 request = recv()<------------------this line been interrupted by SIGCHLD

Then the interrupted recv() will be called again and again until the 
recv() call is not interrupted.   What's wrong here?

>
>
> 3. Actions we will take:
> (1)As a work round we can first remove the zombie reaper from 
> supervdsm server
> (2)I'll see whether python has a fixed version for this
> (3)Yaniv is working on changing vdsm/svdsm communication channel to 
> pipe and handle it ourselves, I believe we'll get rid of this with 
> that properly handled.
>
>
> On 01/25/2013 06:00 AM, Dead Horse wrote:
>> Tried some manual edits to SD states in the dbase. The net result was 
>> I was able to get a node active. However as reconstructing the master 
>> storage domain kicked in it was unable to do so. It was also not able 
>> to recognize the other SD with similar failure modes to the 
>> unrecognized master above. Guessing the newer VDSM version borked 
>> things pretty good. So being as this is a test harness and the SD 
>> data is not worth saving I just smoked the all SD. I ran 
>> engine-cleanup and started from fresh and all is well now.
>>
>> - DHC
>>
>>
>> On Thu, Jan 24, 2013 at 11:53 AM, Dead Horse 
>> <deadhorseconsulting at gmail.com 
>> <mailto:deadhorseconsulting at gmail.com>> wrote:
>>
>>     This test harness setup here consists of two servers tied to NFS
>>     storage via IB (NFS mounts are via IPoIB, NFS over RDMA is
>>     disabled) . All storage domains are NFS. The issue does occur
>>     with both servers on when attempting to bring them out of
>>     maintenance mode with the end result being non-operational due to
>>     storage attach fail.
>>
>>     The current issue is now that with a working older commit the
>>     master storage domain is "stuck" in state "locked" and I see the
>>     secondary issue wherein VDSM cannot seem to find or contact the
>>     master storage domain even though it is there. I am can mount the
>>     master storage domain manually and and all content appears to be
>>     accounted for accordingly on either host.
>>
>>     Here is the current contents of the master storage domain metadata:
>>     CLASS=Data
>>     DESCRIPTION=orgrimmar
>>     IOOPTIMEOUTSEC=1
>>     LEASERETRIES=3
>>     LEASETIMESEC=5
>>     LOCKPOLICY=
>>     LOCKRENEWALINTERVALSEC=5
>>     MASTER_VERSION=417
>>     POOL_DESCRIPTION=Azeroth
>>     POOL_DOMAINS=0549ee91-4498-4130-8c23-4c173b5c0959:Active,d8b55105-c90a-465d-9803-8130da9a671e:Active,67534cca-1327-462a-b455-a04464084b31:Active,c331a800-839d-4d23-9059-870a7471240a:Active,f8984825-ff8d-43d9-91db-0d0959f8bae9:Active,c434056e-96be-4702-8beb-82a408a5c8cb:Active,f7da73c7-b5fe-48b6-93a0-0c773018c94f:Active,82e3b34a-6f89-4299-8cd8-2cc8f973a3b4:Active,e615c975-6b00-469f-8fb6-ff58ae3fdb2c:Active,5bc86532-55f7-4a91-a52c-fad261f322d5:Active,1130b87a-3b34-45d6-8016-d435825c68ef:Active
>>     POOL_SPM_ID=1
>>     POOL_SPM_LVER=6
>>     POOL_UUID=f90a0d1c-06ca-11e2-a05b-00151712f280
>>     REMOTE_PATH=192.168.0.1:/ovirt/orgrimmar
>>     ROLE=Master
>>     SDUUID=67534cca-1327-462a-b455-a04464084b31
>>     TYPE=NFS
>>     VERSION=3
>>     _SHA_CKSUM=1442bb078fd8c9468d241ff141e9bf53839f0721
>>
>>     So now with the older working commit I now get this the
>>     "StoragePoolMasterNotFound: Cannot find master domain" error
>>     (prior details above when I worked backwards to that commit)
>>
>>     This is odd as the nodes can definitely reach the master storage
>>     domain:
>>
>>     showmount from one of the el6.3 nodes:
>>     [root at kezan ~]# showmount -e 192.168.0.1
>>     Export list for 192.168.0.1 <http://192.168.0.1>:
>>     /ovirt/orgrimmar 192.168.0.0/16 <http://192.168.0.0/16>
>>
>>     mount/ls from one of the nodes:
>>     [root at kezan ~]# mount 192.168.0.1:/ovirt/orgrimmar /mnt
>>     [root at kezan ~]# ls -al
>>     /mnt/67534cca-1327-462a-b455-a04464084b31/dom_md/
>>     total 1100
>>     drwxr-xr-x 2 vdsm kvm    4096 Jan 24 11:44 .
>>     drwxr-xr-x 5 vdsm kvm    4096 Oct 19 16:16 ..
>>     -rw-rw---- 1 vdsm kvm 1048576 Jan 19 22:09 ids
>>     -rw-rw---- 1 vdsm kvm       0 Sep 25 00:46 inbox
>>     -rw-rw---- 1 vdsm kvm 2097152 Jan 10 13:33 leases
>>     -rw-r--r-- 1 vdsm kvm     903 Jan 10 13:39 metadata
>>     -rw-rw---- 1 vdsm kvm       0 Sep 25 00:46 outbox
>>
>>
>>     - DHC
>>
>>
>>
>>     On Thu, Jan 24, 2013 at 7:51 AM, ybronhei <ybronhei at redhat.com
>>     <mailto:ybronhei at redhat.com>> wrote:
>>
>>         On 01/24/2013 12:44 AM, Dead Horse wrote:
>>
>>             I narrowed down on the commit where the originally
>>             reported issue crept in:
>>             commitfc3a44f71d2ef202cff18d7203b9e4165b546621building
>>             and testing with
>>
>>             this commit or subsequent commits yields the original issue.
>>
>>         Interesting.. it might be related to this commit and we're
>>         trying to reproduce it.
>>
>>         Did you try to remove that code and run again? does it work
>>         without the additional of zombieReaper?
>>         does the connectivity to the storage work well? when you run
>>         'ls' on the mounted folder you get see the files without a
>>         long delay ? it might related to too long timeout when
>>         validating access to this mount..
>>         we work on that.. any additional info can help
>>
>>         Thanks.
>>
>>
>>             - DHC
>>
>>
>>             On Wed, Jan 23, 2013 at 3:56 PM, Dead Horse
>>             <deadhorseconsulting at gmail.com
>>             <mailto:deadhorseconsulting at gmail.com>>wrote:
>>
>>                 Indeed reverting back to an older vdsm clears up the
>>                 above issue. However
>>                 now I the issue is see is:
>>                 Thread-18::ERROR::2013-01-23
>>                 15:50:42,885::task::833::TaskManager.Task::(_setError)
>>                 Task=`08709e68-bcbc-40d8-843a-d69d4df40ac6`::Unexpected
>>                 error
>>
>>                 Traceback (most recent call last):
>>                    File "/usr/share/vdsm/storage/task.py", line 840,
>>                 in _run
>>                      return fn(*args, **kargs)
>>                    File "/usr/share/vdsm/logUtils.py", line 42, in
>>                 wrapper
>>                      res = f(*args, **kwargs)
>>                    File "/usr/share/vdsm/storage/hsm.py", line 923,
>>                 in connectStoragePool
>>                      masterVersion, options)
>>                    File "/usr/share/vdsm/storage/hsm.py", line 970,
>>                 in _connectStoragePool
>>                      res = pool.connect(hostID, scsiKey, msdUUID,
>>                 masterVersion)
>>                    File "/usr/share/vdsm/storage/sp.py", line 643, in
>>                 connect
>>                      self.__rebuild(msdUUID=msdUUID,
>>                 masterVersion=masterVersion)
>>                    File "/usr/share/vdsm/storage/sp.py", line 1167,
>>                 in __rebuild
>>                      self.masterDomain =
>>                 self.getMasterDomain(msdUUID=msdUUID,
>>                 masterVersion=masterVersion)
>>                    File "/usr/share/vdsm/storage/sp.py", line 1506,
>>                 in getMasterDomain
>>                      raise se.StoragePoolMasterNotFound(self.spUUID,
>>                 msdUUID)
>>                 StoragePoolMasterNotFound: Cannot find master domain:
>>                 'spUUID=f90a0d1c-06ca-11e2-a05b-00151712f280,
>>                 msdUUID=67534cca-1327-462a-b455-a04464084b31'
>>                 Thread-18::DEBUG::2013-01-23
>>                 15:50:42,887::task::852::TaskManager.Task::(_run)
>>                 Task=`08709e68-bcbc-40d8-843a-d69d4df40ac6`::Task._run:
>>                 08709e68-bcbc-40d8-843a-d69d4df40ac6
>>                 ('f90a0d1c-06ca-11e2-a05b-00151712f280', 2,
>>                 'f90a0d1c-06ca-11e2-a05b-00151712f280',
>>                 '67534cca-1327-462a-b455-a04464084b31', 433) {}
>>                 failed - stopping task
>>
>>                 This is with vdsm built from
>>                 commit25a2d8572ad32352227c98a86631300fbd6523c1
>>                 - DHC
>>
>>
>>                 On Wed, Jan 23, 2013 at 10:44 AM, Dead Horse <
>>                 deadhorseconsulting at gmail.com
>>                 <mailto:deadhorseconsulting at gmail.com>> wrote:
>>
>>                     VDSM was built from:
>>                     commit 166138e37e75767b32227746bb671b1dab9cdd5e
>>
>>                     Attached is the full vdsm log
>>
>>                     I should also note that from engine perspective
>>                     it sees the master
>>                     storage domain as locked and the others as unknown.
>>
>>
>>                     On Wed, Jan 23, 2013 at 2:49 AM, Dan Kenigsberg
>>                     <danken at redhat.com <mailto:danken at redhat.com>>wrote:
>>
>>                         On Tue, Jan 22, 2013 at 04:02:24PM -0600,
>>                         Dead Horse wrote:
>>
>>                             Any ideas on this one? (from VDSM log):
>>                             Thread-25::DEBUG::2013-01-22
>>                             15:35:29,065::BindingXMLRPC::914::vds::(wrapper)
>>                             client
>>
>>                         [3.57.111.30]::call
>>
>>                             getCapabilities with () {}
>>                             Thread-25::ERROR::2013-01-22
>>                             15:35:29,113::netinfo::159::root::(speed)
>>                             cannot read ib0 speed
>>                             Traceback (most recent call last):
>>                                File
>>                             "/usr/lib64/python2.6/site-packages/vdsm/netinfo.py",
>>                             line 155,
>>
>>                         in
>>
>>                             speed
>>                                  s =
>>                             int(file('/sys/class/net/%s/speed' %
>>                             dev).read())
>>                             IOError: [Errno 22] Invalid argument
>>
>>                             Causes VDSM to fail to attach storage
>>
>>
>>                         I doubt that this is the cause of the
>>                         failure, as vdsm has always
>>                         reported "0" for ib devices, and still is.
>>
>>         it happens only when you call to getCapabilities.. so it
>>         doesn't related to the flow, and it can't effect the storage.
>>         Dan: I guess this is not the issue but why is the IOError?
>>
>>
>>                         Does a former version works with your Engine?
>>                         Could you share more of your vdsm.log? I
>>                         suppose the culprit lies in one
>>                         one of the storage-related commands, not in
>>                         statistics retrieval.
>>
>>
>>                             Engine side sees:
>>                             ERROR
>>                             [org.ovirt.engine.core.bll.storage.NFSStorageHelper]
>>                             (QuartzScheduler_Worker-96) [553ef26e]
>>                             The connection with details
>>                             192.168.0.1:/ovirt/ds failed because of
>>                             error code 100 and error
>>
>>                         message
>>
>>                             is: general exception
>>                             2013-01-22 15:35:30,160 INFO
>>                             [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand]
>>                             (QuartzScheduler_Worker-96) [1ab78378]
>>                             Running command:
>>                             SetNonOperationalVdsCommand internal:
>>                             true. Entities affected :  ID:
>>                             8970b3fe-1faf-11e2-bc1f-00151712f280
>>                             Type: VDS
>>                             2013-01-22 15:35:30,200 INFO
>>                             [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
>>                             (QuartzScheduler_Worker-96) [1ab78378] START,
>>                             SetVdsStatusVDSCommand(HostName = kezan,
>>                             HostId =
>>                             8970b3fe-1faf-11e2-bc1f-00151712f280,
>>                             status=NonOperational,
>>                             nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE),
>>                             log id: 4af5c4cd
>>                             2013-01-22 15:35:30,211 INFO
>>                             [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
>>                             (QuartzScheduler_Worker-96) [1ab78378]
>>                             FINISH, SetVdsStatusVDSCommand,
>>
>>                         log
>>
>>                             id: 4af5c4cd
>>                             2013-01-22 15:35:30,242 ERROR
>>                             [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
>>                             (QuartzScheduler_Worker-96) [1ab78378]
>>                             Try to add duplicate audit log
>>                             values with the same name. Type:
>>                             VDS_SET_NONOPERATIONAL_DOMAIN. Value:
>>                             storagepoolname
>>
>>                             Engine = latest master
>>                             VDSM = latest master
>>
>>
>>                         Since "latest master" is an unstable
>>                         reference by definition, I'm sure
>>                         that History would thank you if you post the
>>                         exact version (git hash?)
>>                         of the code.
>>
>>                             node = el6
>>
>>
>>
>>
>>
>>
>>
>>
>>             _______________________________________________
>>             Users mailing list
>>             Users at ovirt.org <mailto:Users at ovirt.org>
>>             http://lists.ovirt.org/mailman/listinfo/users
>>
>>
>>
>>         -- 
>>         Yaniv Bronhaim.
>>         RedHat, Israel
>>         09-7692289
>>         054-7744187
>>
>>
>>
>
>
>
>
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users

-- 
---
?? Shu Ming
Open Virtualization Engineerning; CSTL, IBM Corp.
Tel: 86-10-82451626  Tieline: 9051626 E-mail: shuming at cn.ibm.com or shuming at linux.vnet.ibm.com
Address: 3/F Ring Building, ZhongGuanCun Software Park, Haidian District, Beijing 100193, PRC

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20130131/92547216/attachment-0001.html>