Re: [Users] latest vdsm cannot read ib device speeds causing storage attach fail

25 Jan 2013


      Great work!
The default action for SIGCHLD is ignore, so there's no problems 
reported before a signal handler is installed by zombie reaper.
But I still have one problem: the python multiprocessing.manager code 
is running a new thread and according to the implementation of python's 
signal, only the main thread can receive the signal.
So how is the signal delivered to the server thread?


On Fri 25 Jan 2013 12:30:39 PM CST, Royce Lv wrote:
...
...
Thread-18::DEBUG::2013-01-22 10:41:03,570::misc::85::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 192.168.0.1:/ovirt/silvermoon /rhev/data-center/mnt/192.168.0.1:_ovirt_silvermoon' (cwd None)
Thread-18::DEBUG::2013-01-22 10:41:03,607::misc::85::Storage.Misc.excCmd::(<lambda>) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 192.168.0.1:/ovirt/undercity /rhev/data-center/mnt/192.168.0.1:_ovirt_undercity' (cwd None)
Thread-18::ERROR::2013-01-22 10:41:03,627::hsm::2215::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
 File "/usr/share/vdsm/storage/hsm.py", line 2211, in connectStorageServer
   conObj.connect()
 File "/usr/share/vdsm/storage/storageServer.py", line 303, in connect
   return self._mountCon.connect()
 File "/usr/share/vdsm/storage/storageServer.py", line 209, in connect
   fileSD.validateDirAccess(self.getMountObj().getRecord().fs_file)
 File "/usr/share/vdsm/storage/fileSD.py", line 55, in validateDirAccess
   (os.R_OK | os.X_OK))
 File "/usr/share/vdsm/supervdsm.py", line 81, in __call__
   return callMethod()
 File "/usr/share/vdsm/supervdsm.py", line 72, in <lambda>
   **kwargs)
 File "<string>", line 2, in validateAccess
 File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
   raise convert_to_error(kind, result)
...
RemoteError:
The upper part is the trace back from the client side, the following
Hi,
   I reproduced this issue, and I believe it's a python bug.
   1. How to reproduce:
   with the test case attached, put it under /usr/share/vdsm/tests/,
run #./run_tests.sh superVdsmTests.py
   and this issue will be reproduced.
   2.Log analyse:
   We notice a strange pattern in this log: connectStorageServer be
called twice, first supervdsm call succeed, second fails becasue of
validateAccess().
   That is because for the first call validateAccess returns normally
and leave a child there, when the second validateAccess call arrives
and multirprocessing manager is receiving the method message, it is
just the time first child exit and SIGCHLD comming, this signal
interrupted multiprocessing receive system call, python managers.py
should handle INTR and retry recv() like we do in vdsm but it's not,
so the second one raise error.
the vdsm side receive RemoteError because of supervdsm server
multiprocessing manager raise error KIND='TRACEBACK'
part is from server side:
...
---------------------------------------------------------------------------
Traceback (most recent call last):
 File "/usr/lib64/python2.6/multiprocessing/managers.py", line 214, in serve_client
   request = recv()
IOError: [Errno 4] Interrupted system call
---------------------------------------------------------------------------
Corresponding Python source code:managers.py(Server side)
     def serve_client(self, conn):
         '''
         Handle requests from the proxies in a particular process/thread
         '''
         util.debug('starting server thread to service %r',
                    threading.current_thread().name)
         recv = conn.recv
         send = conn.send
         id_to_obj = self.id_to_obj
         while not self.stop:
             try:
                 methodname = obj = None
                 request = recv()<------------------this line been interrupted by SIGCHLD
                 ident, methodname, args, kwds = request
                 obj, exposed, gettypeid = id_to_obj[ident]
                 if methodname not in exposed:
                     raise AttributeError(
                         'method %r of %r object is not in exposed=%r' %
                         (methodname, type(obj), exposed)
                         )
                 function = getattr(obj, methodname)
                 try:
                     res = function(*args, **kwds)
                 except Exception, e:
                     msg = ('#ERROR', e)
                 else:
                     typeid = gettypeid and gettypeid.get(methodname, None)
                     if typeid:
                         rident, rexposed = self.create(conn, typeid, res)
                         token = Token(typeid, self.address, rident)
                         msg = ('#PROXY', (rexposed, token))
                     else:
                         msg = ('#RETURN', res)
             except AttributeError:
                 if methodname is None:
                     msg = ('#TRACEBACK', format_exc())
                 else:
                     try:
                         fallback_func = self.fallback_mapping[methodname]
                         result = fallback_func(
                             self, conn, ident, obj, *args, **kwds
                             )
                         msg = ('#RETURN', result)
                     except Exception:
                         msg = ('#TRACEBACK', format_exc())
except EOFError:
                 util.debug('got EOF -- exiting thread serving %r',
                            threading.current_thread().name)
                 sys.exit(0)
except Exception:<------does not handle IOError,INTR here should retry recv()
                 msg = ('#TRACEBACK', format_exc())
3. Actions we will take:
(1)As a work round we can first remove the zombie reaper from
supervdsm server
(2)I'll see whether python has a fixed version for this
(3)Yaniv is working on changing vdsm/svdsm communication channel to
pipe and handle it ourselves, I believe we'll get rid of this with
that properly handled.
On 01/25/2013 06:00 AM, Dead Horse wrote:
...
Tried some manual edits to SD states in the dbase. The net result was
I was able to get a node active. However as reconstructing the master
storage domain kicked in it was unable to do so. It was also not able
to recognize the other SD with similar failure modes to the
unrecognized master above. Guessing the newer VDSM version borked
things pretty good. So being as this is a test harness and the SD
data is not worth saving I just smoked the all SD. I ran
engine-cleanup and started from fresh and all is well now.
- DHC
On Thu, Jan 24, 2013 at 11:53 AM, Dead Horse
<deadhorseconsulting@gmail.com
<mailto:deadhorseconsulting@gmail.com>> wrote:
This test harness setup here consists of two servers tied to NFS
    storage via IB (NFS mounts are via IPoIB, NFS over RDMA is
    disabled) . All storage domains are NFS. The issue does occur
    with both servers on when attempting to bring them out of
    maintenance mode with the end result being non-operational due to
    storage attach fail.
The current issue is now that with a working older commit the
    master storage domain is "stuck" in state "locked" and I see the
    secondary issue wherein VDSM cannot seem to find or contact the
    master storage domain even though it is there. I am can mount the
    master storage domain manually and and all content appears to be
    accounted for accordingly on either host.
Here is the current contents of the master storage domain metadata:
    CLASS=Data
    DESCRIPTION=orgrimmar
    IOOPTIMEOUTSEC=1
    LEASERETRIES=3
    LEASETIMESEC=5
    LOCKPOLICY=
    LOCKRENEWALINTERVALSEC=5
    MASTER_VERSION=417
    POOL_DESCRIPTION=Azeroth
    POOL_DOMAINS=0549ee91-4498-4130-8c23-4c173b5c0959:Active,d8b55105-c90a-465d-9803-8130da9a671e:Active,67534cca-1327-462a-b455-a04464084b31:Active,c331a800-839d-4d23-9059-870a7471240a:Active,f8984825-ff8d-43d9-91db-0d0959f8bae9:Active,c434056e-96be-4702-8beb-82a408a5c8cb:Active,f7da73c7-b5fe-48b6-93a0-0c773018c94f:Active,82e3b34a-6f89-4299-8cd8-2cc8f973a3b4:Active,e615c975-6b00-469f-8fb6-ff58ae3fdb2c:Active,5bc86532-55f7-4a91-a52c-fad261f322d5:Active,1130b87a-3b34-45d6-8016-d435825c68ef:Active
    POOL_SPM_ID=1
    POOL_SPM_LVER=6
    POOL_UUID=f90a0d1c-06ca-11e2-a05b-00151712f280
    REMOTE_PATH=192.168.0.1:/ovirt/orgrimmar
    ROLE=Master
    SDUUID=67534cca-1327-462a-b455-a04464084b31
    TYPE=NFS
    VERSION=3
    _SHA_CKSUM=1442bb078fd8c9468d241ff141e9bf53839f0721
So now with the older working commit I now get this the
    "StoragePoolMasterNotFound: Cannot find master domain" error
    (prior details above when I worked backwards to that commit)
This is odd as the nodes can definitely reach the master storage
    domain:
showmount from one of the el6.3 nodes:
    [root@kezan ~]# showmount -e 192.168.0.1
    Export list for 192.168.0.1 <http://192.168.0.1>:
    /ovirt/orgrimmar 192.168.0.0/16 <http://192.168.0.0/16>
mount/ls from one of the nodes:
    [root@kezan ~]# mount 192.168.0.1:/ovirt/orgrimmar /mnt
    [root@kezan ~]# ls -al
    /mnt/67534cca-1327-462a-b455-a04464084b31/dom_md/
    total 1100
    drwxr-xr-x 2 vdsm kvm    4096 Jan 24 11:44 .
    drwxr-xr-x 5 vdsm kvm    4096 Oct 19 16:16 ..
    -rw-rw---- 1 vdsm kvm 1048576 Jan 19 22:09 ids
    -rw-rw---- 1 vdsm kvm       0 Sep 25 00:46 inbox
    -rw-rw---- 1 vdsm kvm 2097152 Jan 10 13:33 leases
    -rw-r--r-- 1 vdsm kvm     903 Jan 10 13:39 metadata
    -rw-rw---- 1 vdsm kvm       0 Sep 25 00:46 outbox
- DHC
On Thu, Jan 24, 2013 at 7:51 AM, ybronhei <ybronhei@redhat.com
    <mailto:ybronhei@redhat.com>> wrote:
On 01/24/2013 12:44 AM, Dead Horse wrote:
I narrowed down on the commit where the originally
            reported issue crept in:
            commitfc3a44f71d2ef202cff18d7203b9e4165b546621building
            and testing with
this commit or subsequent commits yields the original issue.
Interesting.. it might be related to this commit and we're
        trying to reproduce it.
Did you try to remove that code and run again? does it work
        without the additional of zombieReaper?
        does the connectivity to the storage work well? when you run
        'ls' on the mounted folder you get see the files without a
        long delay ? it might related to too long timeout when
        validating access to this mount..
        we work on that.. any additional info can help
Thanks.
- DHC
On Wed, Jan 23, 2013 at 3:56 PM, Dead Horse
            <deadhorseconsulting@gmail.com
            <mailto:deadhorseconsulting@gmail.com>>wrote:
Indeed reverting back to an older vdsm clears up the
                above issue. However
                now I the issue is see is:
                Thread-18::ERROR::2013-01-23
                15:50:42,885::task::833::TaskManager.Task::(_setError)
                Task=`08709e68-bcbc-40d8-843a-d69d4df40ac6`::Unexpected
                error
Traceback (most recent call last):
                   File "/usr/share/vdsm/storage/task.py", line 840,
                in _run
                     return fn(*args, **kargs)
                   File "/usr/share/vdsm/logUtils.py", line 42, in
                wrapper
                     res = f(*args, **kwargs)
                   File "/usr/share/vdsm/storage/hsm.py", line 923,
                in connectStoragePool
                     masterVersion, options)
                   File "/usr/share/vdsm/storage/hsm.py", line 970,
                in _connectStoragePool
                     res = pool.connect(hostID, scsiKey, msdUUID,
                masterVersion)
                   File "/usr/share/vdsm/storage/sp.py", line 643, in
                connect
                     self.__rebuild(msdUUID=msdUUID,
                masterVersion=masterVersion)
                   File "/usr/share/vdsm/storage/sp.py", line 1167,
                in __rebuild
                     self.masterDomain =
                self.getMasterDomain(msdUUID=msdUUID,
                masterVersion=masterVersion)
                   File "/usr/share/vdsm/storage/sp.py", line 1506,
                in getMasterDomain
                     raise se.StoragePoolMasterNotFound(self.spUUID,
                msdUUID)
                StoragePoolMasterNotFound: Cannot find master domain:
                'spUUID=f90a0d1c-06ca-11e2-a05b-00151712f280,
                msdUUID=67534cca-1327-462a-b455-a04464084b31'
                Thread-18::DEBUG::2013-01-23
                15:50:42,887::task::852::TaskManager.Task::(_run)
                Task=`08709e68-bcbc-40d8-843a-d69d4df40ac6`::Task._run:
                08709e68-bcbc-40d8-843a-d69d4df40ac6
                ('f90a0d1c-06ca-11e2-a05b-00151712f280', 2,
                'f90a0d1c-06ca-11e2-a05b-00151712f280',
                '67534cca-1327-462a-b455-a04464084b31', 433) {}
                failed - stopping task
This is with vdsm built from
                commit25a2d8572ad32352227c98a86631300fbd6523c1
                - DHC
On Wed, Jan 23, 2013 at 10:44 AM, Dead Horse <
                deadhorseconsulting@gmail.com
                <mailto:deadhorseconsulting@gmail.com>> wrote:
VDSM was built from:
                    commit 166138e37e75767b32227746bb671b1dab9cdd5e
Attached is the full vdsm log
I should also note that from engine perspective
                    it sees the master
                    storage domain as locked and the others as unknown.
On Wed, Jan 23, 2013 at 2:49 AM, Dan Kenigsberg
                    <danken@redhat.com <mailto:danken@redhat.com>>wrote:
On Tue, Jan 22, 2013 at 04:02:24PM -0600,
                        Dead Horse wrote:
Any ideas on this one? (from VDSM log):
                            Thread-25::DEBUG::2013-01-22
                            15:35:29,065::BindingXMLRPC::914::vds::(wrapper)
                            client
[3.57.111.30]::call
getCapabilities with () {}
                            Thread-25::ERROR::2013-01-22
                            15:35:29,113::netinfo::159::root::(speed)
                            cannot read ib0 speed
                            Traceback (most recent call last):
                               File
                            "/usr/lib64/python2.6/site-packages/vdsm/netinfo.py",
                            line 155,
in
speed
                                 s =
                            int(file('/sys/class/net/%s/speed' %
                            dev).read())
                            IOError: [Errno 22] Invalid argument
Causes VDSM to fail to attach storage
I doubt that this is the cause of the
                        failure, as vdsm has always
                        reported "0" for ib devices, and still is.
it happens only when you call to getCapabilities.. so it
        doesn't related to the flow, and it can't effect the storage.
        Dan: I guess this is not the issue but why is the IOError?
Does a former version works with your Engine?
                        Could you share more of your vdsm.log? I
                        suppose the culprit lies in one
                        one of the storage-related commands, not in
                        statistics retrieval.
Engine side sees:
                            ERROR
                            [org.ovirt.engine.core.bll.storage.NFSStorageHelper]
                            (QuartzScheduler_Worker-96) [553ef26e]
                            The connection with details
                            192.168.0.1:/ovirt/ds failed because of
                            error code 100 and error
message
is: general exception
                            2013-01-22 15:35:30,160 INFO
                            [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand]
                            (QuartzScheduler_Worker-96) [1ab78378]
                            Running command:
                            SetNonOperationalVdsCommand internal:
                            true. Entities affected :  ID:
                            8970b3fe-1faf-11e2-bc1f-00151712f280
                            Type: VDS
                            2013-01-22 15:35:30,200 INFO
                            [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
                            (QuartzScheduler_Worker-96) [1ab78378] START,
                            SetVdsStatusVDSCommand(HostName = kezan,
                            HostId =
                            8970b3fe-1faf-11e2-bc1f-00151712f280,
                            status=NonOperational,
                            nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE),
                            log id: 4af5c4cd
                            2013-01-22 15:35:30,211 INFO
                            [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
                            (QuartzScheduler_Worker-96) [1ab78378]
                            FINISH, SetVdsStatusVDSCommand,
log
id: 4af5c4cd
                            2013-01-22 15:35:30,242 ERROR
                            [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
                            (QuartzScheduler_Worker-96) [1ab78378]
                            Try to add duplicate audit log
                            values with the same name. Type:
                            VDS_SET_NONOPERATIONAL_DOMAIN. Value:
                            storagepoolname
Engine = latest master
                            VDSM = latest master
Since "latest master" is an unstable
                        reference by definition, I'm sure
                        that History would thank you if you post the
                        exact version (git hash?)
                        of the code.
node = el6
_______________________________________________
            Users mailing list
            Users@ovirt.org <mailto:Users@ovirt.org>
            http://lists.ovirt.org/mailman/listinfo/users
--
        Yaniv Bronhaim.
        RedHat, Israel
        09-7692289
        054-7744187
_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Re: [Users] latest vdsm cannot read ib device speeds causing storage attach fail

Mark Wu