<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    Hi,<br>

    &nbsp;&nbsp; I reproduced this issue, and I believe it's a python bug.<br>

    &nbsp;&nbsp; 1. How to reproduce:<br>

    &nbsp;&nbsp; with the test case attached, put it under /usr/share/vdsm/tests/,

    run #./run_tests.sh superVdsmTests.py<br>

    &nbsp;&nbsp; and this issue will be reproduced.<br>

    &nbsp;&nbsp; 2.Log analyse:<br>

    &nbsp;&nbsp; We notice a strange pattern in this log: connectStorageServer be

    called twice, first supervdsm call succeed, second fails becasue of&nbsp;

    validateAccess().<br>

    &nbsp;&nbsp; That is because for the first call validateAccess returns

    normally and leave a child there, when the second validateAccess

    call arrives and multirprocessing manager is receiving the method

    message, it is just the time first child exit and SIGCHLD comming,

    this signal interrupted multiprocessing receive system call, python

    managers.py should handle INTR and retry recv() like we do in vdsm

    but it's not, so the second one raise error.<br>

    <pre wrap="">&gt;Thread-18::DEBUG::2013-01-22 10:41:03,570::misc::85::Storage.Misc.excCmd::(&lt;lambda&gt;) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 192.168.0.1:/ovirt/silvermoon /rhev/data-center/mnt/192.168.0.1:_ovirt_silvermoon' (cwd None)

&gt;Thread-18::DEBUG::2013-01-22 10:41:03,607::misc::85::Storage.Misc.excCmd::(&lt;lambda&gt;) '/usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 192.168.0.1:/ovirt/undercity /rhev/data-center/mnt/192.168.0.1:_ovirt_undercity' (cwd None)

&gt;Thread-18::ERROR::2013-01-22 10:41:03,627::hsm::2215::Storage.HSM::(connectStorageServer) Could not connect to storageServer

&gt;Traceback (most recent call last):

&gt;  File "/usr/share/vdsm/storage/hsm.py", line 2211, in connectStorageServer

&gt;    conObj.connect()

&gt;  File "/usr/share/vdsm/storage/storageServer.py", line 303, in connect

&gt;    return self._mountCon.connect()

&gt;  File "/usr/share/vdsm/storage/storageServer.py", line 209, in connect

&gt;    fileSD.validateDirAccess(self.getMountObj().getRecord().fs_file)

&gt;  File "/usr/share/vdsm/storage/fileSD.py", line 55, in validateDirAccess

&gt;    (os.R_OK | os.X_OK))

&gt;  File "/usr/share/vdsm/supervdsm.py", line 81, in __call__

&gt;    return callMethod()

&gt;  File "/usr/share/vdsm/supervdsm.py", line 72, in &lt;lambda&gt;

&gt;    **kwargs)

&gt;  File "&lt;string&gt;", line 2, in validateAccess

&gt;  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod

&gt;    raise convert_to_error(kind, result) 

</pre>

    the vdsm side receive RemoteError because of supervdsm server

    multiprocessing manager raise error KIND='TRACEBACK'<br>

    <pre>&nbsp;&gt;RemoteError: 

</pre>

    The upper part is the trace back from the client side, the following

    part is from server side:<br>

    <pre>&gt;---------------------------------------------------------------------------

&gt;Traceback (most recent call last):

&gt;  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 214, in serve_client

&gt;    request = recv()

&gt;IOError: [Errno 4] Interrupted system call

&gt;---------------------------------------------------------------------------

</pre>

    Corresponding Python source code:managers.py(Server side)<br>

    <pre>&nbsp;&nbsp;&nbsp; def serve_client(self, conn):</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; '''</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Handle requests from the proxies in a particular process/thread</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; '''</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; util.debug('starting server thread to service %r',</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; threading.current_thread().name)</pre>

    <pre>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; recv = conn.recv</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; send = conn.send</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; id_to_obj = self.id_to_obj</pre>

    <pre>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; while not self.stop:</pre>

    <pre>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; try:</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; methodname = obj = None</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; request = recv()&lt;------------------<font color="#3366ff">this line been interrupted by SIGCHLD</font></pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ident, methodname, args, kwds = request</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; obj, exposed, gettypeid = id_to_obj[ident]</pre>

    <pre>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if methodname not in exposed:</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; raise AttributeError(</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 'method %r of %r object is not in exposed=%r' %</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (methodname, type(obj), exposed)</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; )</pre>

    <pre>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; function = getattr(obj, methodname)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; try:</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; res = function(*args, **kwds)</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; except Exception, e:</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg = ('#ERROR', e)</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; else:</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; typeid = gettypeid and gettypeid.get(methodname, None)</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if typeid:</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; rident, rexposed = self.create(conn, typeid, res)</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; token = Token(typeid, self.address, rident)</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg = ('#PROXY', (rexposed, token))</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; else:</pre>

    <pre>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg = ('#RETURN', res)

            except AttributeError:

                if methodname is None:

                    msg = ('#TRACEBACK', format_exc())

                else:

                    try:

                        fallback_func = self.fallback_mapping[methodname]

                        result = fallback_func(

                            self, conn, ident, obj, *args, **kwds

                            )

                        msg = ('#RETURN', result)

                    except Exception:

                        msg = ('#TRACEBACK', format_exc())

            except EOFError:

                util.debug('got EOF -- exiting thread serving %r',

                           threading.current_thread().name)

                sys.exit(0)

            except Exception:&lt;------<font color="#3366ff">does not handle IOError,INTR here should retry recv() </font>

                msg = ('#TRACEBACK', format_exc())

</pre>

    <br>

    <br>

    3. Actions we will take:<br>

    (1)As a work round we can first remove the zombie reaper from

    supervdsm server<br>

    (2)I'll see whether python has a fixed version for this<br>

    (3)Yaniv is working on changing vdsm/svdsm communication channel to

    pipe and handle it ourselves, I believe we'll get rid of this with

    that properly handled.<br>

    <div class="moz-forward-container"><br>

      <br>

      -------- Original Message --------

      <table class="moz-email-headers-table" border="0" cellpadding="0"

        cellspacing="0">

        <tbody>

          <tr>

            <th align="RIGHT" nowrap="nowrap" valign="BASELINE">Subject:

            </th>

            <td>Re: [Users] latest vdsm cannot read ib device speeds

              causing storage attach fail</td>

          </tr>

          <tr>

            <th align="RIGHT" nowrap="nowrap" valign="BASELINE">Resent-Date:

            </th>

            <td>Thu, 24 Jan 2013 12:24:10 +0200</td>

          </tr>

          <tr>

            <th align="RIGHT" nowrap="nowrap" valign="BASELINE">Resent-From:

            </th>

            <td>Dan Kenigsberg <a class="moz-txt-link-rfc2396E" href="mailto:danken@redhat.com">&lt;danken@redhat.com&gt;</a></td>

          </tr>

          <tr>

            <th align="RIGHT" nowrap="nowrap" valign="BASELINE">Resent-To:

            </th>

            <td>Royce Lv <a class="moz-txt-link-rfc2396E" href="mailto:lvroyce@linux.vnet.ibm.com">&lt;lvroyce@linux.vnet.ibm.com&gt;</a></td>

          </tr>

          <tr>

            <th align="RIGHT" nowrap="nowrap" valign="BASELINE">Date: </th>

            <td>Wed, 23 Jan 2013 10:44:57 -0600</td>

          </tr>

          <tr>

            <th align="RIGHT" nowrap="nowrap" valign="BASELINE">From: </th>

            <td>Dead Horse <a class="moz-txt-link-rfc2396E" href="mailto:deadhorseconsulting@gmail.com">&lt;deadhorseconsulting@gmail.com&gt;</a></td>

          </tr>

          <tr>

            <th align="RIGHT" nowrap="nowrap" valign="BASELINE">To: </th>

            <td>Dan Kenigsberg <a class="moz-txt-link-rfc2396E" href="mailto:danken@redhat.com">&lt;danken@redhat.com&gt;</a></td>

          </tr>

          <tr>

            <th align="RIGHT" nowrap="nowrap" valign="BASELINE">CC: </th>

            <td><a class="moz-txt-link-rfc2396E" href="mailto:users@ovirt.org">&lt;users@ovirt.org&gt;</a> <a class="moz-txt-link-rfc2396E" href="mailto:users@ovirt.org">&lt;users@ovirt.org&gt;</a></td>

          </tr>

        </tbody>

      </table>

      <br>

      <br>

      <pre>VDSM was built from:

commit 166138e37e75767b32227746bb671b1dab9cdd5e

Attached is the full vdsm log

I should also note that from engine perspective it sees the master storage

domain as locked and the others as unknown.

On Wed, Jan 23, 2013 at 2:49 AM, Dan Kenigsberg <a class="moz-txt-link-rfc2396E" href="mailto:danken@redhat.com">&lt;danken@redhat.com&gt;</a> wrote:

&gt; On Tue, Jan 22, 2013 at 04:02:24PM -0600, Dead Horse wrote:

&gt; &gt; Any ideas on this one? (from VDSM log):

&gt; &gt; Thread-25::DEBUG::2013-01-22

&gt; &gt; 15:35:29,065::BindingXMLRPC::914::vds::(wrapper) client

&gt; [3.57.111.30]::call

&gt; &gt; getCapabilities with () {}

&gt; &gt; Thread-25::ERROR::2013-01-22 15:35:29,113::netinfo::159::root::(speed)

&gt; &gt; cannot read ib0 speed

&gt; &gt; Traceback (most recent call last):

&gt; &gt;   File "/usr/lib64/python2.6/site-packages/vdsm/netinfo.py", line 155, in

&gt; &gt; speed

&gt; &gt;     s = int(file('/sys/class/net/%s/speed' % dev).read())

&gt; &gt; IOError: [Errno 22] Invalid argument

&gt; &gt;

&gt; &gt; Causes VDSM to fail to attach storage

&gt;

&gt; I doubt that this is the cause of the failure, as vdsm has always

&gt; reported "0" for ib devices, and still is.

&gt;

&gt; Does a former version works with your Engine?

&gt; Could you share more of your vdsm.log? I suppose the culprit lies in one

&gt; one of the storage-related commands, not in statistics retrieval.

&gt;

&gt; &gt;

&gt; &gt; Engine side sees:

&gt; &gt; ERROR [org.ovirt.engine.core.bll.storage.NFSStorageHelper]

&gt; &gt; (QuartzScheduler_Worker-96) [553ef26e] The connection with details

&gt; &gt; 192.168.0.1:/ovirt/ds failed because of error code 100 and error message

&gt; &gt; is: general exception

&gt; &gt; 2013-01-22 15:35:30,160 INFO

&gt; &gt; [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand]

&gt; &gt; (QuartzScheduler_Worker-96) [1ab78378] Running command:

&gt; &gt; SetNonOperationalVdsCommand internal: true. Entities affected :  ID:

&gt; &gt; 8970b3fe-1faf-11e2-bc1f-00151712f280 Type: VDS

&gt; &gt; 2013-01-22 15:35:30,200 INFO

&gt; &gt; [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]

&gt; &gt; (QuartzScheduler_Worker-96) [1ab78378] START,

&gt; &gt; SetVdsStatusVDSCommand(HostName = kezan, HostId =

&gt; &gt; 8970b3fe-1faf-11e2-bc1f-00151712f280, status=NonOperational,

&gt; &gt; nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: 4af5c4cd

&gt; &gt; 2013-01-22 15:35:30,211 INFO

&gt; &gt; [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]

&gt; &gt; (QuartzScheduler_Worker-96) [1ab78378] FINISH, SetVdsStatusVDSCommand,

&gt; log

&gt; &gt; id: 4af5c4cd

&gt; &gt; 2013-01-22 15:35:30,242 ERROR

&gt; &gt; [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]

&gt; &gt; (QuartzScheduler_Worker-96) [1ab78378] Try to add duplicate audit log

&gt; &gt; values with the same name. Type: VDS_SET_NONOPERATIONAL_DOMAIN. Value:

&gt; &gt; storagepoolname

&gt; &gt;

&gt; &gt; Engine = latest master

&gt; &gt; VDSM = latest master

&gt;

&gt; Since "latest master" is an unstable reference by definition, I'm sure

&gt; that History would thank you if you post the exact version (git hash?)

&gt; of the code.

&gt;

&gt; &gt; node = el6

&gt;

&gt;

</pre>

      <br>

      <br>

    </div>

    <br>

  </body>

</html>