<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    So I looked at the vdsm logs and since there were multiple tests

    done it was difficult to isolate which error to track down. You

    mentioned test between 14:00-14:30  CET - but the gluster logs that

    were attached ended at 11.29 UTC<br>

    <br>

    Tracking down the errors when the master domain (gluster volume

    1HP12-R3A1P1) went inactive for time period when corresponding

    gluster volume log was available - they all seem to correspond to an

    issue where gluster volume quorum was not met.<br>

    <br>

    Can you confirm if this was for the test performed - or provide logs

    from correct time period (both vdsm and gluster mount logs are

    required - from hypervisors where the master domain is mounted)?<br>

    <br>

    For master domain:<br>

    On 1hp1:<br>

    vdsm.log<br>

    Thread-35::ERROR::2016-03-31

    13:21:27,225::monitor::276::Storage.Monitor::(_monitorDomain) Err<br>

    or monitoring domain 14995860-1127-4dc4-b8c8-b540b89f9313<br>

    Traceback (most recent call last):<br>

    ...<br>

      File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py",

    line 454, in statvfs<br>

        resdict = self._sendCommand("statvfs", {"path": path},

    self.timeout)<br>

      File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py",

    line 427, in _sendCommand<br>

        raise OSError(errcode, errstr)<br>

    OSError: [Errno 107] Transport endpoint is not connected<br>

    Thread-35::<a class="moz-txt-link-freetext" href="INFO::2016-03-31">INFO::2016-03-31</a>

    13:21:27,267::monitor::299::Storage.Monitor::(_notifyStatusChanges)

    Domain 14995860-1127-4dc4-b8c8-b540b89f9313 became INVALID<br>

     <br>

    -- And I see a corresponding:<br>

    [2016-03-31 11:21:16.027090] W [MSGID: 108001]

    [afr-common.c:4093:afr_notify] 0-1HP12-R3A1P1-r<br>

    eplicate-0: Client-quorum is not met<br>

    <br>

    jsonrpc.Executor/0::DEBUG::2016-03-31

    13:23:34,110::__init__::533::jsonrpc.JsonRpcServer::(_serveRequest)

    Return 'GlusterVolume.status' in bridge with {'volumeStatus':

    {'bricks': [{'status': 'OFFLINE', 'hostuuid':

    'f6568a3b-3d65-4f4f-be9f-14a5935e37a4', 'pid': '-1', 'rdma_port':

    'N/A', 'brick': '1hp1:/STORAGES/P1/GFS', 'port': 'N/A'}, {'status':

    'OFFLINE', 'hostuuid': '8e87cf18-8958-41b7-8d24-7ee420a1ef9f',

    'pid': '-1', 'rdma_port': 'N/A', 'brick': '1hp2:/STORAGES/P1/GFS',

    'port': 'N/A'}], 'nfs': [{'status': 'OFFLINE', 'hostuuid':

    'f6568a3b-3d65-4f4f-be9f-14a5935e37a4', 'hostname':

    '172.16.5.151/24', 'pid': '-1', 'rdma_port': 'N/A', 'port': 'N/A'},

    {'status': 'OFFLINE', 'hostuuid':

    '8e87cf18-8958-41b7-8d24-7ee420a1ef9f', 'hostname': '1hp2', 'pid':

    '-1', 'rdma_port': 'N/A', 'port': 'N/A'}], 'shd': [{'status':

    'ONLINE', 'hostname': '172.16.5.151/24', 'pid': '2148', 'hostuuid':

    'f6568a3b-3d65-4f4f-be9f-14a5935e37a4'}, {'status': 'ONLINE',

    'hostname': '1hp2', 'pid': '2146', 'hostuuid':

    '8e87cf18-8958-41b7-8d24-7ee420a1ef9f'}], 'name': '1HP12-R3A1P1'}}<br>

    <br>

    -- 2 bricks were offline. I think the arbiter brick is not reported

    in the xml output - this is a bug.<br>

    <br>

    Similarly on 1hp2:<br>

    Thread-35::ERROR::2016-03-31

    13:21:14,284::monitor::276::Storage.Monitor::(_monitorDomain) Err<br>

    or monitoring domain 14995860-1127-4dc4-b8c8-b540b89f9313<br>

    Traceback (most recent call last):<br>

      ...<br>

        raise OSError(errcode, errstr)<br>

    OSError: [Errno 2] No such file or directory<br>

    Thread-35::<a class="moz-txt-link-freetext" href="INFO::2016-03-31">INFO::2016-03-31</a>

    13:21:14,285::monitor::299::Storage.Monitor::(_notifyStatusChanges)

    Domain 14995860-1127-4dc4-b8c8-b540b89f9313 became INVALID<br>

    <br>

    Corresponding gluster mount log - <br>

    [2016-03-31 11:21:16.027640] W [MSGID: 108001]

    [afr-common.c:4093:afr_notify] 0-1HP12-R3A1P1-r<br>

    eplicate-0: Client-quorum is not met<br>

    <br>

    <div class="moz-cite-prefix">On 04/05/2016 07:02 PM, <a class="moz-txt-link-abbreviated" href="mailto:paf1@email.cz">paf1@email.cz</a>

      wrote:<br>

    </div>

    <blockquote cite="mid:5703BE74.1050408@email.cz" type="cite">

      <meta content="text/html; charset=windows-1252"

        http-equiv="Content-Type">

      Hello Sahina, <br>

      look attached logs which U requested<br>

      <br>

      regs.<br>

      Pavel<br>

      <br>

      <div class="moz-cite-prefix">On 5.4.2016 14:07, Sahina Bose wrote:<br>

      </div>

      <blockquote cite="mid:5703AA9D.40303@redhat.com" type="cite">

        <meta content="text/html; charset=windows-1252"

          http-equiv="Content-Type">

        <br>

        <br>

        <div class="moz-cite-prefix">On 03/31/2016 06:41 PM, <a

            moz-do-not-send="true" class="moz-txt-link-abbreviated"

            href="mailto:paf1@email.cz"><a class="moz-txt-link-abbreviated" href="mailto:paf1@email.cz">paf1@email.cz</a></a> wrote:<br>

        </div>

        <blockquote cite="mid:56FD221F.30707@email.cz" type="cite">

          <meta content="text/html; charset=windows-1252"

            http-equiv="Content-Type">

          Hi, <br>

          rest of logs:<br>

          <a moz-do-not-send="true"

            href="http://www.uschovna.cz/en/zasilka/HYGXR57CNHM3TP39-L3W"

            style="text-decoration:none;color:#ff9c00;">www.uschovna.cz/en/zasilka/HYGXR57CNHM3TP39-L3W</a><br>

          <br>

          The TEST is the last big event in logs ....<br>

          TEST TIME : about 14:00-14:30  CET<br>

        </blockquote>

        <br>

        Thank you Pavel for the interesting test report and sharing the

        logs.<br>

        <br>

        You are right - the master domain should not go down if 2 of 3

        bricks are available from volume A (1HP12-R3A1P1).<br>

        <br>

        I notice that host kvmarbiter was not responsive at 2016-03-31

        13:27:19 , but the ConnectStorageServerVDSCommand executed on

        kvmarbiter node returned success at 2016-03-31 13:27:26<br>

        <br>

        Could you also share the vdsm logs from 1hp1, 1hp2 and

        kvmarbiter nodes during this time ?<br>

        <br>

        Ravi, Krutika - could you take a look at the gluster logs? <br>

        <br>

        <blockquote cite="mid:56FD221F.30707@email.cz" type="cite"> <br>

          regs.Pavel<br>

          <br>

          <div class="moz-cite-prefix">On 31.3.2016 14:30, Yaniv Kaul

            wrote:<br>

          </div>

          <blockquote

cite="mid:CAJgorsaOUQ_42GUSPh-H1vGUgJ114JYcUHR8vHwvmcWR+w8Jmw@mail.gmail.com"

            type="cite">

            <div dir="ltr">Hi Pavel,

              <div><br>

              </div>

              <div>Thanks for the report. Can you begin with a more

                accurate description of your environment?</div>

              <div>Begin with host, oVirt and Gluster versions. Then

                continue with the exact setup (what are 'A', 'B', 'C' -

                domains? Volumes? What is the mapping between domains

                and volumes?).</div>

              <div><br>

              </div>

              <div>Are there any logs you can share with us?</div>

              <div><br>

              </div>

              <div>I'm sure with more information, we'd be happy to look

                at the issue.</div>

              <div>Y.</div>

              <div><br>

              </div>

            </div>

            <div class="gmail_extra"><br>

              <div class="gmail_quote">On Thu, Mar 31, 2016 at 3:09 PM,

                <a moz-do-not-send="true"

                  class="moz-txt-link-abbreviated"

                  href="mailto:paf1@email.cz">paf1@email.cz</a> <span

                  dir="ltr">&lt;<a moz-do-not-send="true"

                    href="mailto:paf1@email.cz" target="_blank">paf1@email.cz</a>&gt;</span>

                wrote:<br>

                <blockquote class="gmail_quote" style="margin:0 0 0

                  .8ex;border-left:1px #ccc solid;padding-left:1ex">

                  <div text="#000066" bgcolor="#FFFFFF"> Hello, <br>

                    we tried the  following test - with unwanted results<br>

                    <br>

                    input:<br>

                    5 node gluster<br>

                    A = replica 3 with arbiter 1 ( node1+node2+arbiter

                    on node 5 )<br>

                    B = replica 3 with arbiter 1 ( node3+node4+arbiter

                    on node 5 )<br>

                    C = distributed replica 3 arbiter 1  ( node1+node2,

                    node3+node4, each arbiter on node 5)<br>

                    node 5 has only arbiter replica ( 4x )<br>

                    <br>

                    TEST:<br>

                    1)  directly reboot one node - OK ( is not important

                    which ( data node or arbiter node ))<br>

                    2)  directly reboot two nodes - OK ( if  nodes are

                    not from the same replica ) <br>

                    3)  directly reboot three nodes - yes, this is the

                    main problem and a questions ....<br>

                        - rebooted all three nodes from replica "B"  (

                    not so possible, but who knows ... )<br>

                        - all VMs with data on this replica was paused (

                    no data access ) - OK<br>

                        - all VMs running on replica "B" nodes lost ( 

                    started manually, later )( datas on other replicas )

                    - acceptable<br>

                    BUT<br>

                        - !!! all oVIrt domains went down !! - master

                    domain is on replica "A" which lost only one member

                    from three !!!<br>

                        so we are not expecting that all domain will go

                    down, especially master with 2 live members.<br>

                        <br>

                    Results: <br>

                        - the whole cluster unreachable until at all

                    domains up - depent of all nodes up !!!<br>

                        - all paused VMs started back - OK<br>

                        - rest of all VMs rebooted and runnig - OK<br>

                    <br>

                    Questions:<br>

                        1) why all domains down if master domain ( on

                    replica "A" ) has two runnig members ( 2 of 3 )  ??<br>

                        2) how to fix that colaps without waiting to all

                    nodes up ? ( in worste case if node has HW error eg.

                    ) ??<br>

                        3) which oVirt  cluster  policy  can prevent

                    that situation ?? ( if any )<br>

                    <br>

                    regs.<br>

                    Pavel<br>

                    <br>

                    <br>

                  </div>

                  <br>

                  _______________________________________________<br>

                  Users mailing list<br>

                  <a moz-do-not-send="true"

                    href="mailto:Users@ovirt.org">Users@ovirt.org</a><br>

                  <a moz-do-not-send="true"

                    href="http://lists.ovirt.org/mailman/listinfo/users"

                    rel="noreferrer" target="_blank">http://lists.ovirt.org/mailman/listinfo/users</a><br>

                  <br>

                </blockquote>

              </div>

              <br>

            </div>

          </blockquote>

          <br>

          <br>

          <fieldset class="mimeAttachmentHeader"></fieldset>

          <br>

          <pre wrap="">_______________________________________________

Users mailing list

<a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:Users@ovirt.org">Users@ovirt.org</a>

<a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://lists.ovirt.org/mailman/listinfo/users">http://lists.ovirt.org/mailman/listinfo/users</a>

</pre>

        </blockquote>

        <br>

      </blockquote>

      <br>

    </blockquote>

    <br>

  </body>

</html>