Re: [ovirt-users] Ovirt/Gluster replica 3 distributed-replicated problem

Thursday, 29 September 2016

This is a multi-part message in MIME format.
--------------1673FCF49F9C3539CF449CDA
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

On 09/29/2016 05:18 PM, Sahina Bose wrote:
...
 Yes, this is a GlusterFS problem. Adding gluster users ML

 On Thu, Sep 29, 2016 at 5:11 PM, Davide Ferrari <davide(a)billymob.com 
 <mailto:davide@billymob.com>> wrote:

     Hello

     maybe this is more glustefs then ovirt related but since OVirt
     integrates Gluster management and I'm experiencing the problem in
     an ovirt cluster, I'm writing here.

     The problem is simple: I have a data domain mappend on a replica 3
     arbiter1 Gluster volume with 6 bricks, like this:

     Status of volume: data_ssd
     Gluster process TCP Port  RDMA Port  Online  Pid
     ------------------------------------------------------------------------------
     Brick vm01.storage.billy:/gluster/ssd/data/
     brick 49153     0          Y       19298
     Brick vm02.storage.billy:/gluster/ssd/data/
     brick 49153     0          Y       6146
     Brick vm03.storage.billy:/gluster/ssd/data/
     arbiter_brick 49153     0          Y       6552
     Brick vm03.storage.billy:/gluster/ssd/data/
     brick 49154     0          Y       6559
     Brick vm04.storage.billy:/gluster/ssd/data/
     brick 49152     0          Y       6077
     Brick vm02.storage.billy:/gluster/ssd/data/
     arbiter_brick 49154     0          Y       6153
     Self-heal Daemon on localhost N/A       N/A        Y       30746
     Self-heal Daemon on vm01.storage.billy N/A       N/A       
     Y       196058
     Self-heal Daemon on vm03.storage.billy N/A       N/A       
     Y       23205
     Self-heal Daemon on vm04.storage.billy N/A       N/A       
     Y       8246

     Now, I've put in maintenance the vm04 host, from ovirt, ticking
     the "Stop gluster" checkbox, and Ovirt didn't complain about
     anything. But when I tried to run a new VM it complained about
     "storage I/O problem", while the storage data status was always UP.

     Looking in the gluster logs I can see this:

     [2016-09-29 11:01:01.556908] I
     [glusterfsd-mgmt.c:1596:mgmt_getspec_cbk] 0-glusterfs: No change
     in volfile, continuing
     [2016-09-29 11:02:28.124151] E [MSGID: 108008]
     [afr-read-txn.c:89:afr_read_txn_refresh_done]
     0-data_ssd-replicate-1: Failing READ on gfid
     bf5922b7-19f3-4ce3-98df-71e981ecca8d: split-brain observed.
     [Input/output error]
     [2016-09-29 11:02:28.126580] W [MSGID: 108008]
     [afr-read-txn.c:244:afr_read_txn] 0-data_ssd-replicate-1:
     Unreadable subvolume -1 found with event generation 6 for gfid
     bf5922b7-19f3-4ce3-98df-71e981ecca8d. (Possible split-brain)
     [2016-09-29 11:02:28.127374] E [MSGID: 108008]
     [afr-read-txn.c:89:afr_read_txn_refresh_done]
     0-data_ssd-replicate-1: Failing FGETXATTR on gfid
     bf5922b7-19f3-4ce3-98df-71e981ecca8d: split-brain observed.
     [Input/output error]
     [2016-09-29 11:02:28.128130] W [MSGID: 108027]
     [afr-common.c:2403:afr_discover_done] 0-data_ssd-replicate-1: no
     read subvols for (null)
     [2016-09-29 11:02:28.129890] W [fuse-bridge.c:2228:fuse_readv_cbk]
     0-glusterfs-fuse: 8201: READ => -1
     gfid=bf5922b7-19f3-4ce3-98df-71e981ecca8d fd=0x7f09b749d210
     (Input/output error)
     [2016-09-29 11:02:28.130824] E [MSGID: 108008]
     [afr-read-txn.c:89:afr_read_txn_refresh_done]
     0-data_ssd-replicate-1: Failing FSTAT on gfid
     bf5922b7-19f3-4ce3-98df-71e981ecca8d: split-brain observed.
     [Input/output error]

Does `gluster volume heal data_ssd info split-brain` report that the 
file is in split-brain, with vm04 still being down?
If yes, could you provide the extended attributes of this gfid from all 
3 bricks:
getfattr -d -m . -e hex 
/path/to/brick/bf/59/bf5922b7-19f3-4ce3-98df-71e981ecca8d

If no, then I'm guessing that it is not in actual split-brain (hence the 
'Possible split-brain' message). If the node you brought down contains 
the only good copy of the file (i.e the other data brick and arbiter are 
up, and the arbiter 'blames' this other brick), all I/O is failed with 
EIO to prevent file from getting into actual split-brain. The heals will 
happen when the good node comes up and I/O should be allowed again in 
that case.

-Ravi

...
     [2016-09-29 11:02:28.133879] W [fuse-bridge.c:767:fuse_attr_cbk]
     0-glusterfs-fuse: 8202: FSTAT()

/ba2bd397-9222-424d-aecc-eb652c0169d9/images/f02ac1ce-52cd-4b81-8b29-f8006d0469e0/ff4e49c6-3084-4234-80a1-18a67615c527
     => -1 (Input/output error)
     The message "W [MSGID: 108008] [afr-read-txn.c:244:afr_read_txn]
     0-data_ssd-replicate-1: Unreadable subvolume -1 found with event
     generation 6 for gfid bf5922b7-19f3-4ce3-98df-71e981ecca8d.
     (Possible split-brain)" repeated 11 times between [2016-09-29
     11:02:28.126580] and [2016-09-29 11:02:28.517744]
     [2016-09-29 11:02:28.518607] E [MSGID: 108008]
     [afr-read-txn.c:89:afr_read_txn_refresh_done]
     0-data_ssd-replicate-1: Failing STAT on gfid
     bf5922b7-19f3-4ce3-98df-71e981ecca8d: split-brain observed.
     [Input/output error]

     Now, how is it possible to have a split brain if I stopped just
     ONE server which had just ONE of six bricks, and it was cleanly
     shut down with maintenance mode from ovirt?

     I created the volume originally this way:
     # gluster volume create data_ssd replica 3 arbiter 1
     vm01.storage.billy:/gluster/ssd/data/brick
     vm02.storage.billy:/gluster/ssd/data/brick
     vm03.storage.billy:/gluster/ssd/data/arbiter_brick
     vm03.storage.billy:/gluster/ssd/data/brick
     vm04.storage.billy:/gluster/ssd/data/brick
     vm02.storage.billy:/gluster/ssd/data/arbiter_brick
     # gluster volume set data_ssd group virt
     # gluster volume set data_ssd storage.owner-uid 36 && gluster
     volume set data_ssd storage.owner-gid 36
     # gluster volume start data_ssd

...

     -- 
     Davide Ferrari
     Senior Systems Engineer

     _______________________________________________
     Users mailing list
     Users(a)ovirt.org <mailto:Users@ovirt.org>
     http://lists.ovirt.org/mailman/listinfo/users
     <http://lists.ovirt.org/mailman/listinfo/users>

--------------1673FCF49F9C3539CF449CDA
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta content="text/html; charset=utf-8"
http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <div class="moz-cite-prefix">On 09/29/2016 05:18 PM, Sahina Bose
      wrote:<br>
    </div>
    <blockquote
cite="mid:CACjzOvfCdjG0H+pi7hf_jVNxhJfv-Qi2r3u-yTqnYGPt1W0O2w@mail.gmail.com"
      type="cite">
      <div dir="ltr">Yes, this is a GlusterFS problem. Adding gluster
        users ML<br>
      </div>
      <div class="gmail_extra"><br>
        <div class="gmail_quote">On Thu, Sep 29, 2016 at 5:11 PM, Davide
          Ferrari <span dir="ltr">&lt;<a
moz-do-not-send="true"
              href="mailto:davide@billymob.com"
target=&quot;_blank&quot;&gt;davide(a)billymob.com&lt;/a&gt;&amp;gt;&lt;/span&gt;
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div dir="ltr">
              <div>
                <div>
                  <div>
                    <div>
                      <div>
                        <div>Hello<br>
                          <br>
                        </div>
                        maybe this is more glustefs then ovirt related
                        but since OVirt integrates Gluster management
                        and I'm experiencing the problem in an ovirt
                        cluster, I'm writing here.<br>
                        <br>
                      </div>
                      The problem is simple: I have a data domain
                      mappend on a replica 3 arbiter1 Gluster volume
                      with 6 bricks, like this:<br>
                      <span
style="font-family:monospace,monospace"><br>
                        Status of volume: data_ssd<br>
                        Gluster process                       <wbr>     
                        TCP Port  RDMA Port  Online  Pid<br>

------------------------------<wbr>------------------------------<wbr>------------------<br>
                        Brick vm01.storage.billy:/gluster/<wbr>ssd/data/<br>
                        brick                         <wbr>             
                        49153     0          Y       19298<br>
                        Brick vm02.storage.billy:/gluster/<wbr>ssd/data/<br>
                        brick                         <wbr>             
                        49153     0          Y       6146 <br>
                        Brick vm03.storage.billy:/gluster/<wbr>ssd/data/<br>
                        arbiter_brick                 <wbr>             
                        49153     0          Y       6552 <br>
                        Brick vm03.storage.billy:/gluster/<wbr>ssd/data/<br>
                        brick                         <wbr>             
                        49154     0          Y       6559 <br>
                        Brick vm04.storage.billy:/gluster/<wbr>ssd/data/<br>
                        brick                         <wbr>             
                        49152     0          Y       6077 <br>
                        Brick vm02.storage.billy:/gluster/<wbr>ssd/data/<br>
                        arbiter_brick                 <wbr>             
                        49154     0          Y       6153 <br>
                        Self-heal Daemon on localhost              
                        N/A       N/A        Y       30746<br>
                        Self-heal Daemon on vm01.storage.billy     
                        N/A       N/A        Y       196058<br>
                        Self-heal Daemon on vm03.storage.billy     
                        N/A       N/A        Y       23205<br>
                        Self-heal Daemon on vm04.storage.billy     
                        N/A       N/A        Y       8246 </span><br>
                      <br>
                      <br>
                    </div>
                    Now, I've put in maintenance the vm04 host, from
                    ovirt, ticking the "Stop gluster" checkbox, and
                    Ovirt didn't complain about anything. But when I
                    tried to run a new VM it complained about "storage
                    I/O problem", while the storage data status was
                    always UP.<br>
                    <br>
                  </div>
                  Looking in the gluster logs I can see this:<br>
                  <br>
                  <span
style="font-family:monospace,monospace">[2016-09-29
                    11:01:01.556908] I
[glusterfsd-mgmt.c:1596:mgmt_<wbr>getspec_cbk]
                    0-glusterfs: No change in volfile, continuing<br>
                    [2016-09-29 11:02:28.124151] E [MSGID: 108008]
                    [afr-read-txn.c:89:afr_read_<wbr>txn_refresh_done]
                    0-data_ssd-replicate-1: Failing READ on gfid
                    bf5922b7-19f3-4ce3-98df-<wbr>71e981ecca8d:
                    split-brain observed. [Input/output error]<br>
                    [2016-09-29 11:02:28.126580] W [MSGID: 108008]
                    [afr-read-txn.c:244:afr_read_<wbr>txn]
                    0-data_ssd-replicate-1: Unreadable subvolume -1
                    found with event generation 6 for gfid
                    bf5922b7-19f3-4ce3-98df-<wbr>71e981ecca8d. (Possible
                    split-brain)<br>
                    [2016-09-29 11:02:28.127374] E [MSGID: 108008]
                    [afr-read-txn.c:89:afr_read_<wbr>txn_refresh_done]
                    0-data_ssd-replicate-1: Failing FGETXATTR on gfid
                    bf5922b7-19f3-4ce3-98df-<wbr>71e981ecca8d:
                    split-brain observed. [Input/output error]<br>
                    [2016-09-29 11:02:28.128130] W [MSGID: 108027]
                    [afr-common.c:2403:afr_<wbr>discover_done]
                    0-data_ssd-replicate-1: no read subvols for (null)<br>
                    [2016-09-29 11:02:28.129890] W
                    [fuse-bridge.c:2228:fuse_<wbr>readv_cbk]
                    0-glusterfs-fuse: 8201: READ =&gt; -1
                    gfid=bf5922b7-19f3-4ce3-98df-<wbr>71e981ecca8d
                    fd=0x7f09b749d210 (Input/output error)<br>
                    [2016-09-29 11:02:28.130824] E [MSGID: 108008]
                    [afr-read-txn.c:89:afr_read_<wbr>txn_refresh_done]
                    0-data_ssd-replicate-1: Failing FSTAT on gfid
                    bf5922b7-19f3-4ce3-98df-<wbr>71e981ecca8d:
                    split-brain observed. [Input/output error]<br>
                  </span></div>
              </div>
            </div>
          </blockquote>
        </div>
      </div>
    </blockquote>
    <br>
    Does `gluster volume heal data_ssd info split-brain` report that the
    file is in split-brain, with vm04 still being down? <br>
    If yes, could you provide the extended attributes of this gfid from
    all 3 bricks:<br>
    getfattr -d -m . -e hex /path/to/brick/bf/59/<span

style="font-family:monospace,monospace">bf5922b7-19f3-4ce3-98df-<wbr>71e981ecca8d</span><br>
    <br>
    If no, then I'm guessing that it is not in actual split-brain (hence
    the 'Possible split-brain' message). If the node you brought down
    contains the only good copy of the file (i.e the other data brick
    and arbiter are up, and the arbiter 'blames' this other brick), all
    I/O is failed with EIO to prevent file from getting into actual
    split-brain. The heals will happen when the good node comes up and
    I/O should be allowed again in that case.<br>
    <br>
    -Ravi<br>
    <br>
    <br>
    <blockquote
cite="mid:CACjzOvfCdjG0H+pi7hf_jVNxhJfv-Qi2r3u-yTqnYGPt1W0O2w@mail.gmail.com"
      type="cite">
      <div class="gmail_extra">
        <div class="gmail_quote">
          <blockquote class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div dir="ltr">
              <div>
                <div><span
style="font-family:monospace,monospace">[2016-09-29
                    11:02:28.133879] W [fuse-bridge.c:767:fuse_attr_<wbr>cbk]
                    0-glusterfs-fuse: 8202: FSTAT()

/ba2bd397-9222-424d-aecc-<wbr>eb652c0169d9/images/f02ac1ce-<wbr>52cd-4b81-8b29-f8006d0469e0/<wbr>ff4e49c6-3084-4234-80a1-<wbr>18a67615c527
                    =&gt; -1 (Input/output error)<br>
                    The message "W [MSGID: 108008]
                    [afr-read-txn.c:244:afr_read_<wbr>txn]
                    0-data_ssd-replicate-1: Unreadable subvolume -1
                    found with event generation 6 for gfid
                    bf5922b7-19f3-4ce3-98df-<wbr>71e981ecca8d. (Possible
                    split-brain)" repeated 11 times between [2016-09-29
                    11:02:28.126580] and [2016-09-29 11:02:28.517744]<br>
                    [2016-09-29 11:02:28.518607] E [MSGID: 108008]
                    [afr-read-txn.c:89:afr_read_<wbr>txn_refresh_done]
                    0-data_ssd-replicate-1: Failing STAT on gfid
                    bf5922b7-19f3-4ce3-98df-<wbr>71e981ecca8d:
                    split-brain observed. [Input/output error]<br>
                  </span><br>
                </div>
                Now, how is it possible to have a split brain if I
                stopped just ONE server which had just ONE of six
                bricks, and it was cleanly shut down with maintenance
                mode from ovirt?<br>
                <br>
              </div>
              I created the volume originally this way:<br>
              <span style="font-family:monospace,monospace"># gluster
                volume create data_ssd replica 3 arbiter 1
                vm01.storage.billy:/gluster/<wbr>ssd/data/brick
                vm02.storage.billy:/gluster/<wbr>ssd/data/brick
                vm03.storage.billy:/gluster/<wbr>ssd/data/arbiter_brick
                vm03.storage.billy:/gluster/<wbr>ssd/data/brick
                vm04.storage.billy:/gluster/<wbr>ssd/data/brick
                vm02.storage.billy:/gluster/<wbr>ssd/data/arbiter_brick<br>
                # gluster volume set data_ssd group virt<br>
                # gluster volume set data_ssd storage.owner-uid 36
                &amp;&amp; gluster volume set data_ssd storage.owner-gid
                36<br>
                # gluster volume start data_ssd</span><span
                class="HOEnZb"><font
color="#888888"><br>
                </font></span></div>
          </blockquote>
        </div>
      </div>
    </blockquote>
    <br>
    <br>
    <br>
    <br>
    <blockquote
cite="mid:CACjzOvfCdjG0H+pi7hf_jVNxhJfv-Qi2r3u-yTqnYGPt1W0O2w@mail.gmail.com"
      type="cite">
      <div class="gmail_extra">
        <div class="gmail_quote">
          <blockquote class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div dir="ltr"><span class="HOEnZb"><font
color="#888888"><br
                    clear="all">
                  <div>
                    <div>
                      <div>
                        <div>
                          <div>
                            <div>
                              <div><br>
                                -- <br>
                                <div>
                                  <div dir="ltr">
                                    <div>Davide Ferrari<br>
                                    </div>
                                    Senior Systems Engineer<br>
                                  </div>
                                </div>
                              </div>
                            </div>
                          </div>
                        </div>
                      </div>
                    </div>
                  </div>
                </font></span></div>
            <br>
            ______________________________<wbr>_________________<br>
            Users mailing list<br>
            <a moz-do-not-send="true"
href="mailto:Users@ovirt.org">Users@ovirt.org</a><br>
            <a moz-do-not-send="true"
              href="http://lists.ovirt.org/mailman/listinfo/users"
              rel="noreferrer"
target="_blank">http://lists.ovirt.org/<wbr>mailman/li...
            <br>
          </blockquote>
        </div>
        <br>
      </div>
    </blockquote>
    <p><br>
    </p>
  </body>
</html>

--------------1673FCF49F9C3539CF449CDA--

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [ovirt-users] Ovirt/Gluster replica 3 distributed-replicated problem