Re: [ovirt-users] sanlock + gluster recovery -- RFE

23 May 2014

      --------------030407070309010208080203
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Content-Transfer-Encoding: 7bit

Vijay, I am not a member of the developer list, so my comments are at end.

On 5/23/2014 6:55 AM, Vijay Bellur wrote:
...
On 05/21/2014 10:22 PM, Federico Simoncelli wrote:
...
----- Original Message -----
...
From: "Giuseppe Ragusa" <giuseppe.ragusa@hotmail.com>
To: fsimonce@redhat.com
Cc: users@ovirt.org
Sent: Wednesday, May 21, 2014 5:15:30 PM
Subject: sanlock + gluster recovery -- RFE
Hi,
...
----- Original Message -----
...
From: "Ted Miller" <tmiller at hcjb.org>
To: "users" <users at ovirt.org>
Sent: Tuesday, May 20, 2014 11:31:42 PM
Subject: [ovirt-users] sanlock + gluster recovery -- RFE
As you are aware, there is an ongoing split-brain problem with running
sanlock on replicated gluster storage. Personally, I believe that this is
the 5th time that I have been bitten by this sanlock+gluster problem.
I believe that the following are true (if not, my entire request is
probably
off base).
* ovirt uses sanlock in such a way that when the sanlock storage is
     on a
     replicated gluster file system, very small storage disruptions can
     result in a gluster split-brain on the sanlock space
Although this is possible (at the moment) we are working hard to avoid it.
The hardest part here is to ensure that the gluster volume is properly
configured.
The suggested configuration for a volume to be used with ovirt is:
Volume Name: (...)
Type: Replicate
Volume ID: (...)
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
(...three bricks...)
Options Reconfigured:
network.ping-timeout: 10
cluster.quorum-type: auto
The two options ping-timeout and quorum-type are really important.
You would also need a build where this bug is fixed in order to avoid any
chance of a split-brain:
https://bugzilla.redhat.com/show_bug.cgi?id=1066996
It seems that the aforementioned bug is peculiar to 3-bricks setups.
I understand that a 3-bricks setup can allow proper quorum formation without
resorting to "first-configured-brick-has-more-weight" convention used with
only 2 bricks and quorum "auto" (which makes one node "special", so not
properly any-single-fault tolerant).
Correct.
...
But, since we are on ovirt-users, is there a similar suggested configuration
for a 2-hosts setup oVirt+GlusterFS with oVirt-side power management
properly configured and tested-working?
I mean a configuration where "any" host can go south and oVirt (through the
other one) fences it (forcibly powering it off with confirmation from IPMI
or similar) then restarts HA-marked vms that were running there, all the
while keeping the underlying GlusterFS-based storage domains responsive and
readable/writeable (maybe apart from a lapse between detected other-node
unresposiveness and confirmed fencing)?
We already had a discussion with gluster asking if it was possible to
add fencing to the replica 2 quorum/consistency mechanism.
The idea is that as soon as you can't replicate a write you have to
freeze all IO until either the connection is re-established or you
know that the other host has been killed.
Adding Vijay.
There is a related thread on gluster-devel [1] to have a better behavior in 
GlusterFS for prevention of split brains with sanlock and 2-way replicated 
gluster volumes.
Please feel free to comment on the proposal there.
Thanks,
Vijay
[1] http://supercolony.gluster.org/pipermail/gluster-devel/2014-May/040751.html
One quick note before my main comment: I see references to quorum being "N/2 
+ 1".  Isn't if more accurate to say that quorum is "(N + 1)/2" or "N/2 + 0.5"?

Now to my main comment.

I see a case that is not being addressed.  I have no proof of how often this 
use-case occurs, but I believe that is does occur.  (It could (theoretically) 
occur in any situation where multiple bricks are writing to different parts 
of the same file.)

Use-case: sanlock via fuse client.

Steps to produce originally

    (not tested for reproducibility, because I was unable to recover the
    ovirt cluster after occurrence, had to rebuild from scratch), time frame
    was late 2013 or early 2014

    2 node ovirt cluster using replicated gluster storage
    ovirt cluster up and running VMs
    remove power from network switch
    restore power to network switch after a few minutes

Result

    both copies of .../dom_md/ids file accused the other of being out of sync

Hypothesis of cause

    servers (ovirt nodes and gluster bricks) are called A and B
    At the moment when network communication was lost, or just a moment after
    communication was lost

        A had written to local ids file
        A had started process to send write to B
        A had not received write confirmation from B
        and
        B had written to local ids file
        B had started process to send write to A
        B had not received write confirmation from A

    Thus, each file had a segment that had been written to the local file,
    but had not been confirmed written on the remote file.  Each file
    correctly accused the other file of being out-of-sync.  I did read and
    decipher the xattr data, and this was indeed the case, each file accused
    the other.

Possible solutions

    Thinking about it on a systems level, the only solution I can see is to
    route all writes through one gluster brick.  That way all the accusations
    flow from that brick to other bricks, and gluster will find the one file
    with no one accusing it, and can sync from that file to others.

    Within a gluster environment, the only way I know to do this currently is
    to use an nfs mount, forcing all data through that machine, BUT also
    making that machine a single point of failure. That assumes that you do
    not do as I did (and caused split-brain) by mounting an nfs volume using
    localhost:/engVM1, which put me back in the multiple-write situation

    In previous googling, I have seen a proposal to alter/replace the current
    replication translator so that it would do something similar, routing all
    writes through one node, but still allowing local reads, and allowing the
    chosen node to float dynamically among the available bricks.  I looked
    again, but have been unable to find that mailing list entry again. :(

Ted Miller
Elkhart, IN, USA

--------------030407070309010208080203
Content-Type: text/html; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit

<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    Vijay, I am not a member of the developer list, so my comments are
    at end.<br>
    <br>
    <div class="moz-cite-prefix">On 5/23/2014 6:55 AM, Vijay Bellur
      wrote:<br>
    </div>
    <blockquote cite="mid:537F290A.1090008@redhat.com" type="cite">On
      05/21/2014 10:22 PM, Federico Simoncelli wrote:
      <br>
      <blockquote type="cite">----- Original Message -----
        <br>
        <blockquote type="cite">From: "Giuseppe Ragusa"
          <a class="moz-txt-link-rfc2396E" href="mailto:giuseppe.ragusa@hotmail.com"><giuseppe.ragusa@hotmail.com></a>
          <br>
          To: <a class="moz-txt-link-abbreviated" href="mailto:fsimonce@redhat.com">fsimonce@redhat.com</a>
          <br>
          Cc: <a class="moz-txt-link-abbreviated" href="mailto:users@ovirt.org">users@ovirt.org</a>
          <br>
          Sent: Wednesday, May 21, 2014 5:15:30 PM
          <br>
          Subject: sanlock + gluster recovery -- RFE
          <br>
          <br>
          Hi,
          <br>
          <br>
          <blockquote type="cite">----- Original Message -----
            <br>
            <blockquote type="cite">From: "Ted Miller" <tmiller at
              hcjb.org>
              <br>
              To: "users" <users at ovirt.org>
              <br>
              Sent: Tuesday, May 20, 2014 11:31:42 PM
              <br>
              Subject: [ovirt-users] sanlock + gluster recovery -- RFE
              <br>
              <br>
              As you are aware, there is an ongoing split-brain problem
              with running
              <br>
              sanlock on replicated gluster storage. Personally, I
              believe that this is
              <br>
              the 5th time that I have been bitten by this
              sanlock+gluster problem.
              <br>
              <br>
              I believe that the following are true (if not, my entire
              request is
              <br>
              probably
              <br>
              off base).
              <br>
              <br>
              <br>
                   * ovirt uses sanlock in such a way that when the
              sanlock storage is
              <br>
                   on a
              <br>
                   replicated gluster file system, very small storage
              disruptions can
              <br>
                   result in a gluster split-brain on the sanlock space
              <br>
            </blockquote>
            <br>
            Although this is possible (at the moment) we are working
            hard to avoid it.
            <br>
            The hardest part here is to ensure that the gluster volume
            is properly
            <br>
            configured.
            <br>
            <br>
            The suggested configuration for a volume to be used with
            ovirt is:
            <br>
            <br>
            Volume Name: (...)
            <br>
            Type: Replicate
            <br>
            Volume ID: (...)
            <br>
            Status: Started
            <br>
            Number of Bricks: 1 x 3 = 3
            <br>
            Transport-type: tcp
            <br>
            Bricks:
            <br>
            (...three bricks...)
            <br>
            Options Reconfigured:
            <br>
            network.ping-timeout: 10
            <br>
            cluster.quorum-type: auto
            <br>
            <br>
            The two options ping-timeout and quorum-type are really
            important.
            <br>
            <br>
            You would also need a build where this bug is fixed in order
            to avoid any
            <br>
            chance of a split-brain:
            <br>
            <br>
            <a class="moz-txt-link-freetext" href="https://bugzilla.redhat.com/show_bug.cgi?id=1066996">https://bugzilla.redhat.com/show_bug.cgi?id=1066996</a>
            <br>
          </blockquote>
          <br>
          It seems that the aforementioned bug is peculiar to 3-bricks
          setups.
          <br>
          <br>
          I understand that a 3-bricks setup can allow proper quorum
          formation without
          <br>
          resorting to "first-configured-brick-has-more-weight"
          convention used with
          <br>
          only 2 bricks and quorum "auto" (which makes one node
          "special", so not
          <br>
          properly any-single-fault tolerant).
          <br>
        </blockquote>
        <br>
        Correct.
        <br>
        <br>
        <blockquote type="cite">But, since we are on ovirt-users, is
          there a similar suggested configuration
          <br>
          for a 2-hosts setup oVirt+GlusterFS with oVirt-side power
          management
          <br>
          properly configured and tested-working?
          <br>
          I mean a configuration where "any" host can go south and oVirt
          (through the
          <br>
          other one) fences it (forcibly powering it off with
          confirmation from IPMI
          <br>
          or similar) then restarts HA-marked vms that were running
          there, all the
          <br>
          while keeping the underlying GlusterFS-based storage domains
          responsive and
          <br>
          readable/writeable (maybe apart from a lapse between detected
          other-node
          <br>
          unresposiveness and confirmed fencing)?
          <br>
        </blockquote>
        <br>
        We already had a discussion with gluster asking if it was
        possible to
        <br>
        add fencing to the replica 2 quorum/consistency mechanism.
        <br>
        <br>
        The idea is that as soon as you can't replicate a write you have
        to
        <br>
        freeze all IO until either the connection is re-established or
        you
        <br>
        know that the other host has been killed.
        <br>
        <br>
        Adding Vijay.
        <br>
      </blockquote>
      There is a related thread on gluster-devel [1] to have a better
      behavior in GlusterFS for prevention of split brains with sanlock
      and 2-way replicated gluster volumes.
      <br>
      <br>
      Please feel free to comment on the proposal there.
      <br>
      <br>
      Thanks,
      <br>
      Vijay
      <br>
      <br>
      [1]
<a class="moz-txt-link-freetext" href="http://supercolony.gluster.org/pipermail/gluster-devel/2014-May/040751.html">http://supercolony.gluster.org/pipermail/gluster-devel/2014-May/040751.html</a><br>
      <br>
    </blockquote>
    One quick note before my main comment: I see references to quorum
    being "N/2 + 1".  Isn't if more accurate to say that quorum is "(N +
    1)/2" or "N/2 + 0.5"?<br>
    <br>
    Now to my main comment.<br>
    <br>
    I see a case that is not being addressed.  I have no proof of how
    often this use-case occurs, but I believe that is does occur.  (It
    could (theoretically) occur in any situation where multiple bricks
    are writing to different parts of the same file.)<br>
    <br>
    Use-case: sanlock via fuse client.<br>
    <br>
    Steps to produce originally<br>
    <blockquote>(not tested for reproducibility, because I was unable to
      recover the ovirt cluster after occurrence, had to rebuild from
      scratch), time frame was late 2013 or early 2014<br>
      <br>
      2 node ovirt cluster using replicated gluster storage<br>
      ovirt cluster up and running VMs<br>
      remove power from network switch<br>
      restore power to network switch after a few minutes<br>
    </blockquote>
    Result<br>
    <blockquote>both copies of .../dom_md/ids file accused the other of
      being out of sync <br>
    </blockquote>
    Hypothesis of cause<br>
    <blockquote>servers (ovirt nodes and gluster bricks) are called A
      and B<br>
      At the moment when network communication was lost, or just a
      moment after communication was lost<br>
      <blockquote>A had written to local ids file<br>
        A had started process to send write to B<br>
        A had not received write confirmation from B<br>
        and<br>
        B had written to local ids file<br>
        B had started process to send write to A<br>
        B had not received write confirmation from A<br>
      </blockquote>
      Thus, each file had a segment that had been written to the local
      file, but had not been confirmed written on the remote file.  Each
      file correctly accused the other file of being out-of-sync.  I did
      read and decipher the xattr data, and this was indeed the case,
      each file accused the other.<br>
    </blockquote>
    Possible solutions<br>
    <blockquote>Thinking about it on a systems level, the only solution
      I can see is to route all writes through one gluster brick.  That
      way all the accusations flow from that brick to other bricks, and
      gluster will find the one file with no one accusing it, and can
      sync from that file to others.<br>
      <br>
      Within a gluster environment, the only way I know to do this
      currently is to use an nfs mount, forcing all data through that
      machine, BUT also making that machine a single point of failure. 
      That assumes that you do not do as I did (and caused split-brain)
      by mounting an nfs volume using localhost:/engVM1, which put me
      back in the multiple-write situation<br>
      <br>
      In previous googling, I have seen a proposal to alter/replace the
      current replication translator so that it would do something
      similar, routing all writes through one node, but still allowing
      local reads, and allowing the chosen node to float dynamically
      among the available bricks.  I looked again, but have been unable
      to find that mailing list entry again. :(<br>
    </blockquote>
    Ted Miller<br>
    Elkhart, IN, USA<br>
    <br>
  </body>
</html>

--------------030407070309010208080203--