Re: [ovirt-users] sanlock + gluster recovery -- RFE

2 Jun 2014


      I am sorry, this missed my attention over the last few days.

On 05/23/2014 08:50 PM, Ted Miller wrote:
...
Vijay, I am not a member of the developer list, so my comments are at end.
On 5/23/2014 6:55 AM, Vijay Bellur wrote:
...
On 05/21/2014 10:22 PM, Federico Simoncelli wrote:
...
----- Original Message -----
...
From: "Giuseppe Ragusa" <giuseppe.ragusa@hotmail.com>
To: fsimonce@redhat.com
Cc: users@ovirt.org
Sent: Wednesday, May 21, 2014 5:15:30 PM
Subject: sanlock + gluster recovery -- RFE
Hi,
...
----- Original Message -----
...
From: "Ted Miller" <tmiller at hcjb.org>
To: "users" <users at ovirt.org>
Sent: Tuesday, May 20, 2014 11:31:42 PM
Subject: [ovirt-users] sanlock + gluster recovery -- RFE
As you are aware, there is an ongoing split-brain problem with
running
sanlock on replicated gluster storage. Personally, I believe that
this is
the 5th time that I have been bitten by this sanlock+gluster problem.
I believe that the following are true (if not, my entire request is
probably
off base).
* ovirt uses sanlock in such a way that when the sanlock
storage is
     on a
     replicated gluster file system, very small storage
disruptions can
     result in a gluster split-brain on the sanlock space
Although this is possible (at the moment) we are working hard to
avoid it.
The hardest part here is to ensure that the gluster volume is properly
configured.
The suggested configuration for a volume to be used with ovirt is:
Volume Name: (...)
Type: Replicate
Volume ID: (...)
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
(...three bricks...)
Options Reconfigured:
network.ping-timeout: 10
cluster.quorum-type: auto
The two options ping-timeout and quorum-type are really important.
You would also need a build where this bug is fixed in order to
avoid any
chance of a split-brain:
https://bugzilla.redhat.com/show_bug.cgi?id=1066996
It seems that the aforementioned bug is peculiar to 3-bricks setups.
I understand that a 3-bricks setup can allow proper quorum formation
without
resorting to "first-configured-brick-has-more-weight" convention
used with
only 2 bricks and quorum "auto" (which makes one node "special", so not
properly any-single-fault tolerant).
Correct.
...
But, since we are on ovirt-users, is there a similar suggested
configuration
for a 2-hosts setup oVirt+GlusterFS with oVirt-side power management
properly configured and tested-working?
I mean a configuration where "any" host can go south and oVirt
(through the
other one) fences it (forcibly powering it off with confirmation
from IPMI
or similar) then restarts HA-marked vms that were running there, all
the
while keeping the underlying GlusterFS-based storage domains
responsive and
readable/writeable (maybe apart from a lapse between detected
other-node
unresposiveness and confirmed fencing)?
We already had a discussion with gluster asking if it was possible to
add fencing to the replica 2 quorum/consistency mechanism.
The idea is that as soon as you can't replicate a write you have to
freeze all IO until either the connection is re-established or you
know that the other host has been killed.
Adding Vijay.
There is a related thread on gluster-devel [1] to have a better
behavior in GlusterFS for prevention of split brains with sanlock and
2-way replicated gluster volumes.
Please feel free to comment on the proposal there.
Thanks,
Vijay
[1]
http://supercolony.gluster.org/pipermail/gluster-devel/2014-May/040751.html
One quick note before my main comment: I see references to quorum being
"N/2 + 1".  Isn't if more accurate to say that quorum is "(N + 1)/2" or
"N/2 + 0.5"?
"(N + 1)/2" or  "N/2 + 0.5" is fine when N happens to be odd. For both 
odd and even cases of N, "N/2 + 1" does seem to be the more appropriate 
representation (assuming integer arithmetic).
...
Now to my main comment.
I see a case that is not being addressed.  I have no proof of how often
this use-case occurs, but I believe that is does occur.  (It could
(theoretically) occur in any situation where multiple bricks are writing
to different parts of the same file.)
Use-case: sanlock via fuse client.
Steps to produce originally
(not tested for reproducibility, because I was unable to recover the
    ovirt cluster after occurrence, had to rebuild from scratch), time
    frame was late 2013 or early 2014
2 node ovirt cluster using replicated gluster storage
    ovirt cluster up and running VMs
    remove power from network switch
    restore power to network switch after a few minutes
Result
both copies of .../dom_md/ids file accused the other of being out of
    sync
This case would fall under the ambit of "1. Split-brains due to network 
partition or network split-brains" in the proposal on gluster-devel.
...
Possible solutions
Thinking about it on a systems level, the only solution I can see is
    to route all writes through one gluster brick.  That way all the
    accusations flow from that brick to other bricks, and gluster will
    find the one file with no one accusing it, and can sync from that
    file to others.
Yes, this is one possibility. The other possibility would be to increase 
the replica count for this particular file and use client quorum to 
provide network partition tolerance with higher availability too. Even 
if split brain were to happen, we can automatically select a winner by 
picking up the version of the file that has been updated on more than 
half the number of replicas.
...
Within a gluster environment, the only way I know to do this
    currently is to use an nfs mount, forcing all data through that
    machine, BUT also making that machine a single point of failure.
    That assumes that you do not do as I did (and caused split-brain) by
    mounting an nfs volume using localhost:/engVM1, which put me back in
    the multiple-write situation
In previous googling, I have seen a proposal to alter/replace the
    current replication translator so that it would do something
    similar, routing all writes through one node, but still allowing
    local reads, and allowing the chosen node to float dynamically among
    the available bricks.  I looked again, but have been unable to find
    that mailing list entry again. :(
I think you are referring to the New Style Replication (NSR) feature 
proposal [1]. NSR is currently being implemented and you can follow it 
here [2].

Thanks,
Vijay

[1] 
http://www.gluster.org/community/documentation/index.php/Features/new-style-...

[2] http://review.gluster.org/#/q/project:+glusterfs-nsr,n,z