
--_27321994-845b-49b2-9a1d-a49b376f5af2_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi=2C
----- Original Message -----
From: "Ted Miller" <tmiller at hcjb.org> To: "users" <users at ovirt.org> Sent: Tuesday=2C May 20=2C 2014 11:31:42 PM Subject: [ovirt-users] sanlock + gluster recovery -- RFE =20 As you are aware=2C there is an ongoing split-brain problem with runnin= g sanlock on replicated gluster storage. Personally=2C I believe that thi= s is the 5th time that I have been bitten by this sanlock+gluster problem. =20 I believe that the following are true (if not=2C my entire request is p= robably off base). =20 =20 * ovirt uses sanlock in such a way that when the sanlock storage is= on a replicated gluster file system=2C very small storage disruptions ca= n result in a gluster split-brain on the sanlock space =20 Although this is possible (at the moment) we are working hard to avoid it= . The hardest part here is to ensure that the gluster volume is properly configured. =20 The suggested configuration for a volume to be used with ovirt is: =20 Volume Name: (...) Type: Replicate Volume ID: (...) Status: Started Number of Bricks: 1 x 3 =3D 3 Transport-type: tcp Bricks: (...three bricks...) Options Reconfigured: network.ping-timeout: 10 cluster.quorum-type: auto =20 The two options ping-timeout and quorum-type are really important. =20 You would also need a build where this bug is fixed in order to avoid any chance of a split-brain: =20 https://bugzilla.redhat.com/show_bug.cgi?id=3D1066996
It seems that the aforementioned bug is peculiar to 3-bricks setups. I understand that a 3-bricks setup can allow proper quorum formation withou= t resorting to "first-configured-brick-has-more-weight" convention used wit= h only 2 bricks and quorum "auto" (which makes one node "special"=2C so not= properly any-single-fault tolerant). But=2C since we are on ovirt-users=2C is there a similar suggested configur= ation for a 2-hosts setup oVirt+GlusterFS with oVirt-side power management = properly configured and tested-working? I mean a configuration where "any" host can go south and oVirt (through the= other one) fences it (forcibly powering it off with confirmation from IPMI= or similar) then restarts HA-marked vms that were running there=2C all the= while keeping the underlying GlusterFS-based storage domains responsive an= d readable/writeable (maybe apart from a lapse between detected other-node = unresposiveness and confirmed fencing)? Furthermore: is such a suggested configuration possible in a self-hosted-en= gine scenario? Regards=2C Giuseppe
How did I get into this mess? =20 ... =20 What I would like to see in ovirt to help me (and others like me). Alte= rnates listed in order from most desirable (automatic) to least desirable (set= of commands to type=2C with lots of variables to figure out). =20 The real solution is to avoid the split-brain altogether. At the moment i= t seems that using the suggested configurations and the bug fix we shouldn'= t hit a split-brain. =20 1. automagic recovery =20 2. recovery subcommand =20 3. script =20 4. commands =20 I think that the commands to resolve a split-brain should be documented. I just started a page here: =20 http://www.ovirt.org/Gluster_Storage_Domain_Reference =20 Could you add your documentation there? Thanks! =20 --=20 Federico
>=3B >=3B As you are aware=2C there is an ongoing split-brain problem = with running<br>>=3B >=3B sanlock on replicated gluster storage. Person= ally=2C I believe that this is<br>>=3B >=3B the 5th time that I have be= en bitten by this sanlock+gluster problem.<br>>=3B >=3B <br>>=3B >= =3B I believe that the following are true (if not=2C my entire request is p= robably<br>>=3B >=3B off base).<br>>=3B >=3B <br>>=3B >=3B <br>= >=3B >=3B =3B =3B =3B =3B * ovirt uses sanlock in such = a way that when the sanlock storage is on a<br>>=3B >=3B =3B = =3B =3B =3B replicated gluster file system=2C very small storage di= sruptions can<br>>=3B >=3B =3B =3B =3B =3B result in a = gluster split-brain on the sanlock space<br>>=3B <br>>=3B Although this= is possible (at the moment) we are working hard to avoid it.<br>>=3B The= hardest part here is to ensure that the gluster volume is properly<br>>= =3B configured.<br>>=3B <br>>=3B The suggested configuration for a volu= me to be used with ovirt is:<br>>=3B <br>>=3B Volume Name: (...)<br>>= =3B Type: Replicate<br>>=3B Volume ID: (...)<br>>=3B Status: Started<br= >=3B Number of Bricks: 1 x 3 =3D 3<br>>=3B Transport-type: tcp<br>>= =3B Bricks:<br>>=3B (...three bricks...)<br>>=3B Options Reconfigured:<= br>>=3B network.ping-timeout: 10<br>>=3B cluster.quorum-type: auto<br>&= gt=3B <br>>=3B The two options ping-timeout and quorum-type are really im=
<br>Regards=2C<br>Giuseppe<br><br>>=3B >=3B How did I get into this me= ss?<br>>=3B >=3B <br>>=3B >=3B ...<br>>=3B >=3B <br>>=3B >= =3B What I would like to see in ovirt to help me (and others like me). Alte= rnates<br>>=3B >=3B listed in order from most desirable (automatic) to = least desirable (set of<br>>=3B >=3B commands to type=2C with lots of v= ariables to figure out).<br>>=3B <br>>=3B The real solution is to avoid=
= --_27321994-845b-49b2-9a1d-a49b376f5af2_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable <html> <head> <style><!-- .hmmessage P { margin:0px=3B padding:0px } body.hmmessage { font-size: 12pt=3B font-family:Calibri } --></style></head> <body class=3D'hmmessage'><div dir=3D'ltr'>Hi=2C<br><br>>=3B ----- Origin= al Message -----<br>>=3B >=3B From: "Ted Miller" <=3Btmiller at hcjb.= org>=3B<br>>=3B >=3B To: "users" <=3Busers at ovirt.org>=3B<br>&g= t=3B >=3B Sent: Tuesday=2C May 20=2C 2014 11:31:42 PM<br>>=3B >=3B Su= bject: [ovirt-users] sanlock + gluster recovery -- RFE<br>>=3B >=3B <br= portant.<br>>=3B <br>>=3B You would also need a build where this bug is= fixed in order to avoid any<br>>=3B chance of a split-brain:<br>>=3B <= br>>=3B https://bugzilla.redhat.com/show_bug.cgi?id=3D1066996<br><br>It s= eems that the aforementioned bug is peculiar to 3-bricks setups.<br><br>I u= nderstand that a 3-bricks setup can allow proper quorum formation without r= esorting to "first-configured-brick-has-more-weight" convention used with o= nly 2 bricks and quorum "auto" (which makes one node "special"=2C so not pr= operly any-single-fault tolerant).<br><br>But=2C since we are on ovirt-user= s=2C is there a similar suggested configuration for a 2-hosts setup oVirt+G= lusterFS with oVirt-side power management properly configured and tested-wo= rking?<br>I mean a configuration where "any" host can go south and oVirt (t= hrough the other one) fences it (forcibly powering it off with confirmation= from IPMI or similar) then restarts HA-marked vms that were running there= =2C all the while keeping the underlying GlusterFS-based storage domains re= sponsive and readable/writeable (maybe apart from a lapse between detected = other-node unresposiveness and confirmed fencing)?<br><br>Furthermore: is s= uch a suggested configuration possible in a self-hosted-engine scenario?<br= the split-brain altogether. At the moment it<br>>=3B seems that using th= e suggested configurations and the bug fix we shouldn't<br>>=3B hit a spl= it-brain.<br>>=3B <br>>=3B >=3B 1. automagic recovery<br>>=3B >= =3B <br>>=3B >=3B 2. recovery subcommand<br>>=3B >=3B <br>>=3B &g= t=3B 3. script<br>>=3B >=3B <br>>=3B >=3B 4. commands<br>>=3B <br=
>=3B I think that the commands to resolve a split-brain should be docume= nted.<br>>=3B I just started a page here:<br>>=3B <br>>=3B http://www= .ovirt.org/Gluster_Storage_Domain_Reference<br>>=3B <br>>=3B Could you = add your documentation there? Thanks!<br>>=3B <br>>=3B -- <br>>=3B Fe= derico<br><br> </div></body> </html>=
--_27321994-845b-49b2-9a1d-a49b376f5af2_--

----- Original Message -----
From: "Giuseppe Ragusa" <giuseppe.ragusa@hotmail.com> To: fsimonce@redhat.com Cc: users@ovirt.org Sent: Wednesday, May 21, 2014 5:15:30 PM Subject: sanlock + gluster recovery -- RFE
Hi,
----- Original Message -----
From: "Ted Miller" <tmiller at hcjb.org> To: "users" <users at ovirt.org> Sent: Tuesday, May 20, 2014 11:31:42 PM Subject: [ovirt-users] sanlock + gluster recovery -- RFE
As you are aware, there is an ongoing split-brain problem with running sanlock on replicated gluster storage. Personally, I believe that this is the 5th time that I have been bitten by this sanlock+gluster problem.
I believe that the following are true (if not, my entire request is probably off base).
* ovirt uses sanlock in such a way that when the sanlock storage is on a replicated gluster file system, very small storage disruptions can result in a gluster split-brain on the sanlock space
Although this is possible (at the moment) we are working hard to avoid it. The hardest part here is to ensure that the gluster volume is properly configured.
The suggested configuration for a volume to be used with ovirt is:
Volume Name: (...) Type: Replicate Volume ID: (...) Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: (...three bricks...) Options Reconfigured: network.ping-timeout: 10 cluster.quorum-type: auto
The two options ping-timeout and quorum-type are really important.
You would also need a build where this bug is fixed in order to avoid any chance of a split-brain:
It seems that the aforementioned bug is peculiar to 3-bricks setups.
I understand that a 3-bricks setup can allow proper quorum formation without resorting to "first-configured-brick-has-more-weight" convention used with only 2 bricks and quorum "auto" (which makes one node "special", so not properly any-single-fault tolerant).
Correct.
But, since we are on ovirt-users, is there a similar suggested configuration for a 2-hosts setup oVirt+GlusterFS with oVirt-side power management properly configured and tested-working? I mean a configuration where "any" host can go south and oVirt (through the other one) fences it (forcibly powering it off with confirmation from IPMI or similar) then restarts HA-marked vms that were running there, all the while keeping the underlying GlusterFS-based storage domains responsive and readable/writeable (maybe apart from a lapse between detected other-node unresposiveness and confirmed fencing)?
We already had a discussion with gluster asking if it was possible to add fencing to the replica 2 quorum/consistency mechanism. The idea is that as soon as you can't replicate a write you have to freeze all IO until either the connection is re-established or you know that the other host has been killed. Adding Vijay. -- Federico

On 05/21/2014 10:22 PM, Federico Simoncelli wrote:
----- Original Message -----
From: "Giuseppe Ragusa" <giuseppe.ragusa@hotmail.com> To: fsimonce@redhat.com Cc: users@ovirt.org Sent: Wednesday, May 21, 2014 5:15:30 PM Subject: sanlock + gluster recovery -- RFE
Hi,
----- Original Message -----
From: "Ted Miller" <tmiller at hcjb.org> To: "users" <users at ovirt.org> Sent: Tuesday, May 20, 2014 11:31:42 PM Subject: [ovirt-users] sanlock + gluster recovery -- RFE
As you are aware, there is an ongoing split-brain problem with running sanlock on replicated gluster storage. Personally, I believe that this is the 5th time that I have been bitten by this sanlock+gluster problem.
I believe that the following are true (if not, my entire request is probably off base).
* ovirt uses sanlock in such a way that when the sanlock storage is on a replicated gluster file system, very small storage disruptions can result in a gluster split-brain on the sanlock space
Although this is possible (at the moment) we are working hard to avoid it. The hardest part here is to ensure that the gluster volume is properly configured.
The suggested configuration for a volume to be used with ovirt is:
Volume Name: (...) Type: Replicate Volume ID: (...) Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: (...three bricks...) Options Reconfigured: network.ping-timeout: 10 cluster.quorum-type: auto
The two options ping-timeout and quorum-type are really important.
You would also need a build where this bug is fixed in order to avoid any chance of a split-brain:
It seems that the aforementioned bug is peculiar to 3-bricks setups.
I understand that a 3-bricks setup can allow proper quorum formation without resorting to "first-configured-brick-has-more-weight" convention used with only 2 bricks and quorum "auto" (which makes one node "special", so not properly any-single-fault tolerant).
Correct.
But, since we are on ovirt-users, is there a similar suggested configuration for a 2-hosts setup oVirt+GlusterFS with oVirt-side power management properly configured and tested-working? I mean a configuration where "any" host can go south and oVirt (through the other one) fences it (forcibly powering it off with confirmation from IPMI or similar) then restarts HA-marked vms that were running there, all the while keeping the underlying GlusterFS-based storage domains responsive and readable/writeable (maybe apart from a lapse between detected other-node unresposiveness and confirmed fencing)?
We already had a discussion with gluster asking if it was possible to add fencing to the replica 2 quorum/consistency mechanism.
The idea is that as soon as you can't replicate a write you have to freeze all IO until either the connection is re-established or you know that the other host has been killed.
Adding Vijay.
There is a related thread on gluster-devel [1] to have a better behavior in GlusterFS for prevention of split brains with sanlock and 2-way replicated gluster volumes. Please feel free to comment on the proposal there. Thanks, Vijay [1] http://supercolony.gluster.org/pipermail/gluster-devel/2014-May/040751.html

--------------030407070309010208080203 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Vijay, I am not a member of the developer list, so my comments are at end. On 5/23/2014 6:55 AM, Vijay Bellur wrote:
On 05/21/2014 10:22 PM, Federico Simoncelli wrote:
----- Original Message -----
From: "Giuseppe Ragusa" <giuseppe.ragusa@hotmail.com> To: fsimonce@redhat.com Cc: users@ovirt.org Sent: Wednesday, May 21, 2014 5:15:30 PM Subject: sanlock + gluster recovery -- RFE
Hi,
----- Original Message -----
From: "Ted Miller" <tmiller at hcjb.org> To: "users" <users at ovirt.org> Sent: Tuesday, May 20, 2014 11:31:42 PM Subject: [ovirt-users] sanlock + gluster recovery -- RFE
As you are aware, there is an ongoing split-brain problem with running sanlock on replicated gluster storage. Personally, I believe that this is the 5th time that I have been bitten by this sanlock+gluster problem.
I believe that the following are true (if not, my entire request is probably off base).
* ovirt uses sanlock in such a way that when the sanlock storage is on a replicated gluster file system, very small storage disruptions can result in a gluster split-brain on the sanlock space
Although this is possible (at the moment) we are working hard to avoid it. The hardest part here is to ensure that the gluster volume is properly configured.
The suggested configuration for a volume to be used with ovirt is:
Volume Name: (...) Type: Replicate Volume ID: (...) Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: (...three bricks...) Options Reconfigured: network.ping-timeout: 10 cluster.quorum-type: auto
The two options ping-timeout and quorum-type are really important.
You would also need a build where this bug is fixed in order to avoid any chance of a split-brain:
It seems that the aforementioned bug is peculiar to 3-bricks setups.
I understand that a 3-bricks setup can allow proper quorum formation without resorting to "first-configured-brick-has-more-weight" convention used with only 2 bricks and quorum "auto" (which makes one node "special", so not properly any-single-fault tolerant).
Correct.
But, since we are on ovirt-users, is there a similar suggested configuration for a 2-hosts setup oVirt+GlusterFS with oVirt-side power management properly configured and tested-working? I mean a configuration where "any" host can go south and oVirt (through the other one) fences it (forcibly powering it off with confirmation from IPMI or similar) then restarts HA-marked vms that were running there, all the while keeping the underlying GlusterFS-based storage domains responsive and readable/writeable (maybe apart from a lapse between detected other-node unresposiveness and confirmed fencing)?
We already had a discussion with gluster asking if it was possible to add fencing to the replica 2 quorum/consistency mechanism.
The idea is that as soon as you can't replicate a write you have to freeze all IO until either the connection is re-established or you know that the other host has been killed.
Adding Vijay. There is a related thread on gluster-devel [1] to have a better behavior in GlusterFS for prevention of split brains with sanlock and 2-way replicated gluster volumes.
Please feel free to comment on the proposal there.
Thanks, Vijay
[1] http://supercolony.gluster.org/pipermail/gluster-devel/2014-May/040751.html
One quick note before my main comment: I see references to quorum being "N/2 + 1". Isn't if more accurate to say that quorum is "(N + 1)/2" or "N/2 + 0.5"? Now to my main comment. I see a case that is not being addressed. I have no proof of how often this use-case occurs, but I believe that is does occur. (It could (theoretically) occur in any situation where multiple bricks are writing to different parts of the same file.) Use-case: sanlock via fuse client. Steps to produce originally (not tested for reproducibility, because I was unable to recover the ovirt cluster after occurrence, had to rebuild from scratch), time frame was late 2013 or early 2014 2 node ovirt cluster using replicated gluster storage ovirt cluster up and running VMs remove power from network switch restore power to network switch after a few minutes Result both copies of .../dom_md/ids file accused the other of being out of sync Hypothesis of cause servers (ovirt nodes and gluster bricks) are called A and B At the moment when network communication was lost, or just a moment after communication was lost A had written to local ids file A had started process to send write to B A had not received write confirmation from B and B had written to local ids file B had started process to send write to A B had not received write confirmation from A Thus, each file had a segment that had been written to the local file, but had not been confirmed written on the remote file. Each file correctly accused the other file of being out-of-sync. I did read and decipher the xattr data, and this was indeed the case, each file accused the other. Possible solutions Thinking about it on a systems level, the only solution I can see is to route all writes through one gluster brick. That way all the accusations flow from that brick to other bricks, and gluster will find the one file with no one accusing it, and can sync from that file to others. Within a gluster environment, the only way I know to do this currently is to use an nfs mount, forcing all data through that machine, BUT also making that machine a single point of failure. That assumes that you do not do as I did (and caused split-brain) by mounting an nfs volume using localhost:/engVM1, which put me back in the multiple-write situation In previous googling, I have seen a proposal to alter/replace the current replication translator so that it would do something similar, routing all writes through one node, but still allowing local reads, and allowing the chosen node to float dynamically among the available bricks. I looked again, but have been unable to find that mailing list entry again. :( Ted Miller Elkhart, IN, USA --------------030407070309010208080203 Content-Type: text/html; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit <html> <head> <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type"> </head> <body text="#000000" bgcolor="#FFFFFF"> Vijay, I am not a member of the developer list, so my comments are at end.<br> <br> <div class="moz-cite-prefix">On 5/23/2014 6:55 AM, Vijay Bellur wrote:<br> </div> <blockquote cite="mid:537F290A.1090008@redhat.com" type="cite">On 05/21/2014 10:22 PM, Federico Simoncelli wrote: <br> <blockquote type="cite">----- Original Message ----- <br> <blockquote type="cite">From: "Giuseppe Ragusa" <a class="moz-txt-link-rfc2396E" href="mailto:giuseppe.ragusa@hotmail.com"><giuseppe.ragusa@hotmail.com></a> <br> To: <a class="moz-txt-link-abbreviated" href="mailto:fsimonce@redhat.com">fsimonce@redhat.com</a> <br> Cc: <a class="moz-txt-link-abbreviated" href="mailto:users@ovirt.org">users@ovirt.org</a> <br> Sent: Wednesday, May 21, 2014 5:15:30 PM <br> Subject: sanlock + gluster recovery -- RFE <br> <br> Hi, <br> <br> <blockquote type="cite">----- Original Message ----- <br> <blockquote type="cite">From: "Ted Miller" <tmiller at hcjb.org> <br> To: "users" <users at ovirt.org> <br> Sent: Tuesday, May 20, 2014 11:31:42 PM <br> Subject: [ovirt-users] sanlock + gluster recovery -- RFE <br> <br> As you are aware, there is an ongoing split-brain problem with running <br> sanlock on replicated gluster storage. Personally, I believe that this is <br> the 5th time that I have been bitten by this sanlock+gluster problem. <br> <br> I believe that the following are true (if not, my entire request is <br> probably <br> off base). <br> <br> <br> * ovirt uses sanlock in such a way that when the sanlock storage is <br> on a <br> replicated gluster file system, very small storage disruptions can <br> result in a gluster split-brain on the sanlock space <br> </blockquote> <br> Although this is possible (at the moment) we are working hard to avoid it. <br> The hardest part here is to ensure that the gluster volume is properly <br> configured. <br> <br> The suggested configuration for a volume to be used with ovirt is: <br> <br> Volume Name: (...) <br> Type: Replicate <br> Volume ID: (...) <br> Status: Started <br> Number of Bricks: 1 x 3 = 3 <br> Transport-type: tcp <br> Bricks: <br> (...three bricks...) <br> Options Reconfigured: <br> network.ping-timeout: 10 <br> cluster.quorum-type: auto <br> <br> The two options ping-timeout and quorum-type are really important. <br> <br> You would also need a build where this bug is fixed in order to avoid any <br> chance of a split-brain: <br> <br> <a class="moz-txt-link-freetext" href="https://bugzilla.redhat.com/show_bug.cgi?id=1066996">https://bugzilla.redhat.com/show_bug.cgi?id=1066996</a> <br> </blockquote> <br> It seems that the aforementioned bug is peculiar to 3-bricks setups. <br> <br> I understand that a 3-bricks setup can allow proper quorum formation without <br> resorting to "first-configured-brick-has-more-weight" convention used with <br> only 2 bricks and quorum "auto" (which makes one node "special", so not <br> properly any-single-fault tolerant). <br> </blockquote> <br> Correct. <br> <br> <blockquote type="cite">But, since we are on ovirt-users, is there a similar suggested configuration <br> for a 2-hosts setup oVirt+GlusterFS with oVirt-side power management <br> properly configured and tested-working? <br> I mean a configuration where "any" host can go south and oVirt (through the <br> other one) fences it (forcibly powering it off with confirmation from IPMI <br> or similar) then restarts HA-marked vms that were running there, all the <br> while keeping the underlying GlusterFS-based storage domains responsive and <br> readable/writeable (maybe apart from a lapse between detected other-node <br> unresposiveness and confirmed fencing)? <br> </blockquote> <br> We already had a discussion with gluster asking if it was possible to <br> add fencing to the replica 2 quorum/consistency mechanism. <br> <br> The idea is that as soon as you can't replicate a write you have to <br> freeze all IO until either the connection is re-established or you <br> know that the other host has been killed. <br> <br> Adding Vijay. <br> </blockquote> There is a related thread on gluster-devel [1] to have a better behavior in GlusterFS for prevention of split brains with sanlock and 2-way replicated gluster volumes. <br> <br> Please feel free to comment on the proposal there. <br> <br> Thanks, <br> Vijay <br> <br> [1] <a class="moz-txt-link-freetext" href="http://supercolony.gluster.org/pipermail/gluster-devel/2014-May/040751.html">http://supercolony.gluster.org/pipermail/gluster-devel/2014-May/040751.html</a><br> <br> </blockquote> One quick note before my main comment: I see references to quorum being "N/2 + 1". Isn't if more accurate to say that quorum is "(N + 1)/2" or "N/2 + 0.5"?<br> <br> Now to my main comment.<br> <br> I see a case that is not being addressed. I have no proof of how often this use-case occurs, but I believe that is does occur. (It could (theoretically) occur in any situation where multiple bricks are writing to different parts of the same file.)<br> <br> Use-case: sanlock via fuse client.<br> <br> Steps to produce originally<br> <blockquote>(not tested for reproducibility, because I was unable to recover the ovirt cluster after occurrence, had to rebuild from scratch), time frame was late 2013 or early 2014<br> <br> 2 node ovirt cluster using replicated gluster storage<br> ovirt cluster up and running VMs<br> remove power from network switch<br> restore power to network switch after a few minutes<br> </blockquote> Result<br> <blockquote>both copies of .../dom_md/ids file accused the other of being out of sync <br> </blockquote> Hypothesis of cause<br> <blockquote>servers (ovirt nodes and gluster bricks) are called A and B<br> At the moment when network communication was lost, or just a moment after communication was lost<br> <blockquote>A had written to local ids file<br> A had started process to send write to B<br> A had not received write confirmation from B<br> and<br> B had written to local ids file<br> B had started process to send write to A<br> B had not received write confirmation from A<br> </blockquote> Thus, each file had a segment that had been written to the local file, but had not been confirmed written on the remote file. Each file correctly accused the other file of being out-of-sync. I did read and decipher the xattr data, and this was indeed the case, each file accused the other.<br> </blockquote> Possible solutions<br> <blockquote>Thinking about it on a systems level, the only solution I can see is to route all writes through one gluster brick. That way all the accusations flow from that brick to other bricks, and gluster will find the one file with no one accusing it, and can sync from that file to others.<br> <br> Within a gluster environment, the only way I know to do this currently is to use an nfs mount, forcing all data through that machine, BUT also making that machine a single point of failure. That assumes that you do not do as I did (and caused split-brain) by mounting an nfs volume using localhost:/engVM1, which put me back in the multiple-write situation<br> <br> In previous googling, I have seen a proposal to alter/replace the current replication translator so that it would do something similar, routing all writes through one node, but still allowing local reads, and allowing the chosen node to float dynamically among the available bricks. I looked again, but have been unable to find that mailing list entry again. :(<br> </blockquote> Ted Miller<br> Elkhart, IN, USA<br> <br> </body> </html> --------------030407070309010208080203--

I am sorry, this missed my attention over the last few days. On 05/23/2014 08:50 PM, Ted Miller wrote:
Vijay, I am not a member of the developer list, so my comments are at end.
On 5/23/2014 6:55 AM, Vijay Bellur wrote:
On 05/21/2014 10:22 PM, Federico Simoncelli wrote:
----- Original Message -----
From: "Giuseppe Ragusa" <giuseppe.ragusa@hotmail.com> To: fsimonce@redhat.com Cc: users@ovirt.org Sent: Wednesday, May 21, 2014 5:15:30 PM Subject: sanlock + gluster recovery -- RFE
Hi,
----- Original Message -----
From: "Ted Miller" <tmiller at hcjb.org> To: "users" <users at ovirt.org> Sent: Tuesday, May 20, 2014 11:31:42 PM Subject: [ovirt-users] sanlock + gluster recovery -- RFE
As you are aware, there is an ongoing split-brain problem with running sanlock on replicated gluster storage. Personally, I believe that this is the 5th time that I have been bitten by this sanlock+gluster problem.
I believe that the following are true (if not, my entire request is probably off base).
* ovirt uses sanlock in such a way that when the sanlock storage is on a replicated gluster file system, very small storage disruptions can result in a gluster split-brain on the sanlock space
Although this is possible (at the moment) we are working hard to avoid it. The hardest part here is to ensure that the gluster volume is properly configured.
The suggested configuration for a volume to be used with ovirt is:
Volume Name: (...) Type: Replicate Volume ID: (...) Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: (...three bricks...) Options Reconfigured: network.ping-timeout: 10 cluster.quorum-type: auto
The two options ping-timeout and quorum-type are really important.
You would also need a build where this bug is fixed in order to avoid any chance of a split-brain:
It seems that the aforementioned bug is peculiar to 3-bricks setups.
I understand that a 3-bricks setup can allow proper quorum formation without resorting to "first-configured-brick-has-more-weight" convention used with only 2 bricks and quorum "auto" (which makes one node "special", so not properly any-single-fault tolerant).
Correct.
But, since we are on ovirt-users, is there a similar suggested configuration for a 2-hosts setup oVirt+GlusterFS with oVirt-side power management properly configured and tested-working? I mean a configuration where "any" host can go south and oVirt (through the other one) fences it (forcibly powering it off with confirmation from IPMI or similar) then restarts HA-marked vms that were running there, all the while keeping the underlying GlusterFS-based storage domains responsive and readable/writeable (maybe apart from a lapse between detected other-node unresposiveness and confirmed fencing)?
We already had a discussion with gluster asking if it was possible to add fencing to the replica 2 quorum/consistency mechanism.
The idea is that as soon as you can't replicate a write you have to freeze all IO until either the connection is re-established or you know that the other host has been killed.
Adding Vijay. There is a related thread on gluster-devel [1] to have a better behavior in GlusterFS for prevention of split brains with sanlock and 2-way replicated gluster volumes.
Please feel free to comment on the proposal there.
Thanks, Vijay
[1] http://supercolony.gluster.org/pipermail/gluster-devel/2014-May/040751.html
One quick note before my main comment: I see references to quorum being "N/2 + 1". Isn't if more accurate to say that quorum is "(N + 1)/2" or "N/2 + 0.5"?
"(N + 1)/2" or "N/2 + 0.5" is fine when N happens to be odd. For both odd and even cases of N, "N/2 + 1" does seem to be the more appropriate representation (assuming integer arithmetic).
Now to my main comment.
I see a case that is not being addressed. I have no proof of how often this use-case occurs, but I believe that is does occur. (It could (theoretically) occur in any situation where multiple bricks are writing to different parts of the same file.)
Use-case: sanlock via fuse client.
Steps to produce originally
(not tested for reproducibility, because I was unable to recover the ovirt cluster after occurrence, had to rebuild from scratch), time frame was late 2013 or early 2014
2 node ovirt cluster using replicated gluster storage ovirt cluster up and running VMs remove power from network switch restore power to network switch after a few minutes
Result
both copies of .../dom_md/ids file accused the other of being out of sync
This case would fall under the ambit of "1. Split-brains due to network partition or network split-brains" in the proposal on gluster-devel.
Possible solutions
Thinking about it on a systems level, the only solution I can see is to route all writes through one gluster brick. That way all the accusations flow from that brick to other bricks, and gluster will find the one file with no one accusing it, and can sync from that file to others.
Yes, this is one possibility. The other possibility would be to increase the replica count for this particular file and use client quorum to provide network partition tolerance with higher availability too. Even if split brain were to happen, we can automatically select a winner by picking up the version of the file that has been updated on more than half the number of replicas.
Within a gluster environment, the only way I know to do this currently is to use an nfs mount, forcing all data through that machine, BUT also making that machine a single point of failure. That assumes that you do not do as I did (and caused split-brain) by mounting an nfs volume using localhost:/engVM1, which put me back in the multiple-write situation
In previous googling, I have seen a proposal to alter/replace the current replication translator so that it would do something similar, routing all writes through one node, but still allowing local reads, and allowing the chosen node to float dynamically among the available bricks. I looked again, but have been unable to find that mailing list entry again. :(
I think you are referring to the New Style Replication (NSR) feature proposal [1]. NSR is currently being implemented and you can follow it here [2]. Thanks, Vijay [1] http://www.gluster.org/community/documentation/index.php/Features/new-style-... [2] http://review.gluster.org/#/q/project:+glusterfs-nsr,n,z

Hi,
From: "Ted Miller" <tmiller at hcjb.org> To: "users" <users at ovirt.org> Sent: Tuesday, May 20, 2014 11:31:42 PM Subject: [ovirt-users] sanlock + gluster recovery -- RFE
As you are aware, there is an ongoing split-brain problem with running sanlock on replicated gluster storage. Personally, I believe that this is the 5th time that I have been bitten by this sanlock+gluster problem.
I believe that the following are true (if not, my entire request is
----- Original Message ----- probably
off base).
* ovirt uses sanlock in such a way that when the sanlock storage is on a replicated gluster file system, very small storage disruptions can result in a gluster split-brain on the sanlock space
Although this is possible (at the moment) we are working hard to avoid it. The hardest part here is to ensure that the gluster volume is properly configured.
The suggested configuration for a volume to be used with ovirt is:
Volume Name: (...) Type: Replicate Volume ID: (...) Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: (...three bricks...) Options Reconfigured: network.ping-timeout: 10 cluster.quorum-type: auto
The two options ping-timeout and quorum-type are really important.
You would also need a build where this bug is fixed in order to avoid any chance of a split-brain:
It seems that the aforementioned bug is peculiar to 3-bricks setups.
I understand that a 3-bricks setup can allow proper quorum formation without resorting to "first-configured-brick-has-more-weight" convention used with only 2 bricks and quorum "auto" (which makes one node "special", so not properly any-single-fault tolerant).
But, since we are on ovirt-users, is there a similar suggested configuration for a 2-hosts setup oVirt+GlusterFS with oVirt-side power management properly configured and tested-working? I mean a configuration where "any" host can go south and oVirt (through the other one) fences it (forcibly powering it off with confirmation from IPMI or similar) then restarts HA-marked vms that were running there, all the while keeping the underlying GlusterFS-based storage domains responsive and readable/writeable (maybe apart from a lapse between detected other-node unresposiveness and confirmed fencing)?
Furthermore: is such a suggested configuration possible in a self-hosted-engine scenario?
Regards, Giuseppe
How did I get into this mess?
...
What I would like to see in ovirt to help me (and others like me). Alternates listed in order from most desirable (automatic) to least desirable (set of commands to type, with lots of variables to figure out).
The real solution is to avoid the split-brain altogether. At the moment it seems that using the suggested configurations and the bug fix we shouldn't hit a split-brain.
1. automagic recovery
2. recovery subcommand
3. script
4. commands
I think that the commands to resolve a split-brain should be documented. I just started a page here:
http://www.ovirt.org/Gluster_Storage_Domain_Reference I suggest you add these lines to the Gluster configuration, as I have seen
--------------010808060507020608030500 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit On 5/21/2014 11:15 AM, Giuseppe Ragusa wrote: this come up multiple times on the User list: storage.owner-uid: 36 storage.owner-gid: 36 Ted Miller Elkhart, IN, USA --------------010808060507020608030500 Content-Type: text/html; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit <html> <head> <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type"> </head> <body text="#000000" bgcolor="#FFFFFF"> <br> <div class="moz-cite-prefix">On 5/21/2014 11:15 AM, Giuseppe Ragusa wrote:<br> </div> <blockquote cite="mid:DUB121-W1FD91CF8CB1AFFE6E62D5FA3C0@phx.gbl" type="cite"> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> <style><!-- .hmmessage P { margin:0px; padding:0px } body.hmmessage { font-size: 12pt; font-family:Calibri } --></style> <div dir="ltr">Hi,<br> <br> > ----- Original Message -----<br> > > From: "Ted Miller" <tmiller at hcjb.org><br> > > To: "users" <users at ovirt.org><br> > > Sent: Tuesday, May 20, 2014 11:31:42 PM<br> > > Subject: [ovirt-users] sanlock + gluster recovery -- RFE<br> > > <br> > > As you are aware, there is an ongoing split-brain problem with running<br> > > sanlock on replicated gluster storage. Personally, I believe that this is<br> > > the 5th time that I have been bitten by this sanlock+gluster problem.<br> > > <br> > > I believe that the following are true (if not, my entire request is probably<br> > > off base).<br> > > <br> > > <br> > > * ovirt uses sanlock in such a way that when the sanlock storage is on a<br> > > replicated gluster file system, very small storage disruptions can<br> > > result in a gluster split-brain on the sanlock space<br> > <br> > Although this is possible (at the moment) we are working hard to avoid it.<br> > The hardest part here is to ensure that the gluster volume is properly<br> > configured.<br> > <br> > The suggested configuration for a volume to be used with ovirt is:<br> > <br> > Volume Name: (...)<br> > Type: Replicate<br> > Volume ID: (...)<br> > Status: Started<br> > Number of Bricks: 1 x 3 = 3<br> > Transport-type: tcp<br> > Bricks:<br> > (...three bricks...)<br> > Options Reconfigured:<br> > network.ping-timeout: 10<br> > cluster.quorum-type: auto<br> > <br> > The two options ping-timeout and quorum-type are really important.<br> > <br> > You would also need a build where this bug is fixed in order to avoid any<br> > chance of a split-brain:<br> > <br> > <a class="moz-txt-link-freetext" href="https://bugzilla.redhat.com/show_bug.cgi?id=1066996">https://bugzilla.redhat.com/show_bug.cgi?id=1066996</a><br> <br> It seems that the aforementioned bug is peculiar to 3-bricks setups.<br> <br> I understand that a 3-bricks setup can allow proper quorum formation without resorting to "first-configured-brick-has-more-weight" convention used with only 2 bricks and quorum "auto" (which makes one node "special", so not properly any-single-fault tolerant).<br> <br> But, since we are on ovirt-users, is there a similar suggested configuration for a 2-hosts setup oVirt+GlusterFS with oVirt-side power management properly configured and tested-working?<br> I mean a configuration where "any" host can go south and oVirt (through the other one) fences it (forcibly powering it off with confirmation from IPMI or similar) then restarts HA-marked vms that were running there, all the while keeping the underlying GlusterFS-based storage domains responsive and readable/writeable (maybe apart from a lapse between detected other-node unresposiveness and confirmed fencing)?<br> <br> Furthermore: is such a suggested configuration possible in a self-hosted-engine scenario?<br> <br> Regards,<br> Giuseppe<br> <br> > > How did I get into this mess?<br> > > <br> > > ...<br> > > <br> > > What I would like to see in ovirt to help me (and others like me). Alternates<br> > > listed in order from most desirable (automatic) to least desirable (set of<br> > > commands to type, with lots of variables to figure out).<br> > <br> > The real solution is to avoid the split-brain altogether. At the moment it<br> > seems that using the suggested configurations and the bug fix we shouldn't<br> > hit a split-brain.<br> > <br> > > 1. automagic recovery<br> > > <br> > > 2. recovery subcommand<br> > > <br> > > 3. script<br> > > <br> > > 4. commands<br> > <br> > I think that the commands to resolve a split-brain should be documented.<br> > I just started a page here:<br> > <br> > <a class="moz-txt-link-freetext" href="http://www.ovirt.org/Gluster_Storage_Domain_Reference">http://www.ovirt.org/Gluster_Storage_Domain_Reference</a></div> </blockquote> I suggest you add these lines to the Gluster configuration, as I have seen this come up multiple times on the User list:<br> <br> storage.owner-uid: 36<br> storage.owner-gid: 36<br> <br> Ted Miller<br> Elkhart, IN, USA<br> <br> </body> </html> --------------010808060507020608030500--
participants (4)
-
Federico Simoncelli
-
Giuseppe Ragusa
-
Ted Miller
-
Vijay Bellur