sanlock + gluster recovery -- RFE

21 May 2014

      --_27321994-845b-49b2-9a1d-a49b376f5af2_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Hi=2C
...
----- Original Message -----
...
From: "Ted Miller" <tmiller at hcjb.org>
To: "users" <users at ovirt.org>
Sent: Tuesday=2C May 20=2C 2014 11:31:42 PM
Subject: [ovirt-users] sanlock + gluster recovery -- RFE
=20
As you are aware=2C there is an ongoing split-brain problem with runnin=
g
sanlock on replicated gluster storage. Personally=2C I believe that thi=
s is
the 5th time that I have been bitten by this sanlock+gluster problem.
=20
I believe that the following are true (if not=2C my entire request is p=
robably
off base).
=20
=20
    * ovirt uses sanlock in such a way that when the sanlock storage is=
 on a
    replicated gluster file system=2C very small storage disruptions ca=
n
    result in a gluster split-brain on the sanlock space
=20
Although this is possible (at the moment) we are working hard to avoid it=
.
The hardest part here is to ensure that the gluster volume is properly
configured.
=20
The suggested configuration for a volume to be used with ovirt is:
=20
Volume Name: (...)
Type: Replicate
Volume ID: (...)
Status: Started
Number of Bricks: 1 x 3 =3D 3
Transport-type: tcp
Bricks:
(...three bricks...)
Options Reconfigured:
network.ping-timeout: 10
cluster.quorum-type: auto
=20
The two options ping-timeout and quorum-type are really important.
=20
You would also need a build where this bug is fixed in order to avoid any
chance of a split-brain:
=20
https://bugzilla.redhat.com/show_bug.cgi?id=3D1066996
It seems that the aforementioned bug is peculiar to 3-bricks setups.

I understand that a 3-bricks setup can allow proper quorum formation withou=
t resorting to "first-configured-brick-has-more-weight" convention used wit=
h only 2 bricks and quorum "auto" (which makes one node "special"=2C so not=
 properly any-single-fault tolerant).

But=2C since we are on ovirt-users=2C is there a similar suggested configur=
ation for a 2-hosts setup oVirt+GlusterFS with oVirt-side power management =
properly configured and tested-working?
I mean a configuration where "any" host can go south and oVirt (through the=
 other one) fences it (forcibly powering it off with confirmation from IPMI=
 or similar) then restarts HA-marked vms that were running there=2C all the=
 while keeping the underlying GlusterFS-based storage domains responsive an=
d readable/writeable (maybe apart from a lapse between detected other-node =
unresposiveness and confirmed fencing)?

Furthermore: is such a suggested configuration possible in a self-hosted-en=
gine scenario?

Regards=2C
Giuseppe
...
...
How did I get into this mess?
=20
...
=20
What I would like to see in ovirt to help me (and others like me). Alte=
rnates
listed in order from most desirable (automatic) to least desirable (set=
 of
commands to type=2C with lots of variables to figure out).
=20
The real solution is to avoid the split-brain altogether. At the moment i=
t
seems that using the suggested configurations and the bug fix we shouldn'=
t
hit a split-brain.
=20
1. automagic recovery
=20
2. recovery subcommand
=20
3. script
=20
4. commands
=20
I think that the commands to resolve a split-brain should be documented.
I just started a page here:
=20
http://www.ovirt.org/Gluster_Storage_Domain_Reference
=20
Could you add your documentation there? Thanks!
=20
--=20
Federico
...
>=3B >=3B As you are aware=2C there is an ongoing split-brain problem =
with running<br>>=3B >=3B sanlock on replicated gluster storage. Person=
ally=2C I believe that this is<br>>=3B >=3B the 5th time that I have be=
en bitten by this sanlock+gluster problem.<br>>=3B >=3B <br>>=3B >=
=3B I believe that the following are true (if not=2C my entire request is p=
robably<br>>=3B >=3B off base).<br>>=3B >=3B <br>>=3B >=3B <br>=
>=3B >=3B =3B =3B =3B =3B * ovirt uses sanlock in such =
a way that when the sanlock storage is on a<br>>=3B >=3B =3B =
=3B =3B =3B replicated gluster file system=2C very small storage di=
sruptions can<br>>=3B >=3B =3B =3B =3B =3B result in a =
gluster split-brain on the sanlock space<br>>=3B <br>>=3B Although this=
 is possible (at the moment) we are working hard to avoid it.<br>>=3B The=
 hardest part here is to ensure that the gluster volume is properly<br>>=
=3B configured.<br>>=3B <br>>=3B The suggested configuration for a volu=
me to be used with ovirt is:<br>>=3B <br>>=3B Volume Name: (...)<br>>=
=3B Type: Replicate<br>>=3B Volume ID: (...)<br>>=3B Status: Started<br=
>=3B Number of Bricks: 1 x 3 =3D 3<br>>=3B Transport-type: tcp<br>>=
=3B Bricks:<br>>=3B (...three bricks...)<br>>=3B Options Reconfigured:<=
br>>=3B network.ping-timeout: 10<br>>=3B cluster.quorum-type: auto<br>&=
gt=3B <br>>=3B The two options ping-timeout and quorum-type are really im=
...
<br>Regards=2C<br>Giuseppe<br><br>>=3B >=3B How did I get into this me=
ss?<br>>=3B >=3B <br>>=3B >=3B ...<br>>=3B >=3B <br>>=3B >=
=3B What I would like to see in ovirt to help me (and others like me). Alte=
rnates<br>>=3B >=3B listed in order from most desirable (automatic) to =
least desirable (set of<br>>=3B >=3B commands to type=2C with lots of v=
ariables to figure out).<br>>=3B <br>>=3B The real solution is to avoid=
=

--_27321994-845b-49b2-9a1d-a49b376f5af2_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<style><!--
.hmmessage P
{
margin:0px=3B
padding:0px
}
body.hmmessage
{
font-size: 12pt=3B
font-family:Calibri
}
--></style></head>
<body class=3D'hmmessage'><div dir=3D'ltr'>Hi=2C<br><br>>=3B ----- Origin=
al Message -----<br>>=3B >=3B From: "Ted Miller" <=3Btmiller at hcjb.=
org>=3B<br>>=3B >=3B To: "users" <=3Busers at ovirt.org>=3B<br>&g=
t=3B >=3B Sent: Tuesday=2C May 20=2C 2014 11:31:42 PM<br>>=3B >=3B Su=
bject: [ovirt-users] sanlock + gluster recovery -- RFE<br>>=3B >=3B <br=
portant.<br>>=3B <br>>=3B You would also need a build where this bug is=
 fixed in order to avoid any<br>>=3B chance of a split-brain:<br>>=3B <=
br>>=3B https://bugzilla.redhat.com/show_bug.cgi?id=3D1066996<br><br>It s=
eems that the aforementioned bug is peculiar to 3-bricks setups.<br><br>I u=
nderstand that a 3-bricks setup can allow proper quorum formation without r=
esorting to "first-configured-brick-has-more-weight" convention used with o=
nly 2 bricks and quorum "auto" (which makes one node "special"=2C so not pr=
operly any-single-fault tolerant).<br><br>But=2C since we are on ovirt-user=
s=2C is there a similar suggested configuration for a 2-hosts setup oVirt+G=
lusterFS with oVirt-side power management properly configured and tested-wo=
rking?<br>I mean a configuration where "any" host can go south and oVirt (t=
hrough the other one) fences it (forcibly powering it off with confirmation=
 from IPMI or similar) then restarts HA-marked vms that were running there=
=2C all the while keeping the underlying GlusterFS-based storage domains re=
sponsive and readable/writeable (maybe apart from a lapse between detected =
other-node unresposiveness and confirmed fencing)?<br><br>Furthermore: is s=
uch a suggested configuration possible in a self-hosted-engine scenario?<br=
 the split-brain altogether. At the moment it<br>>=3B seems that using th=
e suggested configurations and the bug fix we shouldn't<br>>=3B hit a spl=
it-brain.<br>>=3B <br>>=3B >=3B 1. automagic recovery<br>>=3B >=
=3B <br>>=3B >=3B 2. recovery subcommand<br>>=3B >=3B <br>>=3B &g=
t=3B 3. script<br>>=3B >=3B <br>>=3B >=3B 4. commands<br>>=3B <br=
...
>=3B I think that the commands to resolve a split-brain should be docume=
nted.<br>>=3B I just started a page here:<br>>=3B <br>>=3B http://www=
.ovirt.org/Gluster_Storage_Domain_Reference<br>>=3B <br>>=3B Could you =
add your documentation there? Thanks!<br>>=3B <br>>=3B -- <br>>=3B Fe=
derico<br><br> 		 	   		  </div></body>
</html>=
--_27321994-845b-49b2-9a1d-a49b376f5af2_--

Giuseppe Ragusa

Federico Simoncelli

Vijay Bellur

Ted Miller

Vijay Bellur

Ted Miller

tags

participants (4)