[ovirt-users] Huge Glsuter Issues - oVirt 4.1.7

23 Nov 2017

      ...
<p class=3DMsoNormal><span lang=3DEN-US>cluster.locking-scheme: granular<o=
:p></o:p></span></p><p class=3DMsoNormal><span lang=3DEN-US>cluster.shd-wai=
t-qlength: 10000<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3DEN-=
US>cluster.shd-max-threads: 6<o:p></o:p></span></p><p class=3DMsoNormal><sp=
an lang=3DEN-US>network.ping-timeout: 30<o:p></o:p></span></p><p class=3DMs=
oNormal><span lang=3DEN-US>user.cifs: off<o:p></o:p></span></p><p class=3DM=
soNormal><span lang=3DEN-US>nfs.disable: on<o:p></o:p></span></p><p class=
=3DMsoNormal><span lang=3DEN-US>performance.strict-o-direct: on<o:p></o:p><=
/span></p><p class=3DMsoNormal><span lang=3DEN-US>server.event-threads: 4<o=
:p></o:p></span></p><p class=3DMsoNormal><span lang=3DEN-US>client.event-th=
reads: 4<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3DEN-US><o:p>=
 </o:p></span></p><p class=3DMsoNormal><span lang=3DEN-US>It feel like=
--_000_BFAB40933B3367488CE6299BAF8592D1014E5314F510SOCRATESasl_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Hi All,

I'm experiencing huge issues when working with big VMs on Gluster volumes. =
Doing a Snapshot or removing a big Disk lead to the effect that the SPM nod=
e is getting non responsive. Fencing is than kicking in and taking the node=
 down with the hard reset/reboot.

My setup has three nodes with 10Gbit/s NICs for the Gluster network. The Br=
icks are on Raid-6 with a 1GB cache on the raid controller and the volumes =
are setup as follows:

Volume Name: data
Type: Replicate
Volume ID: c734d678-91e3-449c-8a24-d26b73bef965
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 =3D 3
Transport-type: tcp
Bricks:
Brick1: ovirt-node01-gfs.storage.lan:/gluster/brick2/data
Brick2: ovirt-node02-gfs.storage.lan:/gluster/brick2/data
Brick3: ovirt-node03-gfs.storage.lan:/gluster/brick2/data
Options Reconfigured:
features.barrier: disable
cluster.granular-entry-heal: enable
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: on
cluster.eager-lock: enable
network.remote-dio: off
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 6
network.ping-timeout: 30
user.cifs: off
nfs.disable: on
performance.strict-o-direct: on
server.event-threads: 4
client.event-threads: 4

It feel like the System looks up during snapshotting or removing of a big d=
isk and this delay triggers things to go wrong. Is there anything that is n=
ot setup right on my gluster or is this behavior normal with bigger disks (=
50GB+) ? Is there a reliable option for caching with SSDs ?

Thank you,
Sven

--_000_BFAB40933B3367488CE6299BAF8592D1014E5314F510SOCRATESasl_
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-micr=
osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" xmlns=3D"http:=
//www.w3.org/TR/REC-html40"><head><meta http-equiv=3DContent-Type content=
=3D"text/html; charset=3Dus-ascii"><meta name=3DGenerator content=3D"Micros=
oft Word 15 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0cm;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri",sans-serif;
	mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:#0563C1;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:#954F72;
	text-decoration:underline;}
span.E-MailFormatvorlage17
	{mso-style-type:personal-compose;
	font-family:"Calibri",sans-serif;
	color:windowtext;}
.MsoChpDefault
	{mso-style-type:export-only;
	font-family:"Calibri",sans-serif;
	mso-fareast-language:EN-US;}
@page WordSection1
	{size:612.0pt 792.0pt;
	margin:70.85pt 70.85pt 2.0cm 70.85pt;}
div.WordSection1
	{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]--></head><body lang=3DDE link=3D"#0563C1" v=
link=3D"#954F72"><div class=3DWordSection1><p class=3DMsoNormal>Hi All, <o:=
p></o:p></p><p class=3DMsoNormal><o:p> </o:p></p><p class=3DMsoNormal>=
<span lang=3DEN-US>I’m experiencing huge issues when working with big=
 VMs on Gluster volumes. Doing a Snapshot or removing a big Disk lead to th=
e effect that the SPM node is getting non responsive. Fencing is than kicki=
ng in and taking the node down with the hard reset/reboot. <o:p></o:p></spa=
n></p><p class=3DMsoNormal><span lang=3DEN-US><o:p> </o:p></span></p><=
p class=3DMsoNormal><span lang=3DEN-US>My setup has three nodes with 10Gbit=
/s NICs for the Gluster network. The Bricks are on Raid-6 with a 1GB cache =
on the raid controller and the volumes are setup as follows:<o:p></o:p></sp=
an></p><p class=3DMsoNormal><span lang=3DEN-US><o:p> </o:p></span></p>=
<p class=3DMsoNormal><span lang=3DEN-US>Volume Name: data<o:p></o:p></span>=
</p><p class=3DMsoNormal><span lang=3DEN-US>Type: Replicate<o:p></o:p></spa=
n></p><p class=3DMsoNormal><span lang=3DEN-US>Volume ID: c734d678-91e3-449c=
-8a24-d26b73bef965<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3DE=
N-US>Status: Started<o:p></o:p></span></p><p class=3DMsoNormal><span lang=
=3DEN-US>Snapshot Count: 0<o:p></o:p></span></p><p class=3DMsoNormal><span =
lang=3DEN-US>Number of Bricks: 1 x 3 =3D 3<o:p></o:p></span></p><p class=3D=
MsoNormal><span lang=3DEN-US>Transport-type: tcp<o:p></o:p></span></p><p cl=
ass=3DMsoNormal><span lang=3DEN-US>Bricks:<o:p></o:p></span></p><p class=3D=
MsoNormal><span lang=3DEN-US>Brick1: ovirt-node01-gfs.storage.lan:/gluster/=
brick2/data<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3DEN-US>Br=
ick2: ovirt-node02-gfs.storage.lan:/gluster/brick2/data<o:p></o:p></span></=
p><p class=3DMsoNormal><span lang=3DEN-US>Brick3: ovirt-node03-gfs.storage.=
lan:/gluster/brick2/data<o:p></o:p></span></p><p class=3DMsoNormal><span la=
ng=3DEN-US>Options Reconfigured:<o:p></o:p></span></p><p class=3DMsoNormal>=
<span lang=3DEN-US>features.barrier: disable<o:p></o:p></span></p><p class=
=3DMsoNormal><span lang=3DEN-US>cluster.granular-entry-heal: enable<o:p></o=
:p></span></p><p class=3DMsoNormal><span lang=3DEN-US>performance.readdir-a=
head: on<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3DEN-US>perfo=
rmance.quick-read: off<o:p></o:p></span></p><p class=3DMsoNormal><span lang=
=3DEN-US>performance.read-ahead: off<o:p></o:p></span></p><p class=3DMsoNor=
mal><span lang=3DEN-US>performance.io-cache: off<o:p></o:p></span></p><p cl=
ass=3DMsoNormal><span lang=3DEN-US>performance.stat-prefetch: on<o:p></o:p>=
</span></p><p class=3DMsoNormal><span lang=3DEN-US>cluster.eager-lock: enab=
le<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3DEN-US>network.rem=
ote-dio: off<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3DEN-US>c=
luster.quorum-type: auto<o:p></o:p></span></p><p class=3DMsoNormal><span la=
ng=3DEN-US>cluster.server-quorum-type: server<o:p></o:p></span></p><p class=
=3DMsoNormal><span lang=3DEN-US>storage.owner-uid: 36<o:p></o:p></span></p>=
<p class=3DMsoNormal><span lang=3DEN-US>storage.owner-gid: 36<o:p></o:p></s=
pan></p><p class=3DMsoNormal><span lang=3DEN-US>features.shard: on<o:p></o:=
p></span></p><p class=3DMsoNormal><span lang=3DEN-US>features.shard-block-s=
ize: 512MB<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3DEN-US>per=
formance.low-prio-threads: 32<o:p></o:p></span></p><p class=3DMsoNormal><sp=
an lang=3DEN-US>cluster.data-self-heal-algorithm: full<o:p></o:p></span></p=
 the System looks up during snapshotting or removing of a big disk and this=
 delay triggers things to go wrong. Is there anything that is not setup rig=
ht on my gluster or is this behavior normal with bigger disks (50GB+) ? Is =
there a reliable option for caching with SSDs ?<o:p></o:p></span></p><p cla=
ss=3DMsoNormal><span lang=3DEN-US><o:p> </o:p></span></p><p class=3DMs=
oNormal><span lang=3DEN-US>Thank you, <o:p></o:p></span></p><p class=3DMsoN=
ormal><span lang=3DEN-US>Sven </span><o:p></o:p></p></div></body></html>=

--_000_BFAB40933B3367488CE6299BAF8592D1014E5314F510SOCRATESasl_--

[ovirt-users] Huge Glsuter Issues - oVirt 4.1.7

Sven Achtelik