[ovirt-users] Users Digest, Vol 71, Issue 37

Tue Aug 8 09:49:23 UTC 2017

This is by far more complex. A good NIC will have an offload engine (LSO - Large Segment Offload) and, if so, the NIC driver will report a MTU of 64K to the IP stack. The IP stack will then send data to the NIC as if the MTU were 64K and the NIC will fragment it to the size of the "declared" MTU on the interface so PMTUD will not be efficient in such scenario. If all this takes place in the server, then you get no problem. But if a standard router is configured to support 9K jumbo frame in one interface (i.e.: LAN connection) and 1500 in another (i.e.: WAN connection) then the router will be responsible for the fragmentation. However, most of the routers out there are not able to deal with this in high traffic demands.

Splitting the very intensive east/west traffic like disk copies, VM moves, etc. from the "service" traffic will not only prevent contention but also fix this problem with MTU.

Moacir

________________________________
From: Yaniv Kaul <ykaul at redhat.com>
Sent: Tuesday, August 8, 2017 7:35 AM
To: Moacir Ferreira
Cc: users at ovirt.org
Subject: Re: [ovirt-users] Users Digest, Vol 71, Issue 37

On Tue, Aug 8, 2017 at 12:42 AM, Moacir Ferreira <moacirferreira at hotmail.com<mailto:moacirferreira at hotmail.com>> wrote:

Fabrice,

If you choose to have jumbo frames all over, then when the traffic goes outside of your "jumbo frames" enabled network it will be necessary to be fragmented back again to the destination MTU. Most of the datacenters will provide services to the outside world where the MTU is 1500 bytes. In this case, you will slow down your performance because your router will be doing the fragmentation. So I would always use jumbo frames in the datacenter for east/west traffic and standard (1500 bytes) for north/south traffic.

I doubt this would happen with modern TCP/IP stacks, for TCP connections. It'll adjust to the path most likely, using PMTUD. Of course, this does not always work (depends on HW en-route).
UDP packets might fail miserably too (dropped), depending on the HW en-route, but UDP traffic (and specifically large packets) are not that common these days.

Nevertheless, I don't see a huge advantage in enabling this for north-south traffic, TBH, and the mysterious, random traffic drop issues it may cause is not worth it.
Y.

Moacir

----------------------------------------------------------------------

Message: 1
Date: Mon, 7 Aug 2017 21:50:36 +0200
From: Fabrice Bacchella <fabrice.bacchella at orange.fr<mailto:fabrice.bacchella at orange.fr>>
To: FERNANDO FREDIANI <fernando.frediani at upx.com<mailto:fernando.frediani at upx.com>>
Cc: users at ovirt.org<mailto:users at ovirt.org>
Subject: Re: [ovirt-users] Good practices
Message-ID: <4365E3F7-4C77-4FF5-8401-1CDA2F0029EE at orange.fr<mailto:4365E3F7-4C77-4FF5-8401-1CDA2F0029EE at orange.fr>>
Content-Type: text/plain; charset="windows-1252"

>> Moacir: Yes! This is another reason to have separate networks for north/south and east/west. In that way I can use the standard MTU on the 10Gb NICs and jumbo frames on the file/move 40Gb NICs.

Why not Jumbo frame every where ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20170807/4ba55f08/attachment-0001.html>

------------------------------

Message: 2
Date: Mon, 7 Aug 2017 16:52:40 -0300
From: FERNANDO FREDIANI <fernando.frediani at upx.com<mailto:fernando.frediani at upx.com>>
To: Fabrice Bacchella <fabrice.bacchella at orange.fr<mailto:fabrice.bacchella at orange.fr>>
Cc: users at ovirt.org<mailto:users at ovirt.org>
Subject: Re: [ovirt-users] Good practices
Message-ID: <40d044ae-a41d-082e-131a-bf5fb5503513 at upx.com<mailto:40d044ae-a41d-082e-131a-bf5fb5503513 at upx.com>>
Content-Type: text/plain; charset="utf-8"; Format="flowed"

What you mentioned is a specific case and not a generic situation. The
main point there is that RAID 5 or 6 impacts write performance compared
when you write to only 2 given disks at a time. That was the comparison
made.

Fernando

On 07/08/2017 16:49, Fabrice Bacchella wrote:
>
>> Le 7 ao?t 2017 ? 17:41, FERNANDO FREDIANI <fernando.frediani at upx.com<mailto:fernando.frediani at upx.com>
>> <mailto:fernando.frediani at upx.com>> a ?crit :
>>
>
>> Yet another downside of having a RAID (specially RAID 5 or 6) is that
>> it reduces considerably the write speeds as each group of disks will
>> end up having the write speed of a single disk as all other disks of
>> that group have to wait for each other to write as well.
>>
>
> That's not true if you have medium to high range hardware raid. For
> example, HP Smart Array come with a flash cache of about 1 or 2 Gb
> that hides that from the OS.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20170807/db3094e7/attachment-0001.html>

------------------------------

Message: 3
Date: Mon, 7 Aug 2017 22:05:19 +0200
From: Erekle Magradze <erekle.magradze at recogizer.de<mailto:erekle.magradze at recogizer.de>>
To: FERNANDO FREDIANI <fernando.frediani at upx.com<mailto:fernando.frediani at upx.com>>, users at ovirt.org<mailto:users at ovirt.org>
Subject: Re: [ovirt-users] Good practices
Message-ID: <bac362c7-daba-918c-f728-13e1a74d6cc9 at recogizer.de<mailto:bac362c7-daba-918c-f728-13e1a74d6cc9 at recogizer.de>>
Content-Type: text/plain; charset="utf-8"; Format="flowed"

Hi Franando,

So let's go with the following scenarios:

1. Let's say you have two servers (replication factor is 2), i.e. two
bricks per volume, in this case it is strongly recommended to have the
arbiter node, the metadata storage that will guarantee avoiding the
split brain situation, in this case for arbiter you don't even need a
disk with lots of space, it's enough to have a tiny ssd but hosted on a
separate server. Advantage of such setup is that you don't need the RAID
1 for each brick, you have the metadata information stored in arbiter
node and brick replacement is easy.

2. If you have odd number of bricks (let's say 3, i.e. replication
factor is 3) in your volume and you didn't create the arbiter node as
well as you didn't configure the quorum, in this case the entire load
for keeping the consistency of the volume resides on all 3 servers, each
of them is important and each brick contains key information, they need
to cross-check each other (that's what people usually do with the first
try of gluster :) ), in this case replacing a brick is a big pain and in
this case RAID 1 is a good option to have (that's the disadvantage, i.e.
loosing the space and not having the JBOD option) advantage is that you
don't have the to have additional arbiter node.

3. You have odd number of bricks and configured arbiter node, in this
case you can easily go with JBOD, however a good practice would be to
have a RAID 1 for arbiter disks (tiny 128GB SSD-s ar perfectly
sufficient for volumes with 10s of TB-s in size.)

That's basically it

The rest about the reliability and setup scenarios you can find in
gluster documentation, especially look for quorum and arbiter node
configs+options.

Cheers

Erekle

P.S. What I was mentioning, regarding a good practice is mostly related
to the operations of gluster not installation or deployment, i.e. not
the conceptual understanding of gluster (conceptually it's a JBOD system).

On 08/07/2017 05:41 PM, FERNANDO FREDIANI wrote:
>
> Thanks for the clarification Erekle.
>
> However I get surprised with this way of operating from GlusterFS as
> it adds another layer of complexity to the system (either a hardware
> or software RAID) before the gluster config and increase the system's
> overall costs.
>
> An important point to consider is: In RAID configuration you already
> have space 'wasted' in order to build redundancy (either RAID 1, 5, or
> 6). Then when you have GlusterFS on the top of several RAIDs you have
> again more data replicated so you end up with the same data consuming
> more space in a group of disks and again on the top of several RAIDs
> depending on the Gluster configuration you have (in a RAID 1 config
> the same data is replicated 4 times).
>
> Yet another downside of having a RAID (specially RAID 5 or 6) is that
> it reduces considerably the write speeds as each group of disks will
> end up having the write speed of a single disk as all other disks of
> that group have to wait for each other to write as well.
>
> Therefore if Gluster already replicates data why does it create this
> big pain you mentioned if the data is replicated somewhere else, can
> still be retrieved to both serve clients and reconstruct the
> equivalent disk when it is replaced ?
>
> Fernando
>
>
> On 07/08/2017 10:26, Erekle Magradze wrote:
>>
>> Hi Frenando,
>>
>> Here is my experience, if you consider a particular hard drive as a
>> brick for gluster volume and it dies, i.e. it becomes not accessible
>> it's a huge hassle to discard that brick and exchange with another
>> one, since gluster some tries to access that broken brick and it's
>> causing (at least it cause for me) a big pain, therefore it's better
>> to have a RAID as brick, i.e. have RAID 1 (mirroring) for each brick,
>> in this case if the disk is down you can easily exchange it and
>> rebuild the RAID without going offline, i.e switching off the volume
>> doing brick manipulations and switching it back on.
>>
>> Cheers
>>
>> Erekle
>>
>>
>> On 08/07/2017 03:04 PM, FERNANDO FREDIANI wrote:
>>>
>>> For any RAID 5 or 6 configuration I normally follow a simple gold
>>> rule which gave good results so far:
>>> - up to 4 disks RAID 5
>>> - 5 or more disks RAID 6
>>>
>>> However I didn't really understand well the recommendation to use
>>> any RAID with GlusterFS. I always thought that GlusteFS likes to
>>> work in JBOD mode and control the disks (bricks) directlly so you
>>> can create whatever distribution rule you wish, and if a single disk
>>> fails you just replace it and which obviously have the data
>>> replicated from another. The only downside of using in this way is
>>> that the replication data will be flow accross all servers but that
>>> is not much a big issue.
>>>
>>> Anyone can elaborate about Using RAID + GlusterFS and JBOD + GlusterFS.
>>>
>>> Thanks
>>> Regards
>>> Fernando
>>>
>>>
>>> On 07/08/2017 03:46, Devin Acosta wrote:
>>>>
>>>> Moacir,
>>>>
>>>> I have recently installed multiple Red Hat Virtualization hosts for
>>>> several different companies, and have dealt with the Red Hat
>>>> Support Team in depth about optimal configuration in regards to
>>>> setting up GlusterFS most efficiently and I wanted to share with
>>>> you what I learned.
>>>>
>>>> In general Red Hat Virtualization team frowns upon using each DISK
>>>> of the system as just a JBOD, sure there is some protection by
>>>> having the data replicated, however, the recommendation is to use
>>>> RAID 6 (preferred) or RAID-5, or at least RAID-1 at the very least.
>>>>
>>>> Here is the direct quote from Red Hat when I asked about RAID and
>>>> Bricks:
>>>> /
>>>> /
>>>> /"A typical Gluster configuration would use RAID underneath the
>>>> bricks. RAID 6 is most typical as it gives you 2 disk failure
>>>> protection, but RAID 5 could be used too. Once you have the RAIDed
>>>> bricks, you'd then apply the desired replication on top of that.
>>>> The most popular way of doing this would be distributed replicated
>>>> with 2x replication. In general you'll get better performance with
>>>> larger bricks. 12 drives is often a sweet spot. Another option
>>>> would be to create a separate tier using all SSD?s.? /
>>>>
>>>> /In order to SSD tiering from my understanding you would need 1 x
>>>> NVMe drive in each server, or 4 x SSD hot tier (it needs to be
>>>> distributed, replicated for the hot tier if not using NVME). So
>>>> with you only having 1 SSD drive in each server, I?d suggest maybe
>>>> looking into the NVME option. /
>>>> /
>>>> /
>>>> /Since your using only 3-servers, what I?d probably suggest is to
>>>> do (2 Replicas + Arbiter Node), this setup actually doesn?t require
>>>> the 3rd server to have big drives at all as it only stores
>>>> meta-data about the files and not actually a full copy. /
>>>> /
>>>> /
>>>> /Please see the attached document that was given to me by Red Hat
>>>> to get more information on this. Hope this information helps you./
>>>> /
>>>> /
>>>>
>>>> --
>>>>
>>>> Devin Acosta, RHCA, RHVCA
>>>> Red Hat Certified Architect
>>>>
>>>> On August 6, 2017 at 7:29:29 PM, Moacir Ferreira
>>>> (moacirferreira at hotmail.com<mailto:moacirferreira at hotmail.com> <mailto:moacirferreira at hotmail.com>) wrote:
>>>>
>>>>> I am willing to assemble a oVirt "pod", made of 3 servers, each
>>>>> with 2 CPU sockets of 12 cores, 256GB RAM, 7 HDD 10K, 1 SSD. The
>>>>> idea is to use GlusterFS to provide HA for the VMs. The 3 servers
>>>>> have a dual 40Gb NIC and a dual 10Gb NIC. So my intention is to
>>>>> create a loop like a server triangle using the 40Gb NICs for
>>>>> virtualization files (VMs .qcow2) access and to move VMs around
>>>>> the pod (east /west traffic) while using the 10Gb interfaces for
>>>>> giving services to the outside world (north/south traffic).
>>>>>
>>>>>
>>>>> This said, my first question is: How should I deploy GlusterFS in
>>>>> such oVirt scenario? My questions are:
>>>>>
>>>>>
>>>>> 1 - Should I create 3 RAID (i.e.: RAID 5), one on each oVirt node,
>>>>> and then create a GlusterFS using them?
>>>>>
>>>>> 2 - Instead, should I create a JBOD array made of all server's disks?
>>>>>
>>>>> 3 - What is the best Gluster configuration to provide for HA while
>>>>> not consuming too much disk space?
>>>>>
>>>>> 4 - Does a oVirt hypervisor pod like I am planning to build, and
>>>>> the virtualization environment, benefits from tiering when using a
>>>>> SSD disk? And yes, will Gluster do it by default or I have to
>>>>> configure it to do so?
>>>>>
>>>>>
>>>>> At the bottom line, what is the good practice for using GlusterFS
>>>>> in small pods for enterprises?
>>>>>
>>>>>
>>>>> You opinion/feedback will be really appreciated!
>>>>>
>>>>> Moacir
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list
>>>>> Users at ovirt.org<mailto:Users at ovirt.org> <mailto:Users at ovirt.org>
>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users at ovirt.org<mailto:Users at ovirt.org>
>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users at ovirt.org<mailto:Users at ovirt.org>
>>> http://lists.ovirt.org/mailman/listinfo/users
>>
>

--
Recogizer Group GmbH

Dr.rer.nat. Erekle Magradze
Lead Big Data Engineering & DevOps
Rheinwerkallee 2, 53227 Bonn
Tel: +49 228 29974555<tel:+49%20228%2029974555>

E-Mail erekle.magradze at recogizer.de<mailto:erekle.magradze at recogizer.de>
Web: www.recogizer.com<http://www.recogizer.com>

Recogizer auf LinkedIn https://www.linkedin.com/company-beta/10039182/
Folgen Sie uns auf Twitter https://twitter.com/recogizer

-----------------------------------------------------------------
Recogizer Group GmbH
Gesch?ftsf?hrer: Oliver Habisch, Carsten Kreutze
Handelsregister: Amtsgericht Bonn HRB 20724
Sitz der Gesellschaft: Bonn; USt-ID-Nr.: DE294195993

Diese E-Mail enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen.
Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrt?mlich erhalten haben,
informieren Sie bitte sofort den Absender und l?schen Sie diese Mail.
Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail und der darin enthaltenen Informationen ist nicht gestattet.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20170807/1a5c2ac2/attachment.html>

------------------------------

_______________________________________________
Users mailing list
Users at ovirt.org<mailto:Users at ovirt.org>
http://lists.ovirt.org/mailman/listinfo/users

End of Users Digest, Vol 71, Issue 37
*************************************

_______________________________________________
Users mailing list
Users at ovirt.org<mailto:Users at ovirt.org>
http://lists.ovirt.org/mailman/listinfo/users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20170808/b7311550/attachment-0001.html>