[ovirt-users] oVirt selft-hosted with NFS on top gluster

Abi Askushi rightkicktech at gmail.com
Fri Sep 8 08:50:45 UTC 2017


Hi all,

Seems that I am seeing light in he tunnel!

I have done the below:

Added the option
/etc/glusterfs/glusterd.vol :
option rpc-auth-allow-insecure on

restarted glusterd

Then set the volume option:
gluster volume set vms server.allow-insecure on

Then disabled the option:
gluster volume set vms performance.readdir-ahead off

which seems to have been enabled when desperately testing gluster options.

then I added again the storage domain with glusterfs (not NFS).
This lead me to have really high performance boost on writes of VMs.

The dd went from 10MB/s to 60MB/s!
And the VM disk benchmarks from few MB to 80 - 100MB/s!

Seems that the above two options which were missing did the trick for the
gluster.

When checking the VM XML I still see the disk defines as file and not
network. I am not sure that ligfapi is used from ovirt.
I will set also a new VM just in case to check the XML again.
Is there any way to confirm use of ligfapi?

On Fri, Sep 8, 2017 at 10:08 AM, Abi Askushi <rightkicktech at gmail.com>
wrote:

> I don't see any other bottleneck. CPUs are quite idle. Seems that the load
> is mostly due to high latency on IO.
>
> Reading further the gluster docs:
>
> https://github.com/gluster/glusterfs-specs/blob/master/
> done/GlusterFS%203.5/libgfapi%20with%20qemu%20libvirt.md
>
> I see that I am missing the following options:
>
> /etc/glusterfs/glusterd.vol :
> option rpc-auth-allow-insecure on
>
> gluster volume set vms gluster server.allow-insecure on
>
> It says that the above allow qemu to use libgfapi.
> When checking the VM XML, I don't see any gluster protocol at the disk
> drive:
>
> <disk type='file' device='disk' snapshot='no'>
>       <driver name='qemu' type='raw' cache='none' error_policy='stop'
> io='threads'/>
>       <source file='/rhev/data-center/00000001-0001-0001-0001-
> 000000000311/94741028-e765-4300-a618-c3eeb7dbb7c8/images/
> 222a1312-5efa-4731-8914-9a9d24dccba5/d691e6b3-c8e7-
> 4820-9042-555d30c8a21b'/>
>       <target dev='sda' bus='scsi'/>
>       <serial>222a1312-5efa-4731-8914-9a9d24dccba5</serial>
>       <boot order='1'/>
>       <address type='drive' controller='0' bus='0' target='0' unit='0'/>
>     </disk>
>
>
>
> While at gluster docs it mentions the below type of disk:
>
> <disk type='network' device='disk'>
>        <driver name='qemu' type='raw' cache='none'/>
>        <source protocol='gluster' name='distrepvol/vm3.img'>
>             <host name='10.70.37.106' port='24007'/>
>         </source>
>        <target dev='vda' bus='virtio'/>
>        <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
>     </disk>
>
>
> Does the above indicate that ovirt/qemu is not using libgfapi but FUSE only?
> This could be the reason of such slow perf.
>
>
> On Thu, Sep 7, 2017 at 1:47 PM, Yaniv Kaul <ykaul at redhat.com> wrote:
>
>>
>>
>> On Thu, Sep 7, 2017 at 12:52 PM, Abi Askushi <rightkicktech at gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Thu, Sep 7, 2017 at 10:30 AM, Yaniv Kaul <ykaul at redhat.com> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Sep 7, 2017 at 10:06 AM, Yaniv Kaul <ykaul at redhat.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Wed, Sep 6, 2017 at 6:08 PM, Abi Askushi <rightkicktech at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> For a first idea I use:
>>>>>>
>>>>>> dd if=/dev/zero of=testfile bs=1GB count=1
>>>>>>
>>>>>
>>>>> This is an incorrect way to test performance, for various reasons:
>>>>> 1. You are not using oflag=direct , thus not using DirectIO, but using
>>>>> cache.
>>>>> 2. It's unrealistic - it is very uncommon to write large blocks of
>>>>> zeros (sometimes during FS creation or wiping). Certainly not 1GB
>>>>> 3. It is a single thread of IO - again, unrealistic for VM's IO.
>>>>>
>>>>> I forgot to mention that I include oflag=direct in my tests. I agree
>>> though that dd is not the correct way to test, hence I mentioned I just use
>>> it to get a first feel. More tests are done within the VM benchmarking its
>>> disk IO (with tools like IOmeter).
>>>
>>> I suggest using fio and such. See https://github.com/pcuzner/fio-tools
>>>>> for example.
>>>>>
>>>> Do you have any recommended config file to use for VM workload?
>>>
>>
>> Desktops and Servers VMs behave quite differently, so not really. But the
>> 70/30 job is typically a good baseline.
>>
>>
>>>
>>>
>>>>>
>>>>>>
>>>>>> When testing on the gluster mount point using above command I hardly
>>>>>> get 10MB/s. (On the same time the network traffic hardly reaches 100Mbit).
>>>>>>
>>>>>> When testing our of the gluster (for example at /root) I get 600 -
>>>>>> 700MB/s.
>>>>>>
>>>>>
>>>>> That's very fast - from 4 disks doing RAID5? Impressive (unless you
>>>>> use caching!). Are those HDDs or SSDs/NVMe?
>>>>>
>>>>>
>>>> These are SAS disks. But there is also a RAID controller with 1GB cache.
>>>
>>>
>>>>>> When I mount the gluster volume with NFS and test on it I get 90 -
>>>>>> 100 MB/s, (almost 10x from gluster results) which is the max I can get
>>>>>> considering I have only 1 Gbit network for the storage.
>>>>>>
>>>>>> Also, when using glusterfs the general VM performance is very poor
>>>>>> and disk write benchmarks show that is it at least 4 times slower then when
>>>>>> the VM is hosted on the same data store when NFS mounted.
>>>>>>
>>>>>> I don't know why I hitting such a significant performance penalty,
>>>>>> and every possible tweak that I was able to find out there did not make any
>>>>>> difference on the performance.
>>>>>>
>>>>>> The hardware I am using is pretty decent for the purposes intended:
>>>>>> 3 nodes, each node having with 32 MB of RAM, 16 physical CPU cores, 2
>>>>>> TB of storage in RAID5 (4 disks), of which 1.5 TB are sliced for the data
>>>>>> store of ovirt where VMs are stored.
>>>>>>
>>>>>
>>>> I forgot to ask why are you using RAID 5 with 4 disks and not RAID 10?
>>>> Same usable capacity, higher performance, same protection and faster
>>>> recovery, I believe.
>>>>
>>> Correction: there are 5 disks of 600GB each. The main reason going with
>>> RAID 5 was the capacity. With RAID 10 I can use only 4 of them and get only
>>> 1.1 TB usable, with RAID 5 I get 2.2 TB usable. I agree going with RAID 10
>>> (+ one additional drive to go with 6 drives) would be better but this is
>>> what I have now.
>>>
>>> Y.
>>>>
>>>>
>>>>> You have not mentioned your NIC speeds. Please ensure all work well,
>>>>> with 10g.
>>>>> Is the network dedicated for Gluster traffic? How are they connected?
>>>>>
>>>>>
>>>> I have mentioned that I have 1 Gbit dedicated for the storage. A
>>> different network is used for this and a dedicated 1Gbit switch. The
>>> throughput has been confirmed between all nodes with iperf.
>>>
>>
>> Oh.... With 1Gb, you can't get more than 100+MBps...
>>
>>
>>> I know 10Gbit would be better, but when using native gluster at ovirt
>>> the network pipe was hardly reaching 100Mbps thus the bottleneck was
>>> gluster and not the network. If I can saturate 1Gbit and I still have
>>> performance issues then I may think to go with 10Gbit. With NFS on top
>>> gluster I see traffic reaching 800Mbit when testing with dd which is much
>>> better.
>>>
>>
>> Agreed. Do you see the bottleneck elsewhere? CPU?
>>
>>
>>>
>>>
>>>>>> The gluster configuration is the following:
>>>>>>
>>>>>
>>>>> Which version of Gluster are you using?
>>>>>
>>>>>
>>>> The version is  3.8.12
>>>
>>
>> I think it's a very old release (near end of life?). I warmly suggest
>> 3.10.x or 3.12.
>> There are performance improvements (AFAIR) in both.
>> Y.
>>
>>
>>>
>>>>>> Volume Name: vms
>>>>>> Type: Replicate
>>>>>> Volume ID: 4513340d-7919-498b-bfe0-d836b5cea40b
>>>>>> Status: Started
>>>>>> Snapshot Count: 0
>>>>>> Number of Bricks: 1 x (2 + 1) = 3
>>>>>> Transport-type: tcp
>>>>>> Bricks:
>>>>>> Brick1: gluster0:/gluster/vms/brick
>>>>>> Brick2: gluster1:/gluster/vms/brick
>>>>>> Brick3: gluster2:/gluster/vms/brick (arbiter)
>>>>>> Options Reconfigured:
>>>>>> nfs.export-volumes: on
>>>>>> nfs.disable: off
>>>>>> performance.readdir-ahead: on
>>>>>> transport.address-family: inet
>>>>>> performance.quick-read: off
>>>>>> performance.read-ahead: off
>>>>>> performance.io-cache: off
>>>>>> performance.stat-prefetch: on
>>>>>>
>>>>>
>>>>> I think this should be off.
>>>>>
>>>>>
>>>>>> performance.low-prio-threads: 32
>>>>>> network.remote-dio: off
>>>>>>
>>>>>
>>>>> I think this should be enabled.
>>>>>
>>>>>
>>>>>> cluster.eager-lock: off
>>>>>> cluster.quorum-type: auto
>>>>>> cluster.server-quorum-type: server
>>>>>> cluster.data-self-heal-algorithm: full
>>>>>> cluster.locking-scheme: granular
>>>>>> cluster.shd-max-threads: 8
>>>>>> cluster.shd-wait-qlength: 10000
>>>>>> features.shard: on
>>>>>> user.cifs: off
>>>>>> storage.owner-uid: 36
>>>>>> storage.owner-gid: 36
>>>>>> network.ping-timeout: 30
>>>>>> performance.strict-o-direct: on
>>>>>> cluster.granular-entry-heal: enable
>>>>>> features.shard-block-size: 64MB
>>>>>>
>>>>>
>>>>> I'm not sure if this should not be 512MB.  I don't remember the last
>>>>> resolution on this.
>>>>> Y.
>>>>>
>>>>>
>>>>>> performance.client-io-threads: on
>>>>>> client.event-threads: 4
>>>>>> server.event-threads: 4
>>>>>> performance.write-behind-window-size: 4MB
>>>>>> performance.cache-size: 1GB
>>>>>>
>>>>>> I have been playing with all above with very little difference on
>>> performance I was getting.
>>>
>>> In case I can provide any other details let me know.
>>>>>>
>>>>>
>>>>> What is your tuned profile?
>>>>>
>>>>>
>>>> the tuned profile is virtual-host
>>>
>>> At the moment I already switched to gluster based NFS but I have a
>>>>>> similar setup with 2 nodes  where the data store is mounted through gluster
>>>>>> (and again relatively good hardware) where I might check any tweaks or
>>>>>> improvements on this setup.
>>>>>>
>>>>>> Thanx
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 6, 2017 at 5:32 PM, Yaniv Kaul <ykaul at redhat.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 6, 2017 at 3:32 PM, Abi Askushi <rightkicktech at gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> I've playing with ovirt self hosted engine setup and I even use it
>>>>>>>> to production for several VM. The setup I have is 3 server with gluster
>>>>>>>> storage in replica 2+1 (1 arbiter).
>>>>>>>> The data storage domain where VMs are stored is mounted with
>>>>>>>> gluster through ovirt. The performance I get for the VMs is very low and I
>>>>>>>> was thinking to switch and mount the same storage through NFS instead of
>>>>>>>> glusterfs.
>>>>>>>>
>>>>>>>
>>>>>>> I don't see how it'll improve performance.
>>>>>>> I suggest you share the gluster configuration (as well as the
>>>>>>> storage HW) so we can understand why the performance is low.
>>>>>>> Y.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> The only think I am hesitant is how can I ensure high availability
>>>>>>>> of the storage when I loose one server? I was thinking to have at
>>>>>>>> /etc/hosts sth like below:
>>>>>>>>
>>>>>>>> 10.100.100.1 nfsmount
>>>>>>>> 10.100.100.2 nfsmount
>>>>>>>> 10.100.100.3 nfsmount
>>>>>>>>
>>>>>>>> then use nfsmount as the server name when adding this domain
>>>>>>>> through ovirt GUI.
>>>>>>>> Are there any other more elegant solutions? What do you do for such
>>>>>>>> cases?
>>>>>>>> Note: gluster has the back-vol-file option which provides a lean
>>>>>>>> way to have redundancy on the mount point and I am using this when mounting
>>>>>>>> with glusterfs.
>>>>>>>>
>>>>>>>> Thanx
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list
>>>>>>>> Users at ovirt.org
>>>>>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20170908/cc5a3c78/attachment.html>


More information about the Users mailing list