Poor gluster performances over 10Gbps network

8 Sep 2021

On Wed, Sep 8, 2021 at 12:15 PM Mathieu Valois <mvalois@teicee.com> wrote:

> Sorry for double post but I don't know if this mail has been received.
>
> Hello everyone,
>
> I know this issue was already treated on this mailing list. However none
> of the proposed solutions is satisfying me.
>
> Here is my situation : I've got 3 hyperconverged gluster ovirt nodes, with
> 6 network interfaces, bounded in bunches of 2 (management, VMs and
> gluster). The gluster network is on a dedicated bound where the 2
> interfaces are directly connected to the 2 other ovirt nodes. Gluster is
> apparently using it :
>
> # gluster volume status vmstore
> Status of volume: vmstore
> Gluster process                             TCP Port  RDMA Port  Online
> Pid
>
> ------------------------------------------------------------------------------
> Brick gluster-ov1:/gluster_bricks
> /vmstore/vmstore                            49152     0          Y
> 3019
> Brick gluster-ov2:/gluster_bricks
> /vmstore/vmstore                            49152     0          Y
> 3009
> Brick gluster-ov3:/gluster_bricks
> /vmstore/vmstore
>
> where 'gluster-ov{1,2,3}' are domain names referencing nodes in the
> gluster network. This networks has 10Gbps capabilities :
>
> # iperf3 -c gluster-ov3
> Connecting to host gluster-ov3, port 5201
> [  5] local 10.20.0.50 port 46220 connected to 10.20.0.51 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  1.16 GBytes  9.92 Gbits/sec   17    900
> KBytes
> [  5]   1.00-2.00   sec  1.15 GBytes  9.90 Gbits/sec    0    900
> KBytes
> [  5]   2.00-3.00   sec  1.15 GBytes  9.90 Gbits/sec    4    996
> KBytes
> [  5]   3.00-4.00   sec  1.15 GBytes  9.90 Gbits/sec    1    996
> KBytes
> [  5]   4.00-5.00   sec  1.15 GBytes  9.89 Gbits/sec    0    996
> KBytes
> [  5]   5.00-6.00   sec  1.15 GBytes  9.90 Gbits/sec    0    996
> KBytes
> [  5]   6.00-7.00   sec  1.15 GBytes  9.90 Gbits/sec    0    996
> KBytes
> [  5]   7.00-8.00   sec  1.15 GBytes  9.91 Gbits/sec    0    996
> KBytes
> [  5]   8.00-9.00   sec  1.15 GBytes  9.90 Gbits/sec    0    996
> KBytes
> [  5]   9.00-10.00  sec  1.15 GBytes  9.90 Gbits/sec    0    996
> KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  11.5 GBytes  9.90 Gbits/sec   22
> sender
> [  5]   0.00-10.04  sec  11.5 GBytes  9.86 Gbits/sec
> receiver
>
> iperf Done.
>
>
Network seems fine.

> However, VMs stored on the vmstore gluster volume has poor write
> performances, oscillating between 100KBps and 30MBps. I almost always
> observe a write spike (180Mbps) at the beginning until around 500MB
> written, then it drastically falls at 10MBps, sometimes even less
> (100KBps). Hypervisors have 32 threads (2 sockets, 8 cores per socket, 2
> threads per core).
>
> Here is the volume settings :
>
> Volume Name: vmstore
> Type: Replicate
> Volume ID: XXX
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 3 = 3
>
>
This looks like a replica 3 volume. In this case the VM writes everything
3 times - once per replica. The writes are done in parallel, but the data
is sent over the write 2-3 times (e.g. 2 if one of the bricks is on the
local host).

You may get better performance with replica 2 + arbiter:
https://gluster.readthedocs.io/en/latest/Administrator-Guide/arbiter-volumes-and-quorum/#why-arbiter

In this case data is written only to 2 bricks, and the arbiter brick holds
only
metadata.

> Transport-type: tcp
> Bricks:
> Brick1: gluster-ov1:/gluster_bricks/vmstore/vmstore
> Brick2: gluster-ov2:/gluster_bricks/vmstore/vmstore
> Brick3: gluster-ov3:/gluster_bricks/vmstore/vmstore
> Options Reconfigured:
> performance.io-thread-count: 32 # was 16 by default.
> cluster.granular-entry-heal: enable
> storage.owner-gid: 36
> storage.owner-uid: 36
> cluster.lookup-optimize: off
> server.keepalive-count: 5
> server.keepalive-interval: 2
> server.keepalive-time: 10
> server.tcp-user-timeout: 20
> network.ping-timeout: 30
> server.event-threads: 4
> client.event-threads: 8 # was 4 by default
> cluster.choose-local: off
> features.shard: on
> cluster.shd-wait-qlength: 10000
> cluster.shd-max-threads: 8
> cluster.locking-scheme: granular
> cluster.data-self-heal-algorithm: full
> cluster.server-quorum-type: server
> cluster.quorum-type: auto
> cluster.eager-lock: enable
> performance.strict-o-direct: on
> network.remote-dio: off
> performance.low-prio-threads: 32
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> auth.allow: *
> user.cifs: off
> storage.fips-mode-rchecksum: on
> transport.address-family: inet
> nfs.disable: on
> performance.client-io-threads: on
>
> When I naively write directly on the logical volume, which is mounted on a
> material RAID5 3-disks array, I have interesting performances:
>
> # dd if=/dev/zero of=a bs=4M count=2048
> 2048+0 records in
> 2048+0 records out
> 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 17.2485 s, 498 MB/s #urandom
> gives around 200MBps
>
> There are few issues with this test:
- you don't use oflag=direct or conv=fsync, so this may test copying data
   to the host page cache, instead of writing data to storage
- This tests only sequential write, which is the best case for any kind of
storage
- Using synchronous I/O - every write wait for the previous write completion
- Using single process
- 2g is too small, may test your cache performance

Try to test using fio - attached fio script that tests sequential and
random io with
various queue depth.

You can use it like this:

    fio --filename=/path/to/fio.data --output=test.out bench.fio

Test both on the host, and in the VM. This will give you more detailed
results that may help to evaluate the issue, and it may help Gluster
folks to advise on tuning your storage.

Nir

Mathieu Valois

Staniforth, Paul

Nir Soffer

Mathieu Valois

Nir Soffer

Mathieu Valois

Strahil Nikolov

Guillaume Pavese

Strahil Nikolov

tags

participants (5)