On Wed, Sep 8, 2021 at 12:15 PM Mathieu Valois <mvalois@teicee.com> wrote:

Sorry for double post but I don't know if this mail has been received.

Hello everyone,

I know this issue was already treated on this mailing list. However none of the proposed solutions is satisfying me.

Here is my situation : I've got 3 hyperconverged gluster ovirt nodes, with 6 network interfaces, bounded in bunches of 2 (management, VMs and gluster). The gluster network is on a dedicated bound where the 2 interfaces are directly connected to the 2 other ovirt nodes. Gluster is apparently using it :

# gluster volume status vmstore
Status of volume: vmstore
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gluster-ov1:/gluster_bricks
/vmstore/vmstore                            49152     0          Y       3019
Brick gluster-ov2:/gluster_bricks
/vmstore/vmstore                            49152     0          Y       3009
Brick gluster-ov3:/gluster_bricks
/vmstore/vmstore

where 'gluster-ov{1,2,3}' are domain names referencing nodes in the gluster network. This networks has 10Gbps capabilities :

# iperf3 -c gluster-ov3
Connecting to host gluster-ov3, port 5201
[  5] local 10.20.0.50 port 46220 connected to 10.20.0.51 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  1.16 GBytes  9.92 Gbits/sec   17    900 KBytes      
[  5]   1.00-2.00   sec  1.15 GBytes  9.90 Gbits/sec    0    900 KBytes      
[  5]   2.00-3.00   sec  1.15 GBytes  9.90 Gbits/sec    4    996 KBytes      
[  5]   3.00-4.00   sec  1.15 GBytes  9.90 Gbits/sec    1    996 KBytes      
[  5]   4.00-5.00   sec  1.15 GBytes  9.89 Gbits/sec    0    996 KBytes      
[  5]   5.00-6.00   sec  1.15 GBytes  9.90 Gbits/sec    0    996 KBytes      
[  5]   6.00-7.00   sec  1.15 GBytes  9.90 Gbits/sec    0    996 KBytes      
[  5]   7.00-8.00   sec  1.15 GBytes  9.91 Gbits/sec    0    996 KBytes      
[  5]   8.00-9.00   sec  1.15 GBytes  9.90 Gbits/sec    0    996 KBytes      
[  5]   9.00-10.00  sec  1.15 GBytes  9.90 Gbits/sec    0    996 KBytes      
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  11.5 GBytes  9.90 Gbits/sec   22             sender
[  5]   0.00-10.04  sec  11.5 GBytes  9.86 Gbits/sec                  receiver

iperf Done.


Network seems fine.
 

However, VMs stored on the vmstore gluster volume has poor write performances, oscillating between 100KBps and 30MBps. I almost always observe a write spike (180Mbps) at the beginning until around 500MB written, then it drastically falls at 10MBps, sometimes even less (100KBps). Hypervisors have 32 threads (2 sockets, 8 cores per socket, 2 threads per core).

Here is the volume settings :

Volume Name: vmstore
Type: Replicate
Volume ID: XXX
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3


This looks like a replica 3 volume. In this case the VM writes everything
3 times - once per replica. The writes are done in parallel, but the data
is sent over the write 2-3 times (e.g. 2 if one of the bricks is on the local host).

You may get better performance with replica 2 + arbiter:
https://gluster.readthedocs.io/en/latest/Administrator-Guide/arbiter-volumes-and-quorum/#why-arbiter 

In this case data is written only to 2 bricks, and the arbiter brick holds only
metadata.
 

Transport-type: tcp
Bricks:
Brick1: gluster-ov1:/gluster_bricks/vmstore/vmstore
Brick2: gluster-ov2:/gluster_bricks/vmstore/vmstore
Brick3: gluster-ov3:/gluster_bricks/vmstore/vmstore
Options Reconfigured:
performance.io-thread-count: 32 # was 16 by default.
cluster.granular-entry-heal: enable
storage.owner-gid: 36
storage.owner-uid: 36
cluster.lookup-optimize: off
server.keepalive-count: 5
server.keepalive-interval: 2
server.keepalive-time: 10
server.tcp-user-timeout: 20
network.ping-timeout: 30
server.event-threads: 4
client.event-threads: 8 # was 4 by default
cluster.choose-local: off
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
performance.strict-o-direct: on
network.remote-dio: off
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
auth.allow: *
user.cifs: off
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on

When I naively write directly on the logical volume, which is mounted on a material RAID5 3-disks array, I have interesting performances:

# dd if=/dev/zero of=a bs=4M count=2048
2048+0 records in
2048+0 records out
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 17.2485 s, 498 MB/s #urandom gives around 200MBps

There are few issues with this test:
- you don't use oflag=direct or conv=fsync, so this may test copying data
   to the host page cache, instead of writing data to storage
- This tests only sequential write, which is the best case for any kind of storage
- Using synchronous I/O - every write wait for the previous write completion
- Using single process
- 2g is too small, may test your cache performance

Try to test using fio - attached fio script that tests sequential and random io with
various queue depth.

You can use it like this:

    fio --filename=/path/to/fio.data --output=test.out bench.fio

Test both on the host, and in the VM. This will give you more detailed
results that may help to evaluate the issue, and it may help Gluster 
folks to advise on tuning your storage.

Nir