Poor gluster performances over 10Gbps network

Sorry for double post but I don't know if this mail has been received. Hello everyone, I know this issue was already treated on this mailing list. However none of the proposed solutions is satisfying me. Here is my situation : I've got 3 hyperconverged gluster ovirt nodes, with 6 network interfaces, bounded in bunches of 2 (management, VMs and gluster). The gluster network is on a dedicated bound where the 2 interfaces are directly connected to the 2 other ovirt nodes. Gluster is apparently using it : # gluster volume status vmstore Status of volume: vmstore Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gluster-ov1:/gluster_bricks /vmstore/vmstore 49152 0 Y 3019 Brick gluster-ov2:/gluster_bricks /vmstore/vmstore 49152 0 Y 3009 Brick gluster-ov3:/gluster_bricks /vmstore/vmstore where 'gluster-ov{1,2,3}' are domain names referencing nodes in the gluster network. This networks has 10Gbps capabilities : # iperf3 -c gluster-ov3 Connecting to host gluster-ov3, port 5201 [ 5] local 10.20.0.50 port 46220 connected to 10.20.0.51 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 1.16 GBytes 9.92 Gbits/sec 17 900 KBytes [ 5] 1.00-2.00 sec 1.15 GBytes 9.90 Gbits/sec 0 900 KBytes [ 5] 2.00-3.00 sec 1.15 GBytes 9.90 Gbits/sec 4 996 KBytes [ 5] 3.00-4.00 sec 1.15 GBytes 9.90 Gbits/sec 1 996 KBytes [ 5] 4.00-5.00 sec 1.15 GBytes 9.89 Gbits/sec 0 996 KBytes [ 5] 5.00-6.00 sec 1.15 GBytes 9.90 Gbits/sec 0 996 KBytes [ 5] 6.00-7.00 sec 1.15 GBytes 9.90 Gbits/sec 0 996 KBytes [ 5] 7.00-8.00 sec 1.15 GBytes 9.91 Gbits/sec 0 996 KBytes [ 5] 8.00-9.00 sec 1.15 GBytes 9.90 Gbits/sec 0 996 KBytes [ 5] 9.00-10.00 sec 1.15 GBytes 9.90 Gbits/sec 0 996 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec 22 sender [ 5] 0.00-10.04 sec 11.5 GBytes 9.86 Gbits/sec receiver iperf Done. However, VMs stored on the vmstore gluster volume has poor write performances, oscillating between 100KBps and 30MBps. I almost always observe a write spike (180Mbps) at the beginning until around 500MB written, then it drastically falls at 10MBps, sometimes even less (100KBps). Hypervisors have 32 threads (2 sockets, 8 cores per socket, 2 threads per core). Here is the volume settings : Volume Name: vmstore Type: Replicate Volume ID: XXX Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: gluster-ov1:/gluster_bricks/vmstore/vmstore Brick2: gluster-ov2:/gluster_bricks/vmstore/vmstore Brick3: gluster-ov3:/gluster_bricks/vmstore/vmstore Options Reconfigured: performance.io-thread-count: 32 # was 16 by default. cluster.granular-entry-heal: enable storage.owner-gid: 36 storage.owner-uid: 36 cluster.lookup-optimize: off server.keepalive-count: 5 server.keepalive-interval: 2 server.keepalive-time: 10 server.tcp-user-timeout: 20 network.ping-timeout: 30 server.event-threads: 4 client.event-threads: 8 # was 4 by default cluster.choose-local: off features.shard: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular cluster.data-self-heal-algorithm: full cluster.server-quorum-type: server cluster.quorum-type: auto cluster.eager-lock: enable performance.strict-o-direct: on network.remote-dio: off performance.low-prio-threads: 32 performance.io-cache: off performance.read-ahead: off performance.quick-read: off auth.allow: * user.cifs: off storage.fips-mode-rchecksum: on transport.address-family: inet nfs.disable: on performance.client-io-threads: on When I naively write directly on the logical volume, which is mounted on a material RAID5 3-disks array, I have interesting performances: # dd if=/dev/zero of=a bs=4M count=2048 2048+0 records in 2048+0 records out 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 17.2485 s, 498 MB/s #urandom gives around 200MBps Moreover, hypervisors have SSD which have been configured as lvcache, but I'm unsure how to test it efficiently. I can't find where is the problem, as every piece of the chain is apparently doing well ... Thanks anyone for helping me :) -- téïcée <https://www.teicee.com/?pk_campaign=Email> *Mathieu Valois* Bureau Caen: Quartier Kœnig - 153, rue Géraldine MOCK - 14760 Bretteville-sur-Odon Bureau Vitré: Zone de la baratière - 12, route de Domalain - 35500 Vitré 02 72 34 13 20 | www.teicee.com <https://www.teicee.com/?pk_campaign=Email> téïcée sur facebook <https://www.facebook.com/teicee> téïcée sur twitter <https://twitter.com/Teicee_fr> téïcée sur linkedin <https://www.linkedin.com/company/t-c-e> téïcée sur viadeo <https://fr.viadeo.com/fr/company/teicee> Datadocké

Hi Mathieu, How are you measuring the Gluster disk performance? also using dd you should use the oflag=dsync to avoid buffer caching. Regards, Paul S ________________________________ From: Mathieu Valois <mvalois@teicee.com> Sent: 08 September 2021 10:12 To: users <users@ovirt.org> Subject: [ovirt-users] Poor gluster performances over 10Gbps network Caution External Mail: Do not click any links or open any attachments unless you trust the sender and know that the content is safe. Sorry for double post but I don't know if this mail has been received. Hello everyone, I know this issue was already treated on this mailing list. However none of the proposed solutions is satisfying me. Here is my situation : I've got 3 hyperconverged gluster ovirt nodes, with 6 network interfaces, bounded in bunches of 2 (management, VMs and gluster). The gluster network is on a dedicated bound where the 2 interfaces are directly connected to the 2 other ovirt nodes. Gluster is apparently using it : # gluster volume status vmstore Status of volume: vmstore Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gluster-ov1:/gluster_bricks /vmstore/vmstore 49152 0 Y 3019 Brick gluster-ov2:/gluster_bricks /vmstore/vmstore 49152 0 Y 3009 Brick gluster-ov3:/gluster_bricks /vmstore/vmstore where 'gluster-ov{1,2,3}' are domain names referencing nodes in the gluster network. This networks has 10Gbps capabilities : # iperf3 -c gluster-ov3 Connecting to host gluster-ov3, port 5201 [ 5] local 10.20.0.50 port 46220 connected to 10.20.0.51 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 1.16 GBytes 9.92 Gbits/sec 17 900 KBytes [ 5] 1.00-2.00 sec 1.15 GBytes 9.90 Gbits/sec 0 900 KBytes [ 5] 2.00-3.00 sec 1.15 GBytes 9.90 Gbits/sec 4 996 KBytes [ 5] 3.00-4.00 sec 1.15 GBytes 9.90 Gbits/sec 1 996 KBytes [ 5] 4.00-5.00 sec 1.15 GBytes 9.89 Gbits/sec 0 996 KBytes [ 5] 5.00-6.00 sec 1.15 GBytes 9.90 Gbits/sec 0 996 KBytes [ 5] 6.00-7.00 sec 1.15 GBytes 9.90 Gbits/sec 0 996 KBytes [ 5] 7.00-8.00 sec 1.15 GBytes 9.91 Gbits/sec 0 996 KBytes [ 5] 8.00-9.00 sec 1.15 GBytes 9.90 Gbits/sec 0 996 KBytes [ 5] 9.00-10.00 sec 1.15 GBytes 9.90 Gbits/sec 0 996 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec 22 sender [ 5] 0.00-10.04 sec 11.5 GBytes 9.86 Gbits/sec receiver iperf Done. However, VMs stored on the vmstore gluster volume has poor write performances, oscillating between 100KBps and 30MBps. I almost always observe a write spike (180Mbps) at the beginning until around 500MB written, then it drastically falls at 10MBps, sometimes even less (100KBps). Hypervisors have 32 threads (2 sockets, 8 cores per socket, 2 threads per core). Here is the volume settings : Volume Name: vmstore Type: Replicate Volume ID: XXX Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: gluster-ov1:/gluster_bricks/vmstore/vmstore Brick2: gluster-ov2:/gluster_bricks/vmstore/vmstore Brick3: gluster-ov3:/gluster_bricks/vmstore/vmstore Options Reconfigured: performance.io-thread-count: 32 # was 16 by default. cluster.granular-entry-heal: enable storage.owner-gid: 36 storage.owner-uid: 36 cluster.lookup-optimize: off server.keepalive-count: 5 server.keepalive-interval: 2 server.keepalive-time: 10 server.tcp-user-timeout: 20 network.ping-timeout: 30 server.event-threads: 4 client.event-threads: 8 # was 4 by default cluster.choose-local: off features.shard: on cluster.shd-wait-qlength: 10000 cluster.shd-max-threads: 8 cluster.locking-scheme: granular cluster.data-self-heal-algorithm: full cluster.server-quorum-type: server cluster.quorum-type: auto cluster.eager-lock: enable performance.strict-o-direct: on network.remote-dio: off performance.low-prio-threads: 32 performance.io-cache: off performance.read-ahead: off performance.quick-read: off auth.allow: * user.cifs: off storage.fips-mode-rchecksum: on transport.address-family: inet nfs.disable: on performance.client-io-threads: on When I naively write directly on the logical volume, which is mounted on a material RAID5 3-disks array, I have interesting performances: # dd if=/dev/zero of=a bs=4M count=2048 2048+0 records in 2048+0 records out 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 17.2485 s, 498 MB/s #urandom gives around 200MBps Moreover, hypervisors have SSD which have been configured as lvcache, but I'm unsure how to test it efficiently. I can't find where is the problem, as every piece of the chain is apparently doing well ... Thanks anyone for helping me :) -- [téïcée]<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.teicee.com%2F%3Fpk_campaign%3DEmail&data=04%7C01%7Cp.staniforth%40leedsbeckett.ac.uk%7Cf0749f949cf140395b2308d972a999d9%7Cd79a81124fbe417aa112cd0fb490d85c%7C0%7C0%7C637666895046944322%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=G2s6mcZKkSKMu2uMr714QVLGNcF9J7WofaZrJFpv3HE%3D&reserved=0> Mathieu Valois Bureau Caen: Quartier Kœnig - 153, rue Géraldine MOCK - 14760 Bretteville-sur-Odon Bureau Vitré: Zone de la baratière - 12, route de Domalain - 35500 Vitré 02 72 34 13 20 | www.teicee.com<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.teicee.com%2F%3Fpk_campaign%3DEmail&data=04%7C01%7Cp.staniforth%40leedsbeckett.ac.uk%7Cf0749f949cf140395b2308d972a999d9%7Cd79a81124fbe417aa112cd0fb490d85c%7C0%7C0%7C637666895046954316%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=8bf8SEVkTfx0PhhkwZaa0r6nHzvUihjhKSBCPyGbjF8%3D&reserved=0> [téïcée sur facebook]<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.facebook.com%2Fteicee&data=04%7C01%7Cp.staniforth%40leedsbeckett.ac.uk%7Cf0749f949cf140395b2308d972a999d9%7Cd79a81124fbe417aa112cd0fb490d85c%7C0%7C0%7C637666895046954316%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=LWrXzrizq5uLHHjaIGiUKJgFYXyH0pFph4z5%2FO%2FF%2B5Y%3D&reserved=0> [téïcée sur twitter] <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2FTeicee_fr&data=04%7C01%7Cp.staniforth%40leedsbeckett.ac.uk%7Cf0749f949cf140395b2308d972a999d9%7Cd79a81124fbe417aa112cd0fb490d85c%7C0%7C0%7C637666895046964307%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QXU%2FpF2IKp2ruHnTbOuljhvttRVwlXxHHk9SjkK4aHE%3D&reserved=0> [téïcée sur linkedin] <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Ft-c-e&data=04%7C01%7Cp.staniforth%40leedsbeckett.ac.uk%7Cf0749f949cf140395b2308d972a999d9%7Cd79a81124fbe417aa112cd0fb490d85c%7C0%7C0%7C637666895046964307%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=TCh4jXaHaZCFhOLtgMXr6dttgHqg2VI%2F4f2%2B1pIKv4g%3D&reserved=0> [téïcée sur viadeo] <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Ffr.viadeo.com%2Ffr%2Fcompany%2Fteicee&data=04%7C01%7Cp.staniforth%40leedsbeckett.ac.uk%7Cf0749f949cf140395b2308d972a999d9%7Cd79a81124fbe417aa112cd0fb490d85c%7C0%7C0%7C637666895046974304%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=PSEaTH2hH7OgqMb2JsOomdF%2FrShy%2FOahqXsSgNk7F6c%3D&reserved=0> [Datadocké] To view the terms under which this email is distributed, please go to:- https://leedsbeckett.ac.uk/disclaimer/email

On Wed, Sep 8, 2021 at 12:15 PM Mathieu Valois <mvalois@teicee.com> wrote: > Sorry for double post but I don't know if this mail has been received. > > Hello everyone, > > I know this issue was already treated on this mailing list. However none > of the proposed solutions is satisfying me. > > Here is my situation : I've got 3 hyperconverged gluster ovirt nodes, with > 6 network interfaces, bounded in bunches of 2 (management, VMs and > gluster). The gluster network is on a dedicated bound where the 2 > interfaces are directly connected to the 2 other ovirt nodes. Gluster is > apparently using it : > > # gluster volume status vmstore > Status of volume: vmstore > Gluster process TCP Port RDMA Port Online > Pid > > ------------------------------------------------------------------------------ > Brick gluster-ov1:/gluster_bricks > /vmstore/vmstore 49152 0 Y > 3019 > Brick gluster-ov2:/gluster_bricks > /vmstore/vmstore 49152 0 Y > 3009 > Brick gluster-ov3:/gluster_bricks > /vmstore/vmstore > > where 'gluster-ov{1,2,3}' are domain names referencing nodes in the > gluster network. This networks has 10Gbps capabilities : > > # iperf3 -c gluster-ov3 > Connecting to host gluster-ov3, port 5201 > [ 5] local 10.20.0.50 port 46220 connected to 10.20.0.51 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 1.16 GBytes 9.92 Gbits/sec 17 900 > KBytes > [ 5] 1.00-2.00 sec 1.15 GBytes 9.90 Gbits/sec 0 900 > KBytes > [ 5] 2.00-3.00 sec 1.15 GBytes 9.90 Gbits/sec 4 996 > KBytes > [ 5] 3.00-4.00 sec 1.15 GBytes 9.90 Gbits/sec 1 996 > KBytes > [ 5] 4.00-5.00 sec 1.15 GBytes 9.89 Gbits/sec 0 996 > KBytes > [ 5] 5.00-6.00 sec 1.15 GBytes 9.90 Gbits/sec 0 996 > KBytes > [ 5] 6.00-7.00 sec 1.15 GBytes 9.90 Gbits/sec 0 996 > KBytes > [ 5] 7.00-8.00 sec 1.15 GBytes 9.91 Gbits/sec 0 996 > KBytes > [ 5] 8.00-9.00 sec 1.15 GBytes 9.90 Gbits/sec 0 996 > KBytes > [ 5] 9.00-10.00 sec 1.15 GBytes 9.90 Gbits/sec 0 996 > KBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec 22 > sender > [ 5] 0.00-10.04 sec 11.5 GBytes 9.86 Gbits/sec > receiver > > iperf Done. > > Network seems fine. > However, VMs stored on the vmstore gluster volume has poor write > performances, oscillating between 100KBps and 30MBps. I almost always > observe a write spike (180Mbps) at the beginning until around 500MB > written, then it drastically falls at 10MBps, sometimes even less > (100KBps). Hypervisors have 32 threads (2 sockets, 8 cores per socket, 2 > threads per core). > > Here is the volume settings : > > Volume Name: vmstore > Type: Replicate > Volume ID: XXX > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 3 = 3 > > This looks like a replica 3 volume. In this case the VM writes everything 3 times - once per replica. The writes are done in parallel, but the data is sent over the write 2-3 times (e.g. 2 if one of the bricks is on the local host). You may get better performance with replica 2 + arbiter: https://gluster.readthedocs.io/en/latest/Administrator-Guide/arbiter-volumes-and-quorum/#why-arbiter In this case data is written only to 2 bricks, and the arbiter brick holds only metadata. > Transport-type: tcp > Bricks: > Brick1: gluster-ov1:/gluster_bricks/vmstore/vmstore > Brick2: gluster-ov2:/gluster_bricks/vmstore/vmstore > Brick3: gluster-ov3:/gluster_bricks/vmstore/vmstore > Options Reconfigured: > performance.io-thread-count: 32 # was 16 by default. > cluster.granular-entry-heal: enable > storage.owner-gid: 36 > storage.owner-uid: 36 > cluster.lookup-optimize: off > server.keepalive-count: 5 > server.keepalive-interval: 2 > server.keepalive-time: 10 > server.tcp-user-timeout: 20 > network.ping-timeout: 30 > server.event-threads: 4 > client.event-threads: 8 # was 4 by default > cluster.choose-local: off > features.shard: on > cluster.shd-wait-qlength: 10000 > cluster.shd-max-threads: 8 > cluster.locking-scheme: granular > cluster.data-self-heal-algorithm: full > cluster.server-quorum-type: server > cluster.quorum-type: auto > cluster.eager-lock: enable > performance.strict-o-direct: on > network.remote-dio: off > performance.low-prio-threads: 32 > performance.io-cache: off > performance.read-ahead: off > performance.quick-read: off > auth.allow: * > user.cifs: off > storage.fips-mode-rchecksum: on > transport.address-family: inet > nfs.disable: on > performance.client-io-threads: on > > When I naively write directly on the logical volume, which is mounted on a > material RAID5 3-disks array, I have interesting performances: > > # dd if=/dev/zero of=a bs=4M count=2048 > 2048+0 records in > 2048+0 records out > 8589934592 bytes (8.6 GB, 8.0 GiB) copied, 17.2485 s, 498 MB/s #urandom > gives around 200MBps > > There are few issues with this test: - you don't use oflag=direct or conv=fsync, so this may test copying data to the host page cache, instead of writing data to storage - This tests only sequential write, which is the best case for any kind of storage - Using synchronous I/O - every write wait for the previous write completion - Using single process - 2g is too small, may test your cache performance Try to test using fio - attached fio script that tests sequential and random io with various queue depth. You can use it like this: fio --filename=/path/to/fio.data --output=test.out bench.fio Test both on the host, and in the VM. This will give you more detailed results that may help to evaluate the issue, and it may help Gluster folks to advise on tuning your storage. Nir

You can find attached the benchmarks on the host and guest. I find the differences not so big though... Le 09/09/2021 à 13:40, Nir Soffer a écrit :
There are few issues with this test: - you don't use oflag=direct or conv=fsync, so this may test copying data to the host page cache, instead of writing data to storage - This tests only sequential write, which is the best case for any kind of storage - Using synchronous I/O - every write wait for the previous write completion - Using single process - 2g is too small, may test your cache performance
Try to test using fio - attached fio script that tests sequential and random io with various queue depth.
You can use it like this:
fio --filename=/path/to/fio.data --output=test.out bench.fio
Test both on the host, and in the VM. This will give you more detailed results that may help to evaluate the issue, and it may help Gluster folks to advise on tuning your storage.
Nir -- téïcée <https://www.teicee.com/?pk_campaign=Email> *Mathieu Valois*
Bureau Caen: Quartier Kœnig - 153, rue Géraldine MOCK - 14760 Bretteville-sur-Odon Bureau Vitré: Zone de la baratière - 12, route de Domalain - 35500 Vitré 02 72 34 13 20 | www.teicee.com <https://www.teicee.com/?pk_campaign=Email> téïcée sur facebook <https://www.facebook.com/teicee> téïcée sur twitter <https://twitter.com/Teicee_fr> téïcée sur linkedin <https://www.linkedin.com/company/t-c-e> téïcée sur viadeo <https://fr.viadeo.com/fr/company/teicee> Datadocké

On Thu, Sep 9, 2021 at 4:12 PM Mathieu Valois <mvalois@teicee.com> wrote:
You can find attached the benchmarks on the host and guest. I find the differences not so big though...
Host is using the gluster mount (/rhev/data-center/mnt/glusterSD/server:_path/...) or writing directly into the same filesystem used by gluster (/bricks/brick1/...)? If will help if you share output of lsblk and the command line used to run fio on the host. Comparing host and guest: seq-write: (groupid=0, jobs=4): err= 0: pid=294433: Thu Sep 9 14:30:14 2021 write: IOPS=151, BW=153MiB/s (160MB/s)(4628MiB/30280msec); 0 zone resets I guess the underlying storage is hard disk - 150 MiB/s is not bad but very low compared with fast SSD. seq-read: (groupid=1, jobs=4): err= 0: pid=294778: Thu Sep 9 14:30:14 2021 read: IOPS=7084, BW=7086MiB/s (7430MB/s)(208GiB/30016msec) You have crazy caching (ignoring the direct I/O?), 7GiB/s read? rand-write-qd32: (groupid=2, jobs=4): err= 0: pid=295141: Thu Sep 9 14:30:14 2021 write: IOPS=228, BW=928KiB/s (951kB/s)(28.1MiB/30971msec); 0 zone resets Very low, probably limited by the hard disks? rand-read-qd32: (groupid=3, jobs=4): err= 0: pid=296094: Thu Sep 9 14:30:14 2021 read: IOPS=552k, BW=2157MiB/s (2262MB/s)(63.2GiB/30001msec) Very high, this is what you get from fast consumer SSD. rand-write-qd1: (groupid=4, jobs=1): err= 0: pid=296386: Thu Sep 9 14:30:14 2021 write: IOPS=55, BW=223KiB/s (229kB/s)(6696KiB/30002msec); 0 zone resets Very low. rand-read-qd1: (groupid=5, jobs=1): err= 0: pid=296633: Thu Sep 9 14:30:14 2021 read: IOPS=39.4k, BW=154MiB/s (161MB/s)(4617MiB/30001msec) Same caching. If we compare host and guest: $ grep -B1 IOPS= *.out guest.out-seq-write: (groupid=0, jobs=4): err= 0: pid=46235: Thu Sep 9 14:18:05 2021 guest.out: write: IOPS=57, BW=58.8MiB/s (61.6MB/s)(1792MiB/30492msec); 0 zone resets ~33% of host throughput guest.out-rand-write-qd32: (groupid=2, jobs=4): err= 0: pid=46330: Thu Sep 9 14:18:05 2021 guest.out: write: IOPS=299, BW=1215KiB/s (1244kB/s)(35.8MiB/30212msec); 0 zone resets Better than host guest.out-rand-write-qd1: (groupid=4, jobs=1): err= 0: pid=46552: Thu Sep 9 14:18:05 2021 guest.out: write: IOPS=213, BW=854KiB/s (875kB/s)(25.0MiB/30003msec); 0 zone resets Better than host So you have very fast reads (seq/random), with very slow seq/random write. Also would be interesting to test fsync - this benchmark does not do any fsync but your slow yum/rpm upgrade likey do one of more fsyncs per package upgrade. There is an example sync test script here: https://www.ibm.com/cloud/blog/using-fio-to-tell-whether-your-storage-is-fas... Le 09/09/2021 à 13:40, Nir Soffer a écrit :
There are few issues with this test: - you don't use oflag=direct or conv=fsync, so this may test copying data to the host page cache, instead of writing data to storage - This tests only sequential write, which is the best case for any kind of storage - Using synchronous I/O - every write wait for the previous write completion - Using single process - 2g is too small, may test your cache performance
Try to test using fio - attached fio script that tests sequential and random io with various queue depth.
You can use it like this:
fio --filename=/path/to/fio.data --output=test.out bench.fio
Test both on the host, and in the VM. This will give you more detailed results that may help to evaluate the issue, and it may help Gluster folks to advise on tuning your storage.
Nir
-- [image: téïcée] <https://www.teicee.com/?pk_campaign=Email> *Mathieu Valois*
Bureau Caen: Quartier Kœnig - 153, rue Géraldine MOCK - 14760 Bretteville-sur-Odon Bureau Vitré: Zone de la baratière - 12, route de Domalain - 35500 Vitré 02 72 34 13 20 | www.teicee.com <https://www.teicee.com/?pk_campaign=Email> [image: téïcée sur facebook] <https://www.facebook.com/teicee> [image: téïcée sur twitter] <https://twitter.com/Teicee_fr> [image: téïcée sur linkedin] <https://www.linkedin.com/company/t-c-e> [image: téïcée sur viadeo] <https://fr.viadeo.com/fr/company/teicee> [image: Datadocké]

First of all, many thanks for your analysis! Le 09/09/2021 à 17:06, Nir Soffer a écrit :
On Thu, Sep 9, 2021 at 4:12 PM Mathieu Valois <mvalois@teicee.com <mailto:mvalois@teicee.com>> wrote:
Host is using the gluster mount (/rhev/data-center/mnt/glusterSD/server:_path/...) or writing directly into the same filesystem used by gluster (/bricks/brick1/...)?
Into the brick : # fio --filename=/gluster_bricks/vmstore/fio.data --output=/root/test.out /root/bench.fio
Comparing host and guest:
seq-write: (groupid=0, jobs=4): err= 0: pid=294433: Thu Sep 9 14:30:14 2021 write: IOPS=151, BW=153MiB/s (160MB/s)(4628MiB/30280msec); 0 zone resets
I guess the underlying storage is hard disk - 150 MiB/s is not bad but very low compared with fast SSD.
Yes, LVM is on hard disk RAID with lvcache using SSD.
seq-read: (groupid=1, jobs=4): err= 0: pid=294778: Thu Sep 9 14:30:14 2021 read: IOPS=7084, BW=7086MiB/s (7430MB/s)(208GiB/30016msec)
You have crazy caching (ignoring the direct I/O?), 7GiB/s read?
I've configured hyperconverged gluster with 372Go of SSD cache, but 7GiB/s bandwidth seems surprisingly huge.
...
So you have very fast reads (seq/random), with very slow seq/random write. This is what I feel too.
Also would be interesting to test fsync - this benchmark does not do any fsync but your slow yum/rpm upgrade likey do one of more fsyncs per package upgrade.
Yes, this is very likely. I've analyzed the gluster volume profiling tool, and I the most time-costly operation is FSYNC : # gluster volume profile vmstore info %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 6 FORGET 0.00 0.00 us 0.00 us 0.00 us 210 RELEASE 0.00 0.00 us 0.00 us 0.00 us 42 RELEASEDIR 0.00 212.06 us 86.19 us 396.24 us 4 READDIRP 0.00 105.13 us 10.31 us 241.91 us 14 GETXATTR 0.00 316.95 us 196.80 us 400.89 us 6 CREATE 0.00 234.14 us 15.50 us 805.85 us 10 READDIR 0.00 213.87 us 137.89 us 294.45 us 12 UNLINK 0.00 73.30 us 1.11 us 154.41 us 42 OPENDIR 0.00 318.69 us 211.02 us 465.69 us 23 MKNOD 0.00 46.98 us 13.80 us 140.04 us 201 FLUSH 0.00 98.34 us 43.90 us 308.65 us 204 OPEN 0.00 66.89 us 21.80 us 140.42 us 398 STATFS 0.00 588.67 us 47.60 us 38236.37 us 78 FSTAT 0.02 908.06 us 14.01 us 10798.16 us 249 ENTRYLK 0.02 167.16 us 15.54 us 476.21 us 1787 LOOKUP 0.15 31851.02 us 22.39 us 366250.86 us 57 INODELK 0.16 120.31 us 32.84 us 71638.49 us 15827 FXATTROP 0.68 1430.51 us 11.70 us 1152275.20 us 5719 FINODELK 0.71 3211.50 us 47.48 us 191016.03 us 2667 READ 38.44 50693.63 us 329.41 us 617663.12 us 9134 FSYNC 59.80 21063.30 us 56.86 us 663984.57 us 34197 WRITE 0.00 0.00 us 0.00 us 0.00 us 31 UPCALL
There is an example sync test script here: https://www.ibm.com/cloud/blog/using-fio-to-tell-whether-your-storage-is-fas... <https://www.ibm.com/cloud/blog/using-fio-to-tell-whether-your-storage-is-fast-enough-for-etcd>
I'll test that script. -- téïcée <https://www.teicee.com/?pk_campaign=Email> *Mathieu Valois* Bureau Caen: Quartier Kœnig - 153, rue Géraldine MOCK - 14760 Bretteville-sur-Odon Bureau Vitré: Zone de la baratière - 12, route de Domalain - 35500 Vitré 02 72 34 13 20 | www.teicee.com <https://www.teicee.com/?pk_campaign=Email> téïcée sur facebook <https://www.facebook.com/teicee> téïcée sur twitter <https://twitter.com/Teicee_fr> téïcée sur linkedin <https://www.linkedin.com/company/t-c-e> téïcée sur viadeo <https://fr.viadeo.com/fr/company/teicee> Datadocké

Dis you enable libgfapi ?engine-config -s LibgfApiSupported=true Note: power off and then power on the VM. The qemu process should not use the '/rhev' mountpoints. Also, share your current setup:- disks- hw controller- did you storage align your block devices (hw raid only)- tuned-profile- sysctl settings that are changed- gluster volume options that are changed Best Regards,Strahil Nikolov

It's my understanding that libgfapi is currently unsupported in oVirt because there were a few long standing bugs. Unfortunately, the performance improvements that users have reported on this mailing list haven't been seen or replicated by Red Hat, so recently most of these bugzilla tickets have been closed as WONTFIX https://bugzilla.redhat.com/show_bug.cgi?id=1633642 https://bugzilla.redhat.com/show_bug.cgi?id=1465810 https://bugzilla.redhat.com/show_bug.cgi?id=1552344 This is a shame as I think the performance improvements from libgfapi as reported in user benchmarks were impressive, and the underlying qemu/libvirt bugs are now fixed or were close to be. Guillaume Pavese Ingénieur Système et Réseau Interactiv-Group On Thu, Sep 9, 2021 at 7:00 PM Strahil Nikolov via Users <users@ovirt.org> wrote:
Dis you enable libgfapi ? engine-config -s LibgfApiSupported=true
Note: power off and then power on the VM. The qemu process should not use the '/rhev' mountpoints.
Also, share your current setup: - disks - hw controller - did you storage align your block devices (hw raid only) - tuned-profile - sysctl settings that are changed - gluster volume options that are changed
Best Regards, Strahil Nikolov _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/FYVQBT4ZYTSFD6...
-- Ce message et toutes les pièces jointes (ci-après le “message”) sont établis à l’intention exclusive de ses destinataires et sont confidentiels. Si vous recevez ce message par erreur, merci de le détruire et d’en avertir immédiatement l’expéditeur. Toute utilisation de ce message non conforme a sa destination, toute diffusion ou toute publication, totale ou partielle, est interdite, sauf autorisation expresse. L’internet ne permettant pas d’assurer l’intégrité de ce message . Interactiv-group (et ses filiales) décline(nt) toute responsabilité au titre de ce message, dans l’hypothèse ou il aurait été modifié. IT, ES, UK. <https://interactiv-group.com/disclaimer.html>

Most probably RH specialists can tune Gluster so well that the libgfapi is uselesss ... who knows. Actually, oVirt and support are 2 completely different topics. Yes, it has it drawbacks, butif it brings significant performance gain. Best Regards,Strahil Nikolov On Fri, Sep 10, 2021 at 18:19, Guillaume Pavese<guillaume.pavese@interactiv-group.com> wrote: It's my understanding that libgfapi is currently unsupported in oVirt because there were a few long standing bugs. Unfortunately, the performance improvements that users have reported on this mailing list haven't been seen or replicated by Red Hat, so recently most of these bugzilla tickets have been closed as WONTFIX https://bugzilla.redhat.com/show_bug.cgi?id=1633642 https://bugzilla.redhat.com/show_bug.cgi?id=1465810 https://bugzilla.redhat.com/show_bug.cgi?id=1552344 This is a shame as I think the performance improvements from libgfapi as reported in user benchmarks were impressive, and the underlying qemu/libvirt bugs are now fixed or were close to be. Guillaume Pavese Ingénieur Système et RéseauInteractiv-Group On Thu, Sep 9, 2021 at 7:00 PM Strahil Nikolov via Users <users@ovirt.org> wrote: Dis you enable libgfapi ?engine-config -s LibgfApiSupported=true Note: power off and then power on the VM. The qemu process should not use the '/rhev' mountpoints. Also, share your current setup:- disks- hw controller- did you storage align your block devices (hw raid only)- tuned-profile- sysctl settings that are changed- gluster volume options that are changed Best Regards,Strahil Nikolov_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/FYVQBT4ZYTSFD6... Ce message et toutes les pièces jointes (ci-après le “message”) sont établis à l’intention exclusive de ses destinataires et sont confidentiels. Si vous recevez ce message par erreur, merci de le détruire et d’en avertir immédiatement l’expéditeur. Toute utilisation de ce message non conforme a sa destination, toute diffusion ou toute publication, totale ou partielle, est interdite, sauf autorisation expresse. L’internet ne permettant pas d’assurer l’intégrité de ce message . Interactiv-group (et ses filiales) décline(nt) toute responsabilité au titre de ce message, dans l’hypothèse ou il aurait été modifié. IT, ES, UK.
participants (5)
-
Guillaume Pavese
-
Mathieu Valois
-
Nir Soffer
-
Staniforth, Paul
-
Strahil Nikolov