Speed Issues

Christian Reiss

23 Mar 2020 23 Mar '20

8:42 p.m.

Hey folks, gluster related question. Having SSD in a RAID that can do 2 GB writes and Reads (actually above, but meh) in a 3-way HCI cluster connected with 10gbit connection things are pretty slow inside gluster. I have these settings: Options Reconfigured: cluster.server-quorum-type: server cluster.quorum-type: auto cluster.shd-max-threads: 8 features.shard: on features.shard-block-size: 64MB server.event-threads: 8 user.cifs: off cluster.shd-wait-qlength: 10000 cluster.locking-scheme: granular cluster.eager-lock: enable performance.low-prio-threads: 32 network.ping-timeout: 30 cluster.granular-entry-heal: enable storage.owner-gid: 36 storage.owner-uid: 36 cluster.choose-local: true client.event-threads: 16 performance.strict-o-direct: on network.remote-dio: enable performance.client-io-threads: on nfs.disable: on storage.fips-mode-rchecksum: on transport.address-family: inet cluster.readdir-optimize: on cluster.metadata-self-heal: on cluster.data-self-heal: on cluster.entry-self-heal: on cluster.data-self-heal-algorithm: full features.uss: enable features.show-snapshot-directory: on features.barrier: disable auto-delete: enable snap-activate-on-create: enable Writing inside the /gluster_bricks yields those 2GB/sec writes, Reading the same. Reading inside the /rhev/data-center/mnt/glusterSD/ dir reads go down to 366mb/sec while writes plummet to to 200mb/sec. Summed up: Writing into the SSD Raid in the lvm/xfs gluster brick directory is fast, writing into the mounted gluster dir is horribly slow. The above can be seen and repeated on all 3 servers. The network can do full 10gbit (tested with, among others: rsync, iperf3). Anyone with some idea on whats missing/ going on here? Thanks folks, as always stay safe and healthy! -- with kind regards, mit freundlichen Gruessen, Christian Reiss

Show replies by date

Jayme

23 Mar 23 Mar

9:08 p.m.

I too struggle with speed issues in hci. Latency is a big problem with writes for me especially when dealing with small file workloads. How are you testing exactly? Look into enabling libgfapi and try some comparisons with that. People have been saying it’s much faster, but it’s not a default option and has a few bugs. Redhat devs do not appear to be giving its implementation any priority unfortunately. I’ve been considering switching to nfs storage because I’m seeing much better performance in testing with it. I have some nvme drives on the way and am curious how they would perform in hci but I’m thinking the issue is not a disk bottleneck (that appears very obvious in your case as well) On Mon, Mar 23, 2020 at 6:44 PM Christian Reiss <email@christian-reiss.de> wrote:

...

Hey folks,

gluster related question. Having SSD in a RAID that can do 2 GB writes and Reads (actually above, but meh) in a 3-way HCI cluster connected with 10gbit connection things are pretty slow inside gluster.

I have these settings:

Options Reconfigured: cluster.server-quorum-type: server cluster.quorum-type: auto cluster.shd-max-threads: 8 features.shard: on features.shard-block-size: 64MB server.event-threads: 8 user.cifs: off cluster.shd-wait-qlength: 10000 cluster.locking-scheme: granular cluster.eager-lock: enable performance.low-prio-threads: 32 network.ping-timeout: 30 cluster.granular-entry-heal: enable storage.owner-gid: 36 storage.owner-uid: 36 cluster.choose-local: true client.event-threads: 16 performance.strict-o-direct: on network.remote-dio: enable performance.client-io-threads: on nfs.disable: on storage.fips-mode-rchecksum: on transport.address-family: inet cluster.readdir-optimize: on cluster.metadata-self-heal: on cluster.data-self-heal: on cluster.entry-self-heal: on cluster.data-self-heal-algorithm: full features.uss: enable features.show-snapshot-directory: on features.barrier: disable auto-delete: enable snap-activate-on-create: enable

Writing inside the /gluster_bricks yields those 2GB/sec writes, Reading the same.

Reading inside the /rhev/data-center/mnt/glusterSD/ dir reads go down to 366mb/sec while writes plummet to to 200mb/sec.

Summed up: Writing into the SSD Raid in the lvm/xfs gluster brick directory is fast, writing into the mounted gluster dir is horribly slow.

The above can be seen and repeated on all 3 servers. The network can do full 10gbit (tested with, among others: rsync, iperf3).

Anyone with some idea on whats missing/ going on here?

Thanks folks, as always stay safe and healthy!

-- with kind regards, mit freundlichen Gruessen,

Christian Reiss _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/OMAAERV4IUISYE...

Strahil Nikolov

24 Mar 24 Mar

4:17 a.m.

On March 24, 2020 12:08:08 AM GMT+02:00, Jayme <jaymef@gmail.com> wrote:

...

I too struggle with speed issues in hci. Latency is a big problem with writes for me especially when dealing with small file workloads. How are you testing exactly?

Look into enabling libgfapi and try some comparisons with that. People have been saying it’s much faster, but it’s not a default option and has a few bugs. Redhat devs do not appear to be giving its implementation any priority unfortunately.

I’ve been considering switching to nfs storage because I’m seeing much better performance in testing with it. I have some nvme drives on the way and am curious how they would perform in hci but I’m thinking the issue is not a disk bottleneck (that appears very obvious in your case as well)

On Mon, Mar 23, 2020 at 6:44 PM Christian Reiss <email@christian-reiss.de> wrote:

...
Hey folks,

gluster related question. Having SSD in a RAID that can do 2 GB writes and Reads (actually above, but meh) in a 3-way HCI cluster connected with 10gbit connection things are pretty slow inside gluster.

I have these settings:

Options Reconfigured: cluster.server-quorum-type: server cluster.quorum-type: auto cluster.shd-max-threads: 8 features.shard: on features.shard-block-size: 64MB server.event-threads: 8 user.cifs: off cluster.shd-wait-qlength: 10000 cluster.locking-scheme: granular cluster.eager-lock: enable performance.low-prio-threads: 32 network.ping-timeout: 30 cluster.granular-entry-heal: enable storage.owner-gid: 36 storage.owner-uid: 36 cluster.choose-local: true client.event-threads: 16 performance.strict-o-direct: on network.remote-dio: enable performance.client-io-threads: on nfs.disable: on storage.fips-mode-rchecksum: on transport.address-family: inet cluster.readdir-optimize: on cluster.metadata-self-heal: on cluster.data-self-heal: on cluster.entry-self-heal: on cluster.data-self-heal-algorithm: full features.uss: enable features.show-snapshot-directory: on features.barrier: disable auto-delete: enable snap-activate-on-create: enable

Writing inside the /gluster_bricks yields those 2GB/sec writes, Reading the same.

Reading inside the /rhev/data-center/mnt/glusterSD/ dir reads go down to 366mb/sec while writes plummet to to 200mb/sec.

Summed up: Writing into the SSD Raid in the lvm/xfs gluster brick directory is fast, writing into the mounted gluster dir is horribly slow.

The above can be seen and repeated on all 3 servers. The network can do full 10gbit (tested with, among others: rsync, iperf3).

Anyone with some idea on whats missing/ going on here?

Thanks folks, as always stay safe and healthy!

-- with kind regards, mit freundlichen Gruessen,

Christian Reiss _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives:

https://lists.ovirt.org/archives/list/users@ovirt.org/message/OMAAERV4IUISYE...

...

Hey Chris,, You got some options. 1. To speedup the reads in HCI - you can use the option : cluster.choose-local: on 2. You can adjust the server and client event-threads 3. You can use NFS Ganesha (which connects to all servers via libgfapi) as a NFS Server. In such case you have to use some clustering like ctdb or pacemaker. Note:disable cluster.choose-local if you use this one 4 You can try the built-in NFS , although it's deprecated (NFS Ganesha is fully supported) 5. Create a gluster profile during the tests. I have seen numerous improperly selected tests -> so test with real-world workload. Synthetic tests are not good. Best Regards, Strahil Nikolov

Christian Reiss

8:20 a.m.

Hey Strahil, seems you're the go-to-guy with pretty much all my issues. I thank you for this and your continued support. Much appreciated. 200mb/reads however seems like a broken config or malfunctioning gluster than requiring performance tweaks. I enabled profiling so I have real life data available. But seriously even without tweaks I would like (need) 4 times those numbers, 800mb write speed is okay'ish, given the fact that 10gbit backbone can be the limiting factor. We are running BigCouch/CouchDB Applications that really really need IO. Not in throughput but in response times. 200mb/s is just way off. It feels as gluster can/should do more, natively. -Chris. On 24/03/2020 06:17, Strahil Nikolov wrote:

...

Hey Chris,,

You got some options. 1. To speedup the reads in HCI - you can use the option : cluster.choose-local: on 2. You can adjust the server and client event-threads 3. You can use NFS Ganesha (which connects to all servers via libgfapi) as a NFS Server. In such case you have to use some clustering like ctdb or pacemaker. Note:disable cluster.choose-local if you use this one 4 You can try the built-in NFS , although it's deprecated (NFS Ganesha is fully supported) 5. Create a gluster profile during the tests. I have seen numerous improperly selected tests -> so test with real-world workload. Synthetic tests are not good.

Best Regards, Strahil Nikolov

-- with kind regards, mit freundlichen Gruessen, Christian Reiss

Strahil Nikolov

10:23 a.m.

On March 24, 2020 11:20:10 AM GMT+02:00, Christian Reiss <email@christian-reiss.de> wrote:

...

Hey Strahil,

seems you're the go-to-guy with pretty much all my issues. I thank you for this and your continued support. Much appreciated.

200mb/reads however seems like a broken config or malfunctioning gluster than requiring performance tweaks. I enabled profiling so I have real life data available. But seriously even without tweaks I would like (need) 4 times those numbers, 800mb write speed is okay'ish, given the fact that 10gbit backbone can be the limiting factor.

We are running BigCouch/CouchDB Applications that really really need IO. Not in throughput but in response times. 200mb/s is just way off.

It feels as gluster can/should do more, natively.

-Chris.

...
Hey Chris,,

You got some options. 1. To speedup the reads in HCI - you can use the option : cluster.choose-local: on 2. You can adjust the server and client event-threads 3. You can use NFS Ganesha (which connects to all servers via

On 24/03/2020 06:17, Strahil Nikolov wrote: libgfapi) as a NFS Server.

...
In such case you have to use some clustering like ctdb or pacemaker. Note:disable cluster.choose-local if you use this one 4 You can try the built-in NFS , although it's deprecated (NFS Ganesha is fully supported) 5. Create a gluster profile during the tests. I have seen numerous improperly selected tests -> so test with real-world workload. Synthetic tests are not good.

Best Regards, Strahil Nikolov

Hey Chris, What type is your VM ? Try with 'High Performance' one (there is a good RH documentation on that topic). If the DB load was directly on gluster, you could use the settings in the '/var/lib/gluster/groups/db-workload' to optimize that, but I'm not sure if this will bring any performance on a VM. 1. Check the VM disk scheduler. Use 'noop/none' (depends on multiqueue is enabled) to allow the Hypervisor aggregate the I/O requests from multiple VMs. Next, set 'noop/none' disk scheduler on the hosts - these 2 are the optimal for SSDs and NVME disks (if I recall corectly you are using SSDs) 2. Disable cstates on the host and Guest (there are a lot of articles about that) 3. Enable MTU 9000 for Hypervisor (gluster node). 4. You can try setting/unsetting the tunables in the db-workload group and run benchmarks with real workload . 5. Some users reported that enabling TCP offload on the hosts gave huge improvement in performance of gluster - you can try that. Of course there are mixed feelings - as others report that disabling it brings performance. I guess it is workload specific. 6. You can try to tune the 'performance.readahead' on your gluster volume. Here are some settings of some users /from an old e-mail/: performance.read-ahead: on performance.stat-prefetch: on performance.flush-behind: on performance.client-io-threads: on performance.write-behind-window-size: 64MB (shard size) For a 48 cores / host: server.event-threads: 4 client.event-threads: 8 Your ecent-threads seem to be too high.And yes, documentation explains it , but without an example it becomes more confusing. Best Regards, Strahil Nikolov

Darrell Budic

4:33 p.m.

Christian, Adding on to Stahil’s notes, make sure you’re using jumbo MTUs on servers and client host nodes. Making sure you’re using appropriate disk schedulers on hosts and VMs is important, worth double checking that it’s doing what you think it is. If you are only HCI, gluster’s choose-local on is a good thing, but try cluster.choose-local: false cluster.read-hash-mode: 3 if you have separate servers or nodes with are not HCI to allow it spread reads over multiple nodes. Test out these settings if you have lots of RAM and cores on your servers, they work well for me with 20 cores and 64GB ram on my servers with my load: performance.io-thread-count: 64 performance.low-prio-threads: 32 these are worth testing for your workload. If you’re running VMs with these, test out libglapi connections, it’s significantly better for IO latency than plain fuse mounts. If you can tolerate the issues, the biggest one at the moment being you can’t take snapshots of the VMs with it enabled as of March. If you have tuned available, I use throughput-performance on my servers and guest-host on my vm nodes, throughput-performance on some HCI ones. I’d test with out the fips-rchecksum setting, that may be creating extra work for your servers. If you mounted individual bricks, check that you disabled barriers on them at mount if appropriate. Hope it helps, -Darrell

...

On Mar 24, 2020, at 6:23 AM, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

On March 24, 2020 11:20:10 AM GMT+02:00, Christian Reiss <email@christian-reiss.de> wrote:

...
Hey Strahil,

seems you're the go-to-guy with pretty much all my issues. I thank you for this and your continued support. Much appreciated.

200mb/reads however seems like a broken config or malfunctioning gluster than requiring performance tweaks. I enabled profiling so I have real life data available. But seriously even without tweaks I would like (need) 4 times those numbers, 800mb write speed is okay'ish, given the fact that 10gbit backbone can be the limiting factor.

We are running BigCouch/CouchDB Applications that really really need IO. Not in throughput but in response times. 200mb/s is just way off.

It feels as gluster can/should do more, natively.

-Chris.

...
Hey Chris,,

You got some options. 1. To speedup the reads in HCI - you can use the option : cluster.choose-local: on 2. You can adjust the server and client event-threads 3. You can use NFS Ganesha (which connects to all servers via

On 24/03/2020 06:17, Strahil Nikolov wrote: libgfapi) as a NFS Server.

...
In such case you have to use some clustering like ctdb or pacemaker. Note:disable cluster.choose-local if you use this one 4 You can try the built-in NFS , although it's deprecated (NFS Ganesha is fully supported) 5. Create a gluster profile during the tests. I have seen numerous improperly selected tests -> so test with real-world workload. Synthetic tests are not good.

Best Regards, Strahil Nikolov

Hey Chris,

What type is your VM ? Try with 'High Performance' one (there is a good RH documentation on that topic).

If the DB load was directly on gluster, you could use the settings in the '/var/lib/gluster/groups/db-workload' to optimize that, but I'm not sure if this will bring any performance on a VM.

1. Check the VM disk scheduler. Use 'noop/none' (depends on multiqueue is enabled) to allow the Hypervisor aggregate the I/O requests from multiple VMs. Next, set 'noop/none' disk scheduler on the hosts - these 2 are the optimal for SSDs and NVME disks (if I recall corectly you are using SSDs)

2. Disable cstates on the host and Guest (there are a lot of articles about that)

3. Enable MTU 9000 for Hypervisor (gluster node).

4. You can try setting/unsetting the tunables in the db-workload group and run benchmarks with real workload .

5. Some users reported that enabling TCP offload on the hosts gave huge improvement in performance of gluster - you can try that. Of course there are mixed feelings - as others report that disabling it brings performance. I guess it is workload specific.

6. You can try to tune the 'performance.readahead' on your gluster volume.

Here are some settings of some users /from an old e-mail/:

performance.read-ahead: on performance.stat-prefetch: on performance.flush-behind: on performance.client-io-threads: on performance.write-behind-window-size: 64MB (shard size)

For a 48 cores / host:

server.event-threads: 4 client.event-threads: 8

Your ecent-threads seem to be too high.And yes, documentation explains it , but without an example it becomes more confusing.

Best Regards, Strahil Nikolov

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/BOFZEJPBIRXUAX...

Strahil Nikolov

5:12 p.m.

On March 24, 2020 7:33:16 PM GMT+02:00, Darrell Budic <budic@onholyground.com> wrote:

...

Christian,

Adding on to Stahil’s notes, make sure you’re using jumbo MTUs on servers and client host nodes. Making sure you’re using appropriate disk schedulers on hosts and VMs is important, worth double checking that it’s doing what you think it is. If you are only HCI, gluster’s choose-local on is a good thing, but try

cluster.choose-local: false cluster.read-hash-mode: 3

if you have separate servers or nodes with are not HCI to allow it spread reads over multiple nodes.

Test out these settings if you have lots of RAM and cores on your servers, they work well for me with 20 cores and 64GB ram on my servers with my load:

performance.io-thread-count: 64 performance.low-prio-threads: 32

these are worth testing for your workload.

If you’re running VMs with these, test out libglapi connections, it’s significantly better for IO latency than plain fuse mounts. If you can tolerate the issues, the biggest one at the moment being you can’t take snapshots of the VMs with it enabled as of March.

If you have tuned available, I use throughput-performance on my servers and guest-host on my vm nodes, throughput-performance on some HCI ones.

I’d test with out the fips-rchecksum setting, that may be creating extra work for your servers.

If you mounted individual bricks, check that you disabled barriers on them at mount if appropriate.

Hope it helps,

-Darrell

...
On Mar 24, 2020, at 6:23 AM, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

...
Hey Strahil,

seems you're the go-to-guy with pretty much all my issues. I thank you for this and your continued support. Much appreciated.

200mb/reads however seems like a broken config or malfunctioning gluster than requiring performance tweaks. I enabled profiling so I have real life data available. But seriously even without tweaks I would like (need) 4 times those numbers, 800mb write speed is okay'ish, given

...
fact that 10gbit backbone can be the limiting factor.

We are running BigCouch/CouchDB Applications that really really need IO. Not in throughput but in response times. 200mb/s is just way off.

It feels as gluster can/should do more, natively.

-Chris.

...
Hey Chris,,

You got some options. 1. To speedup the reads in HCI - you can use the option : cluster.choose-local: on 2. You can adjust the server and client event-threads 3. You can use NFS Ganesha (which connects to all servers via

On 24/03/2020 06:17, Strahil Nikolov wrote: libgfapi) as a NFS Server.

...
In such case you have to use some clustering like ctdb or

On March 24, 2020 11:20:10 AM GMT+02:00, Christian Reiss <email@christian-reiss.de> wrote: the pacemaker.

...
...
Note:disable cluster.choose-local if you use this one 4 You can try the built-in NFS , although it's deprecated (NFS Ganesha is fully supported) 5. Create a gluster profile during the tests. I have seen numerous improperly selected tests -> so test with real-world workload. Synthetic tests are not good.

Best Regards, Strahil Nikolov

Hey Chris,

What type is your VM ? Try with 'High Performance' one (there is a good RH documentation on that topic).

If the DB load was directly on gluster, you could use the settings in the '/var/lib/gluster/groups/db-workload' to optimize that, but I'm not sure if this will bring any performance on a VM.

1. Check the VM disk scheduler. Use 'noop/none' (depends on multiqueue is enabled) to allow the Hypervisor aggregate the I/O requests from multiple VMs. Next, set 'noop/none' disk scheduler on the hosts - these 2 are the optimal for SSDs and NVME disks (if I recall corectly you are using SSDs)

2. Disable cstates on the host and Guest (there are a lot of articles about that)

3. Enable MTU 9000 for Hypervisor (gluster node).

4. You can try setting/unsetting the tunables in the db-workload group and run benchmarks with real workload .

5. Some users reported that enabling TCP offload on the hosts gave huge improvement in performance of gluster - you can try that. Of course there are mixed feelings - as others report that disabling it brings performance. I guess it is workload specific.

6. You can try to tune the 'performance.readahead' on your gluster volume.

Here are some settings of some users /from an old e-mail/:

performance.read-ahead: on performance.stat-prefetch: on performance.flush-behind: on performance.client-io-threads: on performance.write-behind-window-size: 64MB (shard size)

For a 48 cores / host:

server.event-threads: 4 client.event-threads: 8

Your ecent-threads seem to be too high.And yes, documentation explains it , but without an example it becomes more confusing.

Best Regards, Strahil Nikolov

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/BOFZEJPBIRXUAX...

When talking about mounts, you can avoid SELINUX lookups via 'context=system_u:object_r:glusterd_brick_t:s0' mount option for all bricks. This way the kernel will reduce the requests to the bricks. Also 'noatime' is a default mount option(relatime is also a good one) for HCI gluster bricks. It seems you have a lot of checks to do :) Best Regards, Strahil Nikolov

Alex McWhirter

5:25 p.m.

Red hat also recommends a shard size of 512mb, it's actually the only shard size they support. Also check the chunk size on the LVM thin pools running the bricks, should be at least 2mb. Note that changing the shard size only applies to new VM disks after the change. Changing the chunk size requires making a new brick. libgfapi brings a huge performance boost, in my opinion its almost a necessity unless you have a ton of extra disk speed / network throughput. Just be aware of the caveats. On 2020-03-24 14:12, Strahil Nikolov wrote:

...

On March 24, 2020 7:33:16 PM GMT+02:00, Darrell Budic <budic@onholyground.com> wrote:

...
Christian,

Adding on to Stahil’s notes, make sure you’re using jumbo MTUs on servers and client host nodes. Making sure you’re using appropriate disk schedulers on hosts and VMs is important, worth double checking that it’s doing what you think it is. If you are only HCI, gluster’s choose-local on is a good thing, but try

cluster.choose-local: false cluster.read-hash-mode: 3

if you have separate servers or nodes with are not HCI to allow it spread reads over multiple nodes.

Test out these settings if you have lots of RAM and cores on your servers, they work well for me with 20 cores and 64GB ram on my servers with my load:

performance.io-thread-count: 64 performance.low-prio-threads: 32

these are worth testing for your workload.

If you’re running VMs with these, test out libglapi connections, it’s significantly better for IO latency than plain fuse mounts. If you can tolerate the issues, the biggest one at the moment being you can’t take snapshots of the VMs with it enabled as of March.

If you have tuned available, I use throughput-performance on my servers and guest-host on my vm nodes, throughput-performance on some HCI ones.

I’d test with out the fips-rchecksum setting, that may be creating extra work for your servers.

If you mounted individual bricks, check that you disabled barriers on them at mount if appropriate.

Hope it helps,

-Darrell

...
On Mar 24, 2020, at 6:23 AM, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

...
Hey Strahil,

seems you're the go-to-guy with pretty much all my issues. I thank you for this and your continued support. Much appreciated.

200mb/reads however seems like a broken config or malfunctioning gluster than requiring performance tweaks. I enabled profiling so I have real life data available. But seriously even without tweaks I would like (need) 4 times those numbers, 800mb write speed is okay'ish, given

...
fact that 10gbit backbone can be the limiting factor.

We are running BigCouch/CouchDB Applications that really really need IO. Not in throughput but in response times. 200mb/s is just way off.

It feels as gluster can/should do more, natively.

-Chris.

...
Hey Chris,,

You got some options. 1. To speedup the reads in HCI - you can use the option : cluster.choose-local: on 2. You can adjust the server and client event-threads 3. You can use NFS Ganesha (which connects to all servers via

On 24/03/2020 06:17, Strahil Nikolov wrote: libgfapi) as a NFS Server.

...
In such case you have to use some clustering like ctdb or

On March 24, 2020 11:20:10 AM GMT+02:00, Christian Reiss <email@christian-reiss.de> wrote: the pacemaker.

...
...
Note:disable cluster.choose-local if you use this one 4 You can try the built-in NFS , although it's deprecated (NFS Ganesha is fully supported) 5. Create a gluster profile during the tests. I have seen numerous improperly selected tests -> so test with real-world workload. Synthetic tests are not good.

Best Regards, Strahil Nikolov

Hey Chris,

What type is your VM ? Try with 'High Performance' one (there is a good RH documentation on that topic).

If the DB load was directly on gluster, you could use the settings in the '/var/lib/gluster/groups/db-workload' to optimize that, but I'm not sure if this will bring any performance on a VM.

1. Check the VM disk scheduler. Use 'noop/none' (depends on multiqueue is enabled) to allow the Hypervisor aggregate the I/O requests from multiple VMs. Next, set 'noop/none' disk scheduler on the hosts - these 2 are the optimal for SSDs and NVME disks (if I recall corectly you are using SSDs)

2. Disable cstates on the host and Guest (there are a lot of articles about that)

3. Enable MTU 9000 for Hypervisor (gluster node).

4. You can try setting/unsetting the tunables in the db-workload group and run benchmarks with real workload .

5. Some users reported that enabling TCP offload on the hosts gave huge improvement in performance of gluster - you can try that. Of course there are mixed feelings - as others report that disabling it brings performance. I guess it is workload specific.

6. You can try to tune the 'performance.readahead' on your gluster volume.

Here are some settings of some users /from an old e-mail/:

performance.read-ahead: on performance.stat-prefetch: on performance.flush-behind: on performance.client-io-threads: on performance.write-behind-window-size: 64MB (shard size)

For a 48 cores / host:

server.event-threads: 4 client.event-threads: 8

Your ecent-threads seem to be too high.And yes, documentation explains it , but without an example it becomes more confusing.

Best Regards, Strahil Nikolov

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/BOFZEJPBIRXUAX...

When talking about mounts, you can avoid SELINUX lookups via 'context=system_u:object_r:glusterd_brick_t:s0' mount option for all bricks. This way the kernel will reduce the requests to the bricks.

Also 'noatime' is a default mount option(relatime is also a good one) for HCI gluster bricks.

It seems you have a lot of checks to do :)

Best Regards, Strahil Nikolov _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/B7OSBWD4MGBS6K...

Jayme

5:49 p.m.

I strongly believe that FUSE mount is the real reason for poor performance in HCI and these minor gluster and other tweaks won't satisfy most seeking i/o performance. Enabling libgfapi is probably the best option. Redhat has recently closed bug reports related to libgfapi citing won't fix and one comment suggests that libgfapi was not showing good enough performance to bother with which appears to contradict what many oVirt users are seeing. It's confusing to me why libgfapi as a default option is not being given any priority. https://bugzilla.redhat.com/show_bug.cgi?id=1465810 "We do not plan to enable libgfapi for oVirt/RHV. We did not find enough performance improvement justification for it" On Tue, Mar 24, 2020 at 3:34 PM Alex McWhirter <alex@triadic.us> wrote:

...

Red hat also recommends a shard size of 512mb, it's actually the only shard size they support. Also check the chunk size on the LVM thin pools running the bricks, should be at least 2mb. Note that changing the shard size only applies to new VM disks after the change. Changing the chunk size requires making a new brick.

libgfapi brings a huge performance boost, in my opinion its almost a necessity unless you have a ton of extra disk speed / network throughput. Just be aware of the caveats.

On 2020-03-24 14:12, Strahil Nikolov wrote:

...
On March 24, 2020 7:33:16 PM GMT+02:00, Darrell Budic <budic@onholyground.com> wrote:

...
Christian,

Adding on to Stahil’s notes, make sure you’re using jumbo MTUs on servers and client host nodes. Making sure you’re using appropriate disk schedulers on hosts and VMs is important, worth double checking that it’s doing what you think it is. If you are only HCI, gluster’s choose-local on is a good thing, but try

cluster.choose-local: false cluster.read-hash-mode: 3

if you have separate servers or nodes with are not HCI to allow it spread reads over multiple nodes.

Test out these settings if you have lots of RAM and cores on your servers, they work well for me with 20 cores and 64GB ram on my servers with my load:

performance.io-thread-count: 64 performance.low-prio-threads: 32

these are worth testing for your workload.

If you’re running VMs with these, test out libglapi connections, it’s significantly better for IO latency than plain fuse mounts. If you can tolerate the issues, the biggest one at the moment being you can’t take snapshots of the VMs with it enabled as of March.

If you have tuned available, I use throughput-performance on my servers and guest-host on my vm nodes, throughput-performance on some HCI ones.

I’d test with out the fips-rchecksum setting, that may be creating extra work for your servers.

If you mounted individual bricks, check that you disabled barriers on them at mount if appropriate.

Hope it helps,

-Darrell

...
On Mar 24, 2020, at 6:23 AM, Strahil Nikolov <hunter86_bg@yahoo.com> wrote:

...
Hey Strahil,

seems you're the go-to-guy with pretty much all my issues. I thank you for this and your continued support. Much appreciated.

200mb/reads however seems like a broken config or malfunctioning gluster than requiring performance tweaks. I enabled profiling so I have real life data available. But seriously even without tweaks I would like (need) 4 times those numbers, 800mb write speed is okay'ish, given

...
fact that 10gbit backbone can be the limiting factor.

We are running BigCouch/CouchDB Applications that really really need IO. Not in throughput but in response times. 200mb/s is just way off.

It feels as gluster can/should do more, natively.

-Chris.

...
Hey Chris,,

You got some options. 1. To speedup the reads in HCI - you can use the option : cluster.choose-local: on 2. You can adjust the server and client event-threads 3. You can use NFS Ganesha (which connects to all servers via

On 24/03/2020 06:17, Strahil Nikolov wrote: libgfapi) as a NFS Server.

...
In such case you have to use some clustering like ctdb or

On March 24, 2020 11:20:10 AM GMT+02:00, Christian Reiss <email@christian-reiss.de> wrote: the pacemaker.

...
...
Note:disable cluster.choose-local if you use this one 4 You can try the built-in NFS , although it's deprecated (NFS Ganesha is fully supported) 5. Create a gluster profile during the tests. I have seen numerous improperly selected tests -> so test with real-world workload. Synthetic tests are not good.

Best Regards, Strahil Nikolov

Hey Chris,

What type is your VM ? Try with 'High Performance' one (there is a good RH documentation on that topic).

If the DB load was directly on gluster, you could use the settings in the '/var/lib/gluster/groups/db-workload' to optimize that, but I'm not sure if this will bring any performance on a VM.

1. Check the VM disk scheduler. Use 'noop/none' (depends on multiqueue is enabled) to allow the Hypervisor aggregate the I/O requests from multiple VMs. Next, set 'noop/none' disk scheduler on the hosts - these 2 are the optimal for SSDs and NVME disks (if I recall corectly you are using SSDs)

2. Disable cstates on the host and Guest (there are a lot of articles about that)

3. Enable MTU 9000 for Hypervisor (gluster node).

4. You can try setting/unsetting the tunables in the db-workload group and run benchmarks with real workload .

5. Some users reported that enabling TCP offload on the hosts gave huge improvement in performance of gluster - you can try that. Of course there are mixed feelings - as others report that disabling it brings performance. I guess it is workload specific.

6. You can try to tune the 'performance.readahead' on your gluster volume.

Here are some settings of some users /from an old e-mail/:

performance.read-ahead: on performance.stat-prefetch: on performance.flush-behind: on performance.client-io-threads: on performance.write-behind-window-size: 64MB (shard size)

For a 48 cores / host:

server.event-threads: 4 client.event-threads: 8

Your ecent-threads seem to be too high.And yes, documentation explains it , but without an example it becomes more confusing.

Best Regards, Strahil Nikolov

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives:

https://lists.ovirt.org/archives/list/users@ovirt.org/message/BOFZEJPBIRXUAX...

When talking about mounts, you can avoid SELINUX lookups via 'context=system_u:object_r:glusterd_brick_t:s0' mount option for all bricks. This way the kernel will reduce the requests to the bricks.

Also 'noatime' is a default mount option(relatime is also a good one) for HCI gluster bricks.

It seems you have a lot of checks to do :)

Best Regards, Strahil Nikolov _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives:

https://lists.ovirt.org/archives/list/users@ovirt.org/message/B7OSBWD4MGBS6K... _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/PDG34NP2ADFL6P...

Christian Reiss

27 Mar 27 Mar

8:26 a.m.

Hey Jayme, thanks for replying; sorry for the delay. If I am understanding this right, there is no real official way to enable libgfapi. If you somehow manage to get it running then you will lose HA capabilities, which is something we like on our production servers. The most recent post I could find on the matter (https://www.mail-archive.com/users@ovirt.org/msg59664.html) read like its worth a try for hobyyists, but for production servers I do am a little bit scared. Do you maybe have any document or other source that does work with 4.3.x versions and inspires confidence? :-) -Chris On 24/03/2020 19:49, Jayme wrote:

...

I strongly believe that FUSE mount is the real reason for poor performance in HCI and these minor gluster and other tweaks won't satisfy most seeking i/o performance. Enabling libgfapi is probably the best option. Redhat has recently closed bug reports related to libgfapi citing won't fix and one comment suggests that libgfapi was not showing good enough performance to bother with which appears to contradict what many oVirt users are seeing. It's confusing to me why libgfapi as a default option is not being given any priority.

https://bugzilla.redhat.com/show_bug.cgi?id=1465810

"We do not plan to enable libgfapi for oVirt/RHV. We did not find enough performance improvement justification for it"

-- Christian Reiss - email@christian-reiss.de /"\ ASCII Ribbon support@alpha-labs.net \ / Campaign X against HTML WEB alpha-labs.net / \ in eMails GPG Retrieval https://gpg.christian-reiss.de GPG ID ABCD43C5, 0x44E29126ABCD43C5 GPG fingerprint = 9549 F537 2596 86BA 733C A4ED 44E2 9126 ABCD 43C5 "It's better to reign in hell than to serve in heaven.", John Milton, Paradise lost.

Strahil Nikolov

4:13 p.m.

On March 27, 2020 11:26:25 AM GMT+02:00, Christian Reiss <email@christian-reiss.de> wrote:

...

Hey Jayme,

thanks for replying; sorry for the delay. If I am understanding this right, there is no real official way to enable libgfapi. If you somehow manage to get it running then you will lose HA capabilities, which is something we like on our production servers.

The most recent post I could find on the matter (https://www.mail-archive.com/users@ovirt.org/msg59664.html) read like its worth a try for hobyyists, but for production servers I do am a little bit scared.

Do you maybe have any document or other source that does work with 4.3.x versions and inspires confidence? :-)

-Chris

...
I strongly believe that FUSE mount is the real reason for poor performance in HCI and these minor gluster and other tweaks won't satisfy most seeking i/o performance. Enabling libgfapi is probably

...
best option. Redhat has recently closed bug reports related to

On 24/03/2020 19:49, Jayme wrote: the libgfapi

...
citing won't fix and one comment suggests that libgfapi was not showing good enough performance to bother with which appears to contradict what many oVirt users are seeing. It's confusing to me why libgfapi as a default option is not being given any priority.

https://bugzilla.redhat.com/show_bug.cgi?id=1465810

"We do not plan to enable libgfapi for oVirt/RHV. We did not find enough performance improvement justification for it"

Hey All, Direct libvirt access via libgfapi causes loss of some features , but this is not the only option. You can always use NFS Ganesha, which uses libgfapi to reach the gluster servers, while providing access via NFS. Best Regards, Strahil Nikolov

Christian Reiss

8:28 a.m.

Hey Alex, you too, thanks for writing. I'm on 64mb as per default for ovirt. We tried no sharding, 128mb sharding, 64mb sharding (always with copying the disk). There was no increase or decrease in disk speed in any way. Besides losing HA capabilites, what other caveats? -Chris. On 24/03/2020 19:25, Alex McWhirter wrote:

...

Red hat also recommends a shard size of 512mb, it's actually the only shard size they support. Also check the chunk size on the LVM thin pools running the bricks, should be at least 2mb. Note that changing the shard size only applies to new VM disks after the change. Changing the chunk size requires making a new brick.

libgfapi brings a huge performance boost, in my opinion its almost a necessity unless you have a ton of extra disk speed / network throughput. Just be aware of the caveats.

Alex McWhirter

4:38 p.m.

On 2020-03-27 05:28, Christian Reiss wrote:

...

Hey Alex,

you too, thanks for writing. I'm on 64mb as per default for ovirt. We tried no sharding, 128mb sharding, 64mb sharding (always with copying the disk). There was no increase or decrease in disk speed in any way.

Besides losing HA capabilites, what other caveats?

-Chris.

On 24/03/2020 19:25, Alex McWhirter wrote:

...
Red hat also recommends a shard size of 512mb, it's actually the only shard size they support. Also check the chunk size on the LVM thin pools running the bricks, should be at least 2mb. Note that changing the shard size only applies to new VM disks after the change. Changing the chunk size requires making a new brick.

libgfapi brings a huge performance boost, in my opinion its almost a necessity unless you have a ton of extra disk speed / network throughput. Just be aware of the caveats.

-- Christian Reiss - email@christian-reiss.de /"\ ASCII Ribbon support@alpha-labs.net \ / Campaign X against HTML WEB alpha-labs.net / \ in eMails

GPG Retrieval https://gpg.christian-reiss.de GPG ID ABCD43C5, 0x44E29126ABCD43C5 GPG fingerprint = 9549 F537 2596 86BA 733C A4ED 44E2 9126 ABCD 43C5

"It's better to reign in hell than to serve in heaven.", John Milton, Paradise lost. _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/KL6HLEIRQ6GCNP...

You don't lose HA, you just loose live migration in between separate data centers or between gluster volumes. Live migration between nodes in the same DC / gluster volume still works fine. Some people have snapshot issues, i don't, but plan for problems just in case. shard size 512MB will only affect new vm's, or new VM disks to be exact. LVM chunk size defaults to 2mb on CentOS 7.6+, but it should be a multiple of your raid stripe size. Stripe size should be fairly large, we use 512KB0 stripe sizes on the bricks, 2mb chunk sizes on lvm. With that and about 90 disks we can saturate 10GBe, then we added in some SSD cache drives to lvm on the bricks, which helped a lot with random io.

Jorick Astrego

11:49 a.m.

On 3/24/20 7:25 PM, Alex McWhirter wrote:

...

Red hat also recommends a shard size of 512mb, it's actually the only shard size they support. Also check the chunk size on the LVM thin pools running the bricks, should be at least 2mb. Note that changing the shard size only applies to new VM disks after the change. Changing the chunk size requires making a new brick.

Regarding the chunk size, red hat tells me it depends on RAID or JBOD https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.5/ht... chunksize An important parameter to be specified while creating a thin pool is the chunk size,which is the unit of allocation. For good performance, the chunk size for the thin pool and the parameters of the underlying hardware RAID storage should be chosen so that they work well together. And regarding the shard size, you can fix that with storage live migration right? Use two volumes and domains and move them so they will adopt the new shard size... Am I correct that when you change the sharding on a running volume, it only applies for new disks? Or does it also apply to extensions to a current disk? Met vriendelijke groet, With kind regards, Jorick Astrego Netbulae Virtualization Experts ---------------- Tel: 053 20 30 270 info@netbulae.eu Staalsteden 4-3A KvK 08198180 Fax: 053 20 30 271 www.netbulae.eu 7547 TA Enschede BTW NL821234584B01 ----------------

Strahil Nikolov

4:35 p.m.

On March 27, 2020 2:49:13 PM GMT+02:00, Jorick Astrego <jorick@netbulae.eu> wrote:

...

On 3/24/20 7:25 PM, Alex McWhirter wrote:

...
Red hat also recommends a shard size of 512mb, it's actually the only shard size they support. Also check the chunk size on the LVM thin pools running the bricks, should be at least 2mb. Note that changing the shard size only applies to new VM disks after the change. Changing the chunk size requires making a new brick.

Regarding the chunk size, red hat tells me it depends on RAID or JBOD

https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.5/ht...

chunksize An important parameter to be specified while creating a thin pool is the chunk size,which is the unit of allocation. For good performance, the chunk size for the thin pool and the parameters of the underlying hardware RAID storage should be chosen so that they work well together.

And regarding the shard size, you can fix that with storage live migration right? Use two volumes and domains and move them so they will adopt the new shard size...

Am I correct that when you change the sharding on a running volume, it only applies for new disks? Or does it also apply to extensions to a current disk?

Met vriendelijke groet, With kind regards,

Jorick Astrego

Netbulae Virtualization Experts

----------------

Tel: 053 20 30 270 info@netbulae.eu Staalsteden 4-3A KvK 08198180 Fax: 053 20 30 271 www.netbulae.eu 7547 TA Enschede BTW NL821234584B01

----------------

Shard size change is valid for new images, but this can be fixed either via storage migration between volumes or via creating new disk and migrating within the OS (if possible). Still, MTU is important and you can use 'ping -s <size_of_data> -c 1 -M do destination ' to test. Keep in mind that VLANs also take some data in the packet (I think around 8 bytes). Today I have set MTU 9100 on some servers in order to guarantee that the app will be able to transfer 9000 bytes of data, but this depends on the switches between the nodes and the NICs of the servers. You can use tracepath to detect if there is a switch that doesn't support Jumbo Frames. Actually setting ctdb with NFS Ganesha is quite easy . You will be able to get all 'goodies' from oVirt (snapshots, live migration, etc) while using Higher performance via NFS Ganesha - which is like a gateway for the clients (while accessing all servers simultaneously), so it will be better situated outside the Gluster servers. Best Regards, Strahil Nikolov

Christian Reiss

9:31 a.m.

Hey, thanks for writing. If I go for dont choose local my speed drops dramatically (halving). Speed between the hosts is okay (tested) but for some odd reason the mtu is at 1500 still. I was sure I set it to jumbo/9k. Oh well. Not during runtime. I can hear the gluster scream if the network dies for a second :) -Chris. On 24/03/2020 18:33, Darrell Budic wrote:

...

cluster.choose-local: false cluster.read-hash-mode: 3

if you have separate servers or nodes with are not HCI to allow it spread reads over multiple nodes.

Jayme

11:45 a.m.

Christian, I've been following along with interest, as I've also been trying everything I can to improve gluster performance in my HCI cluster. My issue is mostly latency related and my workloads are typically small file operations which have been especially challenging. Couple of things 1. About the MTU, did you also enable jumbo frames at switch level (if applicable)? I have jumbo frames enabled but honestly didn't see much of an impact from doing so. 2. About libgfapi. It's actually quite simple to enable it (at least if you want to do some testing). It can be enabled on the hosted engine using engine-config i.e. *engine-config -s LibgfApiSupported=true -- *from my experience you can do this while VMs are running and they won't pick up the new config under powered off/restarted. So you are able to test it out on one VM. Again, as and some others have mentioned this is not a default option in oVirt because there are known bugs with the libgfapi implementation. Some others have worked around these bugs in various ways but like you, I am not willing to do so in a production environment. Still, I think it's very much worth doing some tests on a VM with libgfapi enabled compared to default fuse mount. On Fri, Mar 27, 2020 at 7:44 AM Christian Reiss <email@christian-reiss.de> wrote:

...

Hey,

thanks for writing. If I go for dont choose local my speed drops dramatically (halving). Speed between the hosts is okay (tested) but for some odd reason the mtu is at 1500 still. I was sure I set it to jumbo/9k. Oh well.

Not during runtime. I can hear the gluster scream if the network dies for a second :)

-Chris.

On 24/03/2020 18:33, Darrell Budic wrote:

...
cluster.choose-local: false cluster.read-hash-mode: 3

if you have separate servers or nodes with are not HCI to allow it spread reads over multiple nodes.

-- Christian Reiss - email@christian-reiss.de /"\ ASCII Ribbon support@alpha-labs.net \ / Campaign X against HTML WEB alpha-labs.net / \ in eMails

GPG Retrieval https://gpg.christian-reiss.de GPG ID ABCD43C5, 0x44E29126ABCD43C5 GPG fingerprint = 9549 F537 2596 86BA 733C A4ED 44E2 9126 ABCD 43C5

"It's better to reign in hell than to serve in heaven.", John Milton, Paradise lost. _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/QYS7RIHXYAYW7X...

Christian Reiss

9:01 a.m.

Hey Strahil, as always: thanks! On 24/03/2020 12:23, Strahil Nikolov wrote:

...

Hey Chris,

What type is your VM ?

CentOS7.

...

Try with 'High Performance' one (there is a good RH documentation on that topic).

I was googly-eying that as well. Will try that tonight :)

...

1. Check the VM disk scheduler. Use 'noop/none' (depends on multiqueue is enabled) to allow the Hypervisor aggregate the I/O requests from multiple VMs. Next, set 'noop/none' disk scheduler on the hosts - these 2 are the optimal for SSDs and NVME disks (if I recall corectly you are using SSDs)

Yeah the gluster disks do have noop already.

...

2. Disable cstates on the host and Guest (there are a lot of articles about that)

Not sure its a CPU bottleneck in any capacity, but ill dig into this.

...

3. Enable MTU 9000 for Hypervisor (gluster node).

Already done.

...

4. You can try setting/unsetting the tunables in the db-workload group and run benchmarks with real workload .

Will also check!

...

5. Some users reported that enabling TCP offload on the hosts gave huge improvement in performance of gluster - you can try that. Of course there are mixed feelings - as others report that disabling it brings performance. I guess it is workload specific.

...

performance.write-behind-window-size: 64MB (shard size)

This one doubled my speed from 200mb to 400mb!! I think this is where the meat is at. -Chris. -- Christian Reiss - email@christian-reiss.de /"\ ASCII Ribbon support@alpha-labs.net \ / Campaign X against HTML WEB alpha-labs.net / \ in eMails GPG Retrieval https://gpg.christian-reiss.de GPG ID ABCD43C5, 0x44E29126ABCD43C5 GPG fingerprint = 9549 F537 2596 86BA 733C A4ED 44E2 9126 ABCD 43C5 "It's better to reign in hell than to serve in heaven.", John Milton, Paradise lost.

Jorick Astrego

11:41 a.m.

On 3/27/20 11:01 AM, Christian Reiss wrote:

...

Hey Strahil,

as always: thanks!

On 24/03/2020 12:23, Strahil Nikolov wrote:

...
performance.write-behind-window-size: 64MB (shard size)

This one doubled my speed from 200mb to 400mb!!

I think this is where the meat is at.

-Chris.

Won't this increase the risk of data loss? We have everything on dual power feeds etc, so the risk of the having all or 2/3's of the gluster nodes fail at the same time is very minimal. But still when that happens? And with a shard size of 512MB this would be performance.write-behind-window-size: 512MB? Always tweaking ;-) Met vriendelijke groet, With kind regards, Jorick Astrego Netbulae Virtualization Experts ---------------- Tel: 053 20 30 270 info@netbulae.eu Staalsteden 4-3A KvK 08198180 Fax: 053 20 30 271 www.netbulae.eu 7547 TA Enschede BTW NL821234584B01 ----------------

Nir Soffer

24 Mar 24 Mar

10:25 p.m.

On Mon, Mar 23, 2020 at 11:44 PM Christian Reiss <email@christian-reiss.de> wrote:

...

Hey folks,

gluster related question. Having SSD in a RAID that can do 2 GB writes and Reads (actually above, but meh) in a 3-way HCI cluster connected with 10gbit connection things are pretty slow inside gluster. I have these settings:

Options Reconfigured: cluster.server-quorum-type: server cluster.quorum-type: auto cluster.shd-max-threads: 8 features.shard: on features.shard-block-size: 64MB server.event-threads: 8 user.cifs: off cluster.shd-wait-qlength: 10000 cluster.locking-scheme: granular cluster.eager-lock: enable performance.low-prio-threads: 32 network.ping-timeout: 30 cluster.granular-entry-heal: enable storage.owner-gid: 36 storage.owner-uid: 36 cluster.choose-local: true client.event-threads: 16

These settings mean:

...

performance.strict-o-direct: on network.remote-dio: enable

That you are using direct I/O both on the client and server side.

...

performance.client-io-threads: on nfs.disable: on storage.fips-mode-rchecksum: on transport.address-family: inet cluster.readdir-optimize: on cluster.metadata-self-heal: on cluster.data-self-heal: on cluster.entry-self-heal: on cluster.data-self-heal-algorithm: full features.uss: enable features.show-snapshot-directory: on features.barrier: disable auto-delete: enable snap-activate-on-create: enable

Writing inside the /gluster_bricks yields those 2GB/sec writes, Reading the same.

How did you test this? Did you test reading from the storage on the server side using direct I/O? if not, you test accessing server buffer cache, which is pretty fast.

...

Reading inside the /rhev/data-center/mnt/glusterSD/ dir reads go down to 366mb/sec while writes plummet to to 200mb/sec.

This use direct I/O.

...

Summed up: Writing into the SSD Raid in the lvm/xfs gluster brick directory is fast, writing into the mounted gluster dir is horribly slow.

The above can be seen and repeated on all 3 servers. The network can do full 10gbit (tested with, among others: rsync, iperf3).

Anyone with some idea on whats missing/ going on here?

Please share the commands/configuration files used to perform the tests. Adding storage folks that can help with analyzing this.

...

Thanks folks, as always stay safe and healthy!

Nir

...

-- with kind regards, mit freundlichen Gruessen,

Christian Reiss _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/OMAAERV4IUISYE...

Christian Reiss

27 Mar 27 Mar

8:19 a.m.

Hey, thanks for writing. Sorry about the delay. On 25/03/2020 00:25, Nir Soffer wrote:

...

These settings mean:

...
performance.strict-o-direct: on network.remote-dio: enable

That you are using direct I/O both on the client and server side. I changed them to off, to no avail. Yields the same results.

...
Writing inside the /gluster_bricks yields those 2GB/sec writes, Reading the same.

How did you test this? I ran

dd if=/dev/zero of=testfile oflag=direct bs=1M status=progress (with varying block sized) on - the mounted gluster brick (/gluster_bricks...) - the mounted gluster volume (/rhev.../mount/...) - inside a running VM I also switched it around and read an image file from the gluster volume with the same speeds.

...

Did you test reading from the storage on the server side using direct I/O? if not, you test accessing server buffer cache, which is pretty fast. Which is where oflag comes in. I can confirm skipping that will results in really, really fast io until the buffer is full. oflag=direct shows ~2gb on the raid, 200mb on gluster volume, still.

...
Reading inside the /rhev/data-center/mnt/glusterSD/ dir reads go down to 366mb/sec while writes plummet to to 200mb/sec.

This use direct I/O. Even with the direct I/O turned on (which is off and yielding the same results) this is way too slow for direct IO.

...

Please share the commands/configuration files used to perform the tests.

Adding storage folks that can help with analyzing this. I am happy to oblige and supply and required logs or profiling information if you'd be so kind to tell me which one, precisely.

Stay healthy! -- Christian Reiss - email@christian-reiss.de /"\ ASCII Ribbon support@alpha-labs.net \ / Campaign X against HTML WEB alpha-labs.net / \ in eMails GPG Retrieval https://gpg.christian-reiss.de GPG ID ABCD43C5, 0x44E29126ABCD43C5 GPG fingerprint = 9549 F537 2596 86BA 733C A4ED 44E2 9126 ABCD 43C5 "It's better to reign in hell than to serve in heaven.", John Milton, Paradise lost.

1963

Age (days ago)

1967

Last active (days ago)

List overview

Download

20 comments

7 participants

participants (7)

Alex McWhirter
Christian Reiss
Darrell Budic
Jayme
Jorick Astrego
Nir Soffer
Strahil Nikolov

Speed Issues

tags

participants (7)