
No worries at all about the length of the email, the details are highly appreciated. You've given me lots to look into and consider. On Sat, Mar 7, 2020 at 10:02 AM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
On March 7, 2020 1:12:58 PM GMT+02:00, Jayme <jaymef@gmail.com> wrote:
Thanks again for the info. You’re probably right about the testing method. Though the reason I’m down this path in the first place is because I’m seeing a problem in real world work loads. Many of my vms are used in development environments where working with small files is common such as npm installs working with large node_module folders, ci/cd doing lots of mixed operations io and compute.
I started testing some of these things by comparing side to side with a vm using same specs only difference being gluster vs nfs storage. Nfs backed storage is performing about 3x better real world.
Gluster version is stock that comes with 4.3.7. I haven’t attempted updating it outside of official ovirt updates.
I’d like to see if I could improve it to handle my workloads better. I also understand that replication adds overhead.
I do wonder how much difference in performance there would be with replica 3 vs replica 3 arbiter. I’d assume arbiter setup would be faster but perhaps not by a considerable difference.
I will check into c states as well
On Sat, Mar 7, 2020 at 2:52 AM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
Strahil,
Thanks for your suggestions. The config is pretty standard HCI setup with cockpit and hosts are oVirt node. XFS was handled by the deployment automatically. The gluster volumes were optimized for virt store.
I tried noop on the SSDs, that made zero difference in the tests I was running above. I took a look at the random-io-profile and it looks
On March 7, 2020 1:09:37 AM GMT+02:00, Jayme <jaymef@gmail.com> wrote: like
it really only sets vm.dirty_background_ratio = 2 & vm.dirty_ratio = 5 -- my hosts already appear to have those sysctl values, and by default are using virtual-host tuned profile.
I'm curious what a test like "dd if=/dev/zero of=test2.img bs=512 count=1000 oflag=dsync" on one of your VMs would show for results?
I haven't done much with gluster profiling but will take a look and see if I can make sense of it. Otherwise, the setup is pretty stock oVirt HCI deployment with SSD backed storage and 10Gbe storage network. I'm not coming anywhere close to maxing network throughput.
The NFS export I was testing was an export from a local server exporting a single SSD (same type as in the oVirt hosts).
I might end up switching storage to NFS and ditching gluster if performance is really this much better...
On Fri, Mar 6, 2020 at 5:06 PM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
On March 6, 2020 6:02:03 PM GMT+02:00, Jayme <jaymef@gmail.com> wrote:
I have 3 server HCI with Gluster replica 3 storage (10GBe and SSD disks). Small file performance inner-vm is pretty terrible compared to a similar spec'ed VM using NFS mount (10GBe network, SSD disk)
VM with gluster storage:
# dd if=/dev/zero of=test2.img bs=512 count=1000 oflag=dsync 1000+0 records in 1000+0 records out 512000 bytes (512 kB) copied, 53.9616 s, 9.5 kB/s
VM with NFS:
# dd if=/dev/zero of=test2.img bs=512 count=1000 oflag=dsync 1000+0 records in 1000+0 records out 512000 bytes (512 kB) copied, 2.20059 s, 233 kB/s
This is a very big difference, 2 seconds to copy 1000 files on NFS VM VS 53 seconds on the other.
Aside from enabling libgfapi is there anything I can tune on the gluster or VM side to improve small file performance? I have seen some guides by Redhat in regards to small file performance but I'm not sure what/if any of it applies to oVirt's implementation of gluster in HCI.
You can use the rhgs-random-io tuned profile from
ftp://ftp.redhat.com/redhat/linux/enterprise/7Server/en/RHS/SRPMS/redhat-storage-server-3.4.2.0-1.el7rhgs.src.rpm
and try with that on your hosts. In my case, I have modified it so it's a mixture between rhgs-random-io and the profile for Virtualization Host.
Also,ensure that your bricks are using XFS with relatime/noatime mount option and your scheduler for the SSDs is either 'noop' or 'none' .The default I/O scheduler for RHEL7 is deadline which is giving preference to reads and your workload is definitely 'write'.
Ensure that the virt settings are enabled for your gluster volumes: 'gluster volume set <volname> group virt'
Also, are you running on fully allocated disks for the VM or you started thin ? I'm asking as creation of new shards at gluster level is a slow task.
Have you checked gluster profiling the volume? It can clarify what is going on.
Also are you comparing apples to apples ? For example, 1 ssd mounted and exported as NFS and a replica 3 volume of the same type of ssd ? If not, the NFS can have more iops due to multiple disks behind it, while Gluster has to write the same thing on all nodes.
Best Regards, Strahil Nikolov
Hi Jayme,
My test are not quite good ,as I have a different setup:
NVME - VDO - 4 thin LVs -XFS - 4 Gluster volumes (replica 2 arbiter
- 4 storage domains - striped LV in each VM
RHEL7 VM (fully stock): [root@node1 ~]# dd if=/dev/zero of=test2.img bs=512 count=1000 oflag=dsync 1000+0 records in 1000+0 records out 512000 bytes (512 kB) copied, 19.8195 s, 25.8 kB/s [root@node1 ~]#
Brick: [root@ovirt1 data_fast]# dd if=/dev/zero of=test2.img bs=512 count=1000 oflag=dsync 1000+0 records in 1000+0 records out 512000 bytes (512 kB) copied, 1.41192 s, 363 kB/s
As I use VDO with compression (on 1/4 of the NVMe) - I cannot expect any performance from it.
Is your app really using dsync ? I have seen many times that performance testing with the wrong tools/tests cause more trouble than it should.
I would recommend you to test with a real workload before deciding to change the architecture.
I forgot to mention that you need to disable c states for your systems if you are chasing performance. Run a gluster profile while you run real workload in your VMs and then provide that for analysis.
Which version of Gluster are you using ?
Best Regards, Strahil Nikolov
Hm... Then you do have a real workload scenario - pick one of the most often used tasks and use it's time of completion for reference. Synthetic benchmarking is not good.
As far as I know oVirt is actually running on gluster v6.X . @Sandro, Can you hint us the highest supported gluster version on oVirt ? I'm running v7.0, so I'm little bit off the track.
Jayme,
Next steps are to check: 1. Did you disable cstates - there are very good articles for RHEL/CentOS 7 2. Check firmware of your HCI nodes - I've seen numerous network/SAN issues due to old firmware including stucked processes 3. Check the articles for RHV and hugepages . If your VMs are memory dynamic and lots of RAM is needed -> hugepages will bring more performance. Second , transparent huge pages must be disabled. 4. Create a High Performance VM for testing purposes with fully allocated disks 5. Check if 'noatime' or 'relatime' is set for the bricks. If selinux is in enforcing mode (I highly recommend that), you can use mount option 'system_u:object_r:glusterd_brick_t:s0' which will cause the kernel to reduce lookups to check the SELINUX context of all files in the brick - and increasing the performance.
6. Consider switching to 'noop'/'none' or tuning 'deadline' I/O scheduler to match your needs
7. Create a gluster profile during the VM(step 4) is being tested , as if is needed.
8. Consider using 'Pass-through host cpu' which is enabled in UI via -> VM-> edit -> Host -> Start on specific host -> select all hosts with the same cpu -> allow manual and automatic migration -> OK This mode allows all instructions on the Host CPU to be available on the guest, greatly increasing performance for a lot of software.
The difference between 'replica 3' and 'replica 3 arbiter 1' (old name was 'replica 2 arbiter 1' but it means the same) is the fact that the arbitrated volume requiress less bandwidth (due to the fact that the files on the arbiter has 0 bytes of data) and stores only metadata to prevent splitbrain. Drawbacks of the arbiter is that you have only 2 sources to read from, while replica 3 provides three sources to read from. With glusterd 2.0 ( I think it was introduced in gluster v7 ) the arbiter doesn't need to be locally (which means higher lattencies are no longer an issue), and is only needed when one of data bricks is needed.Still, the remote arbiter is too new for prod.
Next: You can consider clusterized 2-node NFS Ganesha (with quorum device for the third vote) as an NFS source. The good thing about NFS Ganes is the primary focus from the Gluster community and it uses libgfapi to connect to the backend (replica volume).
I think it's enough for now , but I guess other stuff could come to my mind at later stage.
Edit: This e-mail is way longer than I initially thought to be.Sorry about that.
Best Regards, Strahil Nikolov