[ovirt-users] Re: What if anything can be done to improve small file performance with gluster?

8 Mar 2020

      Ok, this is more strange.  The same dd test against my ssd os/boot drives
on oVirt node hosts using the same model drive (only smaller) and same h310
controller (only diff being the os/boot drives are in raid mirror and
gluster drives are passthrough) test completes in <2 seconds in /tmp of
host but takes ~45 seconds in /gluster_bricks/brick_whatever

Is there any explanation why there is such a vast difference between the
two tests?

example of one my mounts:

/dev/mapper/onn_orchard1-tmp /tmp ext4 defaults,discard 1 2
/dev/gluster_vg_sda/gluster_lv_prod_a /gluster_bricks/brick_a xfs
inode64,noatime,nodiratime 0 0

On Sun, Mar 8, 2020 at 12:23 PM Jayme <jaymef@gmail.com> wrote:
...
Strahil,
I'm starting to think that my problem could be related to the use of perc
H310 mini raid controllers in my oVirt hosts. The os/boot SSDs are raid
mirror but gluster storage is SSDs in passthrough. I've read that the queue
depth of h310 card is very low and can cause performance issues
especially when used with flash devices.
dd if=/dev/zero of=test4.img bs=512 count=5000 oflag=dsync on one of my
hosts gluster bricks /gluster_bricks/brick_a for example takes 45 seconds
to complete.
I can perform the same operation in ~2 seconds on another server with a
better raid controller, but with the same model ssd.
I might look at seeing how I can swap out the h310's, unfortunately I
think that may require me to wipe the gluster storage drives as with
another controller I believe they'd need to be added as single raid 0
arrays and would need to be rebuilt to do so.
If I were to take one host down at a time is there a way that I can
re-build the entire server including wiping the gluster disks and add the
host back into the ovirt cluster and rebuild it along with the bricks? How
would you recommend doing such a task if I needed to wipe gluster disks on
each host ?
On Sat, Mar 7, 2020 at 6:24 PM Jayme <jaymef@gmail.com> wrote:
...
No worries at all about the length of the email, the details are highly
appreciated. You've given me lots to look into and consider.
On Sat, Mar 7, 2020 at 10:02 AM Strahil Nikolov <hunter86_bg@yahoo.com>
wrote:
...
On March 7, 2020 1:12:58 PM GMT+02:00, Jayme <jaymef@gmail.com> wrote:
...
Thanks again for the info. You’re probably right about the testing
method.
Though the reason I’m down this path in the first place is because I’m
seeing a problem in real world work loads. Many of my vms are used in
development environments where working with small files is common such
as
npm installs working with large node_module folders, ci/cd doing lots
of
mixed operations io and compute.
I started testing some of these things by comparing side to side with a
vm
using same specs only difference being gluster vs nfs storage. Nfs
backed
storage is performing about 3x better real world.
Gluster version is stock that comes with 4.3.7. I haven’t attempted
updating it outside of official ovirt updates.
I’d like to see if I could improve it to handle my workloads better. I
also
understand that replication adds overhead.
I do wonder how much difference in performance there would be with
replica
3 vs replica 3 arbiter. I’d assume arbiter setup would be faster but
perhaps not by a considerable difference.
I will check into c states as well
On Sat, Mar 7, 2020 at 2:52 AM Strahil Nikolov <hunter86_bg@yahoo.com>
wrote:
...
...
Strahil,
Thanks for your suggestions. The config is pretty standard HCI setup
with
cockpit and hosts are oVirt node. XFS was handled by the deployment
automatically. The gluster volumes were optimized for virt store.
I tried noop on the SSDs, that made zero difference in the tests I
was
running above. I took a look at the random-io-profile and it looks
On March 7, 2020 1:09:37 AM GMT+02:00, Jayme <jaymef@gmail.com>
wrote:
like
...
it
really only sets vm.dirty_background_ratio = 2 & vm.dirty_ratio = 5
--
my
hosts already appear to have those sysctl values, and by default are
using virtual-host tuned profile.
I'm curious what a test like "dd if=/dev/zero of=test2.img bs=512
count=1000 oflag=dsync" on one of your VMs would show for results?
I haven't done much with gluster profiling but will take a look and
see
if
I can make sense of it. Otherwise, the setup is pretty stock oVirt
HCI
deployment with SSD backed storage and 10Gbe storage network.  I'm
not
coming anywhere close to maxing network throughput.
The NFS export I was testing was an export from a local server
exporting a
single SSD (same type as in the oVirt hosts).
I might end up switching storage to NFS and ditching gluster if
performance
is really this much better...
On Fri, Mar 6, 2020 at 5:06 PM Strahil Nikolov
<hunter86_bg@yahoo.com>
wrote:
> On March 6, 2020 6:02:03 PM GMT+02:00, Jayme <jaymef@gmail.com>
wrote:
> >I have 3 server HCI with Gluster replica 3 storage (10GBe and SSD
> >disks).
> >Small file performance inner-vm is pretty terrible compared to a
> >similar
> >spec'ed VM using NFS mount (10GBe network, SSD disk)
> >
> >VM with gluster storage:
> >
> ># dd if=/dev/zero of=test2.img bs=512 count=1000 oflag=dsync
> >1000+0 records in
> >1000+0 records out
> >512000 bytes (512 kB) copied, 53.9616 s, 9.5 kB/s
> >
> >VM with NFS:
> >
> ># dd if=/dev/zero of=test2.img bs=512 count=1000 oflag=dsync
> >1000+0 records in
> >1000+0 records out
> >512000 bytes (512 kB) copied, 2.20059 s, 233 kB/s
> >
> >This is a very big difference, 2 seconds to copy 1000 files on
NFS
VM
> >VS 53
> >seconds on the other.
> >
> >Aside from enabling libgfapi is there anything I can tune on the
> >gluster or
> >VM side to improve small file performance? I have seen some
guides
by
> >Redhat in regards to small file performance but I'm not sure
what/if
> >any of
> >it applies to oVirt's implementation of gluster in HCI.
>
> You can use the rhgs-random-io tuned  profile from
>
ftp://ftp.redhat.com/redhat/linux/enterprise/7Server/en/RHS/SRPMS/redhat-storage-server-3.4.2.0-1.el7rhgs.src.rpm
...

...
...
> and try with that on your hosts.
> In my case, I have  modified  it so it's a mixture between
rhgs-random-io
> and the profile for Virtualization Host.
>
> Also,ensure that your bricks are  using XFS with relatime/noatime
mount
> option and your scheduler for the SSDs is either  'noop' or 'none'
.The
> default  I/O scheduler for RHEL7 is deadline which is giving
preference to
> reads and  your  workload  is  definitely 'write'.
>
> Ensure that the virt settings are  enabled for your gluster
volumes:
> 'gluster volume set <volname> group virt'
>
> Also, are you running  on fully allocated disks for the VM or you
started
> thin ?
> I'm asking as creation of new shards  at gluster  level is a slow
task.
>
> Have you checked  gluster  profiling the volume?  It can clarify
what
is
> going on.
>
>
> Also are you comparing apples to apples ?
> For example, 1 ssd  mounted  and exported  as NFS and a replica 3
volume
> of the same type of ssd ? If not,  the NFS can have more iops due
to
> multiple disks behind it, while Gluster has to write the same
thing
on all
> nodes.
>
> Best Regards,
> Strahil Nikolov
>
>
Hi Jayme,
My test are not quite good ,as I have a different setup:
NVME - VDO - 4 thin LVs -XFS - 4  Gluster  volumes (replica 2 arbiter
...
- 4  storage domains  - striped  LV in each VM
RHEL7 VM (fully stock):
[root@node1 ~]# dd if=/dev/zero of=test2.img bs=512 count=1000
oflag=dsync
1000+0 records in
1000+0 records out
512000 bytes (512 kB) copied, 19.8195 s, 25.8 kB/s
[root@node1 ~]#
Brick:
[root@ovirt1 data_fast]# dd if=/dev/zero of=test2.img bs=512
count=1000
oflag=dsync
1000+0 records in
1000+0 records out
512000 bytes (512 kB) copied, 1.41192 s, 363 kB/s
As I use VDO with compression (on 1/4 of the NVMe) - I cannot expect
any
performance from it.
Is your app really using dsync ? I have seen many times that
performance
testing with the wrong tools/tests cause more  trouble than it
should.
I would recommend you to test with a real workload before deciding to
change the architecture.
I forgot to mention that you need to disable c states for your
systems if
you are chasing performance.
Run a gluster profile while you run real workload in your VMs and
then
provide that for analysis.
Which version of Gluster are you using ?
Best Regards,
Strahil Nikolov
Hm...
Then you do have a real workload scenario - pick one of the most often
used tasks and use it's time of completion for reference.
Synthetic benchmarking is not good.
As far as I know oVirt is actually running on gluster v6.X .
@Sandro,
Can you hint us the highest supported gluster version on oVirt  ? I'm
running v7.0, so I'm little bit off the track.
Jayme,
Next steps are to check:
1.  Did you disable cstates - there are very good articles for
RHEL/CentOS 7
2.  Check firmware  of your HCI nodes - I've seen numerous network/SAN
issues due to old firmware including stucked processes
3. Check the articles for RHV and hugepages . If your VMs are memory
dynamic and lots of RAM is  needed -> hugepages will bring more performance.
Second , transparent huge pages  must be disabled.
4.  Create a High Performance  VM for testing purposes  with fully
allocated disks
5. Check if 'noatime' or  'relatime'  is set for the bricks. If selinux
is in enforcing mode (I highly recommend that), you can use mount option
'system_u:object_r:glusterd_brick_t:s0'  which will cause the kernel to
reduce  lookups to check the SELINUX context of  all  files in the brick  -
and increasing the performance.
6. Consider switching to 'noop'/'none' or tuning 'deadline' I/O
scheduler to match your needs
7.  Create a  gluster profile during the VM(step 4) is being tested ,
as  if is needed.
8. Consider  using 'Pass-through  host cpu' which is enabled in UI via
->  VM-> edit -> Host -> Start on specific host -> select all hosts with
the same cpu ->  allow  manual and automatic migration ->  OK
This mode allows  all instructions on the Host CPU to be available  on
the guest, greatly increasing performance  for a lot of software.
The difference between 'replica 3'  and 'replica 3 arbiter 1' (old name
was  'replica 2 arbiter 1' but it means  the same)  is the fact that the
arbitrated volume requiress  less bandwidth (due  to the fact that the
files  on the arbiter  has 0 bytes  of data)  and stores  only metadata to
prevent splitbrain.
Drawbacks of the arbiter is that you have only 2 sources  to read from,
while replica 3  provides three sources  to read from.
With glusterd 2.0 ( I think it was introduced in gluster v7 ) the
arbiter doesn't need to be locally (which means higher lattencies are no
longer an issue), and is only needed when one of data bricks is
needed.Still, the remote arbiter is too new for prod.
Next: You can consider clusterized  2-node NFS Ganesha (with quorum
device for the third vote) as  an NFS source. The good thing about  NFS
Ganes is the primary focus from the Gluster  community  and it uses
libgfapi to connect  to the backend  (replica  volume).
I  think it's enough for now  , but  I guess  other  stuff could come to
my mind at later  stage.
Edit: This e-mail is way longer than I initially thought to be.Sorry
about that.
Best Regards,
Strahil Nikolov