Ok, this is more strange. The same dd test against my ssd os/boot
drives
on oVirt node hosts using the same model drive (only smaller) and same
h310
controller (only diff being the os/boot drives are in raid mirror and
gluster drives are passthrough) test completes in <2 seconds in /tmp of
host but takes ~45 seconds in /gluster_bricks/brick_whatever
Is there any explanation why there is such a vast difference between
the
two tests?
example of one my mounts:
/dev/mapper/onn_orchard1-tmp /tmp ext4 defaults,discard 1 2
/dev/gluster_vg_sda/gluster_lv_prod_a /gluster_bricks/brick_a xfs
inode64,noatime,nodiratime 0 0
On Sun, Mar 8, 2020 at 12:23 PM Jayme <jaymef(a)gmail.com> wrote:
> Strahil,
>
> I'm starting to think that my problem could be related to the use of
perc
> H310 mini raid controllers in my oVirt hosts. The os/boot SSDs are
raid
> mirror but gluster storage is SSDs in passthrough. I've read that the
queue
> depth of h310 card is very low and can cause performance issues
> especially when used with flash devices.
>
> dd if=/dev/zero of=test4.img bs=512 count=5000 oflag=dsync on one of
my
> hosts gluster bricks /gluster_bricks/brick_a for example takes 45
seconds
> to complete.
>
> I can perform the same operation in ~2 seconds on another server with
a
> better raid controller, but with the same model ssd.
>
> I might look at seeing how I can swap out the h310's, unfortunately I
> think that may require me to wipe the gluster storage drives as with
> another controller I believe they'd need to be added as single raid 0
> arrays and would need to be rebuilt to do so.
>
> If I were to take one host down at a time is there a way that I can
> re-build the entire server including wiping the gluster disks and add
the
> host back into the ovirt cluster and rebuild it along with the
bricks? How
> would you recommend doing such a task if I needed to wipe gluster
disks on
> each host ?
>
>
>
> On Sat, Mar 7, 2020 at 6:24 PM Jayme <jaymef(a)gmail.com> wrote:
>
>> No worries at all about the length of the email, the details are
highly
>> appreciated. You've given me lots to look into and consider.
>>
>>
>>
>> On Sat, Mar 7, 2020 at 10:02 AM Strahil Nikolov
<hunter86_bg(a)yahoo.com>
>> wrote:
>>
>>> On March 7, 2020 1:12:58 PM GMT+02:00, Jayme <jaymef(a)gmail.com>
wrote:
>>> >Thanks again for the info. You’re probably right about the testing
>>> >method.
>>> >Though the reason I’m down this path in the first place is because
I’m
>>> >seeing a problem in real world work loads. Many of my vms are used
in
>>> >development environments where working with small files is common
such
>>> >as
>>> >npm installs working with large node_module folders, ci/cd doing
lots
>>> >of
>>> >mixed operations io and compute.
>>> >
>>> >I started testing some of these things by comparing side to side
with a
>>> >vm
>>> >using same specs only difference being gluster vs nfs storage. Nfs
>>> >backed
>>> >storage is performing about 3x better real world.
>>> >
>>> >Gluster version is stock that comes with 4.3.7. I haven’t
attempted
>>> >updating it outside of official ovirt updates.
>>> >
>>> >I’d like to see if I could improve it to handle my workloads
better. I
>>> >also
>>> >understand that replication adds overhead.
>>> >
>>> >I do wonder how much difference in performance there would be with
>>> >replica
>>> >3 vs replica 3 arbiter. I’d assume arbiter setup would be faster
but
>>> >perhaps not by a considerable difference.
>>> >
>>> >I will check into c states as well
>>> >
>>> >On Sat, Mar 7, 2020 at 2:52 AM Strahil Nikolov
<hunter86_bg(a)yahoo.com>
>>> >wrote:
>>> >
>>> >> On March 7, 2020 1:09:37 AM GMT+02:00, Jayme
<jaymef(a)gmail.com>
>>> >wrote:
>>> >> >Strahil,
>>> >> >
>>> >> >Thanks for your suggestions. The config is pretty standard HCI
setup
>>> >> >with
>>> >> >cockpit and hosts are oVirt node. XFS was handled by the
deployment
>>> >> >automatically. The gluster volumes were optimized for virt
store.
>>> >> >
>>> >> >I tried noop on the SSDs, that made zero difference in the
tests I
>>> >was
>>> >> >running above. I took a look at the random-io-profile and it
looks
>>> >like
>>> >> >it
>>> >> >really only sets vm.dirty_background_ratio = 2 &
vm.dirty_ratio
= 5
>>> >--
>>> >> >my
>>> >> >hosts already appear to have those sysctl values, and by
default are
>>> >> >using virtual-host tuned profile.
>>> >> >
>>> >> >I'm curious what a test like "dd if=/dev/zero
of=test2.img
bs=512
>>> >> >count=1000 oflag=dsync" on one of your VMs would show for
results?
>>> >> >
>>> >> >I haven't done much with gluster profiling but will take a
look
and
>>> >see
>>> >> >if
>>> >> >I can make sense of it. Otherwise, the setup is pretty stock
oVirt
>>> >HCI
>>> >> >deployment with SSD backed storage and 10Gbe storage network.
I'm
>>> >not
>>> >> >coming anywhere close to maxing network throughput.
>>> >> >
>>> >> >The NFS export I was testing was an export from a local server
>>> >> >exporting a
>>> >> >single SSD (same type as in the oVirt hosts).
>>> >> >
>>> >> >I might end up switching storage to NFS and ditching gluster if
>>> >> >performance
>>> >> >is really this much better...
>>> >> >
>>> >> >
>>> >> >On Fri, Mar 6, 2020 at 5:06 PM Strahil Nikolov
>>> ><hunter86_bg(a)yahoo.com>
>>> >> >wrote:
>>> >> >
>>> >> >> On March 6, 2020 6:02:03 PM GMT+02:00, Jayme
<jaymef(a)gmail.com>
>>> >> >wrote:
>>> >> >> >I have 3 server HCI with Gluster replica 3 storage
(10GBe
and SSD
>>> >> >> >disks).
>>> >> >> >Small file performance inner-vm is pretty terrible
compared
to a
>>> >> >> >similar
>>> >> >> >spec'ed VM using NFS mount (10GBe network, SSD
disk)
>>> >> >> >
>>> >> >> >VM with gluster storage:
>>> >> >> >
>>> >> >> ># dd if=/dev/zero of=test2.img bs=512 count=1000
oflag=dsync
>>> >> >> >1000+0 records in
>>> >> >> >1000+0 records out
>>> >> >> >512000 bytes (512 kB) copied, 53.9616 s, 9.5 kB/s
>>> >> >> >
>>> >> >> >VM with NFS:
>>> >> >> >
>>> >> >> ># dd if=/dev/zero of=test2.img bs=512 count=1000
oflag=dsync
>>> >> >> >1000+0 records in
>>> >> >> >1000+0 records out
>>> >> >> >512000 bytes (512 kB) copied, 2.20059 s, 233 kB/s
>>> >> >> >
>>> >> >> >This is a very big difference, 2 seconds to copy 1000
files
on
>>> >NFS
>>> >> >VM
>>> >> >> >VS 53
>>> >> >> >seconds on the other.
>>> >> >> >
>>> >> >> >Aside from enabling libgfapi is there anything I can
tune on
the
>>> >> >> >gluster or
>>> >> >> >VM side to improve small file performance? I have seen
some
>>> >guides
>>> >> >by
>>> >> >> >Redhat in regards to small file performance but I'm
not sure
>>> >what/if
>>> >> >> >any of
>>> >> >> >it applies to oVirt's implementation of gluster in
HCI.
>>> >> >>
>>> >> >> You can use the rhgs-random-io tuned profile from
>>> >> >>
>>> >> >
>>> >>
>>> >
>>>
ftp://ftp.redhat.com/redhat/linux/enterprise/7Server/en/RHS/SRPMS/redhat-...
>>> >> >> and try with that on your hosts.
>>> >> >> In my case, I have modified it so it's a mixture
between
>>> >> >rhgs-random-io
>>> >> >> and the profile for Virtualization Host.
>>> >> >>
>>> >> >> Also,ensure that your bricks are using XFS with
relatime/noatime
>>> >> >mount
>>> >> >> option and your scheduler for the SSDs is either
'noop' or
'none'
>>> >> >.The
>>> >> >> default I/O scheduler for RHEL7 is deadline which is
giving
>>> >> >preference to
>>> >> >> reads and your workload is definitely 'write'.
>>> >> >>
>>> >> >> Ensure that the virt settings are enabled for your
gluster
>>> >volumes:
>>> >> >> 'gluster volume set <volname> group virt'
>>> >> >>
>>> >> >> Also, are you running on fully allocated disks for the VM
or
you
>>> >> >started
>>> >> >> thin ?
>>> >> >> I'm asking as creation of new shards at gluster level
is a
slow
>>> >> >task.
>>> >> >>
>>> >> >> Have you checked gluster profiling the volume? It can
clarify
>>> >what
>>> >> >is
>>> >> >> going on.
>>> >> >>
>>> >> >>
>>> >> >> Also are you comparing apples to apples ?
>>> >> >> For example, 1 ssd mounted and exported as NFS and a
replica 3
>>> >> >volume
>>> >> >> of the same type of ssd ? If not, the NFS can have more
iops
due
>>> >to
>>> >> >> multiple disks behind it, while Gluster has to write the
same
>>> >thing
>>> >> >on all
>>> >> >> nodes.
>>> >> >>
>>> >> >> Best Regards,
>>> >> >> Strahil Nikolov
>>> >> >>
>>> >> >>
>>> >>
>>> >> Hi Jayme,
>>> >>
>>> >>
>>> >> My test are not quite good ,as I have a different setup:
>>> >>
>>> >> NVME - VDO - 4 thin LVs -XFS - 4 Gluster volumes (replica 2
arbiter
>>> >1)
>>> >> - 4 storage domains - striped LV in each VM
>>> >>
>>> >> RHEL7 VM (fully stock):
>>> >> [root@node1 ~]# dd if=/dev/zero of=test2.img bs=512 count=1000
>>> >oflag=dsync
>>> >> 1000+0 records in
>>> >> 1000+0 records out
>>> >> 512000 bytes (512 kB) copied, 19.8195 s, 25.8 kB/s
>>> >> [root@node1 ~]#
>>> >>
>>> >> Brick:
>>> >> [root@ovirt1 data_fast]# dd if=/dev/zero of=test2.img bs=512
>>> >count=1000
>>> >> oflag=dsync
>>> >> 1000+0 records in
>>> >> 1000+0 records out
>>> >> 512000 bytes (512 kB) copied, 1.41192 s, 363 kB/s
>>> >>
>>> >> As I use VDO with compression (on 1/4 of the NVMe) - I cannot
expect
>>> >any
>>> >> performance from it.
>>> >>
>>> >>
>>> >> Is your app really using dsync ? I have seen many times that
>>> >performance
>>> >> testing with the wrong tools/tests cause more trouble than it
>>> >should.
>>> >>
>>> >> I would recommend you to test with a real workload before
deciding to
>>> >> change the architecture.
>>> >>
>>> >> I forgot to mention that you need to disable c states for your
>>> >systems if
>>> >> you are chasing performance.
>>> >> Run a gluster profile while you run real workload in your VMs
and
>>> >then
>>> >> provide that for analysis.
>>> >>
>>> >> Which version of Gluster are you using ?
>>> >>
>>> >> Best Regards,
>>> >> Strahil Nikolov
>>> >>
>>>
>>> Hm...
>>> Then you do have a real workload scenario - pick one of the most
often
>>> used tasks and use it's time of completion for reference.
>>> Synthetic benchmarking is not good.
>>>
>>> As far as I know oVirt is actually running on gluster v6.X .
>>> @Sandro,
>>> Can you hint us the highest supported gluster version on oVirt ?
I'm
>>> running v7.0, so I'm little bit off the track.
>>>
>>> Jayme,
>>>
>>> Next steps are to check:
>>> 1. Did you disable cstates - there are very good articles for
>>> RHEL/CentOS 7
>>> 2. Check firmware of your HCI nodes - I've seen numerous
network/SAN
>>> issues due to old firmware including stucked processes
>>> 3. Check the articles for RHV and hugepages . If your VMs are
memory
>>> dynamic and lots of RAM is needed -> hugepages will bring more
performance.
>>> Second , transparent huge pages must be disabled.
>>> 4. Create a High Performance VM for testing purposes with fully
>>> allocated disks
>>> 5. Check if 'noatime' or 'relatime' is set for the bricks.
If
selinux
>>> is in enforcing mode (I highly recommend that), you can use mount
option
>>> 'system_u:object_r:glusterd_brick_t:s0' which will cause the
kernel to
>>> reduce lookups to check the SELINUX context of all files in the
brick -
>>> and increasing the performance.
>>>
>>> 6. Consider switching to 'noop'/'none' or tuning
'deadline' I/O
>>> scheduler to match your needs
>>>
>>> 7. Create a gluster profile during the VM(step 4) is being tested
,
>>> as if is needed.
>>>
>>> 8. Consider using 'Pass-through host cpu' which is enabled in UI
via
>>> -> VM-> edit -> Host -> Start on specific host -> select all
hosts
with
>>> the same cpu -> allow manual and automatic migration -> OK
>>> This mode allows all instructions on the Host CPU to be available
on
>>> the guest, greatly increasing performance for a lot of software.
>>>
>>>
>>> The difference between 'replica 3' and 'replica 3 arbiter 1'
(old
name
>>> was 'replica 2 arbiter 1' but it means the same) is the fact
that the
>>> arbitrated volume requiress less bandwidth (due to the fact that
the
>>> files on the arbiter has 0 bytes of data) and stores only
metadata to
>>> prevent splitbrain.
>>> Drawbacks of the arbiter is that you have only 2 sources to read
from,
>>> while replica 3 provides three sources to read from.
>>> With glusterd 2.0 ( I think it was introduced in gluster v7 ) the
>>> arbiter doesn't need to be locally (which means higher lattencies
are no
>>> longer an issue), and is only needed when one of data bricks is
>>> needed.Still, the remote arbiter is too new for prod.
>>>
>>> Next: You can consider clusterized 2-node NFS Ganesha (with quorum
>>> device for the third vote) as an NFS source. The good thing about
NFS
>>> Ganes is the primary focus from the Gluster community and it uses
>>> libgfapi to connect to the backend (replica volume).
>>>
>>> I think it's enough for now , but I guess other stuff could
come to
>>> my mind at later stage.
>>>
>>> Edit: This e-mail is way longer than I initially thought to
be.Sorry
>>> about that.
>>>
>>>
>>> Best Regards,
>>> Strahil Nikolov
>>>
>>
Hey Jayme,
I'm glad you got some progress.
Check the firmware of the systems (including disks' firmware).
For HPE hardware, boot from the latest SPP and update everything.For other vendors, follow
the documentation.
Once the system is up test again:
1. Same raid controller + another SSD (check the IOPS as per vendors specs)
2. Same controller + same disk on another system
If in step 1 an SSD with lower specs performs better or in step 2 same hardware setup
performs better - involve the vendor to replace the SSD(s).
About the Gluster brick replacement - the easiest and fastest approach is to replace only
the gluster bricks .
Once you swap the disks, recreate the LVM (then will be the moment to make some changes
in the layout - like thin LVM for gluster snapshots) and setup the mounting (either fstab
or using '.mount' unit files) - you will need to:
1. Create the directory that will contain the brick:
mkdir /gluster_bricks/data/data
2. Restore the selinux (consider instead to define in the mount options the
'context=<full context for gluster>' )
restorecon -RFvv /gluster_bricks/data
3. You then need to use one of the following:
A) gluster's 'reset-brick' -> I had issues with gluster v3 , but should
be OK now
B) gluster's 'replace-brick'
The difference is that the first one requires the brick to be mounted on the old location
, while replace-brick dosan't need that.
Once the heal is over - you can proceed with the next node.
Don't forget -> set the node in maintenance prior disk/controller replacement or
you might get fenced :)
Edit: The previous recommendations are still valid. Once the bricks are replaced, you
can consider updating gluster version to latest supported.
I guess other community members can step in and share what version they are using.
Best Regards,
Strahil Nikolov