[ovirt-users] Re: Seeking best performance on oVirt cluster

14 Jul 2022

      Yes, we did the analysis on this.

The on-platter consistency of the underlying ZRAID file system is
sufficiently robust to meet our production needs today.

In this cluster, our workload is such that a loss of a few minutes of data
is not a business issue. An event that could cause data loss would
require re-execution of the jobs anyways.

Since our cluster is colocated in the power company's data center, if we
see power loss we have much bigger problems. The mishap that started this
is that we had what looked like power loss, but was actually a
failing motherboard. From bitter past experience, nothing protects from a
failing motherboard or RAID controller except frequent backups, in which
case you can lose up to a day's worth of data..

The long term plan is to move to a cinder system when we have resources for
a dedicated platform administrator. NFS on TrueNASis a hold-me-over that we
selected primarily because of its ease of administration. After consulting
here and on the ixSystems lists, TrueNAS is not the right platform for
iSCSI with write intensive virtual machines.

Thank you!

On Thu, Jul 14, 2022 at 8:30 AM Jayme <jaymef@gmail.com> wrote:
...
David,
You should keep in mind that disabling sync on NFS can be potentially
dangerous. I suggested trying it to see if you'd see a difference in your
tests but it may not be recommended for production use because it could
lead to data corruption in certain cases, power loss etc.
On Wed, Jul 13, 2022 at 10:39 PM David Johnson <
djohnson@maxistechnology.com> wrote:
...
I have changed the TrueNAS pool setting to synchronous writes = disabled,
and it improved throughput enormously.
I have not been able to figure out how to set the NFS to async - TrueNAS
and Ovirt both seem to hide the NFS settings, and I haven't found where
either of them allows me to configure these.
I am still not seeing anywhere near the throughput on the disks that I
would expect.
Here is what happened creating a VM from the same template. The VM was
created in 2 minutes instead of 30. The graph doesn't show 10x the
throughput, but that is what I see experientially.
[image: image.png]
This operation did peg the storage network at 10 gbits very briefly, but
at no point did the hard drives hit as much as 10% of their rated sustained
throughput.
Do you see room for more tuning, or have I tuned this as far as is
reasonable?
Thank you
On Tue, Jul 12, 2022 at 5:25 AM Jayme <jaymef@gmail.com> wrote:
...
David,
I’m curious what your tests would look like if you mounted nfs with
async
On Tue, Jul 12, 2022 at 3:02 AM David Johnson <
djohnson@maxistechnology.com> wrote:
...
Good morning all,
I am trying to get the best performance out of my cluster possible,
Here are the details of what I have now:
Ovirt version: 4.4.10.7-1.el8
Bare metal for the ovirt engine
two hosts
TrueNAS cluster storage
   1 NFS share
   3 vdevs, 6 drives in raidz2 in each vdev
   2 nvme drives for silog
Storage network is 10 GBit all static IP addresses
Tonight, I built a new VM from a template.  It had 5 attached disks
totalling 100 GB.  It took 30 minutes to deploy the new VM from the
template.
Global utilization was 9%.
The SPM has 50% of its memory free and never showed more than 12%
network utilization
62 out of 65 TB are available on the newly created NFS backing store
(no fragmentation). The TureNAS system is probably overprovisioned for our
use.
There were peak throughputs of up to 4 GBytes/second (on a 10 GBit
network), but overall throughput on the NAS and the network were low.
ARC hits were 95 to 100%
L2 hits were 0 to 70%
Here's the NFS usage stats:
[image: image.png]
I believe the first peak is where the silog buffered the initial burst
of instructions, followed by sustained IO as the VM volumes were built in
parallel, and then finally tapering off to the one 50 GB volume that took
40 minutes to copy.
The indications of the NFS stats graph are that the network performance
is just fine.
Here are the disk IO stats covering the same time frame, plus a bit
before to show a spike IO:
[image: image.png]
The spike at 2250 (10 minutes before I started building my VM) shows
that the spinners actually hit write speed of almost 20 MBytes per second
briefly, then settled in at a sustained 3 to 4 MBytes per second.  The
silog absorbs several spikes, but remains mostly idle, with activity
measured in kilobytes per second.
The HGST HUS726060AL5210 drives boast a spike throughput of 12 GB/S,
and sustained throughput of 227 Mbps.
------
Now to the questions:
1. Am I asking the on the right list? Does this look like
something where tuning ovirt might make a difference, or is this more
likely a configuration issue with my storage appliances?
2. Am I expecting too much?  Is this well within the bounds of
acceptable (expected) performance?
3. How would I go about identifying the bottleneck, should I need to
dig deeper?
Thanks,
David Johnson
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/BDTES2D7V3KHAT...