Yes, we did the analysis on this.
The on-platter consistency of the underlying ZRAID file system is
sufficiently robust to meet our production needs today.
In this cluster, our workload is such that a loss of a few minutes of data
is not a business issue. An event that could cause data loss would
require re-execution of the jobs anyways.
Since our cluster is colocated in the power company's data center, if we
see power loss we have much bigger problems. The mishap that started this
is that we had what looked like power loss, but was actually a
failing motherboard. From bitter past experience, nothing protects from a
failing motherboard or RAID controller except frequent backups, in which
case you can lose up to a day's worth of data..
The long term plan is to move to a cinder system when we have resources for
a dedicated platform administrator. NFS on TrueNASis a hold-me-over that we
selected primarily because of its ease of administration. After consulting
here and on the ixSystems lists, TrueNAS is not the right platform for
iSCSI with write intensive virtual machines.
Thank you!
On Thu, Jul 14, 2022 at 8:30 AM Jayme <jaymef(a)gmail.com> wrote:
David,
You should keep in mind that disabling sync on NFS can be potentially
dangerous. I suggested trying it to see if you'd see a difference in your
tests but it may not be recommended for production use because it could
lead to data corruption in certain cases, power loss etc.
On Wed, Jul 13, 2022 at 10:39 PM David Johnson <
djohnson(a)maxistechnology.com> wrote:
> I have changed the TrueNAS pool setting to synchronous writes = disabled,
> and it improved throughput enormously.
>
> I have not been able to figure out how to set the NFS to async - TrueNAS
> and Ovirt both seem to hide the NFS settings, and I haven't found where
> either of them allows me to configure these.
>
> I am still not seeing anywhere near the throughput on the disks that I
> would expect.
>
> Here is what happened creating a VM from the same template. The VM was
> created in 2 minutes instead of 30. The graph doesn't show 10x the
> throughput, but that is what I see experientially.
>
> [image: image.png]
>
> This operation did peg the storage network at 10 gbits very briefly, but
> at no point did the hard drives hit as much as 10% of their rated sustained
> throughput.
>
> Do you see room for more tuning, or have I tuned this as far as is
> reasonable?
>
> Thank you
>
>
> On Tue, Jul 12, 2022 at 5:25 AM Jayme <jaymef(a)gmail.com> wrote:
>
>> David,
>>
>> I’m curious what your tests would look like if you mounted nfs with
>> async
>>
>> On Tue, Jul 12, 2022 at 3:02 AM David Johnson <
>> djohnson(a)maxistechnology.com> wrote:
>>
>>> Good morning all,
>>>
>>> I am trying to get the best performance out of my cluster possible,
>>>
>>> Here are the details of what I have now:
>>>
>>> Ovirt version: 4.4.10.7-1.el8
>>> Bare metal for the ovirt engine
>>> two hosts
>>> TrueNAS cluster storage
>>> 1 NFS share
>>> 3 vdevs, 6 drives in raidz2 in each vdev
>>> 2 nvme drives for silog
>>> Storage network is 10 GBit all static IP addresses
>>>
>>> Tonight, I built a new VM from a template. It had 5 attached disks
>>> totalling 100 GB. It took 30 minutes to deploy the new VM from the
>>> template.
>>>
>>> Global utilization was 9%.
>>> The SPM has 50% of its memory free and never showed more than 12%
>>> network utilization
>>>
>>> 62 out of 65 TB are available on the newly created NFS backing store
>>> (no fragmentation). The TureNAS system is probably overprovisioned for our
>>> use.
>>>
>>> There were peak throughputs of up to 4 GBytes/second (on a 10 GBit
>>> network), but overall throughput on the NAS and the network were low.
>>> ARC hits were 95 to 100%
>>> L2 hits were 0 to 70%
>>>
>>> Here's the NFS usage stats:
>>> [image: image.png]
>>>
>>> I believe the first peak is where the silog buffered the initial burst
>>> of instructions, followed by sustained IO as the VM volumes were built in
>>> parallel, and then finally tapering off to the one 50 GB volume that took
>>> 40 minutes to copy.
>>>
>>> The indications of the NFS stats graph are that the network performance
>>> is just fine.
>>>
>>> Here are the disk IO stats covering the same time frame, plus a bit
>>> before to show a spike IO:
>>>
>>> [image: image.png]
>>> The spike at 2250 (10 minutes before I started building my VM) shows
>>> that the spinners actually hit write speed of almost 20 MBytes per second
>>> briefly, then settled in at a sustained 3 to 4 MBytes per second. The
>>> silog absorbs several spikes, but remains mostly idle, with activity
>>> measured in kilobytes per second.
>>>
>>> The HGST HUS726060AL5210 drives boast a spike throughput of 12 GB/S,
>>> and sustained throughput of 227 Mbps.
>>>
>>> ------
>>> Now to the questions:
>>> 1. Am I asking the on the right list? Does this look like
>>> something where tuning ovirt might make a difference, or is this more
>>> likely a configuration issue with my storage appliances?
>>>
>>> 2. Am I expecting too much? Is this well within the bounds of
>>> acceptable (expected) performance?
>>>
>>> 3. How would I go about identifying the bottleneck, should I need to
>>> dig deeper?
>>>
>>> Thanks,
>>> David Johnson
>>> _______________________________________________
>>> Users mailing list -- users(a)ovirt.org
>>> To unsubscribe send an email to users-leave(a)ovirt.org
>>> Privacy Statement:
https://www.ovirt.org/privacy-policy.html
>>> oVirt Code of Conduct:
>>>
https://www.ovirt.org/community/about/community-guidelines/
>>> List Archives:
>>>
https://lists.ovirt.org/archives/list/users@ovirt.org/message/BDTES2D7V3K...
>>>
>>