Question about PCI storage passthrough for a single guest VM

I have configured a host with pci passthrough for GPU pass through. Using this knowledge I went ahead and configured nvme SSD pci pass through. On the guest, I partitioned and mounted the SSD without any issues. Searching google for this exact setup I only see results about "local storage" where local storage = using a disk image on the hosts storage. So I have come here to try and find out if there are any concerns or gripes or issues with using nvme pci pass through compared to local storage. Some more detail about the setup: I have 2 identical hosts (nvidia gpu and also nvme pci SSD). A few weeks ago when I started researching converting one of these systems over (from native ubuntu) to ovirt using gpu pci pass through I found the information about local storage. I have 1 host (host #1) set up with local storage mode and the guest VM is using a disk image on this local storage. Host 2 has an identical hardware setup but I did not configure local storage for this host. Instead, I have the ovirt host OS installed on a SATA HDD and the nvme SSD is in pci pass through to a different guest instance. What I notice is Host 2 disk performance is approx. +30% increase over host #1 when running simple dd tests to write data to the disk. So at first glance it appears the nvme pci pass through gives better performance and this is desired, but I have not seen any ovirt documentation that explains that this is supported or any guidelines on configuring such a setup. Aside from the usual caveats when running pci pass through, are there any other gotchya's when running this type of setup (pci nvme ssd pass through)? I am trying to discover any unknowns about this before I use this for real data. I have no previous experience with this and this is my main reason for emailing the group. Any insight appreciated. Kind regards, Tony Pearce

You gave some different details in your other post, but here you mention use of GPU pass through. Any pass through will lose you the live migration ability, but unfortunately with GPUs, that's just how it is these days: while those could in theory be moved when the GPUs were identical (because their amount of state is limited to VRAM size), the support code (and kernel interfaces?) simply does not exist today. In that scenario a pass-through storage device won't lose you anything you still have. But you'll have to remember that PCI pass-through works only at the granularity of a whole PCI device. That's fine with (an entire) NVMe, because these combine "disks" and "controller", not so fine with individual disks on a SATA or SCSI controller. And you certainly can't pass through partitions! It gets to be really fun with cascaded USB and I haven't really tried Thunderbolt either (mostly because I have given up on CentOS8/oVirt 4.4) But generally the VirtIOSCSI interface imposes so little overhead, it only becomes noticeable when you run massive amounts of tiny I/O on NVMe. Play with the block sizes and the sync flag on your DD tests to see the differences, I've had lots of fun (and some disillusions) with that, but mostly with Gluster storage over TCP/IP on Ethernet. If that's really where your bottlenecks are coming from, you may want to look at architecture rather than pass-through.

My apologies for duplicating posts - they initially got stuck and I really wanted to reach out to the group with the query to try and discover unknowns. Passing through the whole pci nvme device is fine, because the VM is locked to the host due to the gpu pci pass through anyway. I will implement a mechanism to protect the data on the single disk in both cases. I'm not exactly sure what type of disk writes are being used, it's a learning model being trained by the gpu's. I'll try and find out more. After I finished the config I searched online to get some basic throughput test for the disk. Here's the commands and results taken at that time (below). *Test on host with "local storage" (using a disk image on the nvme drive)* # dd if=/dev/zero of=test1.img bs=1G count=1 oflag=dsync 1+0 records in 1+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.92561 s, 558 MB/s *Test on host with nvme pass through* # dd if=/dev/zero of=/mnt/nvme/tmpflag bs=1G count=1 oflag=dsync 1+0 records in 1+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.42554 s, 753 MB/s In both cases the nvme was used as a mounted additional drive. The OS is booting on different disk image, which is located in a Storage Domain over iscsi. I'm not anything close to a storage expert but I understand the gist of the descriptions I find when searching about the dd parameters. Since it looks like both configurations are going to be OK for longevity I'll aim to test both scenarios live and choose the one which gives the best result for the workload. Thanks a lot for your reply and help :) Tony Pearce On Fri, 6 Aug 2021 at 03:28, Thomas Hoberg <thomas@hoberg.net> wrote:
You gave some different details in your other post, but here you mention use of GPU pass through.
Any pass through will lose you the live migration ability, but unfortunately with GPUs, that's just how it is these days: while those could in theory be moved when the GPUs were identical (because their amount of state is limited to VRAM size), the support code (and kernel interfaces?) simply does not exist today.
In that scenario a pass-through storage device won't lose you anything you still have.
But you'll have to remember that PCI pass-through works only at the granularity of a whole PCI device. That's fine with (an entire) NVMe, because these combine "disks" and "controller", not so fine with individual disks on a SATA or SCSI controller. And you certainly can't pass through partitions!
It gets to be really fun with cascaded USB and I haven't really tried Thunderbolt either (mostly because I have given up on CentOS8/oVirt 4.4)
But generally the VirtIOSCSI interface imposes so little overhead, it only becomes noticeable when you run massive amounts of tiny I/O on NVMe. Play with the block sizes and the sync flag on your DD tests to see the differences, I've had lots of fun (and some disillusions) with that, but mostly with Gluster storage over TCP/IP on Ethernet.
If that's really where your bottlenecks are coming from, you may want to look at architecture rather than pass-through. _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/6CJPD6TKL4M44O...

You're welcome! My machine learning team members that I am maintaining oVirt for tend to load training data is large sequential batches, which means bandwidth is nice to have. While I give them local SSD storage on the compute nodes, I also give them lots of HDD/VDO based gluster file space, which might do miserably on OLTP, but pipes out sequential data at rates at least similar to SATA SSDs with a 10Gbit network. Seems to work for them, because to CUDA applications even RAM is barely faster the block storage. PCIe 4.0 NVMe at 8GB/s per device becomes a challenge to any block storage abstraction, inside or outside a VM. And when we are talking about NVMe storage with "native" KV APIs like FusionIO did back then, PCI pass-through will be a necessity, unless somebody comes up with a new hardware abstraction layer.
participants (2)
-
Thomas Hoberg
-
Tony Pearce