GPU Passthrough issues with oVirt 4.5

Hello, does anyone is having issues with device passthrough on oVirt 4.5? I can passthrough the devices without issue to a given VM, but inside the VM it fails to recognize all the devices. In my case I’ve added 4x GPUs to a VM, but only one show up, and there’s the following errors inside the VM: [ 23.006655] nvidia 0000:0a:00.0: enabling device (0000 -> 0002) [ 23.008026] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid: NVRM: BAR1 is 0M @ 0x0 (PCI:0000:0a:00.0) [ 23.008035] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid: NVRM: BAR2 is 0M @ 0x0 (PCI:0000:0a:00.0) [ 23.008040] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid: NVRM: BAR3 is 0M @ 0x0 (PCI:0000:0a:00.0) [ 23.008045] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid: NVRM: BAR4 is 0M @ 0x0 (PCI:0000:0a:00.0) [ 23.008049] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid: NVRM: BAR5 is 0M @ 0x0 (PCI:0000:0a:00.0) [ 23.012339] NVRM: The NVIDIA GPU 0000:0a:00.0 (PCI ID: 10de:1db1) NVRM: installed in this system is not supported by the NVRM: NVIDIA 535.54.03 driver release. NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products' NVRM: in this release's README, available on the operating system NVRM: specific graphics driver download page at www.nvidia.com. [ 23.016175] nvidia: probe of 0000:0a:00.0 failed with error -1 [ 23.016838] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid: NVRM: BAR0 is 0M @ 0x0 (PCI:0000:0b:00.0) [ 23.016842] nvidia: probe of 0000:0b:00.0 failed with error -1 [ 23.017211] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid: NVRM: BAR0 is 0M @ 0x0 (PCI:0000:0c:00.0) [ 23.017215] nvidia: probe of 0000:0c:00.0 failed with error -1 [ 23.017248] NVRM: The NVIDIA probe routine failed for 3 device(s). [ 23.214409] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.54.03 Tue Jun 6 22:20:39 UTC 2023 [ 23.485704] [drm] [nvidia-drm] [GPU ID 0x00000900] Loading driver [ 23.485708] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:09:00.0 on minor 1 On the host this shows up on dmesg, but seems right: [ 709.572845] vfio-pci 0000:1a:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258 [ 709.572877] vfio-pci 0000:1a:00.0: vfio_ecap_init: hiding ecap 0x19@0x900 [ 709.572883] vfio-pci 0000:1a:00.0: vfio_ecap_init: hiding ecap 0x23@0xac0 [ 710.660813] vfio-pci 0000:1d:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258 [ 710.660845] vfio-pci 0000:1d:00.0: vfio_ecap_init: hiding ecap 0x19@0x900 [ 710.660851] vfio-pci 0000:1d:00.0: vfio_ecap_init: hiding ecap 0x23@0xac0 [ 711.748760] vfio-pci 0000:1e:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258 [ 711.748791] vfio-pci 0000:1e:00.0: vfio_ecap_init: hiding ecap 0x19@0x900 [ 711.748797] vfio-pci 0000:1e:00.0: vfio_ecap_init: hiding ecap 0x23@0xac0 [ 712.836687] vfio-pci 0000:1c:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258 [ 712.836718] vfio-pci 0000:1c:00.0: vfio_ecap_init: hiding ecap 0x19@0x900 [ 712.836725] vfio-pci 0000:1c:00.0: vfio_ecap_init: hiding ecap 0x23@0xac0 Thanks.

There is little chance you'll get much response here, because it's probably not considered an oVirt issue. It's somewhere between your BIOS, the host kernel and KVM and I'd start by breaking it down to passing each GPU separately. Fromt he PCI-ID it seems to be V100 SMX2 variants that would require a host that very likely has a capable and compatible BIOS. I've only ever tried dual PCIe V100 in a single VM and that works without any issues on Oracle's RHV 4.4 variant of oVirt. So you need to check your BIOS and to ensure that the host kernel isn't grabbing any of the GPUs e.g. via Nouveau, perhaps try running a manual KVM VM to validate that. But if you've already solved the problem, it's nice to let people know here.
participants (2)
-
Thomas Hoberg
-
Vinícius Ferrão