hi,
with the 2xT4 we haven't seen any trouble. we have no nvlink there.

did u try to disable the nvlink?



Vinícius Ferrão via Users <users@ovirt.org> schrieb am Fr., 4. Sept. 2020, 08:39:
Hello, here we go again.

I’m trying to passthrough 4x NVIDIA Tesla V100 GPUs (with NVLink) to a single VM; but things aren’t that good. Only one GPU shows up on the VM. lspci is able to show the GPUs, but three of them are unusable:

08:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
09:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
0a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
0b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)

There are some errors on dmesg, regarding a misconfigured BIOS:

[   27.295972] nvidia: loading out-of-tree module taints kernel.
[   27.295980] nvidia: module license 'NVIDIA' taints kernel.
[   27.295981] Disabling lock debugging due to kernel taint
[   27.304180] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[   27.364244] nvidia-nvlink: Nvlink Core is being initialized, major device number 241
[   27.579261] nvidia 0000:09:00.0: enabling device (0000 -> 0002)
[   27.579560] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR1 is 0M @ 0x0 (PCI:0000:09:00.0)
[   27.579560] NVRM: The system BIOS may have misconfigured your GPU.
[   27.579566] nvidia: probe of 0000:09:00.0 failed with error -1
[   27.580727] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR0 is 0M @ 0x0 (PCI:0000:0a:00.0)
[   27.580729] NVRM: The system BIOS may have misconfigured your GPU.
[   27.580734] nvidia: probe of 0000:0a:00.0 failed with error -1
[   27.581299] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
               NVRM: BAR0 is 0M @ 0x0 (PCI:0000:0b:00.0)
[   27.581300] NVRM: The system BIOS may have misconfigured your GPU.
[   27.581305] nvidia: probe of 0000:0b:00.0 failed with error -1
[   27.581333] NVRM: The NVIDIA probe routine failed for 3 device(s).
[   27.581334] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  450.51.06  Sun Jul 19 20:02:54 UTC 2020
[   27.649128] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  450.51.06  Sun Jul 19 20:06:42 UTC 2020

The host is Secure Intel Skylake (x86_64). VM is running with Q35 Chipset with UEFI (pc-q35-rhel8.2.0)

I’ve tried to change the I/O mapping options on the host, tried with 56TB and 12TB without success. Same results. Didn’t tried with 512GB since the machine have 768GB of system RAM.

Tried blacklisting the nouveau on the host, nothing.
Installed NVIDIA drivers on the host, nothing.

In the host I can use the 4x V100, but inside a single VM it’s impossible.

Any suggestions?



_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/73CXU27AX6ND6EXUJKBKKRWM6DJH7UL7/