I noticed this document
https://docs.nvidia.com/vgpu/16.0/grid-vgpu-release-notes-generic-linux-k...
has this to say
In pass through mode, all GPUs connected to each other through NVLink must be assigned to
the same VM. If a subset of GPUs connected to each other through NVLink is passed through
to a VM, unrecoverable error XID 74 occurs when the VM is booted. If a subset of GPUs
connected to each other through NVLink is passed through to a VM, unrecoverable error XID
74 occurs when the VM is booted. This error corrupts the NVLink state on the physical GPUs
and, as a result, the NVLink bridge between the NVLink and the physical GPUs is not
recognized. result, the NVLink bridge between the GPUs is unusable.
You may need to passthrough all GPUs in the nvlink to the VM