[ovirt-users] Re: vGPU VM not starting

21 May 2018

      Dear All,

I'm still having the same problems, is this a bug or something that's configured incorrectly?

Regards,
Callum

--

Callum Smith
Research Computing Core
Wellcome Trust Centre for Human Genetics
University of Oxford
e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>

On 18 May 2018, at 13:22, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote:

Yep, creating the mdev manually works, and in fact like I said previously, the VM does actually create an mdev successfully as you can see the UUID of the device (and is correctly identifiable though the /sys/class/mdev_bus/${DEVICE_ADDR}/${UUID}/mdev_type/name

In this specific case to help with the logs, the uuid generated is consistently the similar (even after manual deletion) of "f5dc8396-dad5-3893-9eb4-94eedf60a881"

The VM then fails to start because of the MTU issue. Restarting the VM on the node then produces the issue of the device not being available (because the device with the previous uuid exists and it's of max_instance=1). So it's the first VM start with the MTU issue that needs resolving, with the added complication that the issue of MTU (network) is caused by the mdev being set. The same error does not happen when mdev is not set.

PS. In fact this was the guide i followed, so thank you Martin for writing it, without it getting this far would have been very difficult:
https://mpolednik.github.io/2017/09/13/vgpu-in-ovirt/

Regards,
Callum

--

Callum Smith
Research Computing Core
Wellcome Trust Centre for Human Genetics
University of Oxford
e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>

On 18 May 2018, at 13:05, Martin Polednik <mpolednik@redhat.com<mailto:mpolednik@redhat.com>> wrote:

On 18/05/18 13:42 +0200, Francesco Romani wrote:
Hi,

On 05/17/2018 10:56 AM, Callum Smith wrote:
In an attempt not to mislead you guys as well, there appears to be a
separate, vGPU specific, issue.

https://www.dropbox.com/s/hlymmf9d6rn12tq/vdsm.vgpu.log?dl=0

I've uploaded the full vdsm.log to dropbox. Most recently I tried
unmounting alll network devices from the VM and booting it and i get a
different issue around the vGPU:

2018-05-17 09:48:24,806+0100 INFO  (vm/1bc9dae8) [root]
/usr/libexec/vdsm/hooks/before_vm_start/50_hos
tedengine: rc=0 err= (hooks:110)
2018-05-17 09:48:24,953+0100 INFO  (vm/1bc9dae8) [root]
/usr/libexec/vdsm/hooks/before_vm_start/50_vfi
o_mdev: rc=1 err=vgpu: No device with type nvidia-61 is available.
 (hooks:110)
2018-05-17 09:48:25,069+0100 INFO  (vm/1bc9dae8) [root]
/usr/libexec/vdsm/hooks/before_vm_start/50_vho
stmd: rc=0 err= (hooks:110)
2018-05-17 09:48:25,070+0100 ERROR (vm/1bc9dae8) [virt.vm]
(vmId='1bc9dae8-a0ea-44b3-9103-5805100648d0
') The vm start process failed (vm:943)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872,
in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2862,
in _run
    self._custom)
  File "/usr/lib/python2.7/site-packages/vdsm/common/hooks.py", line
153, in before_vm_start
    return _runHooksDir(domxml, 'before_vm_start', vmconf=vmconf)
  File "/usr/lib/python2.7/site-packages/vdsm/common/hooks.py", line
120, in _runHooksDir
    raise exception.HookError(err)
HookError: Hook Error: ('',)

Despite the nvidia-61 being an option on the
GPU: https://pastebin.com/bucw21DG

Let's tackle one issue at time :)
From the shared logs, the VM start failed because of

2018-05-17 10:11:12,681+0100 INFO  (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine: rc=0 err= (hooks:110)
2018-05-17 10:11:12,837+0100 INFO  (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_vfio_mdev: rc=1 err=vgpu: No device with type nvidia-53 is available.

maybe Martin can shed some light here?

Given that the actual slice is available in sysfs (as indicated by one
of the other branches of this thread), I fear we may be facing some
weird issue with the driver itself.
Can you create the mdev manually?

$ uuidgen >
/sys/class/mdev_bus/${DEVICE_ADDR}/mdev_supported_types/nvidia-61

should be enough for a test.

Callum, please share Vdsm logs showing the network failure

Bests,

--
Francesco Romani
Senior SW Eng., Virtualization R&D
Red Hat
IRC: fromani github: @fromanirh

_______________________________________________
Users mailing list -- users@ovirt.org<mailto:users@ovirt.org>
To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org>