Dear All,
Some background to help identify this problem and potentially re-create it. Migrating the
VM with no mdev settings to the machine the first time works - the machine boots with all
networks attached and good status.
Add mdev of nvidia-xx (one of the supported ones) and the machine does not boot, and gives
the network error of missing MTU?
Then if you try and reboot the machine a second time on the same host, you then get the
"nvidia-xx is not available".
If you manually delete the slice creation out of /sys/class/mdev/*/UUID/delete and then
re-run the VM you go back to the MTU error. So I infer the following issues are
happening:
- Assigning a GPU mdev appears to cause knock-on effects to the network for some reason?
Or the error is wrong. Potentially running out of virtual PCIe lanes?
- A vGPU machine that fails to boot is not removing it's GPU allocation properly in a
failure scenario.
A reminder that logs are available here:
https://www.dropbox.com/s/jf9pwapohn5dq5p/vdsm.gpu2.log?dl=0
But also attached this time in case dropbox is an issue.
Regards,
Callum
--
Callum Smith
Research Computing Core
Wellcome Trust Centre for Human Genetics
University of Oxford
e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>
On 17 May 2018, at 14:28, Callum Smith
<callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote:
Dear All,
Similar issues with a clean install
https://www.dropbox.com/s/jf9pwapohn5dq5p/vdsm.gpu2.log?dl=0
Above is the dropbox of the log of the clean install. This VM has a custom
"mdev_type" of "nvidia-53" which relates to a specific GRID P40-24Q
instance. Even looking in /sys/class/mdev_bus/*/ you see that there has been correctly a
vGPU slice created as part of the boot of the machine, but still you get this error:
2018-05-17 14:19:42,757+0100 INFO (vm/1bc9dae8) [root]
/usr/libexec/vdsm/hooks/before_vm_start/50_vfio_mdev: rc=1 err=vgpu: No device with type
nvidia-53 is available.
(hooks:110)
2018-05-17 14:19:42,873+0100 INFO (vm/1bc9dae8) [root]
/usr/libexec/vdsm/hooks/before_vm_start/50_vhostmd: rc=0 err= (hooks:110)
2018-05-17 14:19:42,874+0100 ERROR (vm/1bc9dae8) [virt.vm]
(vmId='1bc9dae8-a0ea-44b3-9103-5805100648d0') The vm start process failed
(vm:943)
Thanks all for your input.
Regards,
Callum
--
Callum Smith
Research Computing Core
Wellcome Trust Centre for Human Genetics
University of Oxford
e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>
On 17 May 2018, at 14:05, Callum Smith
<callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote:
Dear Yaniv,
Please see my most recent response:
https://www.dropbox.com/s/hlymmf9d6rn12tq/vdsm.vgpu.log?dl=0
I'm doing a clean install of the host right now to see if doing the exact same
procedure a second time produces different results (this way lies madness, but we have
excited bosses about vGPUs on oVirt).
Regards,
Callum
--
Callum Smith
Research Computing Core
Wellcome Trust Centre for Human Genetics
University of Oxford
e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>
On 17 May 2018, at 14:02, Yaniv Kaul
<ykaul@redhat.com<mailto:ykaul@redhat.com>> wrote:
It'd be easier if you could share the complete vdsm log.
Perhaps file a bug and we can investigate it?
Y.
On Thu, May 17, 2018 at 11:25 AM, Callum Smith
<callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote:
Some information that appears to be from around the time of installation to the cluster:
WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -X
libvirt-O-vnet0' failed: Chain 'libvirt-O-vnet0' doesn't exist.
firewalld
WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -F
libvirt-O-vnet0' failed: Chain 'libvirt-O-vnet0' doesn't exist.
firewalld
WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -L
libvirt-O-vnet0' failed: Chain 'libvirt-O-vnet0' doesn't exist.
firewalld
WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -D POSTROUTING -o
vnet0 -j libvirt-O-vnet0' failed: Illegal target name 'libvirt-O-vnet0'.
firewalld
WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -X HI-vnet0' failed:
ip6tables: No chain/target/match by that name.
firewalld
WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -F HI-vnet0' failed:
ip6tables: No chain/target/match by that name.
firewalld
WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -X FI-vnet0' failed:
ip6tables: No chain/target/match by that name.
firewalld
WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -F FI-vnet0' failed:
ip6tables: No chain/target/match by that name.
firewalld
Regards,
Callum
--
Callum Smith
Research Computing Core
Wellcome Trust Centre for Human Genetics
University of Oxford
e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>
On 17 May 2018, at 09:20, Callum Smith
<callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote:
PS. some other WARN's that come up on the host:
WARN File:
/var/lib/libvirt/qemu/channels/1bc9dae8-a0ea-44b3-9103-5805100648d0.org.qemu.guest_agent.0
already removed
vdsm
WARN Attempting to remove a non existing net user:
ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0
vdsm
WARN Attempting to remove a non existing network:
ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0
vdsm
WARN File:
/var/lib/libvirt/qemu/channels/1bc9dae8-a0ea-44b3-9103-5805100648d0.ovirt-guest-agent.0
already removed
vdsm
WARN Attempting to add an existing net user:
ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0
vdsm
Regards,
Callum
--
Callum Smith
Research Computing Core
Wellcome Trust Centre for Human Genetics
University of Oxford
e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>
On 17 May 2018, at 09:16, Callum Smith
<callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote:
OVN Network provider is used, and the node is running 4.2.3 (specifically 2018051606 clean
install last night).
Regards,
Callum
--
Callum Smith
Research Computing Core
Wellcome Trust Centre for Human Genetics
University of Oxford
e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>
On 17 May 2018, at 07:47, Ales Musil
<amusil@redhat.com<mailto:amusil@redhat.com>> wrote:
On Thu, May 17, 2018 at 12:01 AM, Callum Smith
<callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote:
Dear All,
Our vGPU installation is progressing, though the VM is failing to start.
2018-05-16 22:57:34,328+0100 ERROR (vm/1bc9dae8) [virt.vm]
(vmId='1bc9dae8-a0ea-44b3-9103-5805100648d0') The vm start process failed
(vm:943)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in
_startUnderlyingVm
self._run()
File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2872, in _run
dom.createWithFlags(flags)
File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line
130, in wrapper
ret = f(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in
wrapper
return func(inst, *args, **kwargs)
File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in
createWithFlags
if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed',
dom=self)
libvirtError: Cannot get interface MTU on '': No such device
That's the specific error, some other information. It seems the GPU
'allocation' of uuid against the nvidia-xx mdev type is proceeding correctly, and
the device is being created by the VM instantiation but the VM does not succeed in going
up with this error. Any other logs or information relevant to help diagnose?
Regards,
Callum
--
Callum Smith
Research Computing Core
Wellcome Trust Centre for Human Genetics
University of Oxford
e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>
_______________________________________________
Users mailing list -- users@ovirt.org<mailto:users@ovirt.org>
To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org>
Hi Callum,
can you share your version of the setup?
Also do you use OVS switch type in the cluster?
Regards,
Ales.
--
ALES MUSIL
INTERN - rhv network
Red Hat
EMEA<https://www.redhat.com/>
amusil@redhat.com<mailto:amusil@redhat.com> IM: amusil
[
https://www.redhat.com/files/brand/email/sig-redhat.png]<https://red.h...
_______________________________________________
Users mailing list -- users@ovirt.org<mailto:users@ovirt.org>
To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org>
_______________________________________________
Users mailing list -- users@ovirt.org<mailto:users@ovirt.org>
To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org>
_______________________________________________
Users mailing list -- users@ovirt.org<mailto:users@ovirt.org>
To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org>