Fwd: vGPU VM not starting

newer
vGPU setup guide

older
Missing step(s) after custom x509...

Ales Musil

17 May 2018 17 May '18

8:47 a.m.

On Thu, May 17, 2018 at 12:01 AM, Callum Smith <callum@well.ox.ac.uk> wrote:

...

Dear All,

Our vGPU installation is progressing, though the VM is failing to start.

2018-05-16 22:57:34,328+0100 ERROR (vm/1bc9dae8) [virt.vm] (vmId='1bc9dae8-a0ea-44b3-9103-5805100648d0') The vm start process failed (vm:943) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2872, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: Cannot get interface MTU on '': No such device

That's the specific error, some other information. It seems the GPU 'allocation' of uuid against the nvidia-xx mdev type is proceeding correctly, and the device is being created by the VM instantiation but the VM does not succeed in going up with this error. Any other logs or information relevant to help diagnose?

Regards, Callum

--

Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org

Hi Callum, can you share your version of the setup? Also do you use OVS switch type in the cluster? Regards, Ales. -- ALES MUSIL INTERN - rhv network Red Hat EMEA <https://www.redhat.com/> amusil@redhat.com IM: amusil <https://red.ht/sig>

Attachments:

attachment.html (text/html — 7.5 KB)

Show replies by date

Callum Smith

17 May 17 May

10:16 a.m.

OVN Network provider is used, and the node is running 4.2.3 (specifically 2018051606 clean install last night). Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 07:47, Ales Musil <amusil@redhat.com<mailto:amusil@redhat.com>> wrote: On Thu, May 17, 2018 at 12:01 AM, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: Dear All, Our vGPU installation is progressing, though the VM is failing to start. 2018-05-16 22:57:34,328+0100 ERROR (vm/1bc9dae8) [virt.vm] (vmId='1bc9dae8-a0ea-44b3-9103-5805100648d0') The vm start process failed (vm:943) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2872, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: Cannot get interface MTU on '': No such device That's the specific error, some other information. It seems the GPU 'allocation' of uuid against the nvidia-xx mdev type is proceeding correctly, and the device is being created by the VM instantiation but the VM does not succeed in going up with this error. Any other logs or information relevant to help diagnose? Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org> Hi Callum, can you share your version of the setup? Also do you use OVS switch type in the cluster? Regards, Ales. -- ALES MUSIL INTERN - rhv network Red Hat EMEA<https://www.redhat.com/> amusil@redhat.com<mailto:amusil@redhat.com> IM: amusil [https://www.redhat.com/files/brand/email/sig-redhat.png]<https://red.ht/sig> _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org>

Callum Smith

10:20 a.m.

New subject: vGPU VM not starting

PS. some other WARN's that come up on the host: WARN File: /var/lib/libvirt/qemu/channels/1bc9dae8-a0ea-44b3-9103-5805100648d0.org.qemu.guest_agent.0 already removed vdsm WARN Attempting to remove a non existing net user: ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0 vdsm WARN Attempting to remove a non existing network: ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0 vdsm WARN File: /var/lib/libvirt/qemu/channels/1bc9dae8-a0ea-44b3-9103-5805100648d0.ovirt-guest-agent.0 already removed vdsm WARN Attempting to add an existing net user: ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0 vdsm Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 09:16, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: OVN Network provider is used, and the node is running 4.2.3 (specifically 2018051606 clean install last night). Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 07:47, Ales Musil <amusil@redhat.com<mailto:amusil@redhat.com>> wrote: On Thu, May 17, 2018 at 12:01 AM, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: Dear All, Our vGPU installation is progressing, though the VM is failing to start. 2018-05-16 22:57:34,328+0100 ERROR (vm/1bc9dae8) [virt.vm] (vmId='1bc9dae8-a0ea-44b3-9103-5805100648d0') The vm start process failed (vm:943) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2872, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: Cannot get interface MTU on '': No such device That's the specific error, some other information. It seems the GPU 'allocation' of uuid against the nvidia-xx mdev type is proceeding correctly, and the device is being created by the VM instantiation but the VM does not succeed in going up with this error. Any other logs or information relevant to help diagnose? Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org> Hi Callum, can you share your version of the setup? Also do you use OVS switch type in the cluster? Regards, Ales. -- ALES MUSIL INTERN - rhv network Red Hat EMEA<https://www.redhat.com/> amusil@redhat.com<mailto:amusil@redhat.com> IM: amusil [https://www.redhat.com/files/brand/email/sig-redhat.png]<https://red.ht/sig> _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org>

Callum Smith

10:25 a.m.

New subject: vGPU VM not starting

Some information that appears to be from around the time of installation to the cluster: WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -X libvirt-O-vnet0' failed: Chain 'libvirt-O-vnet0' doesn't exist. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -F libvirt-O-vnet0' failed: Chain 'libvirt-O-vnet0' doesn't exist. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -L libvirt-O-vnet0' failed: Chain 'libvirt-O-vnet0' doesn't exist. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -D POSTROUTING -o vnet0 -j libvirt-O-vnet0' failed: Illegal target name 'libvirt-O-vnet0'. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -X HI-vnet0' failed: ip6tables: No chain/target/match by that name. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -F HI-vnet0' failed: ip6tables: No chain/target/match by that name. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -X FI-vnet0' failed: ip6tables: No chain/target/match by that name. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -F FI-vnet0' failed: ip6tables: No chain/target/match by that name. firewalld Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 09:20, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: PS. some other WARN's that come up on the host: WARN File: /var/lib/libvirt/qemu/channels/1bc9dae8-a0ea-44b3-9103-5805100648d0.org.qemu.guest_agent.0 already removed vdsm WARN Attempting to remove a non existing net user: ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0 vdsm WARN Attempting to remove a non existing network: ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0 vdsm WARN File: /var/lib/libvirt/qemu/channels/1bc9dae8-a0ea-44b3-9103-5805100648d0.ovirt-guest-agent.0 already removed vdsm WARN Attempting to add an existing net user: ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0 vdsm Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 09:16, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: OVN Network provider is used, and the node is running 4.2.3 (specifically 2018051606 clean install last night). Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 07:47, Ales Musil <amusil@redhat.com<mailto:amusil@redhat.com>> wrote: On Thu, May 17, 2018 at 12:01 AM, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: Dear All, Our vGPU installation is progressing, though the VM is failing to start. 2018-05-16 22:57:34,328+0100 ERROR (vm/1bc9dae8) [virt.vm] (vmId='1bc9dae8-a0ea-44b3-9103-5805100648d0') The vm start process failed (vm:943) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2872, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: Cannot get interface MTU on '': No such device That's the specific error, some other information. It seems the GPU 'allocation' of uuid against the nvidia-xx mdev type is proceeding correctly, and the device is being created by the VM instantiation but the VM does not succeed in going up with this error. Any other logs or information relevant to help diagnose? Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org> Hi Callum, can you share your version of the setup? Also do you use OVS switch type in the cluster? Regards, Ales. -- ALES MUSIL INTERN - rhv network Red Hat EMEA<https://www.redhat.com/> amusil@redhat.com<mailto:amusil@redhat.com> IM: amusil [https://www.redhat.com/files/brand/email/sig-redhat.png]<https://red.ht/sig> _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org>

Yaniv Kaul

3:02 p.m.

New subject: vGPU VM not starting

It'd be easier if you could share the complete vdsm log. Perhaps file a bug and we can investigate it? Y. On Thu, May 17, 2018 at 11:25 AM, Callum Smith <callum@well.ox.ac.uk> wrote:

...

Some information that appears to be from around the time of installation to the cluster:

WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -X libvirt-O-vnet0' failed: Chain 'libvirt-O-vnet0' doesn't exist. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -F libvirt-O-vnet0' failed: Chain 'libvirt-O-vnet0' doesn't exist. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -L libvirt-O-vnet0' failed: Chain 'libvirt-O-vnet0' doesn't exist. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -D POSTROUTING -o vnet0 -j libvirt-O-vnet0' failed: Illegal target name 'libvirt-O-vnet0'. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -X HI-vnet0' failed: ip6tables: No chain/target/match by that name. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -F HI-vnet0' failed: ip6tables: No chain/target/match by that name. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -X FI-vnet0' failed: ip6tables: No chain/target/match by that name. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -F FI-vnet0' failed: ip6tables: No chain/target/match by that name. firewalld

Regards, Callum

--

Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk

On 17 May 2018, at 09:20, Callum Smith <callum@well.ox.ac.uk> wrote:

PS. some other WARN's that come up on the host:

WARN File: /var/lib/libvirt/qemu/channels/1bc9dae8-a0ea-44b3- 9103-5805100648d0.org.qemu.guest_agent.0 already removed vdsm WARN Attempting to remove a non existing net user: ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0 vdsm WARN Attempting to remove a non existing network: ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0 vdsm WARN File: /var/lib/libvirt/qemu/channels/1bc9dae8-a0ea-44b3- 9103-5805100648d0.ovirt-guest-agent.0 already removed vdsm WARN Attempting to add an existing net user: ovirtmgmt/1bc9dae8-a0ea-44b3- 9103-5805100648d0 vdsm

Regards, Callum

--

Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk

On 17 May 2018, at 09:16, Callum Smith <callum@well.ox.ac.uk> wrote:

OVN Network provider is used, and the node is running 4.2.3 (specifically 2018051606 clean install last night).

Regards, Callum

--

Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk

On 17 May 2018, at 07:47, Ales Musil <amusil@redhat.com> wrote:

On Thu, May 17, 2018 at 12:01 AM, Callum Smith <callum@well.ox.ac.uk> wrote:

...
Dear All,

Our vGPU installation is progressing, though the VM is failing to start.

2018-05-16 22:57:34,328+0100 ERROR (vm/1bc9dae8) [virt.vm] (vmId='1bc9dae8-a0ea-44b3-9103-5805100648d0') The vm start process failed (vm:943) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2872, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: Cannot get interface MTU on '': No such device

That's the specific error, some other information. It seems the GPU 'allocation' of uuid against the nvidia-xx mdev type is proceeding correctly, and the device is being created by the VM instantiation but the VM does not succeed in going up with this error. Any other logs or information relevant to help diagnose?

Regards, Callum

--

Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org

Hi Callum,

can you share your version of the setup?

Also do you use OVS switch type in the cluster?

Regards, Ales.

-- ALES MUSIL INTERN - rhv network Red Hat EMEA <https://www.redhat.com/>

amusil@redhat.com IM: amusil <https://red.ht/sig>

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org

Callum Smith

3:05 p.m.

New subject: vGPU VM not starting

Dear Yaniv, Please see my most recent response: https://www.dropbox.com/s/hlymmf9d6rn12tq/vdsm.vgpu.log?dl=0 I'm doing a clean install of the host right now to see if doing the exact same procedure a second time produces different results (this way lies madness, but we have excited bosses about vGPUs on oVirt). Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 14:02, Yaniv Kaul <ykaul@redhat.com<mailto:ykaul@redhat.com>> wrote: It'd be easier if you could share the complete vdsm log. Perhaps file a bug and we can investigate it? Y. On Thu, May 17, 2018 at 11:25 AM, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: Some information that appears to be from around the time of installation to the cluster: WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -X libvirt-O-vnet0' failed: Chain 'libvirt-O-vnet0' doesn't exist. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -F libvirt-O-vnet0' failed: Chain 'libvirt-O-vnet0' doesn't exist. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -L libvirt-O-vnet0' failed: Chain 'libvirt-O-vnet0' doesn't exist. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -D POSTROUTING -o vnet0 -j libvirt-O-vnet0' failed: Illegal target name 'libvirt-O-vnet0'. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -X HI-vnet0' failed: ip6tables: No chain/target/match by that name. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -F HI-vnet0' failed: ip6tables: No chain/target/match by that name. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -X FI-vnet0' failed: ip6tables: No chain/target/match by that name. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -F FI-vnet0' failed: ip6tables: No chain/target/match by that name. firewalld Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 09:20, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: PS. some other WARN's that come up on the host: WARN File: /var/lib/libvirt/qemu/channels/1bc9dae8-a0ea-44b3-9103-5805100648d0.org.qemu.guest_agent.0 already removed vdsm WARN Attempting to remove a non existing net user: ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0 vdsm WARN Attempting to remove a non existing network: ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0 vdsm WARN File: /var/lib/libvirt/qemu/channels/1bc9dae8-a0ea-44b3-9103-5805100648d0.ovirt-guest-agent.0 already removed vdsm WARN Attempting to add an existing net user: ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0 vdsm Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 09:16, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: OVN Network provider is used, and the node is running 4.2.3 (specifically 2018051606 clean install last night). Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 07:47, Ales Musil <amusil@redhat.com<mailto:amusil@redhat.com>> wrote: On Thu, May 17, 2018 at 12:01 AM, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: Dear All, Our vGPU installation is progressing, though the VM is failing to start. 2018-05-16 22:57:34,328+0100 ERROR (vm/1bc9dae8) [virt.vm] (vmId='1bc9dae8-a0ea-44b3-9103-5805100648d0') The vm start process failed (vm:943) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2872, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: Cannot get interface MTU on '': No such device That's the specific error, some other information. It seems the GPU 'allocation' of uuid against the nvidia-xx mdev type is proceeding correctly, and the device is being created by the VM instantiation but the VM does not succeed in going up with this error. Any other logs or information relevant to help diagnose? Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org> Hi Callum, can you share your version of the setup? Also do you use OVS switch type in the cluster? Regards, Ales. -- ALES MUSIL INTERN - rhv network Red Hat EMEA<https://www.redhat.com/> amusil@redhat.com<mailto:amusil@redhat.com> IM: amusil [https://www.redhat.com/files/brand/email/sig-redhat.png]<https://red.ht/sig> _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org> _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org>

Callum Smith

3:28 p.m.

New subject: vGPU VM not starting

Dear All, Similar issues with a clean install https://www.dropbox.com/s/jf9pwapohn5dq5p/vdsm.gpu2.log?dl=0 Above is the dropbox of the log of the clean install. This VM has a custom "mdev_type" of "nvidia-53" which relates to a specific GRID P40-24Q instance. Even looking in /sys/class/mdev_bus/*/ you see that there has been correctly a vGPU slice created as part of the boot of the machine, but still you get this error: 2018-05-17 14:19:42,757+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_vfio_mdev: rc=1 err=vgpu: No device with type nvidia-53 is available. (hooks:110) 2018-05-17 14:19:42,873+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_vhostmd: rc=0 err= (hooks:110) 2018-05-17 14:19:42,874+0100 ERROR (vm/1bc9dae8) [virt.vm] (vmId='1bc9dae8-a0ea-44b3-9103-5805100648d0') The vm start process failed (vm:943) Thanks all for your input. Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 14:05, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: Dear Yaniv, Please see my most recent response: https://www.dropbox.com/s/hlymmf9d6rn12tq/vdsm.vgpu.log?dl=0 I'm doing a clean install of the host right now to see if doing the exact same procedure a second time produces different results (this way lies madness, but we have excited bosses about vGPUs on oVirt). Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 14:02, Yaniv Kaul <ykaul@redhat.com<mailto:ykaul@redhat.com>> wrote: It'd be easier if you could share the complete vdsm log. Perhaps file a bug and we can investigate it? Y. On Thu, May 17, 2018 at 11:25 AM, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: Some information that appears to be from around the time of installation to the cluster: WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -X libvirt-O-vnet0' failed: Chain 'libvirt-O-vnet0' doesn't exist. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -F libvirt-O-vnet0' failed: Chain 'libvirt-O-vnet0' doesn't exist. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -L libvirt-O-vnet0' failed: Chain 'libvirt-O-vnet0' doesn't exist. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -D POSTROUTING -o vnet0 -j libvirt-O-vnet0' failed: Illegal target name 'libvirt-O-vnet0'. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -X HI-vnet0' failed: ip6tables: No chain/target/match by that name. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -F HI-vnet0' failed: ip6tables: No chain/target/match by that name. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -X FI-vnet0' failed: ip6tables: No chain/target/match by that name. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -F FI-vnet0' failed: ip6tables: No chain/target/match by that name. firewalld Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 09:20, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: PS. some other WARN's that come up on the host: WARN File: /var/lib/libvirt/qemu/channels/1bc9dae8-a0ea-44b3-9103-5805100648d0.org.qemu.guest_agent.0 already removed vdsm WARN Attempting to remove a non existing net user: ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0 vdsm WARN Attempting to remove a non existing network: ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0 vdsm WARN File: /var/lib/libvirt/qemu/channels/1bc9dae8-a0ea-44b3-9103-5805100648d0.ovirt-guest-agent.0 already removed vdsm WARN Attempting to add an existing net user: ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0 vdsm Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 09:16, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: OVN Network provider is used, and the node is running 4.2.3 (specifically 2018051606 clean install last night). Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 07:47, Ales Musil <amusil@redhat.com<mailto:amusil@redhat.com>> wrote: On Thu, May 17, 2018 at 12:01 AM, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: Dear All, Our vGPU installation is progressing, though the VM is failing to start. 2018-05-16 22:57:34,328+0100 ERROR (vm/1bc9dae8) [virt.vm] (vmId='1bc9dae8-a0ea-44b3-9103-5805100648d0') The vm start process failed (vm:943) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2872, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: Cannot get interface MTU on '': No such device That's the specific error, some other information. It seems the GPU 'allocation' of uuid against the nvidia-xx mdev type is proceeding correctly, and the device is being created by the VM instantiation but the VM does not succeed in going up with this error. Any other logs or information relevant to help diagnose? Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org> Hi Callum, can you share your version of the setup? Also do you use OVS switch type in the cluster? Regards, Ales. -- ALES MUSIL INTERN - rhv network Red Hat EMEA<https://www.redhat.com/> amusil@redhat.com<mailto:amusil@redhat.com> IM: amusil [https://www.redhat.com/files/brand/email/sig-redhat.png]<https://red.ht/sig> _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org> _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org>

Callum Smith

18 May 18 May

10:24 a.m.

New subject: vGPU VM not starting

Dear All, Some background to help identify this problem and potentially re-create it. Migrating the VM with no mdev settings to the machine the first time works - the machine boots with all networks attached and good status. Add mdev of nvidia-xx (one of the supported ones) and the machine does not boot, and gives the network error of missing MTU? Then if you try and reboot the machine a second time on the same host, you then get the "nvidia-xx is not available". If you manually delete the slice creation out of /sys/class/mdev/*/UUID/delete and then re-run the VM you go back to the MTU error. So I infer the following issues are happening: - Assigning a GPU mdev appears to cause knock-on effects to the network for some reason? Or the error is wrong. Potentially running out of virtual PCIe lanes? - A vGPU machine that fails to boot is not removing it's GPU allocation properly in a failure scenario. A reminder that logs are available here: https://www.dropbox.com/s/jf9pwapohn5dq5p/vdsm.gpu2.log?dl=0 But also attached this time in case dropbox is an issue. Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 14:28, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: Dear All, Similar issues with a clean install https://www.dropbox.com/s/jf9pwapohn5dq5p/vdsm.gpu2.log?dl=0 Above is the dropbox of the log of the clean install. This VM has a custom "mdev_type" of "nvidia-53" which relates to a specific GRID P40-24Q instance. Even looking in /sys/class/mdev_bus/*/ you see that there has been correctly a vGPU slice created as part of the boot of the machine, but still you get this error: 2018-05-17 14:19:42,757+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_vfio_mdev: rc=1 err=vgpu: No device with type nvidia-53 is available. (hooks:110) 2018-05-17 14:19:42,873+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_vhostmd: rc=0 err= (hooks:110) 2018-05-17 14:19:42,874+0100 ERROR (vm/1bc9dae8) [virt.vm] (vmId='1bc9dae8-a0ea-44b3-9103-5805100648d0') The vm start process failed (vm:943) Thanks all for your input. Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 14:05, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: Dear Yaniv, Please see my most recent response: https://www.dropbox.com/s/hlymmf9d6rn12tq/vdsm.vgpu.log?dl=0 I'm doing a clean install of the host right now to see if doing the exact same procedure a second time produces different results (this way lies madness, but we have excited bosses about vGPUs on oVirt). Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 14:02, Yaniv Kaul <ykaul@redhat.com<mailto:ykaul@redhat.com>> wrote: It'd be easier if you could share the complete vdsm log. Perhaps file a bug and we can investigate it? Y. On Thu, May 17, 2018 at 11:25 AM, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: Some information that appears to be from around the time of installation to the cluster: WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -X libvirt-O-vnet0' failed: Chain 'libvirt-O-vnet0' doesn't exist. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -F libvirt-O-vnet0' failed: Chain 'libvirt-O-vnet0' doesn't exist. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -L libvirt-O-vnet0' failed: Chain 'libvirt-O-vnet0' doesn't exist. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ebtables --concurrent -t nat -D POSTROUTING -o vnet0 -j libvirt-O-vnet0' failed: Illegal target name 'libvirt-O-vnet0'. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -X HI-vnet0' failed: ip6tables: No chain/target/match by that name. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -F HI-vnet0' failed: ip6tables: No chain/target/match by that name. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -X FI-vnet0' failed: ip6tables: No chain/target/match by that name. firewalld WARNING: COMMAND_FAILED: '/usr/sbin/ip6tables -w2 -w -F FI-vnet0' failed: ip6tables: No chain/target/match by that name. firewalld Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 09:20, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: PS. some other WARN's that come up on the host: WARN File: /var/lib/libvirt/qemu/channels/1bc9dae8-a0ea-44b3-9103-5805100648d0.org.qemu.guest_agent.0 already removed vdsm WARN Attempting to remove a non existing net user: ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0 vdsm WARN Attempting to remove a non existing network: ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0 vdsm WARN File: /var/lib/libvirt/qemu/channels/1bc9dae8-a0ea-44b3-9103-5805100648d0.ovirt-guest-agent.0 already removed vdsm WARN Attempting to add an existing net user: ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0 vdsm Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 09:16, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: OVN Network provider is used, and the node is running 4.2.3 (specifically 2018051606 clean install last night). Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 07:47, Ales Musil <amusil@redhat.com<mailto:amusil@redhat.com>> wrote: On Thu, May 17, 2018 at 12:01 AM, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: Dear All, Our vGPU installation is progressing, though the VM is failing to start. 2018-05-16 22:57:34,328+0100 ERROR (vm/1bc9dae8) [virt.vm] (vmId='1bc9dae8-a0ea-44b3-9103-5805100648d0') The vm start process failed (vm:943) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2872, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: Cannot get interface MTU on '': No such device That's the specific error, some other information. It seems the GPU 'allocation' of uuid against the nvidia-xx mdev type is proceeding correctly, and the device is being created by the VM instantiation but the VM does not succeed in going up with this error. Any other logs or information relevant to help diagnose? Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org> Hi Callum, can you share your version of the setup? Also do you use OVS switch type in the cluster? Regards, Ales. -- ALES MUSIL INTERN - rhv network Red Hat EMEA<https://www.redhat.com/> amusil@redhat.com<mailto:amusil@redhat.com> IM: amusil [https://www.redhat.com/files/brand/email/sig-redhat.png]<https://red.ht/sig> _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org> _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org> _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org>

Ales Musil

17 May 17 May

10:28 a.m.

New subject: vGPU VM not starting

Seems like some vdsm problem with xml generation. +Francesco On Thu, May 17, 2018 at 10:20 AM, Callum Smith <callum@well.ox.ac.uk> wrote:

...

PS. some other WARN's that come up on the host:

WARN File: /var/lib/libvirt/qemu/channels/1bc9dae8-a0ea-44b3- 9103-5805100648d0.org.qemu.guest_agent.0 already removed vdsm WARN Attempting to remove a non existing net user: ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0 vdsm WARN Attempting to remove a non existing network: ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0 vdsm WARN File: /var/lib/libvirt/qemu/channels/1bc9dae8-a0ea-44b3- 9103-5805100648d0.ovirt-guest-agent.0 already removed vdsm WARN Attempting to add an existing net user: ovirtmgmt/1bc9dae8-a0ea-44b3- 9103-5805100648d0 vdsm

Regards, Callum

--

Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk

On 17 May 2018, at 09:16, Callum Smith <callum@well.ox.ac.uk> wrote:

OVN Network provider is used, and the node is running 4.2.3 (specifically 2018051606 clean install last night).

Regards, Callum

--

Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk

On 17 May 2018, at 07:47, Ales Musil <amusil@redhat.com> wrote:

On Thu, May 17, 2018 at 12:01 AM, Callum Smith <callum@well.ox.ac.uk> wrote:

...
Dear All,

Our vGPU installation is progressing, though the VM is failing to start.

2018-05-16 22:57:34,328+0100 ERROR (vm/1bc9dae8) [virt.vm] (vmId='1bc9dae8-a0ea-44b3-9103-5805100648d0') The vm start process failed (vm:943) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2872, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: Cannot get interface MTU on '': No such device

That's the specific error, some other information. It seems the GPU 'allocation' of uuid against the nvidia-xx mdev type is proceeding correctly, and the device is being created by the VM instantiation but the VM does not succeed in going up with this error. Any other logs or information relevant to help diagnose?

Regards, Callum

--

Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org

Hi Callum,

can you share your version of the setup?

Also do you use OVS switch type in the cluster?

Regards, Ales.

-- ALES MUSIL INTERN - rhv network Red Hat EMEA <https://www.redhat.com/>

amusil@redhat.com IM: amusil <https://red.ht/sig>

_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org

-- ALES MUSIL INTERN - rhv network Red Hat EMEA <https://www.redhat.com/> amusil@redhat.com IM: amusil <https://red.ht/sig>

Callum Smith

10:56 a.m.

New subject: vGPU VM not starting

In an attempt not to mislead you guys as well, there appears to be a separate, vGPU specific, issue. https://www.dropbox.com/s/hlymmf9d6rn12tq/vdsm.vgpu.log?dl=0 I've uploaded the full vdsm.log to dropbox. Most recently I tried unmounting alll network devices from the VM and booting it and i get a different issue around the vGPU: 2018-05-17 09:48:24,806+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_hos tedengine: rc=0 err= (hooks:110) 2018-05-17 09:48:24,953+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_vfi o_mdev: rc=1 err=vgpu: No device with type nvidia-61 is available. (hooks:110) 2018-05-17 09:48:25,069+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_vho stmd: rc=0 err= (hooks:110) 2018-05-17 09:48:25,070+0100 ERROR (vm/1bc9dae8) [virt.vm] (vmId='1bc9dae8-a0ea-44b3-9103-5805100648d0 ') The vm start process failed (vm:943) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2862, in _run self._custom) File "/usr/lib/python2.7/site-packages/vdsm/common/hooks.py", line 153, in before_vm_start return _runHooksDir(domxml, 'before_vm_start', vmconf=vmconf) File "/usr/lib/python2.7/site-packages/vdsm/common/hooks.py", line 120, in _runHooksDir raise exception.HookError(err) HookError: Hook Error: ('',) Despite the nvidia-61 being an option on the GPU: https://pastebin.com/bucw21DG So I think we have two issues here, one relating to the network and one to GPU. Thanks all for your rapid and very useful help! Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 09:28, Ales Musil <amusil@redhat.com<mailto:amusil@redhat.com>> wrote: Seems like some vdsm problem with xml generation. +Francesco On Thu, May 17, 2018 at 10:20 AM, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: PS. some other WARN's that come up on the host: WARN File: /var/lib/libvirt/qemu/channels/1bc9dae8-a0ea-44b3-9103-5805100648d0.org.qemu.guest_agent.0 already removed vdsm WARN Attempting to remove a non existing net user: ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0 vdsm WARN Attempting to remove a non existing network: ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0 vdsm WARN File: /var/lib/libvirt/qemu/channels/1bc9dae8-a0ea-44b3-9103-5805100648d0.ovirt-guest-agent.0 already removed vdsm WARN Attempting to add an existing net user: ovirtmgmt/1bc9dae8-a0ea-44b3-9103-5805100648d0 vdsm Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 09:16, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: OVN Network provider is used, and the node is running 4.2.3 (specifically 2018051606 clean install last night). Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 17 May 2018, at 07:47, Ales Musil <amusil@redhat.com<mailto:amusil@redhat.com>> wrote: On Thu, May 17, 2018 at 12:01 AM, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: Dear All, Our vGPU installation is progressing, though the VM is failing to start. 2018-05-16 22:57:34,328+0100 ERROR (vm/1bc9dae8) [virt.vm] (vmId='1bc9dae8-a0ea-44b3-9103-5805100648d0') The vm start process failed (vm:943) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2872, in _run dom.createWithFlags(flags) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1099, in createWithFlags if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self) libvirtError: Cannot get interface MTU on '': No such device That's the specific error, some other information. It seems the GPU 'allocation' of uuid against the nvidia-xx mdev type is proceeding correctly, and the device is being created by the VM instantiation but the VM does not succeed in going up with this error. Any other logs or information relevant to help diagnose? Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org> Hi Callum, can you share your version of the setup? Also do you use OVS switch type in the cluster? Regards, Ales. -- ALES MUSIL INTERN - rhv network Red Hat EMEA<https://www.redhat.com/> amusil@redhat.com<mailto:amusil@redhat.com> IM: amusil [https://www.redhat.com/files/brand/email/sig-redhat.png]<https://red.ht/sig> _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org> -- ALES MUSIL INTERN - rhv network Red Hat EMEA<https://www.redhat.com/> amusil@redhat.com<mailto:amusil@redhat.com> IM: amusil [https://www.redhat.com/files/brand/email/sig-redhat.png]<https://red.ht/sig>

Francesco Romani

18 May 18 May

1:42 p.m.

New subject: vGPU VM not starting

Hi, On 05/17/2018 10:56 AM, Callum Smith wrote:

...

In an attempt not to mislead you guys as well, there appears to be a separate, vGPU specific, issue.

https://www.dropbox.com/s/hlymmf9d6rn12tq/vdsm.vgpu.log?dl=0

I've uploaded the full vdsm.log to dropbox. Most recently I tried unmounting alll network devices from the VM and booting it and i get a different issue around the vGPU:

2018-05-17 09:48:24,806+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_hos tedengine: rc=0 err= (hooks:110) 2018-05-17 09:48:24,953+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_vfi o_mdev: rc=1 err=vgpu: No device with type nvidia-61 is available. (hooks:110) 2018-05-17 09:48:25,069+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_vho stmd: rc=0 err= (hooks:110) 2018-05-17 09:48:25,070+0100 ERROR (vm/1bc9dae8) [virt.vm] (vmId='1bc9dae8-a0ea-44b3-9103-5805100648d0 ') The vm start process failed (vm:943) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2862, in _run self._custom) File "/usr/lib/python2.7/site-packages/vdsm/common/hooks.py", line 153, in before_vm_start return _runHooksDir(domxml, 'before_vm_start', vmconf=vmconf) File "/usr/lib/python2.7/site-packages/vdsm/common/hooks.py", line 120, in _runHooksDir raise exception.HookError(err) HookError: Hook Error: ('',)

Despite the nvidia-61 being an option on the GPU: https://pastebin.com/bucw21DG

Let's tackle one issue at time :) From the shared logs, the VM start failed because of 2018-05-17 10:11:12,681+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine: rc=0 err= (hooks:110) 2018-05-17 10:11:12,837+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_vfio_mdev: rc=1 err=vgpu: No device with type nvidia-53 is available. maybe Martin can shed some light here? Callum, please share Vdsm logs showing the network failure Bests, -- Francesco Romani Senior SW Eng., Virtualization R&D Red Hat IRC: fromani github: @fromanirh

Martin Polednik

2:05 p.m.

New subject: vGPU VM not starting

On 18/05/18 13:42 +0200, Francesco Romani wrote:

...

Hi,

On 05/17/2018 10:56 AM, Callum Smith wrote:

...
In an attempt not to mislead you guys as well, there appears to be a separate, vGPU specific, issue.

https://www.dropbox.com/s/hlymmf9d6rn12tq/vdsm.vgpu.log?dl=0

I've uploaded the full vdsm.log to dropbox. Most recently I tried unmounting alll network devices from the VM and booting it and i get a different issue around the vGPU:

2018-05-17 09:48:24,806+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_hos tedengine: rc=0 err= (hooks:110) 2018-05-17 09:48:24,953+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_vfi o_mdev: rc=1 err=vgpu: No device with type nvidia-61 is available. (hooks:110) 2018-05-17 09:48:25,069+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_vho stmd: rc=0 err= (hooks:110) 2018-05-17 09:48:25,070+0100 ERROR (vm/1bc9dae8) [virt.vm] (vmId='1bc9dae8-a0ea-44b3-9103-5805100648d0 ') The vm start process failed (vm:943) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2862, in _run self._custom) File "/usr/lib/python2.7/site-packages/vdsm/common/hooks.py", line 153, in before_vm_start return _runHooksDir(domxml, 'before_vm_start', vmconf=vmconf) File "/usr/lib/python2.7/site-packages/vdsm/common/hooks.py", line 120, in _runHooksDir raise exception.HookError(err) HookError: Hook Error: ('',)

Despite the nvidia-61 being an option on the GPU: https://pastebin.com/bucw21DG

Let's tackle one issue at time :) From the shared logs, the VM start failed because of

2018-05-17 10:11:12,681+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine: rc=0 err= (hooks:110) 2018-05-17 10:11:12,837+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_vfio_mdev: rc=1 err=vgpu: No device with type nvidia-53 is available.

maybe Martin can shed some light here?

Given that the actual slice is available in sysfs (as indicated by one of the other branches of this thread), I fear we may be facing some weird issue with the driver itself. Can you create the mdev manually? $ uuidgen > /sys/class/mdev_bus/${DEVICE_ADDR}/mdev_supported_types/nvidia-61 should be enough for a test.

...

Callum, please share Vdsm logs showing the network failure

Bests,

-- Francesco Romani Senior SW Eng., Virtualization R&D Red Hat IRC: fromani github: @fromanirh

Callum Smith

2:22 p.m.

New subject: vGPU VM not starting

Yep, creating the mdev manually works, and in fact like I said previously, the VM does actually create an mdev successfully as you can see the UUID of the device (and is correctly identifiable though the /sys/class/mdev_bus/${DEVICE_ADDR}/${UUID}/mdev_type/name In this specific case to help with the logs, the uuid generated is consistently the similar (even after manual deletion) of "f5dc8396-dad5-3893-9eb4-94eedf60a881" The VM then fails to start because of the MTU issue. Restarting the VM on the node then produces the issue of the device not being available (because the device with the previous uuid exists and it's of max_instance=1). So it's the first VM start with the MTU issue that needs resolving, with the added complication that the issue of MTU (network) is caused by the mdev being set. The same error does not happen when mdev is not set. PS. In fact this was the guide i followed, so thank you Martin for writing it, without it getting this far would have been very difficult: https://mpolednik.github.io/2017/09/13/vgpu-in-ovirt/ Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 18 May 2018, at 13:05, Martin Polednik <mpolednik@redhat.com<mailto:mpolednik@redhat.com>> wrote: On 18/05/18 13:42 +0200, Francesco Romani wrote: Hi, On 05/17/2018 10:56 AM, Callum Smith wrote: In an attempt not to mislead you guys as well, there appears to be a separate, vGPU specific, issue. https://www.dropbox.com/s/hlymmf9d6rn12tq/vdsm.vgpu.log?dl=0 I've uploaded the full vdsm.log to dropbox. Most recently I tried unmounting alll network devices from the VM and booting it and i get a different issue around the vGPU: 2018-05-17 09:48:24,806+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_hos tedengine: rc=0 err= (hooks:110) 2018-05-17 09:48:24,953+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_vfi o_mdev: rc=1 err=vgpu: No device with type nvidia-61 is available. (hooks:110) 2018-05-17 09:48:25,069+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_vho stmd: rc=0 err= (hooks:110) 2018-05-17 09:48:25,070+0100 ERROR (vm/1bc9dae8) [virt.vm] (vmId='1bc9dae8-a0ea-44b3-9103-5805100648d0 ') The vm start process failed (vm:943) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2862, in _run self._custom) File "/usr/lib/python2.7/site-packages/vdsm/common/hooks.py", line 153, in before_vm_start return _runHooksDir(domxml, 'before_vm_start', vmconf=vmconf) File "/usr/lib/python2.7/site-packages/vdsm/common/hooks.py", line 120, in _runHooksDir raise exception.HookError(err) HookError: Hook Error: ('',) Despite the nvidia-61 being an option on the GPU: https://pastebin.com/bucw21DG Let's tackle one issue at time :) From the shared logs, the VM start failed because of 2018-05-17 10:11:12,681+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine: rc=0 err= (hooks:110) 2018-05-17 10:11:12,837+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_vfio_mdev: rc=1 err=vgpu: No device with type nvidia-53 is available. maybe Martin can shed some light here? Given that the actual slice is available in sysfs (as indicated by one of the other branches of this thread), I fear we may be facing some weird issue with the driver itself. Can you create the mdev manually? $ uuidgen > /sys/class/mdev_bus/${DEVICE_ADDR}/mdev_supported_types/nvidia-61 should be enough for a test. Callum, please share Vdsm logs showing the network failure Bests, -- Francesco Romani Senior SW Eng., Virtualization R&D Red Hat IRC: fromani github: @fromanirh

Callum Smith

21 May 21 May

10:52 a.m.

New subject: vGPU VM not starting

Dear All, I'm still having the same problems, is this a bug or something that's configured incorrectly? Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 18 May 2018, at 13:22, Callum Smith <callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk>> wrote: Yep, creating the mdev manually works, and in fact like I said previously, the VM does actually create an mdev successfully as you can see the UUID of the device (and is correctly identifiable though the /sys/class/mdev_bus/${DEVICE_ADDR}/${UUID}/mdev_type/name In this specific case to help with the logs, the uuid generated is consistently the similar (even after manual deletion) of "f5dc8396-dad5-3893-9eb4-94eedf60a881" The VM then fails to start because of the MTU issue. Restarting the VM on the node then produces the issue of the device not being available (because the device with the previous uuid exists and it's of max_instance=1). So it's the first VM start with the MTU issue that needs resolving, with the added complication that the issue of MTU (network) is caused by the mdev being set. The same error does not happen when mdev is not set. PS. In fact this was the guide i followed, so thank you Martin for writing it, without it getting this far would have been very difficult: https://mpolednik.github.io/2017/09/13/vgpu-in-ovirt/ Regards, Callum -- Callum Smith Research Computing Core Wellcome Trust Centre for Human Genetics University of Oxford e. callum@well.ox.ac.uk<mailto:callum@well.ox.ac.uk> On 18 May 2018, at 13:05, Martin Polednik <mpolednik@redhat.com<mailto:mpolednik@redhat.com>> wrote: On 18/05/18 13:42 +0200, Francesco Romani wrote: Hi, On 05/17/2018 10:56 AM, Callum Smith wrote: In an attempt not to mislead you guys as well, there appears to be a separate, vGPU specific, issue. https://www.dropbox.com/s/hlymmf9d6rn12tq/vdsm.vgpu.log?dl=0 I've uploaded the full vdsm.log to dropbox. Most recently I tried unmounting alll network devices from the VM and booting it and i get a different issue around the vGPU: 2018-05-17 09:48:24,806+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_hos tedengine: rc=0 err= (hooks:110) 2018-05-17 09:48:24,953+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_vfi o_mdev: rc=1 err=vgpu: No device with type nvidia-61 is available. (hooks:110) 2018-05-17 09:48:25,069+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_vho stmd: rc=0 err= (hooks:110) 2018-05-17 09:48:25,070+0100 ERROR (vm/1bc9dae8) [virt.vm] (vmId='1bc9dae8-a0ea-44b3-9103-5805100648d0 ') The vm start process failed (vm:943) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2862, in _run self._custom) File "/usr/lib/python2.7/site-packages/vdsm/common/hooks.py", line 153, in before_vm_start return _runHooksDir(domxml, 'before_vm_start', vmconf=vmconf) File "/usr/lib/python2.7/site-packages/vdsm/common/hooks.py", line 120, in _runHooksDir raise exception.HookError(err) HookError: Hook Error: ('',) Despite the nvidia-61 being an option on the GPU: https://pastebin.com/bucw21DG Let's tackle one issue at time :) From the shared logs, the VM start failed because of 2018-05-17 10:11:12,681+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_hostedengine: rc=0 err= (hooks:110) 2018-05-17 10:11:12,837+0100 INFO (vm/1bc9dae8) [root] /usr/libexec/vdsm/hooks/before_vm_start/50_vfio_mdev: rc=1 err=vgpu: No device with type nvidia-53 is available. maybe Martin can shed some light here? Given that the actual slice is available in sysfs (as indicated by one of the other branches of this thread), I fear we may be facing some weird issue with the driver itself. Can you create the mdev manually? $ uuidgen > /sys/class/mdev_bus/${DEVICE_ADDR}/mdev_supported_types/nvidia-61 should be enough for a test. Callum, please share Vdsm logs showing the network failure Bests, -- Francesco Romani Senior SW Eng., Virtualization R&D Red Hat IRC: fromani github: @fromanirh _______________________________________________ Users mailing list -- users@ovirt.org<mailto:users@ovirt.org> To unsubscribe send an email to users-leave@ovirt.org<mailto:users-leave@ovirt.org>