Re: [ovirt-devel] SR-IOV feature

4 Nov 2014

      Hey all,
sorry for joining a bit late...

General note: hostdev-passthrough wiki will be updated ASAP in order
to reflect ongoing progress.

----- Original Message -----
...
From: "Alona Kaplan" <alkaplan@redhat.com>
To: "Dan Kenigsberg" <danken@redhat.com>
Cc: "Eldan Hildesheim" <ehildesh@redhat.com>, devel@ovirt.org, "Nir Yechiel" <nyechiel@redhat.com>
Sent: Sunday, November 2, 2014 2:17:40 PM
Subject: Re: [ovirt-devel] SR-IOV feature
----- Original Message -----
...
From: "Dan Kenigsberg" <danken@redhat.com>
To: "Alona Kaplan" <alkaplan@redhat.com>, bazulay@redhat.com
Cc: "Itamar Heim" <iheim@redhat.com>, "Eldan Hildesheim"
<ehildesh@redhat.com>, "Nir Yechiel" <nyechiel@redhat.com>,
devel@ovirt.org
Sent: Thursday, October 30, 2014 7:47:31 PM
Subject: Re: [ovirt-devel] SR-IOV feature
On Sun, Oct 26, 2014 at 06:39:00AM -0400, Alona Kaplan wrote:
...
...
...
On 10/05/2014 07:02 AM, Alona Kaplan wrote:
...
Hi all,
Currently SR-IOV in oVirt is only supported using vdsm-hook [1].
This feature will add SR-IOV support to oVirt management system
(including
migration).
You are more than welcome to review the feature page-
http://www.ovirt.org/Feature/SR-IOV
Thanks,
Alona.
_______________________________________________
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel
Glad to see this.
some questions:
...
Note: this feature is about exposing a virtualized (or VirtIO) vNic
to
the
guest, and not about exposing the PCI device to it. This
restriction
is
necessary for migration to be supported.
did not understand this sentence - are you hinting to macvtap?
Most likely macvtap, yes.
Additionally I think Martin Poledník is looking into direct sr-iov
attachment
to VMs as part of the pci passthrough work he is doing.
...
...
add/edit profile
so i gather the implementation is at profile level, which is at
logical
network level?
how does this work exactly? can this logical network be vlan tagged
or
must be native? if vlan tagged who does the tagging for the
passthrough
device? (I see later on vf_vlan is one of the parameters to vdsm,
just
wondering how the mapping can be at host level if this is a
passthrough
device)?
is this because the use of virtio (macvtap)?
The logical network can be vlan tagged.
As you mentioned the vf_vlan is one of the parameters to the vdsm (on
create verb).
Setting the vlan on the vf is done as follows-
ip link set {DEVICE} vf {NUM} [ vlan VLANID ]
It is written in the notes section.
It is not related to the use of virtio. The vlan can be set on the vf
whether it
is connected to the vm via macvtap or directly.
Are you sure about this? I think that when a host device is attached to
a VM, it disappears from the host, and the the guest can send arbitrary
unmodified packets through the wire. But I may well be wrong.
I think you are correct for the case of mtu
(that's why I added it as an open issue- "Is applying MTU on VF supported by
libvirt?").
But as I understand from the documentation (although I didn't test it by
myself)-
that is the purpose of ip link set {DEVICE} vf {NUM} vlan VLANID
The documentation says- "all traffic sent from the VF will be tagged with the
specified VLAN ID.
Incoming traffic will be filtered for the specified VLAN ID, and will have
all
VLAN tags stripped before being passed to the VF."
Note- It is also supported by libvirt. As you can read in-
http://docs.fedoraproject.org/en-US/Fedora_Draft_Documentation/0.1/html/Virt...
"type='hostdev' SR-IOV interfaces do support transparent vlan tagging of
guest traffic".
...
...
...
...
wouldn't it be better to support both macvtap and passthrough and
just
flag the VM as non migratable in that case?
Martin Polednik is working on pci-passthrough-
http://www.ovirt.org/Features/hostdev_passthrough
I'm actively working on hostdev passthrough (not only PCI but
PCI, scsi and usb currently) and part of my testing was done on SR-IOV
capable nic (intel 82576 chip).
...
...
...
Maybe we should wait for his feature to be ready and then combine it with
the
sr-iov feature.
As I see in his feature page he plans to attach a specific device
directly
to the vm.
Hostdev passthrough is working on a VFIO granularity - that means it's reporting
to engine whole computer bus tree (libvirt's listAllDevices()) including few
unique device identifiers (for me that is name of the nevice such as
pci_0000_af_01_1c OR the tuple (vendor_id, device_id). The api is very general -
it doesn't care if we're dealing with PV or VF, only restriction is that whole
IOMMU group has to be attached (libvirt limitation) - in case of SR-IOV NICs
that presents no complications as these are in unique IOMMU groups.

This is the API you should use when dealing with physical host devices,
if anything is missing feel free to bring it up and we can work it in atleast
so we don't implement the same thing twice.
...
...
...
We can combine his feature with the sr-iov feature-
1. The network profile will have type property-
bridge (the regular configuration we have today,
vnic->tap->bridge->physical nic).
virtio(in the current feature design it is called passthrough,
vnic->macvtap->vf)
pci-passthrough(vnic->vf)
2. Attaching a network profile with pci-passthrough type to a vnic will
mark the vm as non-migratable.
This marking can be tuned by the admin. If the admin requests migration
despite the pci-passthrough type, Vdsm can auto-unplug the PCI device
before migration, and plug it back on the destination.
That would allow some kind of migration to guests that are willing to
see a PCI device disappear and re-appear.
For NICs this can even be avoided by using bonding[1], for other devices

we'll need to manually handle cases of
  specific device on specific bus
{ specific device (any bus)
{ VF belonging to specific PF
  VF (any PF)
(and possibly more, to be discussed)
...
Added it as an open issue to the feature page.
...
...
3. When running a vm with pci-passthrough vnic a free VF will be attached
to the vm with the vlan and mtu
configuration of the profile/network (same as for virio profile, as
described in the feature page).
The benefit of it is that the user won't have to choose the vf directly
and
will
be able to set vlan and mtu on the vf.
...
...
also (and doesn't have to be in first phase) what happens if i ran
out
of hosts with sr-iov (or they failed) - can i fail back to non
pcipassthrough profile for backup (policy question at vm level if
more
important to have sr-iov or more important it will run even without
it
since it provides a critical service, with a [scheduling] preference
to
run on sr-iov?
(oh, i see this is in the "futures" section already.
:)
A benefit of this "Nice to have passthrough" is that one could set it on
vNic profiles that are already used by VMs. Once they are migrated to a
new host, the passthrough-ness request would take effect.
Added this benefit to the feature page.
...
...
...
...
...
management, display and migration properties are not relevant for
the
VFs
configuration
just wondering - any technical reason we can't put the management on
a
VF (not saying its a priority to do so)?
Today we mark the logical network with a role
(management/display/migration)
when attaching it to the cluster.
A logical network can be attached to one physical nic (PF).
We can't use the current attachment of a role for sr-iov, since the
network
can
be configured as "vf allowed" on more than one nic (maybe even on all the
nics).
If the network is "vf allowed" on the nic,
a vnic with this network can be attached to a free vf on the nic.
So we can't use the logical network to mark a vf with a role.
We have to mark the vf explicitly.
Since in the current design we don't expose the vf, setting the roles was
blocked.
But if there is a requirement for setting a vf as
management/migration/display we can
re-think about the design for it.
We can relax this requirement by allowing the network to be attached on
one nic (be it VF or PF or legacy), and to set they "vf allowed" on a
completely disjoint set of PFs.
I'm not sure I understand your suggestion.
And still don't understand the benefit of using a vf as
management/display/migration.
...
...
...
...
...
sr-iov host nic management - num of VFs
I assume this is for admin to define a policy on how many VFs to use,
based on the max as reported by getVdsCaps. worth stating that for
clarity.
Updated the wiki with the following-
"It is used for admin to enable this number of VFs on the nic.
Changing this value will remove all the VFs from the nic and create new
#numOFVfs VFs on the nic."
The max value reported by getVdsCaps is just the theoretical maximum
value.
I think that Itamar suggests that this should be automated. And admin
could say "give me all the VFs you can", and when adding a new host,
Engine would set it seemlessly.
By the way, do you know what's the down side of asking for the maximum
number of VFs? Is it memory overhead? CPU? network performence?
I think "give me all the VFs you can" would rarely be used because in
practice this maximum is much lower, since each VF consumes resources.
Network device needs the resources to support the VF such as queues for data,
data address space, command processing, and more.
...
I wonder whether it makes sense for Vdsm to set the max on each reboot?
You're not updating the max, you're updating the number of of existing
VFs on a PF.
On a reboot all the VFs are destroyed.
When the host is started, #defaultNum of VFs are created.
Updating the num of VFs via sysfs is cross modules.
Since the sriov_numvfs value passed to sysfs is not persistent cross reboots,
after a reboot the new value is taken from the module specific configuration.
Each module has its own way to specify persistent default num of VFs.
For example- with Intel VT-d you should add the line- options igb max_vfs=7
to any file in /etc/modprobe.d
If the module doesn't specify the number of VFs in its configuration
the default number is 0.
So if vdsm won't set /sys/class/net/'device_name'/device/sriov_numvfs on each
reboot,
the user will have to control the number manually and module specifically.
...
Another related issue, that is mentioned as an open question:
The current suggestion, of having updateSriovMaxVFs as an independent
verb has a down side: you cannot use it to updateSriovMaxVFs of the PF
that is used by the management network. If we want to support this use
case, we should probably expose the functionality within the
transactional setupNetworks verb.
Why can't it be used on the PF that is used by the management network?
AFAIK the PF doesn't lose connectivity when updating
/sys/class/net/eth0/device/sriov_numvfs
but I"m not sure about it. Added it to the open issues section.
...
...
...
...
...
User Experience - Setup networks - Option 1
in the last picture ("Edit VFs networks and labels") - why are there
labels here together with the networks (if labels appear at the PF
level
in the first dialog)?
iiuc, the option 2 is re-using the setup networks, where the PF will
just be another physical interface, and networks or labels edited
just
like for regular network interfaces?
(not sure where you are on this, but it sounds more straight
forward/similar to existing concepts iiuc).
As I wrote in the answer about the roles.
There are two concepts-
1. The attachment of network to physical nic (what we have today).
2. Containing the network in the "VFs managenet tab=>allowed networks" of
the nic.
In 1, we actually configure the host's nics and bridges according to the
setup networks.
In 2, we just specify the "allowed" list, it doesn't even sent to the
vdsm.
It is used by the engine when it schedules a host for a vm.
The connection between networks to nics is many to many.
The same network can be part of 1 and 2 on the same nic.
And even part of 2 in other sr-iov enabled nics.
Since 2 is completely different concept than 1, we weren't sure that
using
drag and drop
as for PFs isn't to much in this case.
...
...
Question: any issues with hot plug/unplug or just expected to work
normally?
Expected to work (but wasn't tested yet).
_______________________________________________
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel
[1] http://shikee.net/read/VM_OLS08.pdf