[ovirt-devel] SR-IOV feature

Martin Polednik mpolednik at redhat.com
Tue Nov 4 13:10:39 UTC 2014


Hey all,
sorry for joining a bit late...

General note: hostdev-passthrough wiki will be updated ASAP in order
to reflect ongoing progress.

----- Original Message -----
> From: "Alona Kaplan" <alkaplan at redhat.com>
> To: "Dan Kenigsberg" <danken at redhat.com>
> Cc: "Eldan Hildesheim" <ehildesh at redhat.com>, devel at ovirt.org, "Nir Yechiel" <nyechiel at redhat.com>
> Sent: Sunday, November 2, 2014 2:17:40 PM
> Subject: Re: [ovirt-devel] SR-IOV feature
> 
> 
> 
> ----- Original Message -----
> > From: "Dan Kenigsberg" <danken at redhat.com>
> > To: "Alona Kaplan" <alkaplan at redhat.com>, bazulay at redhat.com
> > Cc: "Itamar Heim" <iheim at redhat.com>, "Eldan Hildesheim"
> > <ehildesh at redhat.com>, "Nir Yechiel" <nyechiel at redhat.com>,
> > devel at ovirt.org
> > Sent: Thursday, October 30, 2014 7:47:31 PM
> > Subject: Re: [ovirt-devel] SR-IOV feature
> > 
> > On Sun, Oct 26, 2014 at 06:39:00AM -0400, Alona Kaplan wrote:
> > > 
> > > > > On 10/05/2014 07:02 AM, Alona Kaplan wrote:
> > > > > > Hi all,
> > > > > >
> > > > > > Currently SR-IOV in oVirt is only supported using vdsm-hook [1].
> > > > > > This feature will add SR-IOV support to oVirt management system
> > > > > > (including
> > > > > > migration).
> > > > > >
> > > > > > You are more than welcome to review the feature page-
> > > > > > http://www.ovirt.org/Feature/SR-IOV
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Alona.
> > > > > > _______________________________________________
> > > > > > Devel mailing list
> > > > > > Devel at ovirt.org
> > > > > > http://lists.ovirt.org/mailman/listinfo/devel
> > > > > >
> > > > > 
> > > > > Glad to see this.
> > > > > 
> > > > > some questions:
> > > > > 
> > > > > > Note: this feature is about exposing a virtualized (or VirtIO) vNic
> > > > > > to
> > > > > > the
> > > > > > guest, and not about exposing the PCI device to it. This
> > > > > > restriction
> > > > > > is
> > > > > > necessary for migration to be supported.
> > > > > 
> > > > > did not understand this sentence - are you hinting to macvtap?
> > > > 
> > > > Most likely macvtap, yes.
> > > > 
> > > > Additionally I think Martin Poledník is looking into direct sr-iov
> > > > attachment
> > > > to VMs as part of the pci passthrough work he is doing.
> > > > 
> > > > > 
> > > > > > add/edit profile
> > > > > 
> > > > > so i gather the implementation is at profile level, which is at
> > > > > logical
> > > > > network level?
> > > > > how does this work exactly? can this logical network be vlan tagged
> > > > > or
> > > > > must be native? if vlan tagged who does the tagging for the
> > > > > passthrough
> > > > > device? (I see later on vf_vlan is one of the parameters to vdsm,
> > > > > just
> > > > > wondering how the mapping can be at host level if this is a
> > > > > passthrough
> > > > > device)?
> > > > > is this because the use of virtio (macvtap)?
> > > 
> > > The logical network can be vlan tagged.
> > > As you mentioned the vf_vlan is one of the parameters to the vdsm (on
> > > create verb).
> > > Setting the vlan on the vf is done as follows-
> > > ip link set {DEVICE} vf {NUM} [ vlan VLANID ]
> > > It is written in the notes section.
> > > 
> > > It is not related to the use of virtio. The vlan can be set on the vf
> > > whether it
> > > is connected to the vm via macvtap or directly.
> > 
> > Are you sure about this? I think that when a host device is attached to
> > a VM, it disappears from the host, and the the guest can send arbitrary
> > unmodified packets through the wire. But I may well be wrong.
> > 
> 
> I think you are correct for the case of mtu
> (that's why I added it as an open issue- "Is applying MTU on VF supported by
> libvirt?").
> But as I understand from the documentation (although I didn't test it by
> myself)-
> that is the purpose of ip link set {DEVICE} vf {NUM} vlan VLANID
> The documentation says- "all traffic sent from the VF will be tagged with the
> specified VLAN ID.
> Incoming traffic will be filtered for the specified VLAN ID, and will have
> all
> VLAN tags stripped before being passed to the VF."
> 
> Note- It is also supported by libvirt. As you can read in-
> http://docs.fedoraproject.org/en-US/Fedora_Draft_Documentation/0.1/html/Virtualization_Deployment_and_Administration_Guide/sub-sub-section-libvirt-dom-xml-devices-setting-vlan-tag.html
> "type='hostdev' SR-IOV interfaces do support transparent vlan tagging of
> guest traffic".
> 
> > > > > wouldn't it be better to support both macvtap and passthrough and
> > > > > just
> > > > > flag the VM as non migratable in that case?
> > > 
> > > Martin Polednik is working on pci-passthrough-
> > > http://www.ovirt.org/Features/hostdev_passthrough

I'm actively working on hostdev passthrough (not only PCI but
PCI, scsi and usb currently) and part of my testing was done on SR-IOV
capable nic (intel 82576 chip). 
 
> > > Maybe we should wait for his feature to be ready and then combine it with
> > > the
> > > sr-iov feature.
> > > As I see in his feature page he plans to attach a specific device
> > > directly
> > > to the vm.

Hostdev passthrough is working on a VFIO granularity - that means it's reporting
to engine whole computer bus tree (libvirt's listAllDevices()) including few
unique device identifiers (for me that is name of the nevice such as
pci_0000_af_01_1c OR the tuple (vendor_id, device_id). The api is very general -
it doesn't care if we're dealing with PV or VF, only restriction is that whole
IOMMU group has to be attached (libvirt limitation) - in case of SR-IOV NICs
that presents no complications as these are in unique IOMMU groups.

This is the API you should use when dealing with physical host devices,
if anything is missing feel free to bring it up and we can work it in atleast
so we don't implement the same thing twice.

> > > We can combine his feature with the sr-iov feature-
> > > 1. The network profile will have type property-
> > > bridge (the regular configuration we have today,
> > > vnic->tap->bridge->physical nic).
> > > virtio(in the current feature design it is called passthrough,
> > > vnic->macvtap->vf)
> > > pci-passthrough(vnic->vf)
> > > 2. Attaching a network profile with pci-passthrough type to a vnic will
> > > mark the vm as non-migratable.
> > 
> > This marking can be tuned by the admin. If the admin requests migration
> > despite the pci-passthrough type, Vdsm can auto-unplug the PCI device
> > before migration, and plug it back on the destination.
> > That would allow some kind of migration to guests that are willing to
> > see a PCI device disappear and re-appear.

For NICs this can even be avoided by using bonding[1], for other devices

we'll need to manually handle cases of
  specific device on specific bus
{ specific device (any bus)
{ VF belonging to specific PF
  VF (any PF)
(and possibly more, to be discussed)

> 
> Added it as an open issue to the feature page.
> 
> > > 3. When running a vm with pci-passthrough vnic a free VF will be attached
> > > to the vm with the vlan and mtu
> > > configuration of the profile/network (same as for virio profile, as
> > > described in the feature page).
> > > 
> > > The benefit of it is that the user won't have to choose the vf directly
> > > and
> > > will
> > > be able to set vlan and mtu on the vf.
> > > 
> > > > > 
> > > > > also (and doesn't have to be in first phase) what happens if i ran
> > > > > out
> > > > > of hosts with sr-iov (or they failed) - can i fail back to non
> > > > > pcipassthrough profile for backup (policy question at vm level if
> > > > > more
> > > > > important to have sr-iov or more important it will run even without
> > > > > it
> > > > > since it provides a critical service, with a [scheduling] preference
> > > > > to
> > > > > run on sr-iov?
> > > > > (oh, i see this is in the "futures" section already.
> > > 
> > > :)
> > 
> > A benefit of this "Nice to have passthrough" is that one could set it on
> > vNic profiles that are already used by VMs. Once they are migrated to a
> > new host, the passthrough-ness request would take effect.
> > 
> 
> Added this benefit to the feature page.
> 
> > > 
> > > > > 
> > > > > 
> > > > > > management, display and migration properties are not relevant for
> > > > > > the
> > > > > > VFs
> > > > > > configuration
> > > > > 
> > > > > just wondering - any technical reason we can't put the management on
> > > > > a
> > > > > VF (not saying its a priority to do so)?
> > > 
> > > Today we mark the logical network with a role
> > > (management/display/migration)
> > > when attaching it to the cluster.
> > > A logical network can be attached to one physical nic (PF).
> > > 
> > > We can't use the current attachment of a role for sr-iov, since the
> > > network
> > > can
> > > be configured as "vf allowed" on more than one nic (maybe even on all the
> > > nics).
> > > If the network is "vf allowed" on the nic,
> > > a vnic with this network can be attached to a free vf on the nic.
> > > 
> > > So we can't use the logical network to mark a vf with a role.
> > > We have to mark the vf explicitly.
> > > Since in the current design we don't expose the vf, setting the roles was
> > > blocked.
> > > But if there is a requirement for setting a vf as
> > > management/migration/display we can
> > > re-think about the design for it.
> > 
> > We can relax this requirement by allowing the network to be attached on
> > one nic (be it VF or PF or legacy), and to set they "vf allowed" on a
> > completely disjoint set of PFs.
> >
> 
> I'm not sure I understand your suggestion.
> And still don't understand the benefit of using a vf as
> management/display/migration.
>  
> > 
> > > 
> > > > > 
> > > > > > sr-iov host nic management - num of VFs
> > > > > 
> > > > > I assume this is for admin to define a policy on how many VFs to use,
> > > > > based on the max as reported by getVdsCaps. worth stating that for
> > > > > clarity.
> > > > > 
> > > 
> > > Updated the wiki with the following-
> > > "It is used for admin to enable this number of VFs on the nic.
> > > Changing this value will remove all the VFs from the nic and create new
> > > #numOFVfs VFs on the nic."
> > > 
> > > The max value reported by getVdsCaps is just the theoretical maximum
> > > value.
> > 
> > I think that Itamar suggests that this should be automated. And admin
> > could say "give me all the VFs you can", and when adding a new host,
> > Engine would set it seemlessly.
> > 
> > By the way, do you know what's the down side of asking for the maximum
> > number of VFs? Is it memory overhead? CPU? network performence?
> > 
> 
> I think "give me all the VFs you can" would rarely be used because in
> practice this maximum is much lower, since each VF consumes resources.
> Network device needs the resources to support the VF such as queues for data,
> data address space, command processing, and more.
> 
> > I wonder whether it makes sense for Vdsm to set the max on each reboot?
> > 
> 
> You're not updating the max, you're updating the number of of existing
> VFs on a PF.
>  
> On a reboot all the VFs are destroyed.
> When the host is started, #defaultNum of VFs are created.
> 
> Updating the num of VFs via sysfs is cross modules.
> Since the sriov_numvfs value passed to sysfs is not persistent cross reboots,
> after a reboot the new value is taken from the module specific configuration.
> 
> Each module has its own way to specify persistent default num of VFs.
> For example- with Intel VT-d you should add the line- options igb max_vfs=7
> to any file in /etc/modprobe.d
> If the module doesn't specify the number of VFs in its configuration
> the default number is 0.
> 
> So if vdsm won't set /sys/class/net/'device_name'/device/sriov_numvfs on each
> reboot,
> the user will have to control the number manually and module specifically.
> 
> > Another related issue, that is mentioned as an open question:
> > The current suggestion, of having updateSriovMaxVFs as an independent
> > verb has a down side: you cannot use it to updateSriovMaxVFs of the PF
> > that is used by the management network. If we want to support this use
> > case, we should probably expose the functionality within the
> > transactional setupNetworks verb.
> > 
> 
> Why can't it be used on the PF that is used by the management network?
> AFAIK the PF doesn't lose connectivity when updating
> /sys/class/net/eth0/device/sriov_numvfs
> but I"m not sure about it. Added it to the open issues section.
> 
> > > 
> > > 
> > > > > >  User Experience - Setup networks - Option 1
> > > > > 
> > > > > in the last picture ("Edit VFs networks and labels") - why are there
> > > > > labels here together with the networks (if labels appear at the PF
> > > > > level
> > > > > in the first dialog)?
> > > > > 
> > > > > iiuc, the option 2 is re-using the setup networks, where the PF will
> > > > > just be another physical interface, and networks or labels edited
> > > > > just
> > > > > like for regular network interfaces?
> > > > > (not sure where you are on this, but it sounds more straight
> > > > > forward/similar to existing concepts iiuc).
> > > > > 
> > > 
> > > As I wrote in the answer about the roles.
> > > There are two concepts-
> > > 1. The attachment of network to physical nic (what we have today).
> > > 2. Containing the network in the "VFs managenet tab=>allowed networks" of
> > > the nic.
> > > 
> > > In 1, we actually configure the host's nics and bridges according to the
> > > setup networks.
> > > In 2, we just specify the "allowed" list, it doesn't even sent to the
> > > vdsm.
> > > It is used by the engine when it schedules a host for a vm.
> > > 
> > > The connection between networks to nics is many to many.
> > > The same network can be part of 1 and 2 on the same nic.
> > > And even part of 2 in other sr-iov enabled nics.
> > > 
> > > Since 2 is completely different concept than 1, we weren't sure that
> > > using
> > > drag and drop
> > > as for PFs isn't to much in this case.
> > > 
> > > > > Question: any issues with hot plug/unplug or just expected to work
> > > > > normally?
> > > 
> > > Expected to work (but wasn't tested yet).
> > 
> _______________________________________________
> Devel mailing list
> Devel at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel

[1] http://shikee.net/read/VM_OLS08.pdf



More information about the Devel mailing list