[ovirt-devel] Re: [ovirt-users] Re: [=EXTERNAL=] Re: help using nvme/tcp storage with cinderlib and Managed Block Storage

27 Mar 2022

      Hi Benny,

Any update on this one?
Also, is there a way I can test this with vdsm-client without resorting to
full ovirt? We have run into some issues with getting ovirt working with
the nightlies, but vdsm and vdsm-client appear to work fine with the
patches applied, or at least, they run.

Cheers,
Muli

On Thu, Mar 3, 2022 at 6:09 PM Benny Zlotnik <bzlotnik@redhat.com> wrote:
...
Hi,
I posted draft PRs for engine[1] and vdsm[2], they are still raw and I
only tested running starting VMs with ceph. If you can apply the
changes for lightos (only vdsm should be needed) and try it out it
would be great :)
Also, if you have any suggestions/comments/etc feel free to comment on
the PRs directly
If you don't want to build ovirt-engine from source, CI generated RPMs
should be available in[3] (the job is still running while I'm writing
this email)
[1] https://github.com/oVirt/ovirt-engine/pull/104
[2] https://github.com/oVirt/vdsm/pull/89
[3] https://github.com/oVirt/ovirt-engine/actions/runs/1929008680
On Wed, Mar 2, 2022 at 4:55 PM Muli Ben-Yehuda <muli@lightbitslabs.com>
wrote:
...
Thanks for the update, Benny. How can I help? For example, would logs
from running the connector with the exact data it returns be useful?
...
Cheers,
Muli
On Tue, Mar 1, 2022 at 8:39 PM Benny Zlotnik <bzlotnik@redhat.com>
...
...
Hi,
Just by browsing the code, I can think of one issue in[1], as a result
of[2] where we only considered iscsi and rbd drivers, I suspect your
driver will go into this branch, based on the issue in the 4.3 logs I
went over:
backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/builder/vminfo/LibvirtVmXmlBuilder.java
...
} else if (managedBlockStorageDisk.getCinderVolumeDriver()
== CinderVolumeDriver.BLOCK) {
                Map<String, Object> attachment =
                        (Map<String, Object>)
managedBlockStorageDisk.getDevice().get(DeviceInfoReturn.ATTACHMENT);
                metadata = Map.of(
                        "GUID",
(String)attachment.get(DeviceInfoReturn.SCSI_WWN),
                        "managed", "true"
Which will make it go into the wrong branch in clientIF.py, appending
the empty GUID to /dev/mapper. Perhaps it is possible workaround it in
clientIF if you just want to try and get the VM started for now, by
checking if GUID is empty and deferring to:
   volPath = drive['path']
But as discussed in this thread, our attempt at constructing the
stable paths ourselves doesn't really scale. After further discussion
with Nir I started working on creating a link in vdsm in
managevolume.py#attach_volume to the path returned by the driver, and
engine will use our link to run the VMs.
This should simplify the code and resolve the live VM migration issue.
I had some preliminary success with this so I'll try to post the
patches soon
[1]
https://github.com/oVirt/vdsm/blob/d957a06a4d988489c83da171fcd9cfd254b12ca4/...
...
[2]
https://github.com/oVirt/ovirt-engine/blob/24530d17874e20581deee4b0e319146cd...
On Tue, Mar 1, 2022 at 6:12 PM Muli Ben-Yehuda <muli@lightbitslabs.com>
wrote:
...
Will this support require changes in ovirt-engine or just in vdsm? I
have started to look into vdsm's managedvolume.py and its tests and it
seems like adding support for LightOS there should be pretty simple (famous
last words...). Should this be enough or do you think it will require
changes in other parts of ovirt as well?
...
Cheers,
Muli
On Mon, Feb 28, 2022 at 9:09 AM Nir Soffer <nsoffer@redhat.com>
wrote:
...
...
On Fri, Feb 25, 2022 at 12:04 PM Gorka Eguileor <geguileo@redhat.com>
wrote:
...
...
On 24/02, Nir Soffer wrote:
> On Thu, Feb 24, 2022 at 8:46 PM Gorka Eguileor <
geguileo@redhat.com> wrote:
...
> >
> > On 24/02, Nir Soffer wrote:
> > > On Thu, Feb 24, 2022 at 6:35 PM Muli Ben-Yehuda <
muli@lightbitslabs.com> wrote:
> > > >
> > > > On Thu, Feb 24, 2022 at 6:28 PM Nir Soffer <
nsoffer@redhat.com> wrote:
> > > >>
> > > >> On Thu, Feb 24, 2022 at 6:10 PM Muli Ben-Yehuda <
muli@lightbitslabs.com> wrote:
> > > >> >
> > > >> > On Thu, Feb 24, 2022 at 3:58 PM Nir Soffer <
nsoffer@redhat.com> wrote:
> > > >> >>
> > > >> >> On Wed, Feb 23, 2022 at 6:24 PM Muli Ben-Yehuda <
muli@lightbitslabs.com> wrote:
> > > >> >> >
> > > >> >> > Thanks for the detailed instructions, Nir. I'm going
to scrounge up some hardware.
> > > >> >> > By the way, if anyone else would like to work on
NVMe/TCP support, for NVMe/TCP target you can either use Lightbits (talk to
me offline for details) or use the upstream Linux NVMe/TCP target.
Lightbits is a clustered storage system while upstream is a single target,
but the client side should be close enough for vdsm/ovirt purposes.
> > > >> >>
> > > >> >> I played with NVMe/TCP a little bit, using qemu to
create a virtual
> > > >> >> NVMe disk, and export
> > > >> >> it using the kernel on one VM, and consume it on
another VM.
> > > >> >>
https://futurewei-cloud.github.io/ARM-Datacenter/qemu/nvme-of-tcp-vms/
> > > >> >>
> > > >> >> One question about device naming - do we always get
...
...
...
...
...
> > > >> >> device in all hosts?
> > > >> >
> > > >> >
> > > >> > No, we do not, see below how we handle migration in
os_brick.
> > > >> >
> > > >> >> To support VM migration, every device must have unique
name in the cluster.
> > > >> >> With multipath we always have unique name, since we
disable "friendly names",
> > > >> >> so we always have:
> > > >> >>
> > > >> >>     /dev/mapper/{wwid}
> > > >> >>
> > > >> >> With rbd we also do not use /dev/rbdN but a unique
...
...
...
...
...
> > > >> >>
> > > >> >>     /dev/rbd/poolname/volume-vol-id
> > > >> >>
> > > >> >> How do we ensure cluster-unique device path? If
os_brick does not handle it, we
> > > >> >> can to do in ovirt, for example:
> > > >> >>
> > > >> >>     /run/vdsm/mangedvolumes/{uuid} -> /dev/nvme7n42
> > > >> >>
> > > >> >> but I think this should be handled in cinderlib, since
openstack have
> > > >> >> the same problem with migration.
> > > >> >
> > > >> >
> > > >> > Indeed. Both the Lightbits LightOS connector and the
nvmeof connector do this through the target provided namespace (LUN) UUID.
After connecting to the target, the connectors wait for the local
friendly-named device file that has the right UUID to show up, and then
return the friendly name. So different hosts will have different friendly
names, but the VMs will be attached to the right namespace since we return
...
...
...
...
...
> > > >>
> > > >> It will not work for oVirt.
> > > >>
> > > >> Migration in oVirt works like this:
> > > >>
> > > >> 1. Attach disks to destination host
> > > >> 2. Send VM XML from source host to destination host, and
start the
> > > >>    VM is paused mode
> > > >> 3. Start the migration on the source host
> > > >> 4. When migration is done, start the CPU on the
destination host
> > > >> 5. Detach the disks from the source
> > > >>
> > > >> This will break in step 2, since the source xml refer to
nvme device
> > > >> that does not exist or already used by another VM.
> > > >
> > > >
> > > > Indeed.
> > > >
> > > >> To make this work, the VM XML must use the same path,
existing on
> > > >> both hosts.
> > > >>
> > > >> The issue can be solved by libvirt hook updating the
...
...
...
...
...
> > > >> is started on the destination, but I think the right way
to handle this is to
> > > >> have the same path.
> > > >
> > > >
> > > >  You mentioned above that it can be handled in ovirt
(c.f., /run/vdsm/mangedvolumes/{uuid} -> /dev/nvme7n42), which seems like a
reasonable approach given the constraint imposed by the oVirt migration
flow you outlined above. What information does vdsm need to create and use
...
...
...
...
...
> > > >
> > > >     def connect_volume(self, connection_properties):
> > > >         device_info = {'type': 'block'}
> > > >         uuid = connection_properties['uuid']
> > > >         device_path = self._get_device_by_uuid(uuid)
> > > >         device_info['path'] = device_path
> > > >         return device_info
> > >
> > > I think we have 2 options:
> > >
> > > 1. unique path created by os_brick using the underlying uuid
> > >
> > > In this case the connector will return the uuid, and ovirt
will use
> > > it to resolve the unique path that will be stored and used
on engine
> > > side to create the vm xml.
> > >
> > > I'm not sure how the connector should return this uuid.
Looking in current
> > > vdsm code:
> > >
> > >     if vol_type in ("iscsi", "fibre_channel"):
> > >         if "multipath_id" not in attachment:
> > >             raise se.ManagedVolumeUnsupportedDevice(vol_id,
attachment)
> > >         # /dev/mapper/xxxyyy
> > >         return os.path.join(DEV_MAPPER,
attachment["multipath_id"])
> > >     elif vol_type == "rbd":
> > >         # /dev/rbd/poolname/volume-vol-id
> > >         return os.path.join(DEV_RBD,
connection_info['data']['name'])
> > >
> > > os_brick does not have a uniform way to address different
devices.
> > >
> > > Maybe Gorka can help with this.
> >
> > Hi,
> >
> > That is true, because in OpenStack we haven't had the need to
have the
> > same path on every host or even on the same host during
different
> > connections.
> >
> > For nvme a new `elif` clause could be added there, though it
will be a
> > bit trickier, because the nvme connection properties format
are a bit of
> > a mess...
> >
> > We have 2 different formats for the nvme properties, and the
wwid that
> > appears in symlink /dev/disk/by-id/nvme-<wwid> may or may not
be the
> > volume id, may be the uuid in the connection info if present
or the
> > nguid if the nvme device doesn't have uuid.
> >
> > For these reasons I would recommend not relying on the
connection
> > information and relying on the path from the attachment
instead.
> >
> > Something like this should be probably fine:
> >
> >   elif vol_type == 'nvme':
> >       device_name = os.path.basename(attachment['path'])
> >       controller = device_name.rsplit('n', 1)[0]
> >       wwid_filename =
f'/sys/class/nvme/{controller}/{device_name}/wwid'
> >       with open(wwid_filename, 'r') as f:
> >           uuid = f.read().strip()
> >       return os.path.join('/dev/disk/by-id/nvme-', uuid)
>
> Thanks Gorka!
>
> but isn't this duplicating logic already in os brick?
>
https://github.com/openstack/os-brick/blob/56bf0272b55dcbbc7f5b03150973a80af...
>
Hi Nir,
Oh! I thought we were talking about the generic NVMe-oF connector,
didn't know this was specific about the LightOS one.
The link is used as an easy way to locate the volume, it doesn't
mean
that it is returned to the caller of the `connect_volume` method.
In
fact, we can see how that method actually returns the real path
and not
the link's path:
def _check_device_exists_using_dev_lnk(self, uuid):
        lnk_path = f"/dev/disk/by-id/nvme-uuid.{uuid}"
-->     if os.path.exists(lnk_path):
   ^^^ Check link exists
-->         devname = os.path.realpath(lnk_path)
   ^^^ Get the real path for the symlink
-->         if devname.startswith("/dev/nvme"):
   ^^^ Make extra sure it's not pointing to something crazy
LOG.info("LIGHTOS: devpath %s detected for uuid
%s",
                         devname, uuid)
-->             return devname
   ^^^ Return it
return None
> Another interesting detail is this wait:
>
https://github.com/openstack/os-brick/blob/56bf0272b55dcbbc7f5b03150973a80af...
>
>     def _get_device_by_uuid(self, uuid):
>         endtime = time.time() + self.WAIT_DEVICE_TIMEOUT
>         while time.time() < endtime:
>             try:
>                 device =
self._check_device_exists_using_dev_lnk(uuid)
>                 if device:
>                     return device
>             except Exception as e:
>                 LOG.debug(f'LIGHTOS: {e}')
>             device =
self._check_device_exists_reading_block_class(uuid)
>             if device:
>                 return device
>
>             time.sleep(1)
>         return None
>
> The code does not explain why it tries to use the
/dev/disk/by-id link
> and fallback to sysfs on errors. Based on our experience with
udev,
> I guess that the author does not trust udev. I wonder if we can
...
...
...
...
...
> it as the stable device path.
>
In my experience udev rules (which is different from udev itself)
are
less that reliable as a way of finding devices when working "in the
wild".  They are only reliable if you have full control over the
host
system and are sure nobody (admin or distro) can break things.
For reference, at Red Hat we have an RFE to improve os-brick [1]
and
stop using symlinks at all.
While they are not 100% reliable in the wild, they are quite
reliable
once they are working on a specific system, which means that if we
confirm they are working on a system we can rely on them if no
changes
are made on the system (and if CPU is not 100% during attachment).
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1697319
> If we can trust this path, maybe os_brick can return the stable
...
...
...
...
...
> in a uniform way for all kind of devices?
I don't think this is likely to happen, because it has no real
value for
OpenStack so it's unlikely to get prioritized (for coding and
reviews).
Since we cannot get a stable path from os-brick, and stable path is
a oVirt
specific requirement, we need to handle this in oVirt, similar to
wrote:
the same name of the
path:
the friendly name on the current host that has the right UUID. Does this
also work for you?
paths before qemu
the /var/run/vdsm/managedvolumes/{uuid} link? Today the connector does
(trimmed for brevity):
trust
path
the way we
...
...
...
...
handle multipath and rbd and traditional storage.
Nir
Lightbits Labs
Lead the cloud-native data center transformation by delivering
scalable and efficient software defined storage that is easy to consume.
This message is sent in confidence for the addressee only.  It may
contain legally privileged information. The contents are not to be
disclosed to anyone other than the addressee. Unauthorized recipients are
requested to preserve this confidentiality, advise the sender immediately
of any error in transmission and delete the email from their systems.
Lightbits Labs
Lead the cloud-native data center transformation by delivering scalable
and efficient software defined storage that is easy to consume.
This message is sent in confidence for the addressee only.  It may
contain legally privileged information. The contents are not to be
disclosed to anyone other than the addressee. Unauthorized recipients are
requested to preserve this confidentiality, advise the sender immediately
of any error in transmission and delete the email from their systems.
-- 

*Lightbits Labs**
*Lead the cloud-native data center
transformation by 
delivering *scalable *and *efficient *software
defined storage that is 
*easy *to consume.

*This message is sent in confidence for the addressee 
only.  It
may contain legally privileged information. The contents are not 
to be
disclosed to anyone other than the addressee. Unauthorized recipients 
are
requested to preserve this confidentiality, advise the sender 
immediately of
any error in transmission and delete the email from their 
systems.*