On 24/02, Nir Soffer wrote:
On Thu, Feb 24, 2022 at 8:46 PM Gorka Eguileor
<geguileo(a)redhat.com> wrote:
>
> On 24/02, Nir Soffer wrote:
> > On Thu, Feb 24, 2022 at 6:35 PM Muli Ben-Yehuda <muli(a)lightbitslabs.com>
wrote:
> > >
> > > On Thu, Feb 24, 2022 at 6:28 PM Nir Soffer <nsoffer(a)redhat.com>
wrote:
> > >>
> > >> On Thu, Feb 24, 2022 at 6:10 PM Muli Ben-Yehuda
<muli(a)lightbitslabs.com> wrote:
> > >> >
> > >> > On Thu, Feb 24, 2022 at 3:58 PM Nir Soffer
<nsoffer(a)redhat.com> wrote:
> > >> >>
> > >> >> On Wed, Feb 23, 2022 at 6:24 PM Muli Ben-Yehuda
<muli(a)lightbitslabs.com> wrote:
> > >> >> >
> > >> >> > Thanks for the detailed instructions, Nir. I'm going
to scrounge up some hardware.
> > >> >> > By the way, if anyone else would like to work on
NVMe/TCP support, for NVMe/TCP target you can either use Lightbits (talk to me offline for
details) or use the upstream Linux NVMe/TCP target. Lightbits is a clustered storage
system while upstream is a single target, but the client side should be close enough for
vdsm/ovirt purposes.
> > >> >>
> > >> >> I played with NVMe/TCP a little bit, using qemu to create a
virtual
> > >> >> NVMe disk, and export
> > >> >> it using the kernel on one VM, and consume it on another VM.
> > >> >>
https://futurewei-cloud.github.io/ARM-Datacenter/qemu/nvme-of-tcp-vms/
> > >> >>
> > >> >> One question about device naming - do we always get the same
name of the
> > >> >> device in all hosts?
> > >> >
> > >> >
> > >> > No, we do not, see below how we handle migration in os_brick.
> > >> >
> > >> >> To support VM migration, every device must have unique name
in the cluster.
> > >> >> With multipath we always have unique name, since we disable
"friendly names",
> > >> >> so we always have:
> > >> >>
> > >> >> /dev/mapper/{wwid}
> > >> >>
> > >> >> With rbd we also do not use /dev/rbdN but a unique path:
> > >> >>
> > >> >> /dev/rbd/poolname/volume-vol-id
> > >> >>
> > >> >> How do we ensure cluster-unique device path? If os_brick does
not handle it, we
> > >> >> can to do in ovirt, for example:
> > >> >>
> > >> >> /run/vdsm/mangedvolumes/{uuid} -> /dev/nvme7n42
> > >> >>
> > >> >> but I think this should be handled in cinderlib, since
openstack have
> > >> >> the same problem with migration.
> > >> >
> > >> >
> > >> > Indeed. Both the Lightbits LightOS connector and the nvmeof
connector do this through the target provided namespace (LUN) UUID. After connecting to
the target, the connectors wait for the local friendly-named device file that has the
right UUID to show up, and then return the friendly name. So different hosts will have
different friendly names, but the VMs will be attached to the right namespace since we
return the friendly name on the current host that has the right UUID. Does this also work
for you?
> > >>
> > >> It will not work for oVirt.
> > >>
> > >> Migration in oVirt works like this:
> > >>
> > >> 1. Attach disks to destination host
> > >> 2. Send VM XML from source host to destination host, and start the
> > >> VM is paused mode
> > >> 3. Start the migration on the source host
> > >> 4. When migration is done, start the CPU on the destination host
> > >> 5. Detach the disks from the source
> > >>
> > >> This will break in step 2, since the source xml refer to nvme device
> > >> that does not exist or already used by another VM.
> > >
> > >
> > > Indeed.
> > >
> > >> To make this work, the VM XML must use the same path, existing on
> > >> both hosts.
> > >>
> > >> The issue can be solved by libvirt hook updating the paths before
qemu
> > >> is started on the destination, but I think the right way to handle
this is to
> > >> have the same path.
> > >
> > >
> > > You mentioned above that it can be handled in ovirt (c.f.,
/run/vdsm/mangedvolumes/{uuid} -> /dev/nvme7n42), which seems like a reasonable
approach given the constraint imposed by the oVirt migration flow you outlined above. What
information does vdsm need to create and use the /var/run/vdsm/managedvolumes/{uuid} link?
Today the connector does (trimmed for brevity):
> > >
> > > def connect_volume(self, connection_properties):
> > > device_info = {'type': 'block'}
> > > uuid = connection_properties['uuid']
> > > device_path = self._get_device_by_uuid(uuid)
> > > device_info['path'] = device_path
> > > return device_info
> >
> > I think we have 2 options:
> >
> > 1. unique path created by os_brick using the underlying uuid
> >
> > In this case the connector will return the uuid, and ovirt will use
> > it to resolve the unique path that will be stored and used on engine
> > side to create the vm xml.
> >
> > I'm not sure how the connector should return this uuid. Looking in current
> > vdsm code:
> >
> > if vol_type in ("iscsi", "fibre_channel"):
> > if "multipath_id" not in attachment:
> > raise se.ManagedVolumeUnsupportedDevice(vol_id, attachment)
> > # /dev/mapper/xxxyyy
> > return os.path.join(DEV_MAPPER, attachment["multipath_id"])
> > elif vol_type == "rbd":
> > # /dev/rbd/poolname/volume-vol-id
> > return os.path.join(DEV_RBD,
connection_info['data']['name'])
> >
> > os_brick does not have a uniform way to address different devices.
> >
> > Maybe Gorka can help with this.
>
> Hi,
>
> That is true, because in OpenStack we haven't had the need to have the
> same path on every host or even on the same host during different
> connections.
>
> For nvme a new `elif` clause could be added there, though it will be a
> bit trickier, because the nvme connection properties format are a bit of
> a mess...
>
> We have 2 different formats for the nvme properties, and the wwid that
> appears in symlink /dev/disk/by-id/nvme-<wwid> may or may not be the
> volume id, may be the uuid in the connection info if present or the
> nguid if the nvme device doesn't have uuid.
>
> For these reasons I would recommend not relying on the connection
> information and relying on the path from the attachment instead.
>
> Something like this should be probably fine:
>
> elif vol_type == 'nvme':
> device_name = os.path.basename(attachment['path'])
> controller = device_name.rsplit('n', 1)[0]
> wwid_filename = f'/sys/class/nvme/{controller}/{device_name}/wwid'
> with open(wwid_filename, 'r') as f:
> uuid = f.read().strip()
> return os.path.join('/dev/disk/by-id/nvme-', uuid)
Thanks Gorka!
but isn't this duplicating logic already in os brick?
https://github.com/openstack/os-brick/blob/56bf0272b55dcbbc7f5b03150973a8...
Hi Nir,
Oh! I thought we were talking about the generic NVMe-oF connector,
didn't know this was specific about the LightOS one.
The link is used as an easy way to locate the volume, it doesn't mean
that it is returned to the caller of the `connect_volume` method. In
fact, we can see how that method actually returns the real path and not
the link's path:
def _check_device_exists_using_dev_lnk(self, uuid):
lnk_path = f"/dev/disk/by-id/nvme-uuid.{uuid}"
--> if os.path.exists(lnk_path):
^^^ Check link exists
--> devname = os.path.realpath(lnk_path)
^^^ Get the real path for the symlink
--> if devname.startswith("/dev/nvme"):
^^^ Make extra sure it's not pointing to something crazy
LOG.info("LIGHTOS: devpath %s detected for uuid %s",
devname, uuid)
--> return devname
^^^ Return it
return None
Another interesting detail is this wait:
https://github.com/openstack/os-brick/blob/56bf0272b55dcbbc7f5b03150973a8...
def _get_device_by_uuid(self, uuid):
endtime = time.time() + self.WAIT_DEVICE_TIMEOUT
while time.time() < endtime:
try:
device = self._check_device_exists_using_dev_lnk(uuid)
if device:
return device
except Exception as e:
LOG.debug(f'LIGHTOS: {e}')
device = self._check_device_exists_reading_block_class(uuid)
if device:
return device
time.sleep(1)
return None
The code does not explain why it tries to use the /dev/disk/by-id link
and fallback to sysfs on errors. Based on our experience with udev,
I guess that the author does not trust udev. I wonder if we can trust
it as the stable device path.
In my experience udev rules (which is different from udev itself) are
less that reliable as a way of finding devices when working "in the
wild". They are only reliable if you have full control over the host
system and are sure nobody (admin or distro) can break things.
For reference, at Red Hat we have an RFE to improve os-brick [1] and
stop using symlinks at all.
While they are not 100% reliable in the wild, they are quite reliable
once they are working on a specific system, which means that if we
confirm they are working on a system we can rely on them if no changes
are made on the system (and if CPU is not 100% during attachment).
[1]:
https://bugzilla.redhat.com/show_bug.cgi?id=1697319
If we can trust this path, maybe os_brick can return the stable path
in a uniform way for all kind of devices?
I don't think this is likely to happen, because it has no real value for
OpenStack so it's unlikely to get prioritized (for coding and reviews).
Cheers,
Gorka.
>
> Cheers,
> Gorka.
>
> >
> > 2. unique path created by oVirt
> >
> > In this case oVirt will use the disk uuid already used in
> > ManagedVolume.{attach,detach}_volume APIs:
> >
https://github.com/oVirt/vdsm/blob/500c035903dd35180d71c97791e0ce4356fb77...
> >
https://github.com/oVirt/vdsm/blob/500c035903dd35180d71c97791e0ce4356fb77...
> >
> > From oVirt point of view, using the disk uuid seems better. It makes it easy
> > to debug when you can follow the uuid in all logs on different systems and
> > locate the actual disk using the same uuid.
> >
> > Nir
> >
>