[Engine-devel] [vdsm] RFC: New Storage API

Tue Jan 15 21:37:49 UTC 2013

----- Original Message -----
> From: "Ayal Baron" <abaron at redhat.com>
> To: "Saggi Mizrahi" <smizrahi at redhat.com>
> Cc: "engine-devel" <engine-devel at ovirt.org>, "VDSM Project Development" <vdsm-devel at lists.fedorahosted.org>, "Deepak
> C Shetty" <deepakcs at linux.vnet.ibm.com>
> Sent: Monday, January 14, 2013 6:23:32 PM
> Subject: Re: [vdsm] RFC: New Storage API
> 
> 
> 
> ----- Original Message -----
> > 
> > 
> > ----- Original Message -----
> > > From: "Ayal Baron" <abaron at redhat.com>
> > > To: "Saggi Mizrahi" <smizrahi at redhat.com>
> > > Cc: "engine-devel" <engine-devel at ovirt.org>, "VDSM Project
> > > Development" <vdsm-devel at lists.fedorahosted.org>, "Deepak
> > > C Shetty" <deepakcs at linux.vnet.ibm.com>
> > > Sent: Monday, January 14, 2013 4:56:05 PM
> > > Subject: Re: [vdsm] RFC: New Storage API
> > > 
> > > 
> > > 
> > > ----- Original Message -----
> > > > 
> > > > 
> > > > ----- Original Message -----
> > > > > From: "Deepak C Shetty" <deepakcs at linux.vnet.ibm.com>
> > > > > To: "Saggi Mizrahi" <smizrahi at redhat.com>
> > > > > Cc: "Shu Ming" <shuming at linux.vnet.ibm.com>, "engine-devel"
> > > > > <engine-devel at ovirt.org>, "VDSM Project Development"
> > > > > <vdsm-devel at lists.fedorahosted.org>, "Deepak C Shetty"
> > > > > <deepakcs at linux.vnet.ibm.com>
> > > > > Sent: Sunday, December 16, 2012 11:40:01 PM
> > > > > Subject: Re: [vdsm] RFC: New Storage API
> > > > > 
> > > > > On 12/08/2012 01:23 AM, Saggi Mizrahi wrote:
> > > > > >
> > > > > > ----- Original Message -----
> > > > > >> From: "Deepak C Shetty" <deepakcs at linux.vnet.ibm.com>
> > > > > >> To: "Saggi Mizrahi" <smizrahi at redhat.com>
> > > > > >> Cc: "Shu Ming" <shuming at linux.vnet.ibm.com>,
> > > > > >> "engine-devel"
> > > > > >> <engine-devel at ovirt.org>, "VDSM Project Development"
> > > > > >> <vdsm-devel at lists.fedorahosted.org>, "Deepak C Shetty"
> > > > > >> <deepakcs at linux.vnet.ibm.com>
> > > > > >> Sent: Friday, December 7, 2012 12:23:15 AM
> > > > > >> Subject: Re: [vdsm] RFC: New Storage API
> > > > > >>
> > > > > >> On 12/06/2012 10:22 PM, Saggi Mizrahi wrote:
> > > > > >>> ----- Original Message -----
> > > > > >>>> From: "Shu Ming" <shuming at linux.vnet.ibm.com>
> > > > > >>>> To: "Saggi Mizrahi" <smizrahi at redhat.com>
> > > > > >>>> Cc: "VDSM Project Development"
> > > > > >>>> <vdsm-devel at lists.fedorahosted.org>, "engine-devel"
> > > > > >>>> <engine-devel at ovirt.org>
> > > > > >>>> Sent: Thursday, December 6, 2012 11:02:02 AM
> > > > > >>>> Subject: Re: [vdsm] RFC: New Storage API
> > > > > >>>>
> > > > > >>>> Saggi,
> > > > > >>>>
> > > > > >>>> Thanks for sharing your thought and I get some comments
> > > > > >>>> below.
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> Saggi Mizrahi:
> > > > > >>>>> I've been throwing a lot of bits out about the new
> > > > > >>>>> storage
> > > > > >>>>> API
> > > > > >>>>> and
> > > > > >>>>> I think it's time to talk a bit.
> > > > > >>>>> I will purposefully try and keep implementation details
> > > > > >>>>> away
> > > > > >>>>> and
> > > > > >>>>> concentrate about how the API looks and how you use it.
> > > > > >>>>>
> > > > > >>>>> First major change is in terminology, there is no long
> > > > > >>>>> a
> > > > > >>>>> storage
> > > > > >>>>> domain but a storage repository.
> > > > > >>>>> This change is done because so many things are already
> > > > > >>>>> called
> > > > > >>>>> domain in the system and this will make things less
> > > > > >>>>> confusing
> > > > > >>>>> for
> > > > > >>>>> new-commers with a libvirt background.
> > > > > >>>>>
> > > > > >>>>> One other changes is that repositories no longer have a
> > > > > >>>>> UUID.
> > > > > >>>>> The UUID was only used in the pool members manifest and
> > > > > >>>>> is
> > > > > >>>>> no
> > > > > >>>>> longer needed.
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> connectStorageRepository(repoId, repoFormat,
> > > > > >>>>> connectionParameters={}):
> > > > > >>>>> repoId - is a transient name that will be used to refer
> > > > > >>>>> to
> > > > > >>>>> the
> > > > > >>>>> connected domain, it is not persisted and doesn't have
> > > > > >>>>> to
> > > > > >>>>> be
> > > > > >>>>> the
> > > > > >>>>> same across the cluster.
> > > > > >>>>> repoFormat - Similar to what used to be type (eg.
> > > > > >>>>> localfs-1.0,
> > > > > >>>>> nfs-3.4, clvm-1.2).
> > > > > >>>>> connectionParameters - This is format specific and will
> > > > > >>>>> used
> > > > > >>>>> to
> > > > > >>>>> tell VDSM how to connect to the repo.
> > > > > >>>> Where does repoID come from? I think repoID doesn't
> > > > > >>>> exist
> > > > > >>>> before
> > > > > >>>> connectStorageRepository() return.  Isn't repoID a
> > > > > >>>> return
> > > > > >>>> value
> > > > > >>>> of
> > > > > >>>> connectStorageRepository()?
> > > > > >>> No, repoIDs are no longer part of the domain, they are
> > > > > >>> just
> > > > > >>> a
> > > > > >>> transient handle.
> > > > > >>> The user can put whatever it wants there as long as it
> > > > > >>> isn't
> > > > > >>> already taken by another currently connected domain.
> > > > > >> So what happens when user mistakenly gives a repoID that
> > > > > >> is
> > > > > >> in
> > > > > >> use
> > > > > >> before.. there should be something in the return value
> > > > > >> that
> > > > > >> specifies
> > > > > >> the error and/or reason for error so that user can try
> > > > > >> with
> > > > > >> a
> > > > > >> new/diff
> > > > > >> repoID ?
> > > > > > Asi I said, connect fails if the repoId is in use ATM.
> > > > > >>>>> disconnectStorageRepository(self, repoId)
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> In the new API there are only images, some images are
> > > > > >>>>> mutable
> > > > > >>>>> and
> > > > > >>>>> some are not.
> > > > > >>>>> mutable images are also called VirtualDisks
> > > > > >>>>> immutable images are also called Snapshots
> > > > > >>>>>
> > > > > >>>>> There are no explicit templates, you can create as many
> > > > > >>>>> images
> > > > > >>>>> as
> > > > > >>>>> you want from any snapshot.
> > > > > >>>>>
> > > > > >>>>> There are 4 major image operations:
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> createVirtualDisk(targetRepoId, size,
> > > > > >>>>> baseSnapshotId=None,
> > > > > >>>>>                      userData={}, options={}):
> > > > > >>>>>
> > > > > >>>>> targetRepoId - ID of a connected repo where the disk
> > > > > >>>>> will
> > > > > >>>>> be
> > > > > >>>>> created
> > > > > >>>>> size - The size of the image you wish to create
> > > > > >>>>> baseSnapshotId - the ID of the snapshot you want the
> > > > > >>>>> base
> > > > > >>>>> the
> > > > > >>>>> new
> > > > > >>>>> virtual disk on
> > > > > >>>>> userData - optional data that will be attached to the
> > > > > >>>>> new
> > > > > >>>>> VD,
> > > > > >>>>> could
> > > > > >>>>> be anything that the user desires.
> > > > > >>>>> options - options to modify VDSMs default behavior
> > > > > >> IIUC, i can use options to do storage offloads ? For eg. I
> > > > > >> can
> > > > > >> create
> > > > > >> a
> > > > > >> LUN that represents this VD on my storage array based on
> > > > > >> the
> > > > > >> 'options'
> > > > > >> parameter ? Is this the intended way to use 'options' ?
> > > > > > No, this has nothing to do with offloads.
> > > > > > If by "offloads" you mean having other VDSM hosts to the
> > > > > > heavy
> > > > > > lifting then this is what the option autoFix=False and the
> > > > > > fix
> > > > > > mechanism is for.
> > > > > > If you are talking about advanced scsi features (ie. write
> > > > > > same)
> > > > > > they will be used automatically whenever possible.
> > > > > > In any case, how we manage LUNs (if they are even used) is
> > > > > > an
> > > > > > implementation detail.
> > > > > 
> > > > > I am a bit more interested in how storage array offloads ( by
> > > > > that
> > > > > I
> > > > > mean, offload VD creation, snapshot, clone etc to the storage
> > > > > array
> > > > > when
> > > > > available/possible) can be done from VDSM ?
> > > > > In the past there were talks of using libSM to do that. How
> > > > > does
> > > > > that
> > > > > strategy play in this new Storage API scenario ? I agree its
> > > > > implmn
> > > > > detail, but how & where does that implm sit and how it would
> > > > > be
> > > > > triggered is not very clear to me. Looking at createVD args,
> > > > > it
> > > > > sounded
> > > > > like 'options' seems to be a trigger point for deciding
> > > > > whether
> > > > > to
> > > > > use
> > > > > storage offloads or not, but you spoke otherwise :) Can you
> > > > > provide
> > > > > your
> > > > > vision on how VDSM can understand the storage array
> > > > > capabilities
> > > > > &
> > > > > exploit storgae array offloads in this New Storage API
> > > > > context
> > > > > ?
> > > > > --
> > > > > Thanks deepak
> > > > Some will be used automatically whenever possible (storage
> > > > offloading).
> > > > Features that favor a specific strategy will be activated when
> > > > the
> > > > proper strategy (space, performance) option is selected.
> > > > In cases when only the user can know whether to use a feature
> > > > or
> > > > not
> > > > we will have options to turn that on.
> > > > In any case every domain exports a capability list through
> > > > GetRepositoryCapabilities() that returns a list off repository
> > > > capabilities.
> > > > Some capabilities are VDSM specific like CLUSTERED or
> > > > REQUIRES_SRM.
> > > > Some are storage capabilities like NATIVE_SNAPSHOTS,
> > > > NATIVE_THIN_PROVISIONING, SPARSE_VOLUMES, etc...
> > > > 
> > > > We are also considering an override mechanism where you can
> > > > disable
> > > > features in storage that supports it by setting it in the
> > > > domain
> > > > options. This will be done with NO_XXXXX (eg.
> > > > NO_NATIVE_SNAPSHOTS).
> > > > This will make the domain not use or expose the capability
> > > > through
> > > > the API. I assume it will only be used for testing or in cases
> > > > where
> > > > the storage array is known to have problems with a certain
> > > > feature.
> > > > Not everything can be disables as an example there is no real
> > > > way
> > > > to
> > > > disable NATIVE_THING_PROVISIONING or SPARSE_VOLUMES.
> > > 
> > > Saggi, there are several different discussions going on here
> > > which
> > > I
> > > think require some clearing up (and perhaps splitting).
> > > what I think is missing here:
> > > 1. distinction of what we believe should be at repo level and
> > > what
> > > at
> > > disk level
> > >     e.g. pre/postZero at repo level, native snapshots as
> > >     described
> > >     above would also be repo level (not defined per disk) etc.
> > No option is inherently repo or disk based (even the ones you
> > mantioned).
> > It all depends on the VDSM version you are interacting with. I am
> > open to
> > the suggestion on having a way to query whether certain options are
> > supported by what operation and in what level dynamically.
> > 
> > As for post-zeroing, currently it is kind of insulting to our users
> > that
> > we push it as a solution to prevent new VMs from reading old VMs
> > data. It
> > is just not what it does. It will be replaced with pre-zeroing.
> > 
> > The only reason I see for post zeroing is the "wipe HDD before
> > selling it off"
> > use case but I don't see how it is valid in virtualized
> > environments
> > especially
> > with enterprise storage.
> 
> I have 2 problems with this:
> 1. you're assuming 'enterprise' storage and in cloud this is many
> times not the case at all.  On local storage it's pretty safe to
> assume that what you see is what you get (on block devices at
> least).
> 2. we have requests to in fact add additional wiping algorithms to
> really clean things up exactly for the 'wipe HDD before selling it
> off' scenario (only valid of course if we actually write on the
> relevant sectors which is something we cannot guarantee, but with
> the right disclaimers, this should be fine).
> 
> Rethinking this, at most post-zero is wasteful, but storage array
> should guarantee that next time data is read from disk then zeros
> are served (if it doesn't then it's a security flaw on the storage
> side which we shouldn't care about).
> 
> > 
> > > 2. how/where storage offload would work? is there a single
> > > implementation for repos which detects automatically storage
> > > capabilities or repo class for each storage type
> > They will be used automatically if available and supported by VDSM
> > and the
> > repo format. It doesn't matter to this section of the API whether
> > the
> > user
> > will have to manually create repositories in specific formats or if
> > it will
> > be automatic. As it seems, it will probably be automatically.
> > > 3. and biggest topic is probably - a mapping of the image
> > > operations
> > > and details about the flows (did you send something about that?)
> > > -
> > > i.e. create vdisk flow, copy, etc.
> > The flows you suggest here are basic flows supported with a single
> > API call.
> > 
> > Because of the nature of image states users need to have a
> > mechanism
> > to
> > track said image states and make sure fixes are running (if
> > desired)
> > 
> > Everyone is welcome to give it a shot and see how they can
> > implement
> > high
> > level flows using this API in their own systems. I don't mind
> > helping
> > people
> > figure it out, just ask.
> 
> Actually I meant the new images API flows here (i.e. the internals,
> not how to use it).
Well the internals are out of the scope of this discussion. I don't
think it matters. There is some stuff I put on the wiki but it is
a bit out of date at the moment, though the general principles remain
the same.

Further more, the main idea driving the the API is separating
implementation from it.

- Using intent (space vs performance) instead of implementations details
  (qcow vs raw).

- Removing explicit chains
- making the result of a copy operation allowed against any disk instead
  of forcing the user to understand copy\move(copyOP)\move(moveOP).

- Having fixes be opaque only describing purpose (merge\clean\optimize)
  instead of what the actually do.

- Making properties like SPM be dynamic and queried by the user instead
  of inherent to the repo format making it easier to support new
  implementation that share similar traits and remove\add constraints to
  the implementation between versions.

This, again, means that VDSM can implement the intent in any way it sees
fit as long as it keeps the semantics intact.

To make the burden of semantics to a minimum, the fact that images can be
created and remain in a "broken" or "degraded" state is introduced. This
means that even if a process takes a lot of time or might require specific
rare constraints, it can still be implemented through the creations of
intermediate states which leave the image in "degraded" state making the
VM operational but still giving room for further operations.

Having createXXXX() just mean.
I promise to try my best to create XXXXX here is the object ID YYYYYYY
gives way to just about any implementation desirable. You can even have
it put a file on the disk with the ID and the word "broken" and then
send an email to a person who will manually write the bits to another
file finishing by writing the path to the new disk and the word
"optimized" in it.

The well known open\read\write\close APIs are implemented in many ways
with many file systems. I also hope that when my implementation is
retired because everyone finds out how horrible it is the interface
could remain consistent.

The question I'm asking is whether the minimal set of operations given
and there very minimal set of commitments is enough for people to implement
higher level concepts as long as the semantics are kept by the
implementation.
> > > 
> > > 
> > > 
> > > > > 
> > > > > 
> > > > _______________________________________________
> > > > vdsm-devel mailing list
> > > > vdsm-devel at lists.fedorahosted.org
> > > > https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
> > > > 
> > > 
> > 
>