[Engine-devel] [vdsm] RFC: New Storage API

Saggi Mizrahi smizrahi at redhat.com
Wed Dec 5 00:00:33 UTC 2012



----- Original Message -----
> From: "Adam Litke" <agl at us.ibm.com>
> To: "Saggi Mizrahi" <smizrahi at redhat.com>
> Cc: "VDSM Project Development" <vdsm-devel at lists.fedorahosted.org>, "engine-devel" <engine-devel at ovirt.org>
> Sent: Tuesday, December 4, 2012 6:08:25 PM
> Subject: Re: [vdsm] RFC: New Storage API
> 
> Thanks for sharing this.  It's nice to have something a little more
> concrete to
> think about.  Just a few comments and questions inline to get some
> discussion
> flowing.
> 
> On Tue, Dec 04, 2012 at 04:52:40PM -0500, Saggi Mizrahi wrote:
> > I've been throwing a lot of bits out about the new storage API and
> > I think it's time to talk a bit.
> > I will purposefully try and keep implementation details away and
> > concentrate about how the API looks and how you use it.
> > 
> > First major change is in terminology, there is no long a storage
> > domain but a storage repository.
> > This change is done because so many things are already called
> > domain in the system and this will make things less confusing for
> > new-commers with a libvirt background.
> > 
> > One other changes is that repositories no longer have a UUID.
> > The UUID was only used in the pool members manifest and is no
> > longer needed.
> > 
> > 
> > connectStorageRepository(repoId, repoFormat,
> > connectionParameters={}):
> 
> We should probably add an options/flags parameter for extension of
> all new
> APIs.
Usually I agree but connectionParameters is already generic enough :)
> 
> > repoId - is a transient name that will be used to refer to the
> > connected domain, it is not persisted and doesn't have to be the
> > same across the cluster.
> > repoFormat - Similar to what used to be type (eg. localfs-1.0,
> > nfs-3.4, clvm-1.2).
> > connectionParameters - This is format specific and will used to
> > tell VDSM how to connect to the repo.
> > 
> > disconnectStorageRepository(self, repoId):
> 
> I assume 'self' is a mistake here.  Just want to clarify given all of
> the recent
> talk about instances vs. namespaces.
Yea, it's just pasted from my code
> 
> > In the new API there are only images, some images are mutable and
> > some are not.
> > mutable images are also called VirtualDisks
> > immutable images are also called Snapshots
> 
> By mutable you mean writable right?  Or does the word mutable imply
> more than
> that?
It's a semantic distinction due to implementation details, in general terms, yes.
> 
> > There are no explicit templates, you can create as many images as
> > you want from any snapshot.
> > 
> > There are 4 major image operations:
> > 
> > 
> > createVirtualDisk(targetRepoId, size, baseSnapshotId=None,
> >                   userData={}, options={}):
> 
> Is userdata a 'StringMap'?
currently it's a json object. We could limit it to a string map and trust the client to parse types.
We can have it be a string\blob and trust the user to serialize the data.
It's pass-through object either way.
> 
> I will reopen the argument about an options dict vs a flags
> parameter.  I oppose
> the dict for expansion because I think it causes APIs to devolve into
> a mess
> where lots of arbitrary and not well thought out overrides are packed
> into the
> dict over time.  A flags argument (in json and python it can be an
> enum array)
> limits us to really switching flags on and off instead of passing
> arbitrary
> data.
We already have strategy that we know we want to have several options.
Other stuff that have been suggested is to be able to override the img format (qcow2\qed)

The way I envision it is having an class
opts = CommandOptions()
that you add
opts.addStringOption("key", "value")
opts.addIntOption("key", 3)
opt.addBoolOption("key", True)

I know you could just as well have
strategy_space_flag and strategy_performance_flag
and fail the operation if they both exist.
Since it is a matter of personal taste I think it should be decided by a vote.
> 
> > targetRepoId - ID of a connected repo where the disk will be
> > created
> > size - The size of the image you wish to create
> > baseSnapshotId - the ID of the snapshot you want the base the new
> > virtual disk on
> > userData - optional data that will be attached to the new VD, could
> > be anything that the user desires.
> > options - options to modify VDSMs default behavior
> > 
> > returns the id of the new VD
> > 
> > createSnapshot(targetRepoId, baseVirtualDiskId,
> >                userData={}, options={}):
> > targetRepoId - The ID of a connected repo where the new sanpshot
> > will be created and the original image exists as well.
> > size - The size of the image you wish to create
> 
> Why is this needed?  Doesn't the size of a snapshot have to be equal
> to its
> base image?
oops, another copy\paste error, you can see this arg doesn't exist in the method signature.
My proof reading do need more work.
> 
> > baseVirtualDisk - the ID of a mutable image (Virtual Disk) you want
> > to snapshot
> 
> Can you snapshot a snapshot?  In that case, this parameter should be
> called
> baseImage.
You can't snapshot a sanpshot, it makes no sense as it can't change and you will get the same object.
> 
> > userData - optional data that will be attached to the new Snapshot,
> > could be anything that the user desires.
> > options - options to modify VDSMs default behavior
> > 
> > returns the id of the new Snapshot
> > 
> > copyImage(targetRepoId, imageId, baseImageId=None, userData={},
> > options={})
> > targetRepoId - The ID of a connected repo where the new image will
> > be created
> > imageId - The image you wish to copy
> 
> Do we locate the sourceRepoId automatically based on the imageId?
I have written about that below when I talk about common options
> 
> > baseImageId - if specified, the new image will contain only the
> > diff between image and Id.
> >               If None the new image will contain all the bits of
> >               image Id. This can be used to copy partial parts of
> >               images for export.
> > userData - optional data that will be attached to the new image,
> > could be anything that the user desires.
> > options - options to modify VDSMs default behavior
> > 
> > return the Id of the new image. In case of copying an immutable
> > image the ID will be identical to the original image as they
> > contain the same data. However the user should not assume that and
> > always use the value returned from the method.
> > 
> > removeImage(repositoryId, imageId, options={}):
> > repositoryId - The ID of a connected repo where the image to delete
> > resides
> > imageId - The id of the image you wish to delete.
> > 
> > ----
> > getImageStatus(repositoryId, imageId)
> > repositoryId - The ID of a connected repo where the image to check
> > resides
> > imageId - The id of the image you wish to check.
> 
> What is in this return value?  Is it a single enum indicating whether
> the image
> is locked (being copied, etc.) or a list of detailed information
> (like
> Volume.getInfo)?  (I see some more info below...)
The description below explains everything.
> 
> > All operations return once the operations has been committed to
> > disk NOT when the operation actually completes.
> > This is done so that:
> > - operation come to a stable state as quickly as possible.
> > - In case where there is an SDM, only small portion of the
> > operation actually needs to be performed on the SDM host.
> > - No matter how many times the operation fails and on how many
> > hosts, you can always resume the operation and choose when to do
> > it.
> > - You can stop an operation at any time and remove the resulting
> > object making a distinction between "stop because the host is
> > overloaded" to "I don't want that image"
> > 
> > This means that after calling any operation that creates a new
> > image the user must then call getImageStatus() to check what is
> > the status of the image.
> > The status of the image can be either optimized, degraded, or
> > broken.
> > "Optimized" means that the image is available and you can run VMs
> > of it.
> > "Degraded" means that the image is available and will run VMs but
> > it might be a better way VDSM can represent the underlying data.
> > "Broken" means that the image can't be used at the moment, probably
> > because not all the data has been set up on the volume.
> > 
> > Apart from that VDSM will also return the last persisted status
> > information which will conatin
> > hostID - the last host to try and optimize of fix the image
> > stage - X/Y (eg. 1/10) the last persisted stage of the fix.
> 
> Do you have some examples of what the stages would be?  I think these
> should be
> defined in enums so that the user can check on what the individual
> stages mean.
> What happens when the low level implementation of an operation
> changes?  The
> meaning of the stages will change completely.
No stages can change and are implementation specific, they will be used in the UI to show progress.
I added stages because sometimes it's impossible to give percentage for multiple operations.
(If you need to do 2 copy operations is each of them 50%?)
I agree that we need some way of transferring text. Maybe having another field called current operations.
It will be something from a preset enum list:
"Copying Image" 1 
"Creating Volume" 2
"Changing Metadata" 3
That way we don't commit on number of stages or what they contain but still expose what it is we are doing to display in the UI
> 
> > percent_complete - -1 or 0-100, the last persisted completion
> > percentage of the aforementioned stage. -1 means that no progress
> > is available for that operation.
> 
> > last_error - This will only be filled if the operation failed
> > because of something other then IO or a VDSM crash for obvious
> > reasons.
> >              It will usually be set if the task was manually
> >              stopped
> > 
> > The user can either be satisfied with that information or as the
> > host specified in host ID if it is still working on that image by
> > checking it's running tasks.
> > 
> > checkStorageRepository(self, repositoryId, options={}):
> > A method to go over a storage repository and scan for any existing
> > problems. This includes degraded\broken images and deleted images
> > that have no yet been physically deleted\merged.
> > It returns a list of Fix objects.
> > Fix objects come in 4 types:
> > clean - cleans data, run them to get more space.
> > optimize - run them to optimize a degraded image
> 
> What is an example of a degraded image?
qcow snapshot when the user asked for performance driven image.
We create the QCow snapshot first because it will make the VM usable without doing a lot of IO.
Optimizing it will be to actually flatten the data and reach the desired state.
> 
> > merge - Merges two images together. Doing this sometimes
> >         makes more images ready optimizing or cleaning.
> >         The reason it is different from optimize is that
> >         unmerged images are considered optimized.
> > mend - mends a broken image
> 
> What does this mean?
The image is half copied. Snapshot that we know got corrupted and we have backups of. Whatever operation that moves an image status from broken->(optimized\degraded)
> 
> > The user can read these types and prioritize fixes. Fixes also
> > contain opaque FIX data and they should be sent as received to
> > fixStorageRepository(self, repositoryId, fix, options={}):
> > 
> > That will start a fix operation.
> 
> Could we have an automatic fix mode where vdsm just does the right
> thing (for
> most things)?
I wanted to have that, but deciding what VDSM should do automatically and when is really difficult and policy driven.
The best way to go IMHO is to have policy aware daemons monitor VDSM on the same host and auto-fix according to policy.
This shouldn't be part of VDSM as policy changes from manager to manager.
> 
> > All major operations automatically start the appropriate "Fix" to
> > bring the created object to an optimize\degraded state (the one
> > that is quicker) unless one of the options is
> > AutoFix=False. This is only useful for repos that might not be able
> > to create volumes on all hosts (SDM) but would like to have the
> > actual IO distributed in the cluster.
> > 
> > Other common options is the strategy option:
> > It has currently 2 possible values
> > space and performance - In case VDSM has 2 ways of completing the
> > same operation it will tell it to value one over the other. For
> > example, whether to copy all the data or just create a qcow based
> > of a snapshot.
> > The default is space.
> 
> I like this a lot.
:)
> 
> > You might have also noticed that it is never explicitly specified
> > where to look for existing images. This is done purposefully, VDSM
> > will always look in all connected repositories for existing
> > objects.
> > For very large setups this might be problematic. To mitigate the
> > problem you have these options:
> > participatingRepositories=[repoId, ...] which tell VDSM to narrow
> > the search to just these repositories
> > and
> > imageHints={imgId: repoId} which will force VDSM to look for those
> > image ID just in those repositories and fail if it doesn't find
> > them there.
> 
> I would like to have a better way of specifying these optional
> parameters
> without burying them in an options structure.  I will think a little
> more about
> this.  Strategy can just be a two optional flags in a 'flags'
> argument.  For the
> participatingRepositories and imageHints options, I think we need to
> use real
> parameters.
That was what I did in my original design, but then I chose to just move everything to options.
The reasoning is that without any options VDSM will try and do what it think is best for the user.
Without any options VDSM will use the combination that is most likely to work and in the shortest amount of resources (time\IO) with as little complications as possible.
This means space driven operations as they require less IO and space and "automatic source image discovery" as it has the best chance of finding the resources you need.
Whatever you put in the options essentially overrides the default behavior for cases where the user is willing to take a risk for some theoretical gain.
Using this distinction makes it clear what regular users should use and what expert users should use.
Further more, some options might be repository format specific (striping, pre-zeroing) making them part of the interface is problematic.

Also, the fact that we know about these options now doesn't mean they should get special treatment, they are still saying "override the default behavior".
> 
> --
> Adam Litke <agl at us.ibm.com>
> IBM Linux Technology Center
> 
> 



More information about the Engine-devel mailing list