[Engine-devel] [vdsm] RFC: New Storage API

Mon Jan 14 17:03:44 UTC 2013


----- Original Message -----
> From: "Itamar Heim" <iheim at redhat.com>
> To: "Saggi Mizrahi" <smizrahi at redhat.com>
> Cc: "VDSM Project Development" <vdsm-devel at lists.fedorahosted.org>, "engine-devel" <engine-devel at ovirt.org>
> Sent: Monday, January 14, 2013 6:18:13 AM
> Subject: Re: [vdsm] RFC: New Storage API
> 
> On 12/04/2012 11:52 PM, Saggi Mizrahi wrote:
> > I've been throwing a lot of bits out about the new storage API and
> > I think it's time to talk a bit.
> > I will purposefully try and keep implementation details away and
> > concentrate about how the API looks and how you use it.
> >
> > First major change is in terminology, there is no long a storage
> > domain but a storage repository.
> > This change is done because so many things are already called
> > domain in the system and this will make things less confusing for
> > new-commers with a libvirt background.
> >
> > One other changes is that repositories no longer have a UUID.
> > The UUID was only used in the pool members manifest and is no
> > longer needed.
> >
> >
> > connectStorageRepository(repoId, repoFormat,
> > connectionParameters={}):
> > repoId - is a transient name that will be used to refer to the
> > connected domain, it is not persisted and doesn't have to be the
> > same across the cluster.
> > repoFormat - Similar to what used to be type (eg. localfs-1.0,
> > nfs-3.4, clvm-1.2).
> > connectionParameters - This is format specific and will used to
> > tell VDSM how to connect to the repo.
> >
> > disconnectStorageRepository(self, repoId):
> >
> >
> > In the new API there are only images, some images are mutable and
> > some are not.
> > mutable images are also called VirtualDisks
> > immutable images are also called Snapshots
> >
> > There are no explicit templates, you can create as many images as
> > you want from any snapshot.
> >
> > There are 4 major image operations:
> >
> >
> > createVirtualDisk(targetRepoId, size, baseSnapshotId=None,
> >                    userData={}, options={}):
> >
> > targetRepoId - ID of a connected repo where the disk will be
> > created
> > size - The size of the image you wish to create
> > baseSnapshotId - the ID of the snapshot you want the base the new
> > virtual disk on
> > userData - optional data that will be attached to the new VD, could
> > be anything that the user desires.
> > options - options to modify VDSMs default behavior
> >
> > returns the id of the new VD
> >
> > createSnapshot(targetRepoId, baseVirtualDiskId,
> >                 userData={}, options={}):
> > targetRepoId - The ID of a connected repo where the new sanpshot
> > will be created and the original image exists as well.
> > size - The size of the image you wish to create
> > baseVirtualDisk - the ID of a mutable image (Virtual Disk) you want
> > to snapshot
> > userData - optional data that will be attached to the new Snapshot,
> > could be anything that the user desires.
> > options - options to modify VDSMs default behavior
> >
> > returns the id of the new Snapshot
> >
> > copyImage(targetRepoId, imageId, baseImageId=None, userData={},
> > options={})
> > targetRepoId - The ID of a connected repo where the new image will
> > be created
> > imageId - The image you wish to copy
> > baseImageId - if specified, the new image will contain only the
> > diff between image and Id.
> >                If None the new image will contain all the bits of
> >                image Id. This can be used to copy partial parts of
> >                images for export.
> > userData - optional data that will be attached to the new image,
> > could be anything that the user desires.
> > options - options to modify VDSMs default behavior
> >
> > return the Id of the new image. In case of copying an immutable
> > image the ID will be identical to the original image as they
> > contain the same data. However the user should not assume that and
> > always use the value returned from the method.
> >
> > removeImage(repositoryId, imageId, options={}):
> > repositoryId - The ID of a connected repo where the image to delete
> > resides
> > imageId - The id of the image you wish to delete.
> >
> >
> > ----
> > getImageStatus(repositoryId, imageId)
> > repositoryId - The ID of a connected repo where the image to check
> > resides
> > imageId - The id of the image you wish to check.
> >
> > All operations return once the operations has been committed to
> > disk NOT when the operation actually completes.
> > This is done so that:
> > - operation come to a stable state as quickly as possible.
> > - In case where there is an SDM, only small portion of the
> > operation actually needs to be performed on the SDM host.
> > - No matter how many times the operation fails and on how many
> > hosts, you can always resume the operation and choose when to do
> > it.
> > - You can stop an operation at any time and remove the resulting
> > object making a distinction between "stop because the host is
> > overloaded" to "I don't want that image"
> >
> > This means that after calling any operation that creates a new
> > image the user must then call getImageStatus() to check what is
> > the status of the image.
> > The status of the image can be either optimized, degraded, or
> > broken.
> > "Optimized" means that the image is available and you can run VMs
> > of it.
> > "Degraded" means that the image is available and will run VMs but
> > it might be a better way VDSM can represent the underlying data.
> > "Broken" means that the image can't be used at the moment, probably
> > because not all the data has been set up on the volume.
> >
> > Apart from that VDSM will also return the last persisted status
> > information which will conatin
> > hostID - the last host to try and optimize of fix the image
> > stage - X/Y (eg. 1/10) the last persisted stage of the fix.
> > percent_complete - -1 or 0-100, the last persisted completion
> > percentage of the aforementioned stage. -1 means that no progress
> > is available for that operation.
> > last_error - This will only be filled if the operation failed
> > because of something other then IO or a VDSM crash for obvious
> > reasons.
> >               It will usually be set if the task was manually
> >               stopped
> >
> > The user can either be satisfied with that information or as the
> > host specified in host ID if it is still working on that image by
> > checking it's running tasks.
> >
> > checkStorageRepository(self, repositoryId, options={}):
> > A method to go over a storage repository and scan for any existing
> > problems. This includes degraded\broken images and deleted images
> > that have no yet been physically deleted\merged.
> > It returns a list of Fix objects.
> > Fix objects come in 4 types:
> > clean - cleans data, run them to get more space.
> > optimize - run them to optimize a degraded image
> > merge - Merges two images together. Doing this sometimes
> >          makes more images ready optimizing or cleaning.
> >          The reason it is different from optimize is that
> >          unmerged images are considered optimized.
> > mend - mends a broken image
> >
> > The user can read these types and prioritize fixes. Fixes also
> > contain opaque FIX data and they should be sent as received to
> > fixStorageRepository(self, repositoryId, fix, options={}):
> >
> > That will start a fix operation.
> >
> >
> > All major operations automatically start the appropriate "Fix" to
> > bring the created object to an optimize\degraded state (the one
> > that is quicker) unless one of the options is
> > AutoFix=False. This is only useful for repos that might not be able
> > to create volumes on all hosts (SDM) but would like to have the
> > actual IO distributed in the cluster.
> >
> > Other common options is the strategy option:
> > It has currently 2 possible values
> > space and performance - In case VDSM has 2 ways of completing the
> > same operation it will tell it to value one over the other. For
> > example, whether to copy all the data or just create a qcow based
> > of a snapshot.
> > The default is space.
> >
> > You might have also noticed that it is never explicitly specified
> > where to look for existing images. This is done purposefully, VDSM
> > will always look in all connected repositories for existing
> > objects.
> > For very large setups this might be problematic. To mitigate the
> > problem you have these options:
> > participatingRepositories=[repoId, ...] which tell VDSM to narrow
> > the search to just these repositories
> > and
> > imageHints={imgId: repoId} which will force VDSM to look for those
> > image ID just in those repositories and fail if it doesn't find
> > them there.
> > _______________________________________________
> > vdsm-devel mailing list
> > vdsm-devel at lists.fedorahosted.org
> > https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
> >
> 
> you are using VirtualDisk and Snapshot for mutable and immutable.
> the terminology of "image" for immutable and "volume" for mutable
> seems
> to be the one used by ec2/openstack/etc. - any thoughts on using
> similar
> terminology?
> (though i think they also differ in forcing all images to be in an
> image
> repo, and all volumes in a volume repo, while we allow to mix them in
> same repo).
> 
We have volumes internally though, I wanted to call them slabs but Ayal and Edu feel strongly about calling them volumes and the current naming system in general.