Thanks for sharing this. It's nice to have something a little more concrete to
think about. Just a few comments and questions inline to get some discussion
flowing.
On Tue, Dec 04, 2012 at 04:52:40PM -0500, Saggi Mizrahi wrote:
I've been throwing a lot of bits out about the new storage API
and I think it's time to talk a bit.
I will purposefully try and keep implementation details away and concentrate about how
the API looks and how you use it.
First major change is in terminology, there is no long a storage domain but a storage
repository.
This change is done because so many things are already called domain in the system and
this will make things less confusing for new-commers with a libvirt background.
One other changes is that repositories no longer have a UUID.
The UUID was only used in the pool members manifest and is no longer needed.
connectStorageRepository(repoId, repoFormat, connectionParameters={}):
We should probably add an options/flags parameter for extension of all new
APIs.
repoId - is a transient name that will be used to refer to the
connected domain, it is not persisted and doesn't have to be the same across the
cluster.
repoFormat - Similar to what used to be type (eg. localfs-1.0, nfs-3.4, clvm-1.2).
connectionParameters - This is format specific and will used to tell VDSM how to connect
to the repo.
disconnectStorageRepository(self, repoId):
I assume 'self' is a mistake here. Just want to clarify given all of the recent
talk about instances vs. namespaces.
In the new API there are only images, some images are mutable and
some are not.
mutable images are also called VirtualDisks
immutable images are also called Snapshots
By mutable you mean writable right? Or does the word mutable imply more than
that?
There are no explicit templates, you can create as many images as you
want from any snapshot.
There are 4 major image operations:
createVirtualDisk(targetRepoId, size, baseSnapshotId=None,
userData={}, options={}):
Is userdata a 'StringMap'?
I will reopen the argument about an options dict vs a flags parameter. I oppose
the dict for expansion because I think it causes APIs to devolve into a mess
where lots of arbitrary and not well thought out overrides are packed into the
dict over time. A flags argument (in json and python it can be an enum array)
limits us to really switching flags on and off instead of passing arbitrary
data.
targetRepoId - ID of a connected repo where the disk will be created
size - The size of the image you wish to create
baseSnapshotId - the ID of the snapshot you want the base the new virtual disk on
userData - optional data that will be attached to the new VD, could be anything that the
user desires.
options - options to modify VDSMs default behavior
returns the id of the new VD
createSnapshot(targetRepoId, baseVirtualDiskId,
userData={}, options={}):
targetRepoId - The ID of a connected repo where the new sanpshot will be created and the
original image exists as well.
size - The size of the image you wish to create
Why is this needed? Doesn't the size of a snapshot have to be equal to its
base image?
baseVirtualDisk - the ID of a mutable image (Virtual Disk) you want
to snapshot
Can you snapshot a snapshot? In that case, this parameter should be called
baseImage.
userData - optional data that will be attached to the new Snapshot,
could be anything that the user desires.
options - options to modify VDSMs default behavior
returns the id of the new Snapshot
copyImage(targetRepoId, imageId, baseImageId=None, userData={}, options={})
targetRepoId - The ID of a connected repo where the new image will be created
imageId - The image you wish to copy
Do we locate the sourceRepoId automatically based on the imageId?
baseImageId - if specified, the new image will contain only the diff
between image and Id.
If None the new image will contain all the bits of image Id. This can be
used to copy partial parts of images for export.
userData - optional data that will be attached to the new image, could be anything that
the user desires.
options - options to modify VDSMs default behavior
return the Id of the new image. In case of copying an immutable image the ID will be
identical to the original image as they contain the same data. However the user should not
assume that and always use the value returned from the method.
removeImage(repositoryId, imageId, options={}):
repositoryId - The ID of a connected repo where the image to delete resides
imageId - The id of the image you wish to delete.
----
getImageStatus(repositoryId, imageId)
repositoryId - The ID of a connected repo where the image to check resides
imageId - The id of the image you wish to check.
What is in this return value? Is it a single enum indicating whether the image
is locked (being copied, etc.) or a list of detailed information (like
Volume.getInfo)? (I see some more info below...)
All operations return once the operations has been committed to disk
NOT when the operation actually completes.
This is done so that:
- operation come to a stable state as quickly as possible.
- In case where there is an SDM, only small portion of the operation actually needs to be
performed on the SDM host.
- No matter how many times the operation fails and on how many hosts, you can always
resume the operation and choose when to do it.
- You can stop an operation at any time and remove the resulting object making a
distinction between "stop because the host is overloaded" to "I don't
want that image"
This means that after calling any operation that creates a new image the user must then
call getImageStatus() to check what is the status of the image.
The status of the image can be either optimized, degraded, or broken.
"Optimized" means that the image is available and you can run VMs of it.
"Degraded" means that the image is available and will run VMs but it might be a
better way VDSM can represent the underlying data.
"Broken" means that the image can't be used at the moment, probably because
not all the data has been set up on the volume.
Apart from that VDSM will also return the last persisted status information which will
conatin
hostID - the last host to try and optimize of fix the image
stage - X/Y (eg. 1/10) the last persisted stage of the fix.
Do you have some examples of what the stages would be? I think these should be
defined in enums so that the user can check on what the individual stages mean.
What happens when the low level implementation of an operation changes? The
meaning of the stages will change completely.
percent_complete - -1 or 0-100, the last persisted completion
percentage of the aforementioned stage. -1 means that no progress is available for that
operation.
last_error - This will only be filled if the operation failed because
of something other then IO or a VDSM crash for obvious reasons.
It will usually be set if the task was manually stopped
The user can either be satisfied with that information or as the host specified in host
ID if it is still working on that image by checking it's running tasks.
checkStorageRepository(self, repositoryId, options={}):
A method to go over a storage repository and scan for any existing problems. This
includes degraded\broken images and deleted images that have no yet been physically
deleted\merged.
It returns a list of Fix objects.
Fix objects come in 4 types:
clean - cleans data, run them to get more space.
optimize - run them to optimize a degraded image
What is an example of a degraded image?
merge - Merges two images together. Doing this sometimes
makes more images ready optimizing or cleaning.
The reason it is different from optimize is that
unmerged images are considered optimized.
mend - mends a broken image
What does this mean?
The user can read these types and prioritize fixes. Fixes also
contain opaque FIX data and they should be sent as received to
fixStorageRepository(self, repositoryId, fix, options={}):
That will start a fix operation.
Could we have an automatic fix mode where vdsm just does the right thing (for
most things)?
All major operations automatically start the appropriate
"Fix" to bring the created object to an optimize\degraded state (the one that is
quicker) unless one of the options is
AutoFix=False. This is only useful for repos that might not be able to create volumes on
all hosts (SDM) but would like to have the actual IO distributed in the cluster.
Other common options is the strategy option:
It has currently 2 possible values
space and performance - In case VDSM has 2 ways of completing the same operation it will
tell it to value one over the other. For example, whether to copy all the data or just
create a qcow based of a snapshot.
The default is space.
I like this a lot.
You might have also noticed that it is never explicitly specified
where to look for existing images. This is done purposefully, VDSM will always look in all
connected repositories for existing objects.
For very large setups this might be problematic. To mitigate the problem you have these
options:
participatingRepositories=[repoId, ...] which tell VDSM to narrow the search to just
these repositories
and
imageHints={imgId: repoId} which will force VDSM to look for those image ID just in those
repositories and fail if it doesn't find them there.
I would like to have a better way of specifying these optional parameters
without burying them in an options structure. I will think a little more about
this. Strategy can just be a two optional flags in a 'flags' argument. For the
participatingRepositories and imageHints options, I think we need to use real
parameters.
--
Adam Litke <agl(a)us.ibm.com>
IBM Linux Technology Center