[Engine-devel] [vdsm] RFC: New Storage API

Fri Dec 7 19:53:41 UTC 2012

----- Original Message -----
> From: "Deepak C Shetty" <deepakcs at linux.vnet.ibm.com>
> To: "Saggi Mizrahi" <smizrahi at redhat.com>
> Cc: "Shu Ming" <shuming at linux.vnet.ibm.com>, "engine-devel" <engine-devel at ovirt.org>, "VDSM Project Development"
> <vdsm-devel at lists.fedorahosted.org>, "Deepak C Shetty" <deepakcs at linux.vnet.ibm.com>
> Sent: Friday, December 7, 2012 12:23:15 AM
> Subject: Re: [vdsm] RFC: New Storage API
> 
> On 12/06/2012 10:22 PM, Saggi Mizrahi wrote:
> >
> > ----- Original Message -----
> >> From: "Shu Ming" <shuming at linux.vnet.ibm.com>
> >> To: "Saggi Mizrahi" <smizrahi at redhat.com>
> >> Cc: "VDSM Project Development"
> >> <vdsm-devel at lists.fedorahosted.org>, "engine-devel"
> >> <engine-devel at ovirt.org>
> >> Sent: Thursday, December 6, 2012 11:02:02 AM
> >> Subject: Re: [vdsm] RFC: New Storage API
> >>
> >> Saggi,
> >>
> >> Thanks for sharing your thought and I get some comments below.
> >>
> >>
> >> Saggi Mizrahi:
> >>> I've been throwing a lot of bits out about the new storage API
> >>> and
> >>> I think it's time to talk a bit.
> >>> I will purposefully try and keep implementation details away and
> >>> concentrate about how the API looks and how you use it.
> >>>
> >>> First major change is in terminology, there is no long a storage
> >>> domain but a storage repository.
> >>> This change is done because so many things are already called
> >>> domain in the system and this will make things less confusing for
> >>> new-commers with a libvirt background.
> >>>
> >>> One other changes is that repositories no longer have a UUID.
> >>> The UUID was only used in the pool members manifest and is no
> >>> longer needed.
> >>>
> >>>
> >>> connectStorageRepository(repoId, repoFormat,
> >>> connectionParameters={}):
> >>> repoId - is a transient name that will be used to refer to the
> >>> connected domain, it is not persisted and doesn't have to be the
> >>> same across the cluster.
> >>> repoFormat - Similar to what used to be type (eg. localfs-1.0,
> >>> nfs-3.4, clvm-1.2).
> >>> connectionParameters - This is format specific and will used to
> >>> tell VDSM how to connect to the repo.
> >>
> >> Where does repoID come from? I think repoID doesn't exist before
> >> connectStorageRepository() return.  Isn't repoID a return value of
> >> connectStorageRepository()?
> > No, repoIDs are no longer part of the domain, they are just a
> > transient handle.
> > The user can put whatever it wants there as long as it isn't
> > already taken by another currently connected domain.
> 
> So what happens when user mistakenly gives a repoID that is in use
> before.. there should be something in the return value that specifies
> the error and/or reason for error so that user can try with a
> new/diff
> repoID ?
Asi I said, connect fails if the repoId is in use ATM.
> 
> >>> disconnectStorageRepository(self, repoId)
> >>>
> >>>
> >>> In the new API there are only images, some images are mutable and
> >>> some are not.
> >>> mutable images are also called VirtualDisks
> >>> immutable images are also called Snapshots
> >>>
> >>> There are no explicit templates, you can create as many images as
> >>> you want from any snapshot.
> >>>
> >>> There are 4 major image operations:
> >>>
> >>>
> >>> createVirtualDisk(targetRepoId, size, baseSnapshotId=None,
> >>>                     userData={}, options={}):
> >>>
> >>> targetRepoId - ID of a connected repo where the disk will be
> >>> created
> >>> size - The size of the image you wish to create
> >>> baseSnapshotId - the ID of the snapshot you want the base the new
> >>> virtual disk on
> >>> userData - optional data that will be attached to the new VD,
> >>> could
> >>> be anything that the user desires.
> >>> options - options to modify VDSMs default behavior
> 
> IIUC, i can use options to do storage offloads ? For eg. I can create
> a
> LUN that represents this VD on my storage array based on the
> 'options'
> parameter ? Is this the intended way to use 'options' ?
No, this has nothing to do with offloads.
If by "offloads" you mean having other VDSM hosts to the heavy lifting then this is what the option autoFix=False and the fix mechanism is for.
If you are talking about advanced scsi features (ie. write same) they will be used automatically whenever possible.
In any case, how we manage LUNs (if they are even used) is an implementation detail.
> 
> >>>
> >>> returns the id of the new VD
> >> I think we will also need a function to check if a a VirtualDisk
> >> is
> >> based on a specific snapshot.
> >> Like: isSnapshotOf(virtualDiskId, baseSnapshotID):
> > No, the design is that volume dependencies are an implementation
> > detail.
> > There is no reason for you to know that an image is physically a
> > snapshot of another.
> > Logical snapshots, template information, and any other information
> > can be set by the user by using the userData field available for
> > every image.
> >>> createSnapshot(targetRepoId, baseVirtualDiskId,
> >>>                  userData={}, options={}):
> >>> targetRepoId - The ID of a connected repo where the new sanpshot
> >>> will be created and the original image exists as well.
> >>> size - The size of the image you wish to create
> >>> baseVirtualDisk - the ID of a mutable image (Virtual Disk) you
> >>> want
> >>> to snapshot
> >>> userData - optional data that will be attached to the new
> >>> Snapshot,
> >>> could be anything that the user desires.
> >>> options - options to modify VDSMs default behavior
> >>>
> >>> returns the id of the new Snapshot
> >>>
> >>> copyImage(targetRepoId, imageId, baseImageId=None, userData={},
> >>> options={})
> >>> targetRepoId - The ID of a connected repo where the new image
> >>> will
> >>> be created
> >>> imageId - The image you wish to copy
> >>> baseImageId - if specified, the new image will contain only the
> >>> diff between image and Id.
> >>>                 If None the new image will contain all the bits
> >>>                 of
> >>>                 image Id. This can be used to copy partial parts
> >>>                 of
> >>>                 images for export.
> >>> userData - optional data that will be attached to the new image,
> >>> could be anything that the user desires.
> >>> options - options to modify VDSMs default behavior
> >> Does this function mean that we can copy the image from one
> >> repository
> >> to another repository? Does it cover the semantics of storage
> >> migration,
> >> storage backup, storage incremental backup?
> > Yes, the main purpose is copying to another repo. and you can even
> > do incremental backups.
> > Also the following flow
> > 1. Run a VM using imageA
> > 2. write to disk
> > 3. Stop VM
> > 4. copy imageA to repoB
> > 5. Run a VM using imageA again
> > 6. Write to disk
> > 7. Stop VM
> > 8. Copy imageA again basing it of imageA_copy1 on repoB creating a
> > diff on repo diff without snapshotting the original image.
> >
> >>> return the Id of the new image. In case of copying an immutable
> >>> image the ID will be identical to the original image as they
> >>> contain the same data. However the user should not assume that
> >>> and
> >>> always use the value returned from the method.
> >>>
> >>> removeImage(repositoryId, imageId, options={}):
> >>> repositoryId - The ID of a connected repo where the image to
> >>> delete
> >>> resides
> >>> imageId - The id of the image you wish to delete.
> >>>
> >>>
> >>> ----
> >>> getImageStatus(repositoryId, imageId)
> >>> repositoryId - The ID of a connected repo where the image to
> >>> check
> >>> resides
> >>> imageId - The id of the image you wish to check.
> >>>
> >>> All operations return once the operations has been committed to
> >>> disk NOT when the operation actually completes.
> >>> This is done so that:
> >>> - operation come to a stable state as quickly as possible.
> >>> - In case where there is an SDM, only small portion of the
> >>> operation actually needs to be performed on the SDM host.
> >>> - No matter how many times the operation fails and on how many
> >>> hosts, you can always resume the operation and choose when to do
> >>> it.
> >>> - You can stop an operation at any time and remove the resulting
> >>> object making a distinction between "stop because the host is
> >>> overloaded" to "I don't want that image"
> >>>
> >>> This means that after calling any operation that creates a new
> >>> image the user must then call getImageStatus() to check what is
> >>> the status of the image.
> >>> The status of the image can be either optimized, degraded, or
> >>> broken.
> >>> "Optimized" means that the image is available and you can run VMs
> >>> of it.
> >>> "Degraded" means that the image is available and will run VMs but
> >>> it might be a better way VDSM can represent the underlying data.
> 
> Calling qcow2 based snapshot degraded (meaning its degraded in perf,
> as
> its space optimzed ) and calling raw images as optimised ( meaning
> its
> optimised for perf as its space in-efficient) is confusing. Degraded
> sounds like a bad thing when seen by the end-user :) I think there is
> scope for having some better and less confusing terminology here ?
No, that is not what I meant. If the user asked for a space driven image then the qcow2 version is the optimized version.
I chose my words carefully when I originally wrote "a better way VDSM can represent the underlying data".
"better" can mean different things in different circumstances.
How VDSM makes it "better" is an implementation detail. It can even change between VDSM versions.
An extreme case is if we ever decide to make QED the preferred format.
That could potentially suddenly mark all images that are internally represented as qcow2 as degraded because VDSM has found a "better way to represent the data".
All the user has to know is that "degraded images" can be used but they are in a what VDSM considers to be a sub-optimal state.
It's up to the user to decide whether to optimize now, later or never.
> 
> >> What does the "represent" mean here?
> > Anything, but mostly image formate RAW\QCOW2 when performance
> > strategy has been selected.
> >>> "Broken" means that the image can't be used at the moment,
> >>> probably
> >>> because not all the data has been set up on the volume.
> >>>
> >>> Apart from that VDSM will also return the last persisted status
> >>> information which will conatin
> >>> hostID - the last host to try and optimize of fix the image
> >> Any host can optimize the image? No need to be SDM?
> > On anything but lvm based block domains there will not even be an
> > SDM.
> > On SDM based domains we will try as hard as we can to have as many
> > operations executable on any host.
> 
> 1) Can you provide more info on why there is a exception for 'lvm
> based
> block domain'. Its not coming out clearly.
File based domains are responsible for syncing up object manipulation (creation\deletion)
The backend is responsible for making sure it all works either by having a single writer (NFS) or having it's own locking mechanism (gluster).
In our LVM based domains VDSM is responsible for basic object manipulation.
The current design uses an approach where there is a single host responsible for object creation\deleteion it is the SRM\SDM\SPM\S?M.
If we ever find a way to make it fully clustered without a big hit in performance the S?M requirement will be removed form that type of domain.
> 2) Based on the terminology change, domain is now replaced by
> repository, SDM should now be more aptly called SRM (storage repo
> manager) so that we are consistent in the usage of terminology
OK
> 3) Can you provide some example flow / scenario to understnad how
> with
> and without SDM domains work ? Especially how the disk based lock is
> taken if no SDM ?
This is part of the repo API which is out of the scope of this document but in general terms.
getStorageRepositoryContraints() will return the NEED_S?M flag which will notify the manager that it needs to elect an S?M for this repository instance.
All image createVirtualDisk() createSnapshot() and copyImage() and some fixes() will need to be performed on the S?M, removeImage() and some of the fixes could be performed on any host.
As previously noted sending the option autoFix=False will make VDSM not automatically start a fix operation after the operation finished persisting the request and creating the object.
This means you could offload the actual date IO to another host.
Because what fixes can run on what host can change between versions you will just have to try and see if VDSM rejects it on hosts other then the SPM.

There are also plans on abstracting further and making hosts that are not the S?M ask the S?M to create\delete objects for them so you can run any request on any host.
I would very much like to implement that but I am still not sure how efficient inter VDSM communication is to commit to that.
This can be easily implemented in the future by a new flag that marks that even though this repo requires an S?M it doesn't matter where you perform the operation.
> 
> >>> stage - X/Y (eg. 1/10) the last persisted stage of the fix.
> >>> percent_complete - -1 or 0-100, the last persisted completion
> >>> percentage of the aforementioned stage. -1 means that no progress
> >>> is available for that operation.
> >>> last_error - This will only be filled if the operation failed
> >>> because of something other then IO or a VDSM crash for obvious
> >>> reasons.
> >>>                It will usually be set if the task was manually
> >>>                stopped
> >>>
> >>> The user can either be satisfied with that information or as the
> >>> host specified in host ID if it is still working on that image by
> >>> checking it's running tasks.
> >> So we need a function to know what tasks are running on the image
> > getImageStatus()
> >>> checkStorageRepository(self, repositoryId, options={}):
> >>> A method to go over a storage repository and scan for any
> >>> existing
> >>> problems. This includes degraded\broken images and deleted images
> >>> that have no yet been physically deleted\merged.
> >>> It returns a list of Fix objects.
> >>> Fix objects come in 4 types:
> >>> clean - cleans data, run them to get more space.
> >>> optimize - run them to optimize a degraded image
> >>> merge - Merges two images together. Doing this sometimes
> >>>           makes more images ready optimizing or cleaning.
> >>>           The reason it is different from optimize is that
> >>>           unmerged images are considered optimized.
> >>> mend - mends a broken image
> >>>
> >>> The user can read these types and prioritize fixes. Fixes also
> >>> contain opaque FIX data and they should be sent as received to
> >>> fixStorageRepository(self, repositoryId, fix, options={}):
> >>>
> >>> That will start a fix operation.
> 
> It would be good if you can provide some example or flow of "fix"
> operation.
> When and Why would somebody want to do it ?
You would want to do it to get your images to a better state broken->(degraded|optimized) or degraded->optimized.
Getting an image to a functional (degraded|optimized) state is obviously beneficial and getting to an optimized is better by definition.
As an example.
The user called copyImage() that created the all the objects involved and persisted the command and returned.
The snapshot is not yet ready as the actual copy of the data didn't really happen.
Lets say that the method was called with autoFix=False so you now have a broken image (the new image).
You can't use it at the moment. Running a Fix will continue the actual operation moving it to a better state.
VDSM will choose whatever is more likely to get the image to a functional state faster. This means that the image might find itself in a degraded state.
This is good because if the user want to run the VM ASAP it can now do it for the price of performance.
If the user has resource to spare (time\IO\hosts) it can choose to perform another fix and get the image to an optimized state.
This ensures that the images is in the best state according to the users options.
This of course doesn't mean best performance as the user might ask that VDSM value other things over performance.

This system means that you can allocate and deallocate resources at will with minimal consequences.
In some cases advanced users can use this system in conjunction with live merge \ live migration to actually "optimize" while it's running.

This system means that VDSM can change it's policy and behavior between versions without that breaking the interface.
We find a way to skip degraded state for some use case, great.
We found a way to make some use case finish faster but in degraded mode, awesome.
We found a way to make some operations run on all hosts, fantastic.
The user can always handle it because it never assumes the physical result of any operation just the logical.

> 
> Does 'Fix' here mean that i move from raw to qcow2 format or
> vice-versa,
> or there is more to it ?
Fix could literally mean anything, format conversion, compression\decompression, cleanups, merges.
> 
> >>>
> >>>
> >>> All major operations automatically start the appropriate "Fix" to
> >>> bring the created object to an optimize\degraded state (the one
> >>> that is quicker) unless one of the options is
> >>> AutoFix=False. This is only useful for repos that might not be
> >>> able
> >>> to create volumes on all hosts (SDM) but would like to have the
> >>> actual IO distributed in the cluster.
> >>>
> >>> Other common options is the strategy option:
> >>> It has currently 2 possible values
> >>> space and performance - In case VDSM has 2 ways of completing the
> >>> same operation it will tell it to value one over the other. For
> >>> example, whether to copy all the data or just create a qcow based
> >>> of a snapshot.
> >>> The default is space.
> >>>
> >>> You might have also noticed that it is never explicitly specified
> >>> where to look for existing images. This is done purposefully,
> >>> VDSM
> >>> will always look in all connected repositories for existing
> >>> objects.
> >>> For very large setups this might be problematic. To mitigate the
> >>> problem you have these options:
> >>> participatingRepositories=[repoId, ...] which tell VDSM to narrow
> >>> the search to just these repositories
> >>> and
> >>> imageHints={imgId: repoId} which will force VDSM to look for
> >>> those
> >>> image ID just in those repositories and fail if it doesn't find
> >>> them there.
> >>> _______________________________________________
> >>> vdsm-devel mailing list
> >>> vdsm-devel at lists.fedorahosted.org
> >>> https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
> >>
> >> --
> >> ---
> >> 舒明 Shu Ming
> >> Open Virtualization Engineerning; CSTL, IBM Corp.
> >> Tel: 86-10-82451626  Tieline: 9051626 E-mail: shuming at cn.ibm.com
> >> or
> >> shuming at linux.vnet.ibm.com
> >> Address: 3/F Ring Building, ZhongGuanCun Software Park, Haidian
> >> District, Beijing 100193, PRC
> >>
> >>
> >>
> > _______________________________________________
> > vdsm-devel mailing list
> > vdsm-devel at lists.fedorahosted.org
> > https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
> 
>