[Engine-devel] [vdsm] RFC: New Storage API
Saggi Mizrahi
smizrahi at redhat.com
Tue Dec 11 19:02:10 UTC 2012
----- Original Message -----
> From: "Shu Ming" <shuming at linux.vnet.ibm.com>
> To: "Saggi Mizrahi" <smizrahi at redhat.com>
> Cc: "Adam Litke" <agl at us.ibm.com>, "engine-devel" <engine-devel at ovirt.org>, "VDSM Project Development"
> <vdsm-devel at lists.fedorahosted.org>
> Sent: Monday, December 10, 2012 10:33:23 PM
> Subject: Re: [vdsm] RFC: New Storage API
>
> 2012-12-11 4:36, Saggi Mizrahi:
> >
> > ----- Original Message -----
> >> From: "Adam Litke" <agl at us.ibm.com>
> >> To: "Saggi Mizrahi" <smizrahi at redhat.com>
> >> Cc: "Shu Ming" <shuming at linux.vnet.ibm.com>, "engine-devel"
> >> <engine-devel at ovirt.org>, "VDSM Project Development"
> >> <vdsm-devel at lists.fedorahosted.org>
> >> Sent: Monday, December 10, 2012 1:39:51 PM
> >> Subject: Re: [vdsm] RFC: New Storage API
> >>
> >> On Thu, Dec 06, 2012 at 11:52:01AM -0500, Saggi Mizrahi wrote:
> >>>
> >>> ----- Original Message -----
> >>>> From: "Shu Ming" <shuming at linux.vnet.ibm.com>
> >>>> To: "Saggi Mizrahi" <smizrahi at redhat.com>
> >>>> Cc: "VDSM Project Development"
> >>>> <vdsm-devel at lists.fedorahosted.org>, "engine-devel"
> >>>> <engine-devel at ovirt.org>
> >>>> Sent: Thursday, December 6, 2012 11:02:02 AM
> >>>> Subject: Re: [vdsm] RFC: New Storage API
> >>>>
> >>>> Saggi,
> >>>>
> >>>> Thanks for sharing your thought and I get some comments below.
> >>>>
> >>>>
> >>>> Saggi Mizrahi:
> >>>>> I've been throwing a lot of bits out about the new storage API
> >>>>> and
> >>>>> I think it's time to talk a bit.
> >>>>> I will purposefully try and keep implementation details away
> >>>>> and
> >>>>> concentrate about how the API looks and how you use it.
> >>>>>
> >>>>> First major change is in terminology, there is no long a
> >>>>> storage
> >>>>> domain but a storage repository.
> >>>>> This change is done because so many things are already called
> >>>>> domain in the system and this will make things less confusing
> >>>>> for
> >>>>> new-commers with a libvirt background.
> >>>>>
> >>>>> One other changes is that repositories no longer have a UUID.
> >>>>> The UUID was only used in the pool members manifest and is no
> >>>>> longer needed.
> >>>>>
> >>>>>
> >>>>> connectStorageRepository(repoId, repoFormat,
> >>>>> connectionParameters={}):
> >>>>> repoId - is a transient name that will be used to refer to the
> >>>>> connected domain, it is not persisted and doesn't have to be
> >>>>> the
> >>>>> same across the cluster.
> >>>>> repoFormat - Similar to what used to be type (eg. localfs-1.0,
> >>>>> nfs-3.4, clvm-1.2).
> >>>>> connectionParameters - This is format specific and will used to
> >>>>> tell VDSM how to connect to the repo.
> >>>>
> >>>> Where does repoID come from? I think repoID doesn't exist before
> >>>> connectStorageRepository() return. Isn't repoID a return value
> >>>> of
> >>>> connectStorageRepository()?
> >>> No, repoIDs are no longer part of the domain, they are just a
> >>> transient handle.
> >>> The user can put whatever it wants there as long as it isn't
> >>> already taken by another currently connected domain.
> >>>>> disconnectStorageRepository(self, repoId)
> >>>>>
> >>>>>
> >>>>> In the new API there are only images, some images are mutable
> >>>>> and
> >>>>> some are not.
> >>>>> mutable images are also called VirtualDisks
> >>>>> immutable images are also called Snapshots
> >>>>>
> >>>>> There are no explicit templates, you can create as many images
> >>>>> as
> >>>>> you want from any snapshot.
> >>>>>
> >>>>> There are 4 major image operations:
> >>>>>
> >>>>>
> >>>>> createVirtualDisk(targetRepoId, size, baseSnapshotId=None,
> >>>>> userData={}, options={}):
> >>>>>
> >>>>> targetRepoId - ID of a connected repo where the disk will be
> >>>>> created
> >>>>> size - The size of the image you wish to create
> >>>>> baseSnapshotId - the ID of the snapshot you want the base the
> >>>>> new
> >>>>> virtual disk on
> >>>>> userData - optional data that will be attached to the new VD,
> >>>>> could
> >>>>> be anything that the user desires.
> >>>>> options - options to modify VDSMs default behavior
> >>>>>
> >>>>> returns the id of the new VD
> >>>> I think we will also need a function to check if a a VirtualDisk
> >>>> is
> >>>> based on a specific snapshot.
> >>>> Like: isSnapshotOf(virtualDiskId, baseSnapshotID):
> >>> No, the design is that volume dependencies are an implementation
> >>> detail.
> >>> There is no reason for you to know that an image is physically a
> >>> snapshot of another.
> >>> Logical snapshots, template information, and any other
> >>> information
> >>> can be set by the user by using the userData field available for
> >>> every image.
> >> Statements like this make me start to worry about your userData
> >> concept. It's a
> >> sign of a bad API if the user needs to invent a custom metadata
> >> scheme for
> >> itself. This reminds me of the abomination that is the 'custom'
> >> property in the
> >> vm definition today.
> > In one sentence: If VDSM doesn't care about it, VDSM doesn't manage
> > it.
> >
> > userData being a "void*" is quite common and I don't understand why
> > you would thing it's a sign of a bad API.
> > Further more, giving the user choice about how to represent it's
> > own metadata and what fields it want to keep seems reasonable to
> > me.
> > Especially given the fact that VDSM never reads it.
> >
> > The reason we are pulling away from the current system of VDSM
> > understanding the extra data is that it makes that data tied to
> > VDSMs on disk format.
> > VDSM on disk format has to be very stable because of clusters with
> > multiple VDSM versions.
> > Further more, since this is actually manager data it has to be tied
> > to the manager backward compatibility lifetime as well.
> > Having it be opaque to VDSM ties it to only one, simpler, support
> > lifetime instead of two.
>
> Making userData being opaque gives flexibilities to the management
> applications. To me, opaque userDaa can have two types at least. The
> first is the userData for runtime only. The second is the userData
> expected to be persisted into the metadata disk. For the first type,
> the management applications can store their own data structures like
> temporary task states, VDSM query caches &etc. After the VDSM host is
> fenced, the userData goes away. For the second type, different
> management applications can have different formats of userData and
> persisting it to the VDSM metadata disk will corrupt each other in
> very
> high possibility. You may argue that, all the management
> applications
> should reach a agreement before they use it as the second type. But
> in
> practice, it is pretty hard to maintain.
We are talking only about the second type.
And yes, if not all clients agree about the format and don't check that they are modifying an image with someone else's userdata you will have problems.
This is the same problem you have if any program starts randomly messing with any other programs persistent data.
Users will have to agree on a format anyway if the want to use the same storage domains because they'll be sharing the same data anyway.
The thing is that if you are making anything other then a manager you should develop against the engine API.
If you are writing your own manager and are also sharing the host with another manager you should really not mess with other manager's repos.
That being said, agreeing on a set of fields and format to be able to have some sort of interoperability between managers is a good idea IMO.
>
>
> >
> > I guess you are implying that it will make it problematic for
> > multiple users to read userData left by another user because the
> > formats might not be compatible.
> > The solution is that all parties interested in using VDSM storage
> > agree on format, and common fields, and supportability, and all
> > the other things that choosing a supporting *something* entails.
> > This is, however, out of the scope of VDSM. When the time comes I
> > think how the userData blob is actually parsed and what fields it
> > keeps should be discussed on ovirt-devel or engine-devel.
> >
> > The crux of the issue is that VDSM manages only what it cares about
> > and the user can't modify directly.
> > This is done because everything we expose we commit to.
> > If you want any information persisted like:
> > - Human readable name (in whatever encoding)
> > - Is this a template or a snapshot
> > - What user owns this image
> >
> > You can just put it in the userData.
> > VDSM is not going to impose what encoding you use.
> > It's not going to decide if you represent your users as IDs or
> > names or ldap queries or Public Keys.
> > It's not going to decide if you have explicit templates or not.
> > It's not going to decide if you care what is the logical image
> > chain.
> > It's not going to decide anything that is out of it's scope.
> > No format is future proof, no selection of fields will be good for
> > any situation.
> > I'd much rather it be someone else's problem when any of them need
> > to be changed.
> > They have currently been VDSMs problem and it has been hell to
> > maintain.
> >
> >>>>> createSnapshot(targetRepoId, baseVirtualDiskId,
> >>>>> userData={}, options={}):
> >>>>> targetRepoId - The ID of a connected repo where the new
> >>>>> sanpshot
> >>>>> will be created and the original image exists as well.
> >>>>> size - The size of the image you wish to create
> >>>>> baseVirtualDisk - the ID of a mutable image (Virtual Disk) you
> >>>>> want
> >>>>> to snapshot
> >>>>> userData - optional data that will be attached to the new
> >>>>> Snapshot,
> >>>>> could be anything that the user desires.
> >>>>> options - options to modify VDSMs default behavior
> >>>>>
> >>>>> returns the id of the new Snapshot
> >>>>>
> >>>>> copyImage(targetRepoId, imageId, baseImageId=None, userData={},
> >>>>> options={})
> >>>>> targetRepoId - The ID of a connected repo where the new image
> >>>>> will
> >>>>> be created
> >>>>> imageId - The image you wish to copy
> >>>>> baseImageId - if specified, the new image will contain only the
> >>>>> diff between image and Id.
> >>>>> If None the new image will contain all the bits
> >>>>> of
> >>>>> image Id. This can be used to copy partial
> >>>>> parts
> >>>>> of
> >>>>> images for export.
> >>>>> userData - optional data that will be attached to the new
> >>>>> image,
> >>>>> could be anything that the user desires.
> >>>>> options - options to modify VDSMs default behavior
> >>>> Does this function mean that we can copy the image from one
> >>>> repository
> >>>> to another repository? Does it cover the semantics of storage
> >>>> migration,
> >>>> storage backup, storage incremental backup?
> >>> Yes, the main purpose is copying to another repo. and you can
> >>> even
> >>> do incremental backups.
> >>> Also the following flow
> >>> 1. Run a VM using imageA
> >>> 2. write to disk
> >>> 3. Stop VM
> >>> 4. copy imageA to repoB
> >>> 5. Run a VM using imageA again
> >>> 6. Write to disk
> >>> 7. Stop VM
> >>> 8. Copy imageA again basing it of imageA_copy1 on repoB creating
> >>> a
> >>> diff on repo diff without snapshotting the original image.
> >>>
> >>>>> return the Id of the new image. In case of copying an immutable
> >>>>> image the ID will be identical to the original image as they
> >>>>> contain the same data. However the user should not assume that
> >>>>> and
> >>>>> always use the value returned from the method.
> >>>>>
> >>>>> removeImage(repositoryId, imageId, options={}):
> >>>>> repositoryId - The ID of a connected repo where the image to
> >>>>> delete
> >>>>> resides
> >>>>> imageId - The id of the image you wish to delete.
> >>>>>
> >>>>>
> >>>>> ----
> >>>>> getImageStatus(repositoryId, imageId)
> >>>>> repositoryId - The ID of a connected repo where the image to
> >>>>> check
> >>>>> resides
> >>>>> imageId - The id of the image you wish to check.
> >>>>>
> >>>>> All operations return once the operations has been committed to
> >>>>> disk NOT when the operation actually completes.
> >>>>> This is done so that:
> >>>>> - operation come to a stable state as quickly as possible.
> >>>>> - In case where there is an SDM, only small portion of the
> >>>>> operation actually needs to be performed on the SDM host.
> >>>>> - No matter how many times the operation fails and on how many
> >>>>> hosts, you can always resume the operation and choose when to
> >>>>> do
> >>>>> it.
> >>>>> - You can stop an operation at any time and remove the
> >>>>> resulting
> >>>>> object making a distinction between "stop because the host is
> >>>>> overloaded" to "I don't want that image"
> >>>>>
> >>>>> This means that after calling any operation that creates a new
> >>>>> image the user must then call getImageStatus() to check what is
> >>>>> the status of the image.
> >>>>> The status of the image can be either optimized, degraded, or
> >>>>> broken.
> >>>>> "Optimized" means that the image is available and you can run
> >>>>> VMs
> >>>>> of it.
> >>>>> "Degraded" means that the image is available and will run VMs
> >>>>> but
> >>>>> it might be a better way VDSM can represent the underlying
> >>>>> data.
> >>>> What does the "represent" mean here?
> >>> Anything, but mostly image formate RAW\QCOW2 when performance
> >>> strategy has been selected.
> >>>>> "Broken" means that the image can't be used at the moment,
> >>>>> probably
> >>>>> because not all the data has been set up on the volume.
> >>>>>
> >>>>> Apart from that VDSM will also return the last persisted status
> >>>>> information which will conatin
> >>>>> hostID - the last host to try and optimize of fix the image
> >>>> Any host can optimize the image? No need to be SDM?
> >>> On anything but lvm based block domains there will not even be an
> >>> SDM.
> >>> On SDM based domains we will try as hard as we can to have as
> >>> many
> >>> operations executable on any host.
> >>>>> stage - X/Y (eg. 1/10) the last persisted stage of the fix.
> >>>>> percent_complete - -1 or 0-100, the last persisted completion
> >>>>> percentage of the aforementioned stage. -1 means that no
> >>>>> progress
> >>>>> is available for that operation.
> >>>>> last_error - This will only be filled if the operation failed
> >>>>> because of something other then IO or a VDSM crash for obvious
> >>>>> reasons.
> >>>>> It will usually be set if the task was manually
> >>>>> stopped
> >>>>>
> >>>>> The user can either be satisfied with that information or as
> >>>>> the
> >>>>> host specified in host ID if it is still working on that image
> >>>>> by
> >>>>> checking it's running tasks.
> >>>> So we need a function to know what tasks are running on the
> >>>> image
> >>> getImageStatus()
> >>>>> checkStorageRepository(self, repositoryId, options={}):
> >>>>> A method to go over a storage repository and scan for any
> >>>>> existing
> >>>>> problems. This includes degraded\broken images and deleted
> >>>>> images
> >>>>> that have no yet been physically deleted\merged.
> >>>>> It returns a list of Fix objects.
> >>>>> Fix objects come in 4 types:
> >>>>> clean - cleans data, run them to get more space.
> >>>>> optimize - run them to optimize a degraded image
> >>>>> merge - Merges two images together. Doing this sometimes
> >>>>> makes more images ready optimizing or cleaning.
> >>>>> The reason it is different from optimize is that
> >>>>> unmerged images are considered optimized.
> >>>>> mend - mends a broken image
> >>>>>
> >>>>> The user can read these types and prioritize fixes. Fixes also
> >>>>> contain opaque FIX data and they should be sent as received to
> >>>>> fixStorageRepository(self, repositoryId, fix, options={}):
> >>>>>
> >>>>> That will start a fix operation.
> >>>>>
> >>>>>
> >>>>> All major operations automatically start the appropriate "Fix"
> >>>>> to
> >>>>> bring the created object to an optimize\degraded state (the one
> >>>>> that is quicker) unless one of the options is
> >>>>> AutoFix=False. This is only useful for repos that might not be
> >>>>> able
> >>>>> to create volumes on all hosts (SDM) but would like to have the
> >>>>> actual IO distributed in the cluster.
> >>>>>
> >>>>> Other common options is the strategy option:
> >>>>> It has currently 2 possible values
> >>>>> space and performance - In case VDSM has 2 ways of completing
> >>>>> the
> >>>>> same operation it will tell it to value one over the other. For
> >>>>> example, whether to copy all the data or just create a qcow
> >>>>> based
> >>>>> of a snapshot.
> >>>>> The default is space.
> >>>>>
> >>>>> You might have also noticed that it is never explicitly
> >>>>> specified
> >>>>> where to look for existing images. This is done purposefully,
> >>>>> VDSM
> >>>>> will always look in all connected repositories for existing
> >>>>> objects.
> >>>>> For very large setups this might be problematic. To mitigate
> >>>>> the
> >>>>> problem you have these options:
> >>>>> participatingRepositories=[repoId, ...] which tell VDSM to
> >>>>> narrow
> >>>>> the search to just these repositories
> >>>>> and
> >>>>> imageHints={imgId: repoId} which will force VDSM to look for
> >>>>> those
> >>>>> image ID just in those repositories and fail if it doesn't find
> >>>>> them there.
> >>>>> _______________________________________________
> >>>>> vdsm-devel mailing list
> >>>>> vdsm-devel at lists.fedorahosted.org
> >>>>> https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
> >>>>
> >>>> --
> >>>> ---
> >>>> 舒明 Shu Ming
> >>>> Open Virtualization Engineerning; CSTL, IBM Corp.
> >>>> Tel: 86-10-82451626 Tieline: 9051626 E-mail: shuming at cn.ibm.com
> >>>> or
> >>>> shuming at linux.vnet.ibm.com
> >>>> Address: 3/F Ring Building, ZhongGuanCun Software Park, Haidian
> >>>> District, Beijing 100193, PRC
> >>>>
> >>>>
> >>>>
> >>> _______________________________________________
> >>> vdsm-devel mailing list
> >>> vdsm-devel at lists.fedorahosted.org
> >>> https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
> >> --
> >> Adam Litke <agl at us.ibm.com>
> >> IBM Linux Technology Center
> >>
> >>
>
>
> --
> ---
> 舒明 Shu Ming
> Open Virtualization Engineerning; CSTL, IBM Corp.
> Tel: 86-10-82451626 Tieline: 9051626 E-mail: shuming at cn.ibm.com or
> shuming at linux.vnet.ibm.com
> Address: 3/F Ring Building, ZhongGuanCun Software Park, Haidian
> District, Beijing 100193, PRC
>
>
>
More information about the Devel
mailing list