[Engine-devel] Managing async tasks

Mon Dec 17 22:44:55 UTC 2012


----- Original Message -----
> From: "Ayal Baron" <abaron at redhat.com>
> To: "Saggi Mizrahi" <smizrahi at redhat.com>
> Cc: "Dan Kenigsberg" <danken at redhat.com>, "Federico Simoncelli" <fsimonce at redhat.com>, engine-devel at ovirt.org,
> vdsm-devel at lists.fedorahosted.org, "Adam Litke" <agl at us.ibm.com>
> Sent: Monday, December 17, 2012 5:24:48 PM
> Subject: Re: Managing async tasks
> 
> 
> 
> ----- Original Message -----
> > This is an addendum to my previous email.
> > 
> > ----- Original Message -----
> > > From: "Saggi Mizrahi" <smizrahi at redhat.com>
> > > To: "Adam Litke" <agl at us.ibm.com>
> > > Cc: "Dan Kenigsberg" <danken at redhat.com>, "Ayal Baron"
> > > <abaron at redhat.com>, "Federico Simoncelli"
> > > <fsimonce at redhat.com>, engine-devel at ovirt.org,
> > > vdsm-devel at lists.fedorahosted.org
> > > Sent: Monday, December 17, 2012 2:52:06 PM
> > > Subject: Re: Managing async tasks
> > > 
> > > 
> > > 
> > > ----- Original Message -----
> > > > From: "Adam Litke" <agl at us.ibm.com>
> > > > To: "Saggi Mizrahi" <smizrahi at redhat.com>
> > > > Cc: "Dan Kenigsberg" <danken at redhat.com>, "Ayal Baron"
> > > > <abaron at redhat.com>, "Federico Simoncelli"
> > > > <fsimonce at redhat.com>, engine-devel at ovirt.org,
> > > > vdsm-devel at lists.fedorahosted.org
> > > > Sent: Monday, December 17, 2012 2:16:25 PM
> > > > Subject: Re: Managing async tasks
> > > > 
> > > > On Mon, Dec 17, 2012 at 12:15:08PM -0500, Saggi Mizrahi wrote:
> > > > > 
> > > > > 
> > > > > ----- Original Message -----
> > > > > > From: "Adam Litke" <agl at us.ibm.com> To:
> > > > > > vdsm-devel at lists.fedorahosted.org
> > > > > > Cc: "Dan Kenigsberg" <danken at redhat.com>, "Ayal Baron"
> > > > > > <abaron at redhat.com>,
> > > > > > "Saggi Mizrahi" <smizrahi at redhat.com>, "Federico
> > > > > > Simoncelli"
> > > > > > <fsimonce at redhat.com>, engine-devel at ovirt.org Sent: Monday,
> > > > > > December 17,
> > > > > > 2012 12:00:49 PM Subject: Managing async tasks
> > > > > > 
> > > > > > On today's vdsm call we had a lively discussion around how
> > > > > > asynchronous
> > > > > > operations should be handled in the future.  In an effort
> > > > > > to
> > > > > > include more
> > > > > > people in the discussion and to better capture the
> > > > > > resulting
> > > > > > conversation I
> > > > > > would like to continue that discussion here on the mailing
> > > > > > list.
> > > > > > 
> > > > > > A lot of ideas were thrown around about how 'tasks' should
> > > > > > be
> > > > > > handled in the
> > > > > > future.  There are a lot of ways that it can be done.  To
> > > > > > determine how we
> > > > > > should implement it, it's probably best if we start with a
> > > > > > set
> > > > > > of
> > > > > > requirements.  If we can first agree on these, it should be
> > > > > > easy
> > > > > > to find a
> > > > > > solution that meets them.  I'll take a stab at identifying
> > > > > > a
> > > > > > first set of
> > > > > > POSSIBLE requirements:
> > > > > > 
> > > > > > - Standardized method for determining the result of an
> > > > > > operation
> > > > > > 
> > > > > >   This is a big one for me because it directly affects the
> > > > > >   consumability of
> > > > > >   the API.  If each verb has different semantics for
> > > > > >   discovering
> > > > > >   whether it
> > > > > >   has completed successfully, then the API will be nearly
> > > > > >   impossible to use
> > > > > >   easily.
> > > > > Since there is no way to assure if of some tasks completed
> > > > > successfully or
> > > > > failed, especially around the murky waters of storage, I say
> > > > > this
> > > > > requirement
> > > > > should be removed.  At least not in the context of a task.
> > > > 
> > > > I don't agree.  Please feel free to convince me with some
> > > > exampled.
> > > >  If we
> > > > cannot provide feedback to a user as to whether their request
> > > > has
> > > > been satisfied
> > > > or not, then we have some bigger problems to solve.
> > > If VDSM sends a write command to a storage server, and the
> > > connection
> > > hangs up before the ACK has returned.
> > > The operation has been committed but VDSM has no way of knowing
> > > if
> > > that happened as far as VDSM is concerned it got an ETIMEO or
> > > EIO.
> > > This is the same problem that the engine has with VDSM.
> > > If VDSM creates an image\VM\network\repo but the connection hangs
> > > up
> > > before the response can be sent back as far as the engine is
> > > concerned the operation times out.
> > > This is an inherent issue with clustering.
> > > This is why I want to move away from tasks being *the* trackable
> > > objects.
> > > Tasks should be short. As short as possible.
> > > Run VM should just persist the VM information on the VDSM host
> > > and
> > > return. The rest of the tracking should be done using the VM ID.
> > > Create image should return once VDSM persisted the information
> > > about
> > > the request on the repository and created the metadata files.
> > > Tracking should be done on the repo or the imageId.
> > 
> > The thing is that I know how long a VM object should live (or an
> > Image object).
> > So tracking it is straight forward. How long a task should live is
> > very problematic and quite context specific.
> > It depends on what the task is.
> > I think it's quite confusing from an API standpoint to have every
> > task have a different scope, id requirement and life-cycle.
> > 
> > In VDSM has two types of APIs
> > 
> > CRUD objects - VM, Image, Repository, Bridge, Storage
> > Connections....
> > General transient methods - getBiosInfo(), getDeviceList()
> > 
> > The latter are quite simple to manage. They don't need any special
> > handling. If you lost a getBiosInfo() call you just send another
> > one, no harm done.
> > The same is even true with things that "change" the host like
> > getDeviceList()
> > 
> > What we are really arguing about is fitting the CRUD objects to
> > some
> > generic task oriented scheme.
> > I'm saying it's a waste of time as you can quite easily have flows
> > to
> > recover from each operation.
> > 
> > Create - Check if the object exists
> > Read - Read again
> > Update - either update again or read and update if update didn't
> > commit the first time
> > Delete - Check if object doesn't exist
> > 
> > Each of the objects we CRUD have different life-cycles and
> > ownership
> > semantics.
> > 
> > Danken raised the point that creation has a problem that if it
> > fails
> > there is no way to get why it failed.
> > This is why Create method should be minimal. They shouldn't create
> > the object just the entry in the respective persistent storage.
> > Even now storage connections are persisted to disk and then the
> > operation returns and the user polls to see the state of the
> > connection.
> > The same should be done for everything. Do the minimum required to
> > create the object entry and mark it as "not usable".
> > For storage connections it's "connecting"
> > For VMs it's "preparing for launch"
> > For new images it's "broken" and in some regards "degraded"
> > 
> > I hope this makes things clearer
> 
> Saggi,
> 
> When running an async operation (not task, operation), I want an
> indication of when it finishes.  This can be either an event sent to
> me or via polling or by divine intervention, but this is basic
> information that is required.
> 
> Polling for a specific end state is wrong because there can be
> multiple end states (success, failure 1, failure 2, maybe even
> multiple options for success, etc).
> From the call I get the feeling that you do support having this just
> not having it persisted across restarts of the service?
> If so, then let's discuss the semantics of what can be reported while
> vdsm doesn't crash.
> If not then in addition to not agreeing with you on this, I have
> additional problems.  For example, when an operation ends with
> failure, it is insufficient to know that it failed. I want to know
> *why* it failed.  Without changing something there is no reason to
> believe that trying again would succeed.  Without indication of
> reason of failure I'd just be shooting in the dark.
> 
> To keep the discussion focused I will stop here to let you comment.
I agree, I sent a different email with my suggestion on how to solve all
these problems.
> 
> > 
> > 
> > > > 
> > > > > > 
> > > > > > 
> > > > > > Sorry.  That's my list :)  Hopefully others will be willing
> > > > > > to
> > > > > > add other
> > > > > > requirements for consideration.
> > > > > > 
> > > > > > From my understanding, task recovery (stop, abort,
> > > > > > rollback,
> > > > > > etc)
> > > > > > will not
> > > > > > be generally supported and should not be a requirement.
> > > > > > 
> > > > 
> > > > --
> > > > Adam Litke <agl at us.ibm.com>
> > > > IBM Linux Technology Center
> > > > 
> > > > 
> > > 
> > 
>