[Engine-devel] [Draft]Task Management API

Mon Dec 17 22:40:20 UTC 2012

Dan rightly suggested I'd be more specific about what the task system is
instead of what the task system isn't.

The problem is that I'm not completely sure how it's going to work.
It also depends on the events mechanism.
This is my current working draft:

TaskInfo:
id string
methodName string
kwargs json-object (string keys variant values) *filtered to remove sensitive
                                                 information

getRunningTasks(filter string, filterType enum{glob, regexp})
Returns a list of TaskInfo of all tasks that their id's match the filter

That's it, not even stopTask()

As explained, I would like to offload handling to the subsystems.
In order to make things easier for the clients every subsystem can choose a
filed of the object to be of type OperationInfo.
This is a generic structure that the user has a generic way to track all tasks
on all subsystem with a report interface. The extraData field is for subsystem
specific data. This is where the storage subsystem would put, for example,
imageState (broken, degraded, optimized) data.

OperationInfo:
operationDescription string - something out of an agreed enum of strings
                              vaguely describing the operation at hand for
			      example "Copying", "Merging", "Deleting",
			      "Configuring", "Stopped", "Paused", ....
			      They must be known to the client so it can in
			      turn translate it in the UI. The also have to
			      remain relatively vague as they are part of the
			      interface meaning that new values will break old
			      clients so they have to be reusable.
stageDescription - Similar to operation description in case you want more
                   granularity, optional.
stage (int, int) - (5, 10) means 5 out of 10. 1 out of 1 implies the UI to not
                   display stage widgets.
percentage - 0-100, -1 means unknown.
lastError - (code, message) the same errors that can return for regular calls
extraData - json-object

For example creatVM will return once the object is created in VDSM.
getVmInfo() would return, amongst other things, the operation info.
For the case of preparing for launch it will be:
  {"Creating", "configuring", (2, 4), 40, (0, ""),
   {state="preparing for launch"}}
In the case of VM paused on EIO:
  {"Paused", "Paused", (1, 1), -1, (123, "Error writing to disks"),
   {state="paused"}}

Migration is a tricky one, it will be reported as a task while it's in progress
but all the information is available on the image operationInfo.
In the case of Migration:
  {"Migration", "Configuring", (1, 3), -1, (0, ""), {status="Migrating"}}

For StorageConnection this is somewhat already the case but in simplified
version.

If you want to ask about any other operation I'd be more then happy to write my
suggestion for it.

Subsystems have complete freedom about how to set up the API.
For Storage you have Fixes() to start\stop operations.
Gluster is pretty autonomous once operations have been started.

Since operations return as soon as they are registered (persisted) or fail to
register, it makes synchronous programming a bit clunky.
vdsm.pauseVm(vmId) doesn't return when the VM is paused but when VDSM committed
it will try to pause it. This means you will have to poll in order to see if
the operation finished. For gluster, as an example, this is the only way we
can check that the operation finished.

For stuff we have a bit more control over vdsm will fire events using json-rpc
notifications sent to the clients. The will be in the form of:
{"method": "alert", "params": {
  "alertName": <subsystem>(.<objectType>)?.<object>.(<subobject>., ...),
  "operationInfo", OperationInfo}
}

The user can register to recive events using a glob or a regexp.
registering to vdsm.VM.* pop every time any VM has changed stage.
This means that whenever the task finishes, fails or gains significance progress
and VDSM is there to track it, an event will be sent to the client.

This means that the general flow is.
# Register operation
vmID = best_vm
host.VM.pauseVM(vmID)
while True:
    opInfo = None
    try:
       event = host.waitForEvent("vdsm.VM.best_vm", timeout=10)
       opInfo = event.opInfo
    except VdsmDisconnectionError:
       host.waitForReconnect()
       host.vm.getVmInfo(vmID)  # Double check that we didn't miss the event
       continue
    except Timeout:
       # This is a long operation, poll to see that we didn't miss any event
       # but more commonly, update percentage in the UI to show progress.
       vmInfo = host.vm.getVmInfo(vmID)
       opInfo = vmInfo.operationInfo

    if opInfo.stage.number != op.stage.total:
       # Operation in progress
       updateUI(opInfo)
    else:
       # Operation completed
       # Check that the state is what we expected it to be.
       if oInfo.extraData.state == "paused":
          return SUCCESS
       else:
          return opInfo.lastError

vdsm.waitForEvent(filterm, timeout) is a client side libvdsm helper operation.
Clients that access the raw API need to create thir own client side code to
filter out events and manage their distribution. I'm open to also defining
server side filters but I'm not sure whether it's worth it or just having
it be a boolean (all events or none) is sufficient.

This is a very simplified example but the general flow is clear.
Even if the connection is lost for 1 second or 4 days, the code
still works. Further more, the user can wait for multiple operations
in the same thread using:
  host.waitForEvent("vdsm.VM.(best_vm_ever|not_so_good_vm)")
This means that the client can wait for a 100 VMs or all VMs (using wildecards)
in a mechanism similar "poll()" with minimal overhead. This also means that if
The fact that operations are registered means that even if connections is lost
due to VDSM crashing or the network crashing, the manager doesn't need to care
one the original command returns as it know the operation registered.
This doesn't mean that every operation must retry forever. How persistent
every method is can and should change between the different operations.
Also, it means that manager that didn't initiate an operation track it in
the same way as those that did. This makes clustered managers a lot easier
to implement as if one goes down a second one can more or less immediately with
minimal extra code.