Some thoughts on enhancing High Availability in oVirt

Sun Feb 19 15:42:39 UTC 2012

>> Absolutely.
>>
>> In this case the Cloud Application is the combination of thw two
>> separate VM components (database VM and AS VM).  A CAPE (cloud
>> application policy engine) maintains the HA state of both VMs including
>> correcting for resource (db,as) or vm failures, and ensuring ordering
>> constraints even during recovery (the AS would start after the DB in
>> this model).
>>
> 
> ok, how would a flow look like to the user (oVirt user)?
> 
> - Adding new service in OE
> - Specifying for the service which VMs provide it (?)

That could work, or you could do:

1. Adding a new VM (or set of VMs in OE)
2. Adding one or more services to associate with those VMs

Just depends on what the easier user experience is.  From the
perspective of pcmk-cloud, we get the same data in the end, which is a
config file that specifies the resources we care about (both VMs and
services on those VMs)

> - Specify how the service can be monitored (? how does CAPE knows what
> to look for as the service heartbeat?)

For each service you would specify whether or not to use:
* an OCF resource agent (see resources-agents package in Fedora and
  other distros)
* A systemd unit or sysV init script
* Some other custom script (which would need to be either in OCF RA or
  init script style)

> - Marking th service as HA
> 
> What's next?
> Where can the user define the policy about this service

There would need to be UI in OE that exposed an interface for adding
policy information.  Because the Pacemaker policy engine is very
flexible, it would make sense to only define very specific knobs in the
UI, otherwise it could get very confusing for the users.  For more
complex policies, it might be better to provide a way to manually edit
the policy file and upload it rather than trying to model everything in
the UI.

> (i.e. 'should be
> available only on Tuesdays' or 'should be available only between
> 0800-1700 CET' etc)?

For this example, what do you mean by 'should be available'?  In general
with HA, the idea is to 'keep the service running as much as possible'.

The above example seems less like an HA concern and more of a general
resource scheduling concern.  I think using the Pacemaker Rules engine
with pcmk-cloud, this should be possible as well, but I'll let
Andrew/Steve comment further on that.

Perry