Some thoughts on enhancing High Availability in oVirt

Wed Feb 15 01:36:48 UTC 2012

> I'm not sure I agree.
> This entire thread assumes that the way to do this is to have the
> engine continuously monitor all services on all (HA) guests and
> according to varying policies reschedule VMs (services within VMs?)

That's one interpretation of what I wrote, but not the only one.

Pacemaker Cloud doesn't rely on a single process (like oVirt Engine) to
monitor all VMs and the services in those VMs.  It relies on spawning a
monitor process for each logical grouping of VMs in an 'application group'.

So the engine doesn't need to continuously monitor every VM and every
service, it delegates to the Cloud Policy Engine (CPE) which in turn
creates a daemon (DPE[1]) to monitor each application group.

> I don't think this is scalable (and wrt drools/pacemaker, assuming
> what Andrew says is correct, drools doesn't even remotely come close
> to supporting even relatively small scales)

The way to deal with drools poor scaling is... don't use drools :)

But you're right, having oVirt Engine be the sole entity for monitoring
every service on every VM is not scalable, which is the reason why the
Pacemaker Cloud architecture doesn't do it that way.

> Engine should decide on policy, the hosts should enforce it.

This is how Pacemaker Cloud works as well, except right now I'd restate
it as: Engine should decide on policy and the DPEs should enforce it.

In the current thinking the DPEs run co-located with the CPE, which
would run nearby (but not necessarily on the same server as) the oVirt
Engine.

However, you bring up a good point in that the DPEs could be distributed
to the hosts.  (Right now CPE/DPE communication uses IPC but this could
be replaced with something TCP oriented)

Note: Not relying on anything from the host was a design constraint for
Pacemaker Cloud.  oVirt is different in that you can put things on the
hosts, so there may be optimizations we can make due to this relaxed
constraint, like putting the DPEs onto the Hosts.

> What this would translate to is a more distributed way of monitoring
> and moving around of VMs/services.  E.g. for each service, engine
> would run the VM on host A and let host B know that it is the
> failover node for this service.

That seems restrictive.  Why not allow that VM to fail over to 'any
other node in the cloud' vs. picking a specific piece of hardware?  If
you allow it to just pick the best available node at the time using
predefined policies that will result in less focus on the individual
hosts and make things more cloud-like (abstraction of resources)

>  Node B would be monitoring the
> heartbeats for the services it is in charge of and take over when
> needed. In case host B crashes, engine would choose a different host
> to be the failover node (note that there can be more than 2 nodes
> with a predefined order of priority).

Agree with this... Sort of what I said above, the DPE could run on HostB
to monitor stuff running on Hosts A and C (for case where there are
multiple VMs across different hosts in an application group).  And if
the DPE or HostB fails, then the CPE would respawn a new DPE on a new host.

I think Pacemaker Cloud could fit the paradigm you're looking for here.
 But it will require a little integration work.  On the other hand, if
you are looking to keep this more Java oriented or very tightly
integrated with the oVirt codebase, then you could probably take similar
concepts as what has already been done in pcmk-cloud and re-implement them.

Either way works.  We'd be happy to assist either with integration of
pcmk-cloud here or with general advice on HA as you work on the Java
implementation.

Perry

[1] This daemon right now is called the DPE for Deployable Policy
    Engine, since in the Aeolus terminology a Deployable was a set of
    VMs that were coordinated to run an application.  For example, 2
    VMs, one running a database and the other running a web server.

    Aeolus terminology has changed and 'Deployable' is no longer used
    to describe this.  Instead this is called an Application Set/Group

    Because pcmk-cloud adopted Aeolus terminology and the Deployable
    term is not really well known, we're probably going to rename the
    DPE to be "Cloud Application Policy Engine" or CAPE.