Some thoughts on enhancing High Availability in oVirt

Wed Feb 15 06:32:31 UTC 2012

----- Original Message -----
> > I'm not sure I agree.
> > This entire thread assumes that the way to do this is to have the
> > engine continuously monitor all services on all (HA) guests and
> > according to varying policies reschedule VMs (services within VMs?)
> 
> That's one interpretation of what I wrote, but not the only one.
> 
> Pacemaker Cloud doesn't rely on a single process (like oVirt Engine)
> to
> monitor all VMs and the services in those VMs.  It relies on spawning
> a
> monitor process for each logical grouping of VMs in an 'application
> group'.
> 
> So the engine doesn't need to continuously monitor every VM and every
> service, it delegates to the Cloud Policy Engine (CPE) which in turn
> creates a daemon (DPE[1]) to monitor each application group.

Where is the daemon spawn? on the engine or in a distributed fashion? if the latter then drools is irrelevant.  if the former then it would just make things worse (scalability wise)

> 
> > I don't think this is scalable (and wrt drools/pacemaker, assuming
> > what Andrew says is correct, drools doesn't even remotely come
> > close
> > to supporting even relatively small scales)
> 
> The way to deal with drools poor scaling is... don't use drools :)
> 
> But you're right, having oVirt Engine be the sole entity for
> monitoring
> every service on every VM is not scalable, which is the reason why
> the
> Pacemaker Cloud architecture doesn't do it that way.
> 
> > Engine should decide on policy, the hosts should enforce it.
> 
> This is how Pacemaker Cloud works as well, except right now I'd
> restate
> it as: Engine should decide on policy and the DPEs should enforce it.
> 
> In the current thinking the DPEs run co-located with the CPE, which
> would run nearby (but not necessarily on the same server as) the
> oVirt
> Engine.
> 
> However, you bring up a good point in that the DPEs could be
> distributed
> to the hosts.  (Right now CPE/DPE communication uses IPC but this
> could
> be replaced with something TCP oriented)
> 
> Note: Not relying on anything from the host was a design constraint
> for
> Pacemaker Cloud.  oVirt is different in that you can put things on
> the
> hosts, so there may be optimizations we can make due to this relaxed
> constraint, like putting the DPEs onto the Hosts.
> 
> > What this would translate to is a more distributed way of
> > monitoring
> > and moving around of VMs/services.  E.g. for each service, engine
> > would run the VM on host A and let host B know that it is the
> > failover node for this service.
> 
> That seems restrictive.  Why not allow that VM to fail over to 'any
> other node in the cloud' vs. picking a specific piece of hardware?
>  If
> you allow it to just pick the best available node at the time using
> predefined policies that will result in less focus on the individual
> hosts and make things more cloud-like (abstraction of resources)
> 
> >  Node B would be monitoring the
> > heartbeats for the services it is in charge of and take over when
> > needed. In case host B crashes, engine would choose a different
> > host
> > to be the failover node (note that there can be more than 2 nodes
> > with a predefined order of priority).
> 
> Agree with this... Sort of what I said above, the DPE could run on
> HostB
> to monitor stuff running on Hosts A and C (for case where there are
> multiple VMs across different hosts in an application group).  And if
> the DPE or HostB fails, then the CPE would respawn a new DPE on a new
> host.
> 
> I think Pacemaker Cloud could fit the paradigm you're looking for
> here.
>  But it will require a little integration work.  On the other hand,
>  if
> you are looking to keep this more Java oriented or very tightly
> integrated with the oVirt codebase, then you could probably take
> similar
> concepts as what has already been done in pcmk-cloud and re-implement
> them.
> 
> Either way works.  We'd be happy to assist either with integration of
> pcmk-cloud here or with general advice on HA as you work on the Java
> implementation.
> 
> Perry
> 
> 
> [1] This daemon right now is called the DPE for Deployable Policy
>     Engine, since in the Aeolus terminology a Deployable was a set of
>     VMs that were coordinated to run an application.  For example, 2
>     VMs, one running a database and the other running a web server.
> 
>     Aeolus terminology has changed and 'Deployable' is no longer used
>     to describe this.  Instead this is called an Application
>     Set/Group
> 
>     Because pcmk-cloud adopted Aeolus terminology and the Deployable
>     term is not really well known, we're probably going to rename the
>     DPE to be "Cloud Application Policy Engine" or CAPE.
>