Some thoughts on enhancing High Availability in oVirt

Wed Feb 15 15:37:49 UTC 2012

On 02/14/2012 11:32 PM, Ayal Baron wrote:
> 
> 
> ----- Original Message -----
>>> I'm not sure I agree.
>>> This entire thread assumes that the way to do this is to have the
>>> engine continuously monitor all services on all (HA) guests and
>>> according to varying policies reschedule VMs (services within VMs?)
>>
>> That's one interpretation of what I wrote, but not the only one.
>>
>> Pacemaker Cloud doesn't rely on a single process (like oVirt Engine)
>> to
>> monitor all VMs and the services in those VMs.  It relies on spawning
>> a
>> monitor process for each logical grouping of VMs in an 'application
>> group'.
>>
>> So the engine doesn't need to continuously monitor every VM and every
>> service, it delegates to the Cloud Policy Engine (CPE) which in turn
>> creates a daemon (DPE[1]) to monitor each application group.
> 
> Where is the daemon spawn? on the engine or in a distributed fashion? if the latter then drools is irrelevant.  if the former then it would just make things worse (scalability wise)
> 

Ayal,

CPE (cloud policy engine - responsible for starting/stopping cloud
application policy engines, provides an API for third party control)
runs on the same machines as the CAPE(aka DPE) (cloud application policy
engine - responsible for maintaining the availability of the resources
and virtual machines in one cloud application - including recovery
escalation, ordering constraints, fault detection, fault isolation,
instantiation of vms).  This collection of software components could be
collocated with the engine, or a separate machine entirely since the
project provides an API to third party projects.

One thing that may not be entirely clear is that there is a new DPE
process for each cloud application (which could be monitor several
hundreds VMs for large applications).  This converts the inherent
inability of any policy engine to scale to large object counts into a
kernel scheduling problem and memory consumption problem (kernel.org
scheduler rocks, memory is cheap).

The CAPE processes could be spawned in a distributed fashion very
trivially, if/when we run into scaling problems with a single node.  No
sense optimizing for a condition that may not be relevant.

One intentional aspect of our project is focused around reliability.
Our CAPE process is approximately 2kloc.  Its very small code footprint
is designed to be easy to "get right" vs a huge monolithic code base
which increases the possible failure scenarios.

As a short note about scalability, my laptop can run 1000 CAPE processes
with 1% total cpu utilization (measured with top) and 5gig memory
utilization (measured with free).  The design's upper limit on scale is
based upon a) limitations of kernel scheduling b) memory consumption of
the CAPE process.

Regards
-steve