Some thoughts on enhancing High Availability in oVirt

Fri Feb 17 00:29:14 UTC 2012

On 02/16/2012 09:14 AM, Ayal Baron wrote:
> 
> 
> ----- Original Message -----
>> On 02/14/2012 11:32 PM, Ayal Baron wrote:
>>>
>>>
>>> ----- Original Message -----
>>>>> I'm not sure I agree.
>>>>> This entire thread assumes that the way to do this is to have the
>>>>> engine continuously monitor all services on all (HA) guests and
>>>>> according to varying policies reschedule VMs (services within
>>>>> VMs?)
>>>>
>>>> That's one interpretation of what I wrote, but not the only one.
>>>>
>>>> Pacemaker Cloud doesn't rely on a single process (like oVirt
>>>> Engine)
>>>> to
>>>> monitor all VMs and the services in those VMs.  It relies on
>>>> spawning
>>>> a
>>>> monitor process for each logical grouping of VMs in an
>>>> 'application
>>>> group'.
>>>>
>>>> So the engine doesn't need to continuously monitor every VM and
>>>> every
>>>> service, it delegates to the Cloud Policy Engine (CPE) which in
>>>> turn
>>>> creates a daemon (DPE[1]) to monitor each application group.
>>>
>>> Where is the daemon spawn? on the engine or in a distributed
>>> fashion? if the latter then drools is irrelevant.  if the former
>>> then it would just make things worse (scalability wise)
>>>
>>
>> Ayal,
>>
>> CPE (cloud policy engine - responsible for starting/stopping cloud
>> application policy engines, provides an API for third party control)
>> runs on the same machines as the CAPE(aka DPE) (cloud application
>> policy
>> engine - responsible for maintaining the availability of the
>> resources
>> and virtual machines in one cloud application - including recovery
>> escalation, ordering constraints, fault detection, fault isolation,
>> instantiation of vms).  This collection of software components could
>> be
>> collocated with the engine, or a separate machine entirely since the
>> project provides an API to third party projects.
>>
>> One thing that may not be entirely clear is that there is a new DPE
>> process for each cloud application (which could be monitor several
>> hundreds VMs for large applications).  This converts the inherent
>> inability of any policy engine to scale to large object counts into a
>> kernel scheduling problem and memory consumption problem (kernel.org
>> scheduler rocks, memory is cheap).
>>
>> The CAPE processes could be spawned in a distributed fashion very
>> trivially, if/when we run into scaling problems with a single node.
>>  No
>> sense optimizing for a condition that may not be relevant.
>>
>> One intentional aspect of our project is focused around reliability.
>> Our CAPE process is approximately 2kloc.  Its very small code
>> footprint
>> is designed to be easy to "get right" vs a huge monolithic code base
>> which increases the possible failure scenarios.
>>
>> As a short note about scalability, my laptop can run 1000 CAPE
>> processes
>> with 1% total cpu utilization (measured with top) and 5gig memory
>> utilization (measured with free).  The design's upper limit on scale
>> is
>> based upon a) limitations of kernel scheduling b) memory consumption
>> of
>> the CAPE process.
> 
> But they all schedule the services to run on the same set of resources (hosts / memory / cpu), how do you coordinate?
> 

Ayal,

The Pacemaker Cloud model is based upon the assumption that if a vm is
requested to be started, it will start (or fail to start, in which case
recovery will be executed) (unlimited resources).  This is the model in
public clouds.  We have had some interest in scheduling the vm starting
on specific hosts, but don't know of specific APIs to gather information
about how to make a decision to place specific VMs.  No project
currently solves this scheduling problem.  I believe part of the reason
is that there is no standardized method to gather topology information
for the VM infrastructure.

However, the objective of finding an appropriate host/memory/cpu to
instantiate a VM is orthogonal to the objective of providing high
availability, recovery escalation and ordering guarantees with virtual
machines and resources[1] that compose a cloud application.

Regards
-steve

[1] the term resource used indicates an individual component
     application, such as Apache's http.
>>
>> Regards
>> -steve
>>