Some thoughts on enhancing High Availability in oVirt

Wed Feb 15 06:55:46 UTC 2012

On 02/15/2012 03:41 AM, Perry Myers wrote:
>> As long as you expect the VM to enforce reliability on the raw
>> storage devices then you are going to have problems with restarting
>> HA VMs. If you switch your thinking to making the storage operations
>> HA, then all you need is a response cache.
>>
>> A restarted VM replays the operation, and the cached response is
>> retransmitted (or the operation is benignly re-applied). Without
>> defining the operations so that they can be benignly re-applied or
>> adding a response cache you will always be able to come up with some
>> order of failure that won't work. There is no cost-effective way to
>> guarantee that you snapshot the VM only when there is no in-flight
>> storage activity.
>
> How is this any different than a bare metal host crashing while writes
> are in flight either to a local disk or FC disk?  When something crashes
> (be it physical or virtual) you're always going to lose some data that
> was in flight but not committed to disk (network has same issue).  It's
> up to individual applications to be resilient to this.
>
> I think this issue is somewhat orthogonal to simply providing reduced
> MTTR by restarting failed services or VMs.

don't you fence the other node first to make sure it won't write after 
you started another one?
here we are talking about moving the VM, without fencing the host.