Some thoughts on enhancing High Availability in oVirt

Tue Feb 14 22:28:28 UTC 2012

On 02/14/2012 08:32 PM, Itamar Heim wrote:
> On 02/14/2012 06:31 PM, Adam Litke wrote:
>> On Thu, Feb 09, 2012 at 11:45:09AM -0500, Perry Myers wrote:
>>> warning: tl;dr
>>>
>>> Right now, HA in oVirt is limited to VM level granularity. Each VM
>>> provides a heartbeat through vdsm back to the oVirt Engine. If that
>>> heartbeat is lost, the VM is terminated and (if the user has configured
>>> it) the VM is relaunched. If the host running that VM has lost its
>>> heartbeat, the host is fenced (via a remote power operation) and all HA
>>> VMs are restarted on an alternate host.
>>>
>>
>> Has anyone considered how live snapshots and live block copy will
>> intersect HA
>> to provide a better end-user experience? For example, will we be able
>> to handle
>> a storage connection failure without power-cycling VMs by migrating
>> storage to a
>> failover storage domain and/or live-migrating the VM to a host with
>> functioning
>> storage connections?

Not sure I get the scope here - if the storage is dead, the VM won't be 
able to copy the storage to its new destination. There is only one 
theoretical chance it will work - for shared storage, if one of the 
hosts has its hba/nic/link/port dead maybe some other host will be able 
to access the storage. It seems like a long shot to me. More over,
not all of the guest IO reached the storage prior to the migration. Even 
w/ ODIRECT there is still various caches around, some belong to the VM, 
some may be meta data for image files. We won't be able to switch to 
another host w/o writing the data in most cases.
>
> cc'ing Dor - iirc, he mentioned an issue with live migrating a guest
> post an IO error

In short, while it may be theoretically possible in rare cases, I rather 
not to relay on it. Seems that the 'average' storage array/HBA is more 
stable than live migration + IO errors path...

Cheers,
Dor