[Users] Testing High Availability and Power outages

Fri Jan 11 12:47:38 UTC 2013

Hi,

Today, I started testing on my Ovirt 3.1 installation (from dreyou 
repos) running on 3 x Centos 6.3 hypervisors the High Availability 
features and the fence mechanism.

As yesterday, I have reported in a previous email thread, that the 
migration priority queue cannot be increased (bug) in this current 
version, I decided to test what the official documentation says about 
the High Availability cases.

This will be a disaster case scenarios to suffer from if one hypervisor 
has a power outage/hardware problem and the VMs running on it are not 
migrating on other spare resources.

In the official documenation from ovirt.org it is quoted the following:

      /High availability /

//

/Allows critical VMs to be restarted on another host in the event of 
hardware failure with three levels of priority, taking into account 
resiliency policy. /

//

  * /Resiliency policy to control high availability VMs at the cluster
    level. /
  * /Supports application-level high availability with supported fencing
    agents. /

As well as in the Architecture description:

/High Availability - restart guest VMs from failed hosts automatically 
on other hosts/

So the testing went like this -- One VM running a linux box, having the 
check box "High Available" and "Priority for Run/Migration queue:" set 
to Low. On Host we have the check box to "Any Host in Cluster", without 
"Allow VM migration only upon Admin specific request" checked.

My environment:

Configuration :  2 x Hypervisors (same cluster/hardware configuration) ; 
1 x Hypervisor + acting as a NAS (NFS) server (different 
cluster/hardware configuration)

Actions: Went and cut-off the power from one of the hypervisors from the 
2 node clusters, while the VM was running on. This would translate to a 
power outage.

Results: The hypervisor node that suffered from the outage is showing in 
Hosts tab as Non Responsive on Status, and the VM has a question mark 
and cannot be powered off or nothing (therefore it's stuck).

In the Log console in GUI, I get:

Host Hyper01 is non-responsive.
VM Web-Frontend01 was set to the Unknown status.

There is nothing I could I could do besides clicking on the Hyper01 
"Confirm Host as been rebooted", afterwards the VM starts on the Hyper02 
with a cold reboot of the VM.

The Log console changes to:

Vm Web-Frontend01 was shut down due to Hyper01 host reboot or manual fence
All VMs' status on Non-Responsive Host Hyper01 were changed to 'Down' by 
admin at internal
Manual fencing for host Hyper01 was started.
VM Web-Frontend01 was restarted on Host Hyper02

I would like you approach on this problem, reading the documentation & 
features pages on the official website, I suppose that this would have 
been an automatically mechanism working on some sort of a vdsm & engine 
fencing action. Am I missing something regarding it ?

Thank you for your patience reading this.

Regards,
Alex.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20130111/5519c662/attachment-0001.html>