[Users] Testing High Availability and Power outages

Sun Jan 13 02:56:57 EST 2013

----- Original Message -----

> From: "Alexandru Vladulescu" <avladulescu at bfproject.ro>
> To: "users" <users at ovirt.org>
> Sent: Friday, January 11, 2013 2:47:38 PM
> Subject: [Users] Testing High Availability and Power outages

> Hi,

> Today, I started testing on my Ovirt 3.1 installation (from dreyou
> repos) running on 3 x Centos 6.3 hypervisors the High Availability
> features and the fence mechanism.

> As yesterday, I have reported in a previous email thread, that the
> migration priority queue cannot be increased (bug) in this current
> version, I decided to test what the official documentation says
> about the High Availability cases.

> This will be a disaster case scenarios to suffer from if one
> hypervisor has a power outage/hardware problem and the VMs running
> on it are not migrating on other spare resources.

> In the official documenation from ovirt.org it is quoted the
> following:
> High availability

> Allows critical VMs to be restarted on another host in the event of
> hardware failure with three levels of priority, taking into account
> resiliency policy.

> * Resiliency policy to control high availability VMs at the cluster
> level.
> * Supports application-level high availability with supported fencing
> agents.

> As well as in the Architecture description:

> High Availability - restart guest VMs from failed hosts automatically
> on other hosts

> So the testing went like this -- One VM running a linux box, having
> the check box "High Available" and "Priority for Run/Migration
> queue:" set to Low. On Host we have the check box to "Any Host in
> Cluster", without "Allow VM migration only upon Admin specific
> request" checked.

> My environment:

> Configuration : 2 x Hypervisors (same cluster/hardware configuration)
> ; 1 x Hypervisor + acting as a NAS (NFS) server (different
> cluster/hardware configuration)

> Actions: Went and cut-off the power from one of the hypervisors from
> the 2 node clusters, while the VM was running on. This would
> translate to a power outage.

> Results: The hypervisor node that suffered from the outage is showing
> in Hosts tab as Non Responsive on Status, and the VM has a question
> mark and cannot be powered off or nothing (therefore it's stuck).

> In the Log console in GUI, I get:

> Host Hyper01 is non-responsive.
> VM Web-Frontend01 was set to the Unknown status.

> There is nothing I could I could do besides clicking on the Hyper01
> "Confirm Host as been rebooted", afterwards the VM starts on the
> Hyper02 with a cold reboot of the VM.

> The Log console changes to:

> Vm Web-Frontend01 was shut down due to Hyper01 host reboot or manual
> fence
> All VMs' status on Non-Responsive Host Hyper01 were changed to 'Down'
> by admin at internal
> Manual fencing for host Hyper01 was started.
> VM Web-Frontend01 was restarted on Host Hyper02

> I would like you approach on this problem, reading the documentation
> & features pages on the official website, I suppose that this would
> have been an automatically mechanism working on some sort of a vdsm
> & engine fencing action. Am I missing something regarding it ?

> Thank you for your patience reading this.

> Regards,
> Alex.

> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users

Hi Alex, 
Can you share with us the engine's log from the relevant time period? 

Doron 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20130113/1f27cc86/attachment.html>