From: "Alexandru Vladulescu" <avladulescu@bfproject.ro>
To: "Doron Fediuck" <dfediuck@redhat.com>
Cc: "users" <users@ovirt.org>
Sent: Sunday, January 13, 2013 10:46:41 AM
Subject: Re: [Users] Testing High Availability and Power outages

Dear Doron,

I haven't collected the logs from the tests, but I would gladly re-do the case and get back to you asap.

This feature is the main reason of which I have chosen to go with Ovirt in the first place, besides other virt environments.

Could you please inform me what logs should I be focusing on, besides the engine log; vdsm maybe or other relevant logs?

Regards,

Alex

--

Sent from phone.

On 13.01.2013, at 09:56, Doron Fediuck <dfediuck@redhat.com> wrote:

From: "Alexandru Vladulescu" <avladulescu@bfproject.ro>
To: "users" <users@ovirt.org>
Sent: Friday, January 11, 2013 2:47:38 PM
Subject: [Users] Testing High Availability and Power outages

Hi,

Today, I started testing on my Ovirt 3.1 installation (from dreyou repos) running on 3 x Centos 6.3 hypervisors the High Availability features and the fence mechanism.

As yesterday, I have reported in a previous email thread, that the migration priority queue cannot be increased (bug) in this current version, I decided to test what the official documentation says about the High Availability cases.

This will be a disaster case scenarios to suffer from if one hypervisor has a power outage/hardware problem and the VMs running on it are not migrating on other spare resources.

In the official documenation from ovirt.org it is quoted the following:

High availability

Allows critical VMs to be restarted on another host in the event of hardware failure with three levels of priority, taking into account resiliency policy.

Resiliency policy to control high availability VMs at the cluster level.

Supports application-level high availability with supported fencing agents.

As well as in the Architecture description:

High Availability - restart guest VMs from failed hosts automatically on other hosts

So the testing went like this -- One VM running a linux box, having the check box "High Available" and "Priority for Run/Migration queue:" set to Low. On Host we have the check box to "Any Host in Cluster", without "Allow VM migration only upon Admin specific request" checked.

My environment:

Configuration : 2 x Hypervisors (same cluster/hardware configuration) ; 1 x Hypervisor + acting as a NAS (NFS) server (different cluster/hardware configuration)

Actions: Went and cut-off the power from one of the hypervisors from the 2 node clusters, while the VM was running on. This would translate to a power outage.

Results: The hypervisor node that suffered from the outage is showing in Hosts tab as Non Responsive on Status, and the VM has a question mark and cannot be powered off or nothing (therefore it's stuck).

In the Log console in GUI, I get:

Host Hyper01 is non-responsive.
VM Web-Frontend01 was set to the Unknown status.

There is nothing I could I could do besides clicking on the Hyper01 "Confirm Host as been rebooted", afterwards the VM starts on the Hyper02 with a cold reboot of the VM.

The Log console changes to:

Vm Web-Frontend01 was shut down due to Hyper01 host reboot or manual fence
All VMs' status on Non-Responsive Host Hyper01 were changed to 'Down' by admin@internal
Manual fencing for host Hyper01 was started.
VM Web-Frontend01 was restarted on Host Hyper02

I would like you approach on this problem, reading the documentation & features pages on the official website, I suppose that this would have been an automatically mechanism working on some sort of a vdsm & engine fencing action. Am I missing something regarding it ?

Thank you for your patience reading this.

Regards,
Alex.

_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Hi Alex,
Can you share with us the engine's log from the relevant time period?

Doron

Hi Alex,
engine log is the important one, as it will indicate on the decision making process.
VDSM logs should be kept in case something is unclear, but I suggest we begin with
engine.log.