Re: [Users] Testing High Availability and Power outages

13 Jan 2013

      Dear Doron,

I had the case retested now and I am writing you the results.

Furthermore, if this information should be useful for you, my network 
setup is the following: 2 Layer 2 (Zyxel es2108-g & ES2200-8) switches 
configured with 2 VLANs ( 1 inside backbone network -- added br0 to 
Ovirt ; 1 outside network -- running on ovirtmgmt interface for Internet 
traffic to VMs). The backbone switch is a gigabit capable one, and each 
host runs on jumbo frame setup. There is one more firewall server that 
routes the subnets through trunking port and VLAN configuration. The 
Ovirt software has been setup with backbone network subnet.

As you could guess the network infrastructure is not the problem here.

The test case was the same as described before:

1. Vm running on Hyper01, none on Hyper02. Host had configured the High 
Available check box.
2. Hand power off of Hyper01 from power network (no soft/manual shutdown).
3. After a while, Ovirt marks the Hyper01 as Non Responsive
4. Manually clicked on Confirm host reboot and the VM starts after 
Ovirt's manual fence to Hyper01 on Hyper02 host.

I have provided engine log attached. The Confirm Host reboot was done at 
precise time of 21:31:45 On the cluster section, in Ovirt, I did try 
changing the "Resilience Policy" attribute from "Migrate Virtual 
Machines" to "Migrate only High Available Virtual Machines" but with the 
same results.

As I am guessing from the engine log the Node Controller sees the 
Hyper01 node as it has a "network fault" no route to host, although this 
was shut down.

Is this supposed to be the default behavior in this case, as the 
scenario might overlap with a real case of network outage.

My Regards,
Alex.

On 01/13/2013 10:54 AM, Doron Fediuck wrote:
...
------------------------------------------------------------------------
*From: *"Alexandru Vladulescu" <avladulescu@bfproject.ro>
    *To: *"Doron Fediuck" <dfediuck@redhat.com>
    *Cc: *"users" <users@ovirt.org>
    *Sent: *Sunday, January 13, 2013 10:46:41 AM
    *Subject: *Re: [Users] Testing High Availability and Power outages
Dear Doron,
I haven't collected the logs from the tests, but I would gladly
    re-do the case and get back to you asap.
This feature is the main reason of which I have chosen to go with
    Ovirt in the first place, besides other virt environments.
Could you please inform me what logs should I be focusing on,
    besides the engine log; vdsm maybe or other relevant logs?
Regards,
    Alex
--
    Sent from phone.
On 13.01.2013, at 09:56, Doron Fediuck <dfediuck@redhat.com
    <mailto:dfediuck@redhat.com>> wrote:
------------------------------------------------------------------------
*From: *"Alexandru Vladulescu" <avladulescu@bfproject.ro
            <mailto:avladulescu@bfproject.ro>>
            *To: *"users" <users@ovirt.org <mailto:users@ovirt.org>>
            *Sent: *Friday, January 11, 2013 2:47:38 PM
            *Subject: *[Users] Testing High Availability and Power outages
Hi,
Today, I started testing on my Ovirt 3.1 installation
            (from dreyou repos) running on 3 x Centos 6.3 hypervisors
            the High Availability features and the fence mechanism.
As yesterday, I have reported in a previous email thread,
            that the migration priority queue cannot be increased
            (bug) in this current version, I decided to test what the
            official documentation says about the High Availability
            cases.
This will be a disaster case scenarios to suffer from if
            one hypervisor has a power outage/hardware problem and the
            VMs running on it are not migrating on other spare resources.
In the official documenation from ovirt.org
            <http://ovirt.org> it is quoted the following:
/High availability /
//
/Allows critical VMs to be restarted on another host in
            the event of hardware failure with three levels of
            priority, taking into account resiliency policy. /
//
* /Resiliency policy to control high availability VMs at
                the cluster level. /
              * /Supports application-level high availability with
                supported fencing agents. /
As well as in the Architecture description:
/High Availability - restart guest VMs from failed hosts
            automatically on other hosts/
So the testing went like this -- One VM running a linux
            box, having the check box "High Available" and "Priority
            for Run/Migration queue:" set to Low. On Host we have the
            check box to "Any Host in Cluster", without "Allow VM
            migration only upon Admin specific request" checked.
My environment:
Configuration :  2 x Hypervisors (same cluster/hardware
            configuration) ; 1 x Hypervisor + acting as a NAS (NFS)
            server (different cluster/hardware configuration)
Actions: Went and cut-off the power from one of the
            hypervisors from the 2 node clusters, while the VM was
            running on. This would translate to a power outage.
Results: The hypervisor node that suffered from the outage
            is showing in Hosts tab as Non Responsive on Status, and
            the VM has a question mark and cannot be powered off or
            nothing (therefore it's stuck).
In the Log console in GUI, I get:
Host Hyper01 is non-responsive.
            VM Web-Frontend01 was set to the Unknown status.
There is nothing I could I could do besides clicking on
            the Hyper01 "Confirm Host as been rebooted", afterwards
            the VM starts on the Hyper02 with a cold reboot of the VM.
The Log console changes to:
Vm Web-Frontend01 was shut down due to Hyper01 host reboot
            or manual fence
            All VMs' status on Non-Responsive Host Hyper01 were
            changed to 'Down' by admin@internal
            Manual fencing for host Hyper01 was started.
            VM Web-Frontend01 was restarted on Host Hyper02
I would like you approach on this problem, reading the
            documentation & features pages on the official website, I
            suppose that this would have been an automatically
            mechanism working on some sort of a vdsm & engine fencing
            action. Am I missing something regarding it ?
Thank you for your patience reading this.
Regards,
            Alex.
_______________________________________________
            Users mailing list
            Users@ovirt.org <mailto:Users@ovirt.org>
            http://lists.ovirt.org/mailman/listinfo/users
Hi Alex,
        Can you share with us the engine's log from the relevant time
        period?
Doron
Hi Alex,
engine log is the important one, as it will indicate on the decision 
making process.
VDSM logs should be kept in case something is unclear, but I suggest 
we begin with
engine.log.