[Users] Testing High Availability and Power outages

Alexandru Vladulescu avladulescu at bfproject.ro
Mon Jan 14 11:50:41 UTC 2013


On 01/14/2013 10:13 AM, Doron Fediuck wrote:
>
>
> ------------------------------------------------------------------------
>
>     *From: *"Alexandru Vladulescu" <avladulescu at bfproject.ro>
>     *To: *"Doron Fediuck" <dfediuck at redhat.com>
>     *Cc: *"users" <users at ovirt.org>
>     *Sent: *Sunday, January 13, 2013 9:49:25 PM
>     *Subject: *Re: [Users] Testing High Availability and Power outages
>
>
>     Dear Doron,
>
>
>     I had the case retested now and I am writing you the results.
>
>     Furthermore, if this information should be useful for you, my
>     network setup is the following: 2 Layer 2 (Zyxel es2108-g &
>     ES2200-8) switches configured with 2 VLANs ( 1 inside backbone
>     network -- added br0 to Ovirt ; 1 outside network -- running on
>     ovirtmgmt interface for Internet traffic to VMs). The backbone
>     switch is a gigabit capable one, and each host runs on jumbo frame
>     setup. There is one more firewall server that routes the subnets
>     through trunking port and VLAN configuration. The Ovirt software
>     has been setup with backbone network subnet.
>
>     As you could guess the network infrastructure is not the problem here.
>
>     The test case was the same as described before:
>
>     1. Vm running on Hyper01, none on Hyper02. Host had configured the
>     High Available check box.
>     2. Hand power off of Hyper01 from power network (no soft/manual
>     shutdown).
>     3. After a while, Ovirt marks the Hyper01 as Non Responsive
>     4. Manually clicked on Confirm host reboot and the VM starts after
>     Ovirt's manual fence to Hyper01 on Hyper02 host.
>
>     I have provided engine log attached. The Confirm Host reboot was
>     done at precise time of 21:31:45 On the cluster section, in Ovirt,
>     I did try changing the "Resilience Policy" attribute from "Migrate
>     Virtual Machines" to "Migrate only High Available Virtual
>     Machines" but with the same results.
>
>
>     As I am guessing from the engine log the Node Controller sees the
>     Hyper01 node as it has a "network fault" no route to host,
>     although this was shut down.
>
>     Is this supposed to be the default behavior in this case, as the
>     scenario might overlap with a real case of network outage.
>
>
>     My Regards,
>     Alex.
>
>
>
>     On 01/13/2013 10:54 AM, Doron Fediuck wrote:
>
>
>
>         ------------------------------------------------------------------------
>
>             *From: *"Alexandru Vladulescu" <avladulescu at bfproject.ro>
>             *To: *"Doron Fediuck" <dfediuck at redhat.com>
>             *Cc: *"users" <users at ovirt.org>
>             *Sent: *Sunday, January 13, 2013 10:46:41 AM
>             *Subject: *Re: [Users] Testing High Availability and Power
>             outages
>
>             Dear Doron,
>
>             I haven't collected the logs from the tests, but I would
>             gladly re-do the case and get back to you asap.
>
>             This feature is the main reason of which I have chosen to
>             go with Ovirt in the first place, besides other virt
>             environments.
>
>             Could you please inform me what logs should I be focusing
>             on, besides the engine log; vdsm maybe or other relevant logs?
>
>             Regards,
>             Alex
>
>
>             --
>             Sent from phone.
>
>             On 13.01.2013, at 09:56, Doron Fediuck
>             <dfediuck at redhat.com <mailto:dfediuck at redhat.com>> wrote:
>
>
>
>                 ------------------------------------------------------------------------
>
>                     *From: *"Alexandru Vladulescu"
>                     <avladulescu at bfproject.ro
>                     <mailto:avladulescu at bfproject.ro>>
>                     *To: *"users" <users at ovirt.org
>                     <mailto:users at ovirt.org>>
>                     *Sent: *Friday, January 11, 2013 2:47:38 PM
>                     *Subject: *[Users] Testing High Availability and
>                     Power outages
>
>
>                     Hi,
>
>
>                     Today, I started testing on my Ovirt 3.1
>                     installation (from dreyou repos) running on 3 x
>                     Centos 6.3 hypervisors the High Availability
>                     features and the fence mechanism.
>
>                     As yesterday, I have reported in a previous email
>                     thread, that the migration priority queue cannot
>                     be increased (bug) in this current version, I
>                     decided to test what the official documentation
>                     says about the High Availability cases.
>
>                     This will be a disaster case scenarios to suffer
>                     from if one hypervisor has a power outage/hardware
>                     problem and the VMs running on it are not
>                     migrating on other spare resources.
>
>
>                     In the official documenation from ovirt.org
>                     <http://ovirt.org> it is quoted the following:
>
>
>                           /High availability /
>
>                     //
>
>                     /Allows critical VMs to be restarted on another
>                     host in the event of hardware failure with three
>                     levels of priority, taking into account resiliency
>                     policy. /
>
>                     //
>
>                       * /Resiliency policy to control high
>                         availability VMs at the cluster level. /
>                       * /Supports application-level high availability
>                         with supported fencing agents. /
>
>
>                     As well as in the Architecture description:
>
>                     /High Availability - restart guest VMs from failed
>                     hosts automatically on other hosts/
>
>
>
>                     So the testing went like this -- One VM running a
>                     linux box, having the check box "High Available"
>                     and "Priority for Run/Migration queue:" set to
>                     Low. On Host we have the check box to "Any Host in
>                     Cluster", without "Allow VM migration only upon
>                     Admin specific request" checked.
>
>
>
>                     My environment:
>
>
>                     Configuration :  2 x Hypervisors (same
>                     cluster/hardware configuration) ; 1 x Hypervisor +
>                     acting as a NAS (NFS) server (different
>                     cluster/hardware configuration)
>
>                     Actions: Went and cut-off the power from one of
>                     the hypervisors from the 2 node clusters, while
>                     the VM was running on. This would translate to a
>                     power outage.
>
>                     Results: The hypervisor node that suffered from
>                     the outage is showing in Hosts tab as Non
>                     Responsive on Status, and the VM has a question
>                     mark and cannot be powered off or nothing
>                     (therefore it's stuck).
>
>                     In the Log console in GUI, I get:
>
>                     Host Hyper01 is non-responsive.
>                     VM Web-Frontend01 was set to the Unknown status.
>
>                     There is nothing I could I could do besides
>                     clicking on the Hyper01 "Confirm Host as been
>                     rebooted", afterwards the VM starts on the Hyper02
>                     with a cold reboot of the VM.
>
>                     The Log console changes to:
>
>                     Vm Web-Frontend01 was shut down due to Hyper01
>                     host reboot or manual fence
>                     All VMs' status on Non-Responsive Host Hyper01
>                     were changed to 'Down' by admin at internal
>                     Manual fencing for host Hyper01 was started.
>                     VM Web-Frontend01 was restarted on Host Hyper02
>
>
>                     I would like you approach on this problem, reading
>                     the documentation & features pages on the official
>                     website, I suppose that this would have been an
>                     automatically mechanism working on some sort of a
>                     vdsm & engine fencing action. Am I missing
>                     something regarding it ?
>
>
>                     Thank you for your patience reading this.
>
>
>                     Regards,
>                     Alex.
>
>
>
>
>                     _______________________________________________
>                     Users mailing list
>                     Users at ovirt.org <mailto:Users at ovirt.org>
>                     http://lists.ovirt.org/mailman/listinfo/users
>
>                 Hi Alex,
>                 Can you share with us the engine's log from the
>                 relevant time period?
>
>                 Doron
>
>         Hi Alex,
>         engine log is the important one, as it will indicate on the
>         decision making process.
>         VDSM logs should be kept in case something is unclear, but I
>         suggest we begin with
>         engine.log.
>
>
> Hi Alex,in tab, rig
> In order to have HA working in host level (which is what you're 
> testing now) you need to
> configure power management to each of the relevant hosts (Go to Hosts 
> maht click a host
> and choose edit. Now select the Power management tab and you'll see 
> it). In the details you
> gave us it's not clear how you defined Power management for your 
> hosts, so I can only assume
> it's not defined properly.
>
> The reason for this necessity is that we cannot resume a VM on a 
> different host before we
> verified the original hosts status. If, for example the VM is still 
> running on the original
> host and we lost network connectivity to it, we're in a risk of 
> running the same VM on 2 different
> hosts at the same time which will corrupt its disk(s). So the only way 
> to prevent it, is
> rebooting the original host which will ensure the VM is not running 
> there. We call the reboot
> procedure fencing, and if you'll check your logs you'll be able to see:
>
> 2013-01-13 21:29:42,380 ERROR 
> [org.ovirt.engine.core.bll.VdsNotRespondingTreatmentCommand] 
> (pool-3-thread-44) [a1803d1] Failed to run Fence script on 
> vds:Hyper01, VMs moved to UnKnown instead.
>
> So the only way for you to handle it, is to confirm host was rebooted 
> (as you did), which will
> allow resuming the VM on a different host.
>
> Doron

Hi Doron,

Regarding your reply I don't have such fence mechanism through IMM or 
iLO interface as the hardware that I am using doesn't support such IPMI 
technology. Seeing your response makes me consider the option of really 
getting an add-on card that will be able to do the basic reboot, 
restart, reset functions for our hardware.

Thank you very much for your advice on this.

Alex


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20130114/ae9e92fb/attachment-0001.html>


More information about the Users mailing list