[Users] Testing High Availability and Power outages
Doron Fediuck
dfediuck at redhat.com
Mon Jan 14 08:13:31 UTC 2013
----- Original Message -----
> From: "Alexandru Vladulescu" <avladulescu at bfproject.ro>
> To: "Doron Fediuck" <dfediuck at redhat.com>
> Cc: "users" <users at ovirt.org>
> Sent: Sunday, January 13, 2013 9:49:25 PM
> Subject: Re: [Users] Testing High Availability and Power outages
> Dear Doron,
> I had the case retested now and I am writing you the results.
> Furthermore, if this information should be useful for you, my network
> setup is the following: 2 Layer 2 (Zyxel es2108-g & ES2200-8)
> switches configured with 2 VLANs ( 1 inside backbone network --
> added br0 to Ovirt ; 1 outside network -- running on ovirtmgmt
> interface for Internet traffic to VMs). The backbone switch is a
> gigabit capable one, and each host runs on jumbo frame setup. There
> is one more firewall server that routes the subnets through trunking
> port and VLAN configuration. The Ovirt software has been setup with
> backbone network subnet.
> As you could guess the network infrastructure is not the problem
> here.
> The test case was the same as described before:
> 1. Vm running on Hyper01, none on Hyper02. Host had configured the
> High Available check box.
> 2. Hand power off of Hyper01 from power network (no soft/manual
> shutdown).
> 3. After a while, Ovirt marks the Hyper01 as Non Responsive
> 4. Manually clicked on Confirm host reboot and the VM starts after
> Ovirt's manual fence to Hyper01 on Hyper02 host.
> I have provided engine log attached. The Confirm Host reboot was done
> at precise time of 21:31:45 On the cluster section, in Ovirt, I did
> try changing the "Resilience Policy" attribute from "Migrate Virtual
> Machines" to "Migrate only High Available Virtual Machines" but with
> the same results.
> As I am guessing from the engine log the Node Controller sees the
> Hyper01 node as it has a "network fault" no route to host, although
> this was shut down.
> Is this supposed to be the default behavior in this case, as the
> scenario might overlap with a real case of network outage.
> My Regards,
> Alex.
> On 01/13/2013 10:54 AM, Doron Fediuck wrote:
> > ----- Original Message -----
>
> > > From: "Alexandru Vladulescu" <avladulescu at bfproject.ro>
> >
>
> > > To: "Doron Fediuck" <dfediuck at redhat.com>
> >
>
> > > Cc: "users" <users at ovirt.org>
> >
>
> > > Sent: Sunday, January 13, 2013 10:46:41 AM
> >
>
> > > Subject: Re: [Users] Testing High Availability and Power outages
> >
>
> > > Dear Doron,
> >
>
> > > I haven't collected the logs from the tests, but I would gladly
> > > re-do
> > > the case and get back to you asap.
> >
>
> > > This feature is the main reason of which I have chosen to go with
> > > Ovirt in the first place, besides other virt environments.
> >
>
> > > Could you please inform me what logs should I be focusing on,
> > > besides
> > > the engine log; vdsm maybe or other relevant logs?
> >
>
> > > Regards,
> >
>
> > > Alex
> >
>
> > > --
> >
>
> > > Sent from phone.
> >
>
> > > On 13.01.2013, at 09:56, Doron Fediuck < dfediuck at redhat.com >
> > > wrote:
> >
>
> > > > ----- Original Message -----
> > >
> >
>
> > > > > From: "Alexandru Vladulescu" < avladulescu at bfproject.ro >
> > > >
> > >
> >
>
> > > > > To: "users" < users at ovirt.org >
> > > >
> > >
> >
>
> > > > > Sent: Friday, January 11, 2013 2:47:38 PM
> > > >
> > >
> >
>
> > > > > Subject: [Users] Testing High Availability and Power outages
> > > >
> > >
> >
>
> > > > > Hi,
> > > >
> > >
> >
>
> > > > > Today, I started testing on my Ovirt 3.1 installation (from
> > > > > dreyou
> > > > > repos) running on 3 x Centos 6.3 hypervisors the High
> > > > > Availability
> > > > > features and the fence mechanism.
> > > >
> > >
> >
>
> > > > > As yesterday, I have reported in a previous email thread,
> > > > > that
> > > > > the
> > > > > migration priority queue cannot be increased (bug) in this
> > > > > current
> > > > > version, I decided to test what the official documentation
> > > > > says
> > > > > about the High Availability cases.
> > > >
> > >
> >
>
> > > > > This will be a disaster case scenarios to suffer from if one
> > > > > hypervisor has a power outage/hardware problem and the VMs
> > > > > running
> > > > > on it are not migrating on other spare resources.
> > > >
> > >
> >
>
> > > > > In the official documenation from ovirt.org it is quoted the
> > > > > following:
> > > >
> > >
> >
>
> > > > > High availability
> > > >
> > >
> >
>
> > > > > Allows critical VMs to be restarted on another host in the
> > > > > event
> > > > > of
> > > > > hardware failure with three levels of priority, taking into
> > > > > account
> > > > > resiliency policy.
> > > >
> > >
> >
>
> > > > > * Resiliency policy to control high availability VMs at the
> > > > > cluster
> > > > > level.
> > > >
> > >
> >
>
> > > > > * Supports application-level high availability with supported
> > > > > fencing
> > > > > agents.
> > > >
> > >
> >
>
> > > > > As well as in the Architecture description:
> > > >
> > >
> >
>
> > > > > High Availability - restart guest VMs from failed hosts
> > > > > automatically
> > > > > on other hosts
> > > >
> > >
> >
>
> > > > > So the testing went like this -- One VM running a linux box,
> > > > > having
> > > > > the check box "High Available" and "Priority for
> > > > > Run/Migration
> > > > > queue:" set to Low. On Host we have the check box to "Any
> > > > > Host
> > > > > in
> > > > > Cluster", without "Allow VM migration only upon Admin
> > > > > specific
> > > > > request" checked.
> > > >
> > >
> >
>
> > > > > My environment:
> > > >
> > >
> >
>
> > > > > Configuration : 2 x Hypervisors (same cluster/hardware
> > > > > configuration)
> > > > > ; 1 x Hypervisor + acting as a NAS (NFS) server (different
> > > > > cluster/hardware configuration)
> > > >
> > >
> >
>
> > > > > Actions: Went and cut-off the power from one of the
> > > > > hypervisors
> > > > > from
> > > > > the 2 node clusters, while the VM was running on. This would
> > > > > translate to a power outage.
> > > >
> > >
> >
>
> > > > > Results: The hypervisor node that suffered from the outage is
> > > > > showing
> > > > > in Hosts tab as Non Responsive on Status, and the VM has a
> > > > > question
> > > > > mark and cannot be powered off or nothing (therefore it's
> > > > > stuck).
> > > >
> > >
> >
>
> > > > > In the Log console in GUI, I get:
> > > >
> > >
> >
>
> > > > > Host Hyper01 is non-responsive.
> > > >
> > >
> >
>
> > > > > VM Web-Frontend01 was set to the Unknown status.
> > > >
> > >
> >
>
> > > > > There is nothing I could I could do besides clicking on the
> > > > > Hyper01
> > > > > "Confirm Host as been rebooted", afterwards the VM starts on
> > > > > the
> > > > > Hyper02 with a cold reboot of the VM.
> > > >
> > >
> >
>
> > > > > The Log console changes to:
> > > >
> > >
> >
>
> > > > > Vm Web-Frontend01 was shut down due to Hyper01 host reboot or
> > > > > manual
> > > > > fence
> > > >
> > >
> >
>
> > > > > All VMs' status on Non-Responsive Host Hyper01 were changed
> > > > > to
> > > > > 'Down'
> > > > > by admin at internal
> > > >
> > >
> >
>
> > > > > Manual fencing for host Hyper01 was started.
> > > >
> > >
> >
>
> > > > > VM Web-Frontend01 was restarted on Host Hyper02
> > > >
> > >
> >
>
> > > > > I would like you approach on this problem, reading the
> > > > > documentation
> > > > > & features pages on the official website, I suppose that this
> > > > > would
> > > > > have been an automatically mechanism working on some sort of
> > > > > a
> > > > > vdsm
> > > > > & engine fencing action. Am I missing something regarding it
> > > > > ?
> > > >
> > >
> >
>
> > > > > Thank you for your patience reading this.
> > > >
> > >
> >
>
> > > > > Regards,
> > > >
> > >
> >
>
> > > > > Alex.
> > > >
> > >
> >
>
> > > > > _______________________________________________
> > > >
> > >
> >
>
> > > > > Users mailing list
> > > >
> > >
> >
>
> > > > > Users at ovirt.org
> > > >
> > >
> >
>
> > > > > http://lists.ovirt.org/mailman/listinfo/users
> > > >
> > >
> >
>
> > > > Hi Alex,
> > >
> >
>
> > > > Can you share with us the engine's log from the relevant time
> > > > period?
> > >
> >
>
> > > > Doron
> > >
> >
>
> > Hi Alex,
>
> > engine log is the important one, as it will indicate on the
> > decision
> > making process.
>
> > VDSM logs should be kept in case something is unclear, but I
> > suggest
> > we begin with
>
> > engine.log.
>
Hi Alex,
In order to have HA working in host level (which is what you're testing now) you need to
configure power management to each of the relevant hosts (Go to Hosts main tab, right click a host
and choose edit. Now select the Power management tab and you'll see it). In the details you
gave us it's not clear how you defined Power management for your hosts, so I can only assume
it's not defined properly.
The reason for this necessity is that we cannot resume a VM on a different host before we
verified the original hosts status. If, for example the VM is still running on the original
host and we lost network connectivity to it, we're in a risk of running the same VM on 2 different
hosts at the same time which will corrupt its disk(s). So the only way to prevent it, is
rebooting the original host which will ensure the VM is not running there. We call the reboot
procedure fencing, and if you'll check your logs you'll be able to see:
2013-01-13 21:29:42,380 ERROR [org.ovirt.engine.core.bll.VdsNotRespondingTreatmentCommand] (pool-3-thread-44) [a1803d1] Failed to run Fence script on vds:Hyper01, VMs moved to UnKnown instead.
So the only way for you to handle it, is to confirm host was rebooted (as you did), which will
allow resuming the VM on a different host.
Doron
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20130114/a0821403/attachment-0001.html>
More information about the Users
mailing list