[Users] Testing High Availability and Power outages
Alexandru Vladulescu
avladulescu at bfproject.ro
Mon Jan 14 11:50:41 UTC 2013
On 01/14/2013 10:13 AM, Doron Fediuck wrote:
>
>
> ------------------------------------------------------------------------
>
> *From: *"Alexandru Vladulescu" <avladulescu at bfproject.ro>
> *To: *"Doron Fediuck" <dfediuck at redhat.com>
> *Cc: *"users" <users at ovirt.org>
> *Sent: *Sunday, January 13, 2013 9:49:25 PM
> *Subject: *Re: [Users] Testing High Availability and Power outages
>
>
> Dear Doron,
>
>
> I had the case retested now and I am writing you the results.
>
> Furthermore, if this information should be useful for you, my
> network setup is the following: 2 Layer 2 (Zyxel es2108-g &
> ES2200-8) switches configured with 2 VLANs ( 1 inside backbone
> network -- added br0 to Ovirt ; 1 outside network -- running on
> ovirtmgmt interface for Internet traffic to VMs). The backbone
> switch is a gigabit capable one, and each host runs on jumbo frame
> setup. There is one more firewall server that routes the subnets
> through trunking port and VLAN configuration. The Ovirt software
> has been setup with backbone network subnet.
>
> As you could guess the network infrastructure is not the problem here.
>
> The test case was the same as described before:
>
> 1. Vm running on Hyper01, none on Hyper02. Host had configured the
> High Available check box.
> 2. Hand power off of Hyper01 from power network (no soft/manual
> shutdown).
> 3. After a while, Ovirt marks the Hyper01 as Non Responsive
> 4. Manually clicked on Confirm host reboot and the VM starts after
> Ovirt's manual fence to Hyper01 on Hyper02 host.
>
> I have provided engine log attached. The Confirm Host reboot was
> done at precise time of 21:31:45 On the cluster section, in Ovirt,
> I did try changing the "Resilience Policy" attribute from "Migrate
> Virtual Machines" to "Migrate only High Available Virtual
> Machines" but with the same results.
>
>
> As I am guessing from the engine log the Node Controller sees the
> Hyper01 node as it has a "network fault" no route to host,
> although this was shut down.
>
> Is this supposed to be the default behavior in this case, as the
> scenario might overlap with a real case of network outage.
>
>
> My Regards,
> Alex.
>
>
>
> On 01/13/2013 10:54 AM, Doron Fediuck wrote:
>
>
>
> ------------------------------------------------------------------------
>
> *From: *"Alexandru Vladulescu" <avladulescu at bfproject.ro>
> *To: *"Doron Fediuck" <dfediuck at redhat.com>
> *Cc: *"users" <users at ovirt.org>
> *Sent: *Sunday, January 13, 2013 10:46:41 AM
> *Subject: *Re: [Users] Testing High Availability and Power
> outages
>
> Dear Doron,
>
> I haven't collected the logs from the tests, but I would
> gladly re-do the case and get back to you asap.
>
> This feature is the main reason of which I have chosen to
> go with Ovirt in the first place, besides other virt
> environments.
>
> Could you please inform me what logs should I be focusing
> on, besides the engine log; vdsm maybe or other relevant logs?
>
> Regards,
> Alex
>
>
> --
> Sent from phone.
>
> On 13.01.2013, at 09:56, Doron Fediuck
> <dfediuck at redhat.com <mailto:dfediuck at redhat.com>> wrote:
>
>
>
> ------------------------------------------------------------------------
>
> *From: *"Alexandru Vladulescu"
> <avladulescu at bfproject.ro
> <mailto:avladulescu at bfproject.ro>>
> *To: *"users" <users at ovirt.org
> <mailto:users at ovirt.org>>
> *Sent: *Friday, January 11, 2013 2:47:38 PM
> *Subject: *[Users] Testing High Availability and
> Power outages
>
>
> Hi,
>
>
> Today, I started testing on my Ovirt 3.1
> installation (from dreyou repos) running on 3 x
> Centos 6.3 hypervisors the High Availability
> features and the fence mechanism.
>
> As yesterday, I have reported in a previous email
> thread, that the migration priority queue cannot
> be increased (bug) in this current version, I
> decided to test what the official documentation
> says about the High Availability cases.
>
> This will be a disaster case scenarios to suffer
> from if one hypervisor has a power outage/hardware
> problem and the VMs running on it are not
> migrating on other spare resources.
>
>
> In the official documenation from ovirt.org
> <http://ovirt.org> it is quoted the following:
>
>
> /High availability /
>
> //
>
> /Allows critical VMs to be restarted on another
> host in the event of hardware failure with three
> levels of priority, taking into account resiliency
> policy. /
>
> //
>
> * /Resiliency policy to control high
> availability VMs at the cluster level. /
> * /Supports application-level high availability
> with supported fencing agents. /
>
>
> As well as in the Architecture description:
>
> /High Availability - restart guest VMs from failed
> hosts automatically on other hosts/
>
>
>
> So the testing went like this -- One VM running a
> linux box, having the check box "High Available"
> and "Priority for Run/Migration queue:" set to
> Low. On Host we have the check box to "Any Host in
> Cluster", without "Allow VM migration only upon
> Admin specific request" checked.
>
>
>
> My environment:
>
>
> Configuration : 2 x Hypervisors (same
> cluster/hardware configuration) ; 1 x Hypervisor +
> acting as a NAS (NFS) server (different
> cluster/hardware configuration)
>
> Actions: Went and cut-off the power from one of
> the hypervisors from the 2 node clusters, while
> the VM was running on. This would translate to a
> power outage.
>
> Results: The hypervisor node that suffered from
> the outage is showing in Hosts tab as Non
> Responsive on Status, and the VM has a question
> mark and cannot be powered off or nothing
> (therefore it's stuck).
>
> In the Log console in GUI, I get:
>
> Host Hyper01 is non-responsive.
> VM Web-Frontend01 was set to the Unknown status.
>
> There is nothing I could I could do besides
> clicking on the Hyper01 "Confirm Host as been
> rebooted", afterwards the VM starts on the Hyper02
> with a cold reboot of the VM.
>
> The Log console changes to:
>
> Vm Web-Frontend01 was shut down due to Hyper01
> host reboot or manual fence
> All VMs' status on Non-Responsive Host Hyper01
> were changed to 'Down' by admin at internal
> Manual fencing for host Hyper01 was started.
> VM Web-Frontend01 was restarted on Host Hyper02
>
>
> I would like you approach on this problem, reading
> the documentation & features pages on the official
> website, I suppose that this would have been an
> automatically mechanism working on some sort of a
> vdsm & engine fencing action. Am I missing
> something regarding it ?
>
>
> Thank you for your patience reading this.
>
>
> Regards,
> Alex.
>
>
>
>
> _______________________________________________
> Users mailing list
> Users at ovirt.org <mailto:Users at ovirt.org>
> http://lists.ovirt.org/mailman/listinfo/users
>
> Hi Alex,
> Can you share with us the engine's log from the
> relevant time period?
>
> Doron
>
> Hi Alex,
> engine log is the important one, as it will indicate on the
> decision making process.
> VDSM logs should be kept in case something is unclear, but I
> suggest we begin with
> engine.log.
>
>
> Hi Alex,in tab, rig
> In order to have HA working in host level (which is what you're
> testing now) you need to
> configure power management to each of the relevant hosts (Go to Hosts
> maht click a host
> and choose edit. Now select the Power management tab and you'll see
> it). In the details you
> gave us it's not clear how you defined Power management for your
> hosts, so I can only assume
> it's not defined properly.
>
> The reason for this necessity is that we cannot resume a VM on a
> different host before we
> verified the original hosts status. If, for example the VM is still
> running on the original
> host and we lost network connectivity to it, we're in a risk of
> running the same VM on 2 different
> hosts at the same time which will corrupt its disk(s). So the only way
> to prevent it, is
> rebooting the original host which will ensure the VM is not running
> there. We call the reboot
> procedure fencing, and if you'll check your logs you'll be able to see:
>
> 2013-01-13 21:29:42,380 ERROR
> [org.ovirt.engine.core.bll.VdsNotRespondingTreatmentCommand]
> (pool-3-thread-44) [a1803d1] Failed to run Fence script on
> vds:Hyper01, VMs moved to UnKnown instead.
>
> So the only way for you to handle it, is to confirm host was rebooted
> (as you did), which will
> allow resuming the VM on a different host.
>
> Doron
Hi Doron,
Regarding your reply I don't have such fence mechanism through IMM or
iLO interface as the hardware that I am using doesn't support such IPMI
technology. Seeing your response makes me consider the option of really
getting an add-on card that will be able to do the basic reboot,
restart, reset functions for our hardware.
Thank you very much for your advice on this.
Alex
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20130114/ae9e92fb/attachment-0001.html>
More information about the Users
mailing list