[Users] Testing High Availability and Power outages

Sun Jan 13 03:54:19 EST 2013

----- Original Message -----

> From: "Alexandru Vladulescu" <avladulescu at bfproject.ro>
> To: "Doron Fediuck" <dfediuck at redhat.com>
> Cc: "users" <users at ovirt.org>
> Sent: Sunday, January 13, 2013 10:46:41 AM
> Subject: Re: [Users] Testing High Availability and Power outages

> Dear Doron,

> I haven't collected the logs from the tests, but I would gladly re-do
> the case and get back to you asap.

> This feature is the main reason of which I have chosen to go with
> Ovirt in the first place, besides other virt environments.

> Could you please inform me what logs should I be focusing on, besides
> the engine log; vdsm maybe or other relevant logs?

> Regards,
> Alex

> --
> Sent from phone.

> On 13.01.2013, at 09:56, Doron Fediuck < dfediuck at redhat.com > wrote:

> > ----- Original Message -----
> 

> > > From: "Alexandru Vladulescu" < avladulescu at bfproject.ro >
> > 
> 
> > > To: "users" < users at ovirt.org >
> > 
> 
> > > Sent: Friday, January 11, 2013 2:47:38 PM
> > 
> 
> > > Subject: [Users] Testing High Availability and Power outages
> > 
> 

> > > Hi,
> > 
> 

> > > Today, I started testing on my Ovirt 3.1 installation (from
> > > dreyou
> > > repos) running on 3 x Centos 6.3 hypervisors the High
> > > Availability
> > > features and the fence mechanism.
> > 
> 

> > > As yesterday, I have reported in a previous email thread, that
> > > the
> > > migration priority queue cannot be increased (bug) in this
> > > current
> > > version, I decided to test what the official documentation says
> > > about the High Availability cases.
> > 
> 

> > > This will be a disaster case scenarios to suffer from if one
> > > hypervisor has a power outage/hardware problem and the VMs
> > > running
> > > on it are not migrating on other spare resources.
> > 
> 

> > > In the official documenation from ovirt.org it is quoted the
> > > following:
> > 
> 
> > > High availability
> > 
> 

> > > Allows critical VMs to be restarted on another host in the event
> > > of
> > > hardware failure with three levels of priority, taking into
> > > account
> > > resiliency policy.
> > 
> 

> > > * Resiliency policy to control high availability VMs at the
> > > cluster
> > > level.
> > 
> 
> > > * Supports application-level high availability with supported
> > > fencing
> > > agents.
> > 
> 

> > > As well as in the Architecture description:
> > 
> 

> > > High Availability - restart guest VMs from failed hosts
> > > automatically
> > > on other hosts
> > 
> 

> > > So the testing went like this -- One VM running a linux box,
> > > having
> > > the check box "High Available" and "Priority for Run/Migration
> > > queue:" set to Low. On Host we have the check box to "Any Host in
> > > Cluster", without "Allow VM migration only upon Admin specific
> > > request" checked.
> > 
> 

> > > My environment:
> > 
> 

> > > Configuration : 2 x Hypervisors (same cluster/hardware
> > > configuration)
> > > ; 1 x Hypervisor + acting as a NAS (NFS) server (different
> > > cluster/hardware configuration)
> > 
> 

> > > Actions: Went and cut-off the power from one of the hypervisors
> > > from
> > > the 2 node clusters, while the VM was running on. This would
> > > translate to a power outage.
> > 
> 

> > > Results: The hypervisor node that suffered from the outage is
> > > showing
> > > in Hosts tab as Non Responsive on Status, and the VM has a
> > > question
> > > mark and cannot be powered off or nothing (therefore it's stuck).
> > 
> 

> > > In the Log console in GUI, I get:
> > 
> 

> > > Host Hyper01 is non-responsive.
> > 
> 
> > > VM Web-Frontend01 was set to the Unknown status.
> > 
> 

> > > There is nothing I could I could do besides clicking on the
> > > Hyper01
> > > "Confirm Host as been rebooted", afterwards the VM starts on the
> > > Hyper02 with a cold reboot of the VM.
> > 
> 

> > > The Log console changes to:
> > 
> 

> > > Vm Web-Frontend01 was shut down due to Hyper01 host reboot or
> > > manual
> > > fence
> > 
> 
> > > All VMs' status on Non-Responsive Host Hyper01 were changed to
> > > 'Down'
> > > by admin at internal
> > 
> 
> > > Manual fencing for host Hyper01 was started.
> > 
> 
> > > VM Web-Frontend01 was restarted on Host Hyper02
> > 
> 

> > > I would like you approach on this problem, reading the
> > > documentation
> > > & features pages on the official website, I suppose that this
> > > would
> > > have been an automatically mechanism working on some sort of a
> > > vdsm
> > > & engine fencing action. Am I missing something regarding it ?
> > 
> 

> > > Thank you for your patience reading this.
> > 
> 

> > > Regards,
> > 
> 
> > > Alex.
> > 
> 

> > > _______________________________________________
> > 
> 
> > > Users mailing list
> > 
> 
> > > Users at ovirt.org
> > 
> 
> > > http://lists.ovirt.org/mailman/listinfo/users
> > 
> 

> > Hi Alex,
> 
> > Can you share with us the engine's log from the relevant time
> > period?
> 

> > Doron
> 

Hi Alex, 
engine log is the important one, as it will indicate on the decision making process. 
VDSM logs should be kept in case something is unclear, but I suggest we begin with 
engine.log. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20130113/72c7e3ac/attachment.html>