[Users] Testing High Availability and Power outages
Doron Fediuck
dfediuck at redhat.com
Sun Jan 13 08:54:19 UTC 2013
----- Original Message -----
> From: "Alexandru Vladulescu" <avladulescu at bfproject.ro>
> To: "Doron Fediuck" <dfediuck at redhat.com>
> Cc: "users" <users at ovirt.org>
> Sent: Sunday, January 13, 2013 10:46:41 AM
> Subject: Re: [Users] Testing High Availability and Power outages
> Dear Doron,
> I haven't collected the logs from the tests, but I would gladly re-do
> the case and get back to you asap.
> This feature is the main reason of which I have chosen to go with
> Ovirt in the first place, besides other virt environments.
> Could you please inform me what logs should I be focusing on, besides
> the engine log; vdsm maybe or other relevant logs?
> Regards,
> Alex
> --
> Sent from phone.
> On 13.01.2013, at 09:56, Doron Fediuck < dfediuck at redhat.com > wrote:
> > ----- Original Message -----
>
> > > From: "Alexandru Vladulescu" < avladulescu at bfproject.ro >
> >
>
> > > To: "users" < users at ovirt.org >
> >
>
> > > Sent: Friday, January 11, 2013 2:47:38 PM
> >
>
> > > Subject: [Users] Testing High Availability and Power outages
> >
>
> > > Hi,
> >
>
> > > Today, I started testing on my Ovirt 3.1 installation (from
> > > dreyou
> > > repos) running on 3 x Centos 6.3 hypervisors the High
> > > Availability
> > > features and the fence mechanism.
> >
>
> > > As yesterday, I have reported in a previous email thread, that
> > > the
> > > migration priority queue cannot be increased (bug) in this
> > > current
> > > version, I decided to test what the official documentation says
> > > about the High Availability cases.
> >
>
> > > This will be a disaster case scenarios to suffer from if one
> > > hypervisor has a power outage/hardware problem and the VMs
> > > running
> > > on it are not migrating on other spare resources.
> >
>
> > > In the official documenation from ovirt.org it is quoted the
> > > following:
> >
>
> > > High availability
> >
>
> > > Allows critical VMs to be restarted on another host in the event
> > > of
> > > hardware failure with three levels of priority, taking into
> > > account
> > > resiliency policy.
> >
>
> > > * Resiliency policy to control high availability VMs at the
> > > cluster
> > > level.
> >
>
> > > * Supports application-level high availability with supported
> > > fencing
> > > agents.
> >
>
> > > As well as in the Architecture description:
> >
>
> > > High Availability - restart guest VMs from failed hosts
> > > automatically
> > > on other hosts
> >
>
> > > So the testing went like this -- One VM running a linux box,
> > > having
> > > the check box "High Available" and "Priority for Run/Migration
> > > queue:" set to Low. On Host we have the check box to "Any Host in
> > > Cluster", without "Allow VM migration only upon Admin specific
> > > request" checked.
> >
>
> > > My environment:
> >
>
> > > Configuration : 2 x Hypervisors (same cluster/hardware
> > > configuration)
> > > ; 1 x Hypervisor + acting as a NAS (NFS) server (different
> > > cluster/hardware configuration)
> >
>
> > > Actions: Went and cut-off the power from one of the hypervisors
> > > from
> > > the 2 node clusters, while the VM was running on. This would
> > > translate to a power outage.
> >
>
> > > Results: The hypervisor node that suffered from the outage is
> > > showing
> > > in Hosts tab as Non Responsive on Status, and the VM has a
> > > question
> > > mark and cannot be powered off or nothing (therefore it's stuck).
> >
>
> > > In the Log console in GUI, I get:
> >
>
> > > Host Hyper01 is non-responsive.
> >
>
> > > VM Web-Frontend01 was set to the Unknown status.
> >
>
> > > There is nothing I could I could do besides clicking on the
> > > Hyper01
> > > "Confirm Host as been rebooted", afterwards the VM starts on the
> > > Hyper02 with a cold reboot of the VM.
> >
>
> > > The Log console changes to:
> >
>
> > > Vm Web-Frontend01 was shut down due to Hyper01 host reboot or
> > > manual
> > > fence
> >
>
> > > All VMs' status on Non-Responsive Host Hyper01 were changed to
> > > 'Down'
> > > by admin at internal
> >
>
> > > Manual fencing for host Hyper01 was started.
> >
>
> > > VM Web-Frontend01 was restarted on Host Hyper02
> >
>
> > > I would like you approach on this problem, reading the
> > > documentation
> > > & features pages on the official website, I suppose that this
> > > would
> > > have been an automatically mechanism working on some sort of a
> > > vdsm
> > > & engine fencing action. Am I missing something regarding it ?
> >
>
> > > Thank you for your patience reading this.
> >
>
> > > Regards,
> >
>
> > > Alex.
> >
>
> > > _______________________________________________
> >
>
> > > Users mailing list
> >
>
> > > Users at ovirt.org
> >
>
> > > http://lists.ovirt.org/mailman/listinfo/users
> >
>
> > Hi Alex,
>
> > Can you share with us the engine's log from the relevant time
> > period?
>
> > Doron
>
Hi Alex,
engine log is the important one, as it will indicate on the decision making process.
VDSM logs should be kept in case something is unclear, but I suggest we begin with
engine.log.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20130113/72c7e3ac/attachment-0001.html>
More information about the Users
mailing list