[Users] Hosted Engine recovery failure of all HA - nodes
Greg Padgett
gpadgett at redhat.com
Wed Apr 9 12:07:14 UTC 2014
Hi Andrew and Martin,
On 04/09/2014 04:07 AM, Andrew Lau wrote:
> Hi,
>
> On Apr 9, 2014 5:43 PM, "Martin Sivak" <msivak at redhat.com
> <mailto:msivak at redhat.com>> wrote:
> >
> > Hi,
> >
> > > I noticed this happens too, I think the issue is after N attempts the
> > > ovirt-ha-agent process will kill itself if it believes it can't access
> > > the storage or it fails in some other way.
> >
> > If the agent can't access storage or VDSM it waits for 60 seconds and
> tries again. After three (iirc) failed attempts it shuts down.
>
> Is there any reason it shuts down? Could it not be possible to just have
> it sleep for x minutes? Have that sleep time exponentially scale after
> each fail.
It looks like this is a side effect of a fix for a different bug,
https://bugzilla.redhat.com/show_bug.cgi?id=1008505
in which the agent would try to run when it wasn't fully configured.
> >
> > > The ovirt-ha-broker service
> > > however still remains and continues to calculate the score.
> >
> > The broker acts only as a data link, the score is computed by the
> agent. The broker is used to propagate it to storage (and to collect data).
>
> Thanks for clarifying, I remember seeing some reference to score in the
> broker log. Assumed incorrectly.
> >
> > > It'll be
> > > nice I guess if it could pro-actively restart the ha-agent every now
> > > and then.
> >
> > We actually have a bug that is related to this:
> https://bugzilla.redhat.com/show_bug.cgi?id=1030441
> >
> > Greg, are you still working on it?
Sorry, not currently looking at that one.
> >
> > > > What is the supposed procedure after a shutdown (graceful / ungraceful)
> > > > of Hosted-Engine HA nodes? Should the engine recover by itself? Should
> > > > the running VM's be restarted automatically?
> >
> > If the agent-broker pair recovers and sanlock is not preventing taking
> the lock (which was not released properly) then the engine VM should be
> started automatically.
> >
> > > If all the nodes come up at the same time, in my testing, it took 10
> > > minutes for the ha-agents to settle and then finally decide which host
> > > to bring up the engine.
> >
> > We set a 10 minute mandatory down time for a host when a VM start is
> not successful. That might be because the sanlock still things somebody is
> running the VM. The /var/log/ovirt-hosted-engine-ha/agent.log would help here.
> >
> > Regards
> > --
> > Martin Sivák
> > msivak at redhat.com <mailto:msivak at redhat.com>
> > Red Hat Czech
> > RHEV-M SLA / Brno, CZ
> >
> > ----- Original Message -----
> > > On Wed, Apr 9, 2014 at 2:09 AM, Daniel Helgenberger
> > > <daniel.helgenberger at m-box.de <mailto:daniel.helgenberger at m-box.de>>
> wrote:
> > > > Hello,
> > > >
> > > > I have an oVirt 3.4 hosted engine lab setup witch I am evaluating for
> > > > production use.
> > > >
> > > > I "simulated" an ungraceful shutdown of all HA nodes (powercut) while
> > > > the engine was running. After powering up, the system did not recover
> > > > itself (it seemed).
> > > > I had to restart the ovirt-hosted-ha service (witch was in a locked
> > > > state) and then manually run 'hosted-engine --vm-start'.
> > >
> > > I noticed this happens too, I think the issue is after N attempts the
> > > ovirt-ha-agent process will kill itself if it believes it can't access
> > > the storage or it fails in some other way. The ovirt-ha-broker service
> > > however still remains and continues to calculate the score. It'll be
> > > nice I guess if it could pro-actively restart the ha-agent every now
> > > and then.
> > >
> > > >
> > > > What is the supposed procedure after a shutdown (graceful / ungraceful)
> > > > of Hosted-Engine HA nodes? Should the engine recover by itself? Should
> > > > the running VM's be restarted automatically?
> > >
> > > I don't think any other VMs get restarted automatically, this is
> > > because the engine is used to ensure that the VM hasn't been restarted
> > > on another host. This is where power management etc comes into play.
> > >
> > > If all the nodes come up at the same time, in my testing, it took 10
> > > minutes for the ha-agents to settle and then finally decide which host
> > > to bring up the engine. Then technically... (untested) any VMs which
> > > you've marked as HA should be automatically brought back up by the
> > > engine. This would be 15-20 minutes to recover which feels a little
> > > slow.. although fairly automatic.
> > >
> > > >
> > > > Thanks,
> > > > Daniel
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Users mailing list
> > > > Users at ovirt.org <mailto:Users at ovirt.org>
> > > > http://lists.ovirt.org/mailman/listinfo/users
> > > >
> > > _______________________________________________
> > > Users mailing list
> > > Users at ovirt.org <mailto:Users at ovirt.org>
> > > http://lists.ovirt.org/mailman/listinfo/users
> > >
>
More information about the Users
mailing list