[Users] Hosted Engine recovery failure of all HA - nodes

Andrew Lau andrew at andrewklau.com
Wed Apr 9 04:07:24 EDT 2014


Hi,

On Apr 9, 2014 5:43 PM, "Martin Sivak" <msivak at redhat.com> wrote:
>
> Hi,
>
> > I noticed this happens too, I think the issue is after N attempts the
> > ovirt-ha-agent process will kill itself if it believes it can't access
> > the storage or it fails in some other way.
>
> If the agent can't access storage or VDSM it waits for 60 seconds and
tries again. After three (iirc) failed attempts it shuts down.

Is there any reason it shuts down? Could it not be possible to just have it
sleep for x minutes? Have that sleep time exponentially scale after each
fail.
>
> > The ovirt-ha-broker service
> > however still remains and continues to calculate the score.
>
> The broker acts only as a data link, the score is computed by the agent.
The broker is used to propagate it to storage (and to collect data).

Thanks for clarifying, I remember seeing some reference to score in the
broker log. Assumed incorrectly.
>
> > It'll be
> > nice I guess if it could pro-actively restart the ha-agent every now
> > and then.
>
> We actually have a bug that is related to this:
https://bugzilla.redhat.com/show_bug.cgi?id=1030441
>
> Greg, are you still working on it?
>
> > > What is the supposed procedure after a shutdown (graceful /
ungraceful)
> > > of Hosted-Engine HA nodes? Should the engine recover by itself? Should
> > > the running VM's be restarted automatically?
>
> If the agent-broker pair recovers and sanlock is not preventing taking
the lock (which was not released properly) then the engine VM should be
started automatically.
>
> > If all the nodes come up at the same time, in my testing, it took 10
> > minutes for the ha-agents to settle and then finally decide which host
> > to bring up the engine.
>
> We set a 10 minute mandatory down time for a host when a VM start is not
successful. That might be because the sanlock still things somebody is
running the VM. The /var/log/ovirt-hosted-engine-ha/agent.log would help
here.
>
> Regards
> --
> Martin Sivák
> msivak at redhat.com
> Red Hat Czech
> RHEV-M SLA / Brno, CZ
>
> ----- Original Message -----
> > On Wed, Apr 9, 2014 at 2:09 AM, Daniel Helgenberger
> > <daniel.helgenberger at m-box.de> wrote:
> > > Hello,
> > >
> > > I have an oVirt 3.4 hosted engine lab setup witch I am evaluating for
> > > production use.
> > >
> > > I "simulated" an ungraceful shutdown of all HA nodes (powercut) while
> > > the engine was running. After powering up, the system did not recover
> > > itself (it seemed).
> > > I had to restart the ovirt-hosted-ha service (witch was in a locked
> > > state) and then manually run 'hosted-engine --vm-start'.
> >
> > I noticed this happens too, I think the issue is after N attempts the
> > ovirt-ha-agent process will kill itself if it believes it can't access
> > the storage or it fails in some other way. The ovirt-ha-broker service
> > however still remains and continues to calculate the score. It'll be
> > nice I guess if it could pro-actively restart the ha-agent every now
> > and then.
> >
> > >
> > > What is the supposed procedure after a shutdown (graceful /
ungraceful)
> > > of Hosted-Engine HA nodes? Should the engine recover by itself? Should
> > > the running VM's be restarted automatically?
> >
> > I don't think any other VMs get restarted automatically, this is
> > because the engine is used to ensure that the VM hasn't been restarted
> > on another host. This is where power management etc comes into play.
> >
> > If all the nodes come up at the same time, in my testing, it took 10
> > minutes for the ha-agents to settle and then finally decide which host
> > to bring up the engine. Then technically... (untested) any VMs which
> > you've marked as HA should be automatically brought back up by the
> > engine. This would be 15-20 minutes to recover which feels a little
> > slow.. although fairly automatic.
> >
> > >
> > > Thanks,
> > > Daniel
> > >
> > >
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > Users mailing list
> > > Users at ovirt.org
> > > http://lists.ovirt.org/mailman/listinfo/users
> > >
> > _______________________________________________
> > Users mailing list
> > Users at ovirt.org
> > http://lists.ovirt.org/mailman/listinfo/users
> >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20140409/c598e6ed/attachment-0001.html>


More information about the Users mailing list