[Users] Hosted Engine recovery failure of all HA - nodes

Greg Padgett gpadgett at redhat.com
Wed Apr 9 12:07:14 UTC 2014


Hi Andrew and Martin,

On 04/09/2014 04:07 AM, Andrew Lau wrote:
> Hi,
>
> On Apr 9, 2014 5:43 PM, "Martin Sivak" <msivak at redhat.com
> <mailto:msivak at redhat.com>> wrote:
>  >
>  > Hi,
>  >
>  > > I noticed this happens too, I think the issue is after N attempts the
>  > > ovirt-ha-agent process will kill itself if it believes it can't access
>  > > the storage or it fails in some other way.
>  >
>  > If the agent can't access storage or VDSM it waits for 60 seconds and
> tries again. After three (iirc) failed attempts it shuts down.
>
> Is there any reason it shuts down? Could it not be possible to just have
> it sleep for x minutes? Have that sleep time exponentially scale after
> each fail.

It looks like this is a side effect of a fix for a different bug,
https://bugzilla.redhat.com/show_bug.cgi?id=1008505
in which the agent would try to run when it wasn't fully configured.

>  >
>  > > The ovirt-ha-broker service
>  > > however still remains and continues to calculate the score.
>  >
>  > The broker acts only as a data link, the score is computed by the
> agent. The broker is used to propagate it to storage (and to collect data).
>
> Thanks for clarifying, I remember seeing some reference to score in the
> broker log. Assumed incorrectly.
>  >
>  > > It'll be
>  > > nice I guess if it could pro-actively restart the ha-agent every now
>  > > and then.
>  >
>  > We actually have a bug that is related to this:
> https://bugzilla.redhat.com/show_bug.cgi?id=1030441
>  >
>  > Greg, are you still working on it?

Sorry, not currently looking at that one.

>  >
>  > > > What is the supposed procedure after a shutdown (graceful / ungraceful)
>  > > > of Hosted-Engine HA nodes? Should the engine recover by itself? Should
>  > > > the running VM's be restarted automatically?
>  >
>  > If the agent-broker pair recovers and sanlock is not preventing taking
> the lock (which was not released properly) then the engine VM should be
> started automatically.
>  >
>  > > If all the nodes come up at the same time, in my testing, it took 10
>  > > minutes for the ha-agents to settle and then finally decide which host
>  > > to bring up the engine.
>  >
>  > We set a 10 minute mandatory down time for a host when a VM start is
> not successful. That might be because the sanlock still things somebody is
> running the VM. The /var/log/ovirt-hosted-engine-ha/agent.log would help here.
>  >
>  > Regards
>  > --
>  > Martin Sivák
>  > msivak at redhat.com <mailto:msivak at redhat.com>
>  > Red Hat Czech
>  > RHEV-M SLA / Brno, CZ
>  >
>  > ----- Original Message -----
>  > > On Wed, Apr 9, 2014 at 2:09 AM, Daniel Helgenberger
>  > > <daniel.helgenberger at m-box.de <mailto:daniel.helgenberger at m-box.de>>
> wrote:
>  > > > Hello,
>  > > >
>  > > > I have an oVirt 3.4 hosted engine lab setup witch I am evaluating for
>  > > > production use.
>  > > >
>  > > > I "simulated" an ungraceful shutdown of all HA nodes (powercut) while
>  > > > the engine was running. After powering up, the system did not recover
>  > > > itself (it seemed).
>  > > > I had to restart the ovirt-hosted-ha service (witch was in a locked
>  > > > state) and then manually run 'hosted-engine --vm-start'.
>  > >
>  > > I noticed this happens too, I think the issue is after N attempts the
>  > > ovirt-ha-agent process will kill itself if it believes it can't access
>  > > the storage or it fails in some other way. The ovirt-ha-broker service
>  > > however still remains and continues to calculate the score. It'll be
>  > > nice I guess if it could pro-actively restart the ha-agent every now
>  > > and then.
>  > >
>  > > >
>  > > > What is the supposed procedure after a shutdown (graceful / ungraceful)
>  > > > of Hosted-Engine HA nodes? Should the engine recover by itself? Should
>  > > > the running VM's be restarted automatically?
>  > >
>  > > I don't think any other VMs get restarted automatically, this is
>  > > because the engine is used to ensure that the VM hasn't been restarted
>  > > on another host. This is where power management etc comes into play.
>  > >
>  > > If all the nodes come up at the same time, in my testing, it took 10
>  > > minutes for the ha-agents to settle and then finally decide which host
>  > > to bring up the engine. Then technically... (untested) any VMs which
>  > > you've marked as HA should be automatically brought back up by the
>  > > engine. This would be 15-20 minutes to recover which feels a little
>  > > slow.. although fairly automatic.
>  > >
>  > > >
>  > > > Thanks,
>  > > > Daniel
>  > > >
>  > > >
>  > > >
>  > > >
>  > > >
>  > > >
>  > > > _______________________________________________
>  > > > Users mailing list
>  > > > Users at ovirt.org <mailto:Users at ovirt.org>
>  > > > http://lists.ovirt.org/mailman/listinfo/users
>  > > >
>  > > _______________________________________________
>  > > Users mailing list
>  > > Users at ovirt.org <mailto:Users at ovirt.org>
>  > > http://lists.ovirt.org/mailman/listinfo/users
>  > >
>




More information about the Users mailing list