[Engine-devel] Autorecovery feature plan for review

Thu Feb 16 08:28:12 UTC 2012


----- Original Message -----
> From: "Moran Goldboim" <mgoldboi at redhat.com>
> To: "Yaniv Kaul" <ykaul at redhat.com>
> Cc: engine-devel at ovirt.org
> Sent: Thursday, February 16, 2012 10:01:37 AM
> Subject: Re: [Engine-devel] Autorecovery feature plan for review
> 
> On 02/16/2012 09:35 AM, Yaniv Kaul wrote:
> > On 02/16/2012 09:29 AM, Moran Goldboim wrote:
> >> On 02/16/2012 12:38 AM, Itamar Heim wrote:
> >>> On 02/15/2012 07:02 PM, Livnat Peer wrote:
> >>>> On 15/02/12 18:28, Ayal Baron wrote:
> >>>>>
> >>>>>
> >>>>> ----- Original Message -----
> >>>>>> Hi,
> >>>>>>
> >>>>>> A short summary from the call today, please correct me if I
> >>>>>> forgot or
> >>>>>> misunderstood something.
> >>>>>>
> >>>>>> Ayal argued that the failed host/storagedomain should be
> >>>>>> reactivated
> >>>>>> by a periodically executed job, he would prefer if the engine
> >>>>>> could
> >>>>>> [try to] correct the problem right on discovery.
> >>>>>> Livnat's point was that this is hard to implement and it is OK
> >>>>>> if we
> >>>>>> move it to Nonoperational state and periodically check it
> >>>>>> again.
> >>>>>>
> >>>>>> There was a little arguing if we call the current behavior a
> >>>>>> bug
> >>>>>> or a
> >>>>>> missing behavior, I believe this is not quite important.
> >>>>>>
> >>>>>> I did not fully understand the last few sentences from Livant,
> >>>>>> did we
> >>>>>> manage to agree in a change in the plan?
> >>>>>
> >>>>> A couple of points that we agreed upon:
> >>>>> 1. no need for new mechanism, just initiate this from the
> >>>>> monitoring context.
> >>>>>     Preferably, if not difficult, evaluate the monitoring data,
> >>>>>     if
> >>>>> host should remain in non-op then don't bother running
> >>>>> initVdsOnUp
> >>>>> 2. configuration of when to call initvdsonup is orthogonal to
> >>>>> auto-init behaviour and if introduced should be on by default
> >>>>> and
> >>>>> user should be able to configure this either on or off for the
> >>>>> host in general (no lower granularity) and can only be
> >>>>> configured
> >>>>> via the API.
> >>>>> When disabled initVdsOnUp would be called only when admin
> >>>>> activates the host/storage and any error would keep it inactive
> >>>>> (I
> >>>>> still don't understand why this is at all needed but whatever).
> >>>>>
> >>>>
> >>>> Also a note from Moran on the call was to check if we can unify
> >>>> the
> >>>> non-operational and Error statuses of the host.
> >>>> It was mentioned on the call that the reason for having ERROR
> >>>> state is
> >>>> for recovery (time out of the error state) but since we are
> >>>> about to
> >>>> recover from non-operational status as well there is no reason
> >>>> to have
> >>>> two different statuses.
> >>>
> >>> they are not exactly the same.
> >>> or should i say, error is supposed to be when reason isn't
> >>> related
> >>> to host being non-operational.
> >>>
> >>> what is error state?
> >>> a host will go into error state if it fails to run 3
> >>> (configurable)
> >>> VMs, that succeeded running on other host on retry.
> >>> i.e., something is wrong with that host, failing to launch VMs.
> >>> as it happens, it already "auto recovers" for this mode after a
> >>> certain period of time.
> >>>
> >>> why? because the host will fail to run virtual machines, and will
> >>> be
> >>> the least loaded, so it will be the first target selected to run
> >>> them, which will continue to fail.
> >>>
> >>> so there is a negative scoring mechanism on number of errors,
> >>> till
> >>> host is taken out for a while.
> >>>
> >>> (I don't remember if the reverse is true and the VM goes into
> >>> error
> >>> mode if the VM failed to launch on all hosts per number of
> >>> retries.
> >>> i think this wasn't needed and user just got an error in audit
> >>> log)
> >>>
> >>> i can see two reasons a host will go into error state:
> >>> 1. monitoring didn't detect an issue yet, and host would
> >>> have/will/should go into non-operational mode.
> >>> if host will go into non-operational mode, and will auto recover
> >>> with the above flow, i guess it is fine.
> >>>
> >>> 2. cause for failure isn't something we monitor for (upgraded to
> >>> a
> >>> bad version of qemu, or qemu got corrupted).
> >>>
> >>> now, the error mode was developed quite a long time ago (august
> >>> 2007
> >>> iirc), so could be it mostly compensated for the first reason
> >>> which
> >>> is now better monitored.
> >>> i wonder how often error state is seen due to a reason which
> >>> isn't
> >>> monitored already.
> >>> moran - do you have examples of when you see error state of
> >>> hosts?
> >>
> >> usually it happened when there were a problematic/ misconfigurated
> >> vdsm / libvirt which failed to run vms (nothing we can recover
> >> from)-
> >> i haven't faced the issue of "host it too loaded" that status has
> >> some other syndromes, however the behaviour on that state is very
> >> much the same -waiting for 30 min (?) and than move it to
> >> activated.
> >> Moran.
> >
> > 'host is too loaded' is too loaded is the only transient state
> > where a
> > temporary 'error' state makes sense, but in the same time, it can
> > also
> > fit the 'non operational' state description.
> > From my experience, the problem with KVM/libvirt/VDSM
> > mis-configured
> > is never temporary, (= magically solved by itself, without concrete
> > user intervention). IMHO, it should move the host to an error state
> > that would not automatically recover from.
> > Regardless, consolidating the names of the states ('inactive,
> > detached, non operational, maintenance, error, unknown' ...) would
> > be
> > nice too. Probably can't be done for all, of course.
> > Y.
> 
> agreed, most of the causes of ERROR state aren't transient, but looks
> to
> me as if this state is redundant and could be taken care as part of
> the
> other host states, since the way it's being used today isn't very
> helpful as well.
> Moran.
However, I can envision an ERROR state that you don't want to keep retry mechanism on...
which might be a different behavior than the NON-OP one.
> 
> 
> >
> >
> >> _______________________________________________
> >> Engine-devel mailing list
> >> Engine-devel at ovirt.org
> >> http://lists.ovirt.org/mailman/listinfo/engine-devel
> >
> 
> _______________________________________________
> Engine-devel mailing list
> Engine-devel at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/engine-devel
>