[Engine-devel] Autorecovery feature plan for review

Itamar Heim iheim at redhat.com
Fri Feb 17 00:25:06 UTC 2012


On 02/16/2012 11:22 AM, Livnat Peer wrote:
> On 16/02/12 10:01, Moran Goldboim wrote:
>> On 02/16/2012 09:35 AM, Yaniv Kaul wrote:
>>> On 02/16/2012 09:29 AM, Moran Goldboim wrote:
>>>> On 02/16/2012 12:38 AM, Itamar Heim wrote:
>>>>> On 02/15/2012 07:02 PM, Livnat Peer wrote:
>>>>>> On 15/02/12 18:28, Ayal Baron wrote:
>>>>>>>
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> A short summary from the call today, please correct me if I
>>>>>>>> forgot or
>>>>>>>> misunderstood something.
>>>>>>>>
>>>>>>>> Ayal argued that the failed host/storagedomain should be reactivated
>>>>>>>> by a periodically executed job, he would prefer if the engine could
>>>>>>>> [try to] correct the problem right on discovery.
>>>>>>>> Livnat's point was that this is hard to implement and it is OK if we
>>>>>>>> move it to Nonoperational state and periodically check it again.
>>>>>>>>
>>>>>>>> There was a little arguing if we call the current behavior a bug
>>>>>>>> or a
>>>>>>>> missing behavior, I believe this is not quite important.
>>>>>>>>
>>>>>>>> I did not fully understand the last few sentences from Livant,
>>>>>>>> did we
>>>>>>>> manage to agree in a change in the plan?
>>>>>>>
>>>>>>> A couple of points that we agreed upon:
>>>>>>> 1. no need for new mechanism, just initiate this from the
>>>>>>> monitoring context.
>>>>>>>      Preferably, if not difficult, evaluate the monitoring data, if
>>>>>>> host should remain in non-op then don't bother running initVdsOnUp
>>>>>>> 2. configuration of when to call initvdsonup is orthogonal to
>>>>>>> auto-init behaviour and if introduced should be on by default and
>>>>>>> user should be able to configure this either on or off for the
>>>>>>> host in general (no lower granularity) and can only be configured
>>>>>>> via the API.
>>>>>>> When disabled initVdsOnUp would be called only when admin
>>>>>>> activates the host/storage and any error would keep it inactive (I
>>>>>>> still don't understand why this is at all needed but whatever).
>>>>>>>
>>>>>>
>>>>>> Also a note from Moran on the call was to check if we can unify the
>>>>>> non-operational and Error statuses of the host.
>>>>>> It was mentioned on the call that the reason for having ERROR state is
>>>>>> for recovery (time out of the error state) but since we are about to
>>>>>> recover from non-operational status as well there is no reason to have
>>>>>> two different statuses.
>>>>>
>>>>> they are not exactly the same.
>>>>> or should i say, error is supposed to be when reason isn't related
>>>>> to host being non-operational.
>>>>>
>>>>> what is error state?
>>>>> a host will go into error state if it fails to run 3 (configurable)
>>>>> VMs, that succeeded running on other host on retry.
>>>>> i.e., something is wrong with that host, failing to launch VMs.
>>>>> as it happens, it already "auto recovers" for this mode after a
>>>>> certain period of time.
>>>>>
>>>>> why? because the host will fail to run virtual machines, and will be
>>>>> the least loaded, so it will be the first target selected to run
>>>>> them, which will continue to fail.
>>>>>
>>>>> so there is a negative scoring mechanism on number of errors, till
>>>>> host is taken out for a while.
>>>>>
>>>>> (I don't remember if the reverse is true and the VM goes into error
>>>>> mode if the VM failed to launch on all hosts per number of retries.
>>>>> i think this wasn't needed and user just got an error in audit log)
>>>>>
>>>>> i can see two reasons a host will go into error state:
>>>>> 1. monitoring didn't detect an issue yet, and host would
>>>>> have/will/should go into non-operational mode.
>>>>> if host will go into non-operational mode, and will auto recover
>>>>> with the above flow, i guess it is fine.
>>>>>
>>>>> 2. cause for failure isn't something we monitor for (upgraded to a
>>>>> bad version of qemu, or qemu got corrupted).
>>>>>
>>>>> now, the error mode was developed quite a long time ago (august 2007
>>>>> iirc), so could be it mostly compensated for the first reason which
>>>>> is now better monitored.
>>>>> i wonder how often error state is seen due to a reason which isn't
>>>>> monitored already.
>>>>> moran - do you have examples of when you see error state of hosts?
>>>>
>>>> usually it happened when there were a problematic/ misconfigurated
>>>> vdsm / libvirt which failed to run vms (nothing we can recover from)-
>>>> i haven't faced the issue of "host it too loaded" that status has
>>>> some other syndromes, however the behaviour on that state is very
>>>> much the same -waiting for 30 min (?) and than move it to activated.
>>>> Moran.
>>>
>>> 'host is too loaded' is too loaded is the only transient state where a
>>> temporary 'error' state makes sense, but in the same time, it can also
>>> fit the 'non operational' state description.
>>>  From my experience, the problem with KVM/libvirt/VDSM mis-configured
>>> is never temporary, (= magically solved by itself, without concrete
>>> user intervention). IMHO, it should move the host to an error state
>>> that would not automatically recover from.
>>> Regardless, consolidating the names of the states ('inactive,
>>> detached, non operational, maintenance, error, unknown' ...) would be
>>> nice too. Probably can't be done for all, of course.
>>> Y.
>>
>> agreed, most of the causes of ERROR state aren't transient, but looks to
>> me as if this state is redundant and could be taken care as part of the
>> other host states, since the way it's being used today isn't very
>> helpful as well.
>> Moran.
>>
>
> Currently host status is changed to non-operational on various reasons,
> some of them are static like vdsm version and cpu model and some of them
> are (potentially) transient like network failure.
>
> The Error state, as Itamar detailed earlier on this thread, is used
> currently on what I would call (potentially) transient reason.
>
> The original intention (I think) was to move host to non-operational on
> reasons which are static and to Error on reasons which are transient,
> and I guess that is why there is timeout on the Error state and OE tries
> to initialize a host after 30 minutes in Error state.
>
> The problem is that as the code evolved this is not the case anymore.
> I suggest that we use the non-operational state for transient reasons,
> which we detect in monitoring flow, or execution failures and do the
> initialization retry as Laszlo suggested in the document. Use the Error
> state for static errors and remove the 'timeout' mechanism we currently
> have (from Error state).

we are just adding a retry mechanism where we didn't have it.
I wouldn't remove the one we have so soon, as we may get it back very 
fast as 'need retry/timeout on errors'.
it sounds like both statuses are indeed different - but even if we think 
error covers mostly non transient, we can't be sure.



More information about the Engine-devel mailing list