[Engine-devel] Autorecovery feature plan for review

Thu Feb 16 08:45:41 UTC 2012

On 02/16/2012 10:28 AM, Miki Kenneth wrote:
>
> ----- Original Message -----
>> From: "Moran Goldboim"<mgoldboi at redhat.com>
>> To: "Yaniv Kaul"<ykaul at redhat.com>
>> Cc: engine-devel at ovirt.org
>> Sent: Thursday, February 16, 2012 10:01:37 AM
>> Subject: Re: [Engine-devel] Autorecovery feature plan for review
>>
>> On 02/16/2012 09:35 AM, Yaniv Kaul wrote:
>>> On 02/16/2012 09:29 AM, Moran Goldboim wrote:
>>>> On 02/16/2012 12:38 AM, Itamar Heim wrote:
>>>>> On 02/15/2012 07:02 PM, Livnat Peer wrote:
>>>>>> On 15/02/12 18:28, Ayal Baron wrote:
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> A short summary from the call today, please correct me if I
>>>>>>>> forgot or
>>>>>>>> misunderstood something.
>>>>>>>>
>>>>>>>> Ayal argued that the failed host/storagedomain should be
>>>>>>>> reactivated
>>>>>>>> by a periodically executed job, he would prefer if the engine
>>>>>>>> could
>>>>>>>> [try to] correct the problem right on discovery.
>>>>>>>> Livnat's point was that this is hard to implement and it is OK
>>>>>>>> if we
>>>>>>>> move it to Nonoperational state and periodically check it
>>>>>>>> again.
>>>>>>>>
>>>>>>>> There was a little arguing if we call the current behavior a
>>>>>>>> bug
>>>>>>>> or a
>>>>>>>> missing behavior, I believe this is not quite important.
>>>>>>>>
>>>>>>>> I did not fully understand the last few sentences from Livant,
>>>>>>>> did we
>>>>>>>> manage to agree in a change in the plan?
>>>>>>> A couple of points that we agreed upon:
>>>>>>> 1. no need for new mechanism, just initiate this from the
>>>>>>> monitoring context.
>>>>>>>      Preferably, if not difficult, evaluate the monitoring data,
>>>>>>>      if
>>>>>>> host should remain in non-op then don't bother running
>>>>>>> initVdsOnUp
>>>>>>> 2. configuration of when to call initvdsonup is orthogonal to
>>>>>>> auto-init behaviour and if introduced should be on by default
>>>>>>> and
>>>>>>> user should be able to configure this either on or off for the
>>>>>>> host in general (no lower granularity) and can only be
>>>>>>> configured
>>>>>>> via the API.
>>>>>>> When disabled initVdsOnUp would be called only when admin
>>>>>>> activates the host/storage and any error would keep it inactive
>>>>>>> (I
>>>>>>> still don't understand why this is at all needed but whatever).
>>>>>>>
>>>>>> Also a note from Moran on the call was to check if we can unify
>>>>>> the
>>>>>> non-operational and Error statuses of the host.
>>>>>> It was mentioned on the call that the reason for having ERROR
>>>>>> state is
>>>>>> for recovery (time out of the error state) but since we are
>>>>>> about to
>>>>>> recover from non-operational status as well there is no reason
>>>>>> to have
>>>>>> two different statuses.
>>>>> they are not exactly the same.
>>>>> or should i say, error is supposed to be when reason isn't
>>>>> related
>>>>> to host being non-operational.
>>>>>
>>>>> what is error state?
>>>>> a host will go into error state if it fails to run 3
>>>>> (configurable)
>>>>> VMs, that succeeded running on other host on retry.
>>>>> i.e., something is wrong with that host, failing to launch VMs.
>>>>> as it happens, it already "auto recovers" for this mode after a
>>>>> certain period of time.
>>>>>
>>>>> why? because the host will fail to run virtual machines, and will
>>>>> be
>>>>> the least loaded, so it will be the first target selected to run
>>>>> them, which will continue to fail.
>>>>>
>>>>> so there is a negative scoring mechanism on number of errors,
>>>>> till
>>>>> host is taken out for a while.
>>>>>
>>>>> (I don't remember if the reverse is true and the VM goes into
>>>>> error
>>>>> mode if the VM failed to launch on all hosts per number of
>>>>> retries.
>>>>> i think this wasn't needed and user just got an error in audit
>>>>> log)
>>>>>
>>>>> i can see two reasons a host will go into error state:
>>>>> 1. monitoring didn't detect an issue yet, and host would
>>>>> have/will/should go into non-operational mode.
>>>>> if host will go into non-operational mode, and will auto recover
>>>>> with the above flow, i guess it is fine.
>>>>>
>>>>> 2. cause for failure isn't something we monitor for (upgraded to
>>>>> a
>>>>> bad version of qemu, or qemu got corrupted).
>>>>>
>>>>> now, the error mode was developed quite a long time ago (august
>>>>> 2007
>>>>> iirc), so could be it mostly compensated for the first reason
>>>>> which
>>>>> is now better monitored.
>>>>> i wonder how often error state is seen due to a reason which
>>>>> isn't
>>>>> monitored already.
>>>>> moran - do you have examples of when you see error state of
>>>>> hosts?
>>>> usually it happened when there were a problematic/ misconfigurated
>>>> vdsm / libvirt which failed to run vms (nothing we can recover
>>>> from)-
>>>> i haven't faced the issue of "host it too loaded" that status has
>>>> some other syndromes, however the behaviour on that state is very
>>>> much the same -waiting for 30 min (?) and than move it to
>>>> activated.
>>>> Moran.
>>> 'host is too loaded' is too loaded is the only transient state
>>> where a
>>> temporary 'error' state makes sense, but in the same time, it can
>>> also
>>> fit the 'non operational' state description.
>>>  From my experience, the problem with KVM/libvirt/VDSM
>>> mis-configured
>>> is never temporary, (= magically solved by itself, without concrete
>>> user intervention). IMHO, it should move the host to an error state
>>> that would not automatically recover from.
>>> Regardless, consolidating the names of the states ('inactive,
>>> detached, non operational, maintenance, error, unknown' ...) would
>>> be
>>> nice too. Probably can't be done for all, of course.
>>> Y.
>> agreed, most of the causes of ERROR state aren't transient, but looks
>> to
>> me as if this state is redundant and could be taken care as part of
>> the
>> other host states, since the way it's being used today isn't very
>> helpful as well.
>> Moran.
> However, I can envision an ERROR state that you don't want to keep retry mechanism on...
> which might be a different behavior than the NON-OP one.

it stills means that the host will be non-operational, just that you 
don't want to perform reties on it, it's need to be divided to 
transient/non-transient treatments (may apply to other scenarios as well 
-like qemu isn't there or virt isn't enabled on bios etc)
Moran.
>>
>>>
>>>> _______________________________________________
>>>> Engine-devel mailing list
>>>> Engine-devel at ovirt.org
>>>> http://lists.ovirt.org/mailman/listinfo/engine-devel
>> _______________________________________________
>> Engine-devel mailing list
>> Engine-devel at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/engine-devel
>>