On 02/16/2012 11:22 AM, Livnat Peer wrote:
On 16/02/12 10:01, Moran Goldboim wrote:
> On 02/16/2012 09:35 AM, Yaniv Kaul wrote:
>> On 02/16/2012 09:29 AM, Moran Goldboim wrote:
>>> On 02/16/2012 12:38 AM, Itamar Heim wrote:
>>>> On 02/15/2012 07:02 PM, Livnat Peer wrote:
>>>>> On 15/02/12 18:28, Ayal Baron wrote:
>>>>>>
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> Hi,
>>>>>>>
>>>>>>> A short summary from the call today, please correct me if I
>>>>>>> forgot or
>>>>>>> misunderstood something.
>>>>>>>
>>>>>>> Ayal argued that the failed host/storagedomain should be
reactivated
>>>>>>> by a periodically executed job, he would prefer if the engine
could
>>>>>>> [try to] correct the problem right on discovery.
>>>>>>> Livnat's point was that this is hard to implement and it
is OK if we
>>>>>>> move it to Nonoperational state and periodically check it
again.
>>>>>>>
>>>>>>> There was a little arguing if we call the current behavior a
bug
>>>>>>> or a
>>>>>>> missing behavior, I believe this is not quite important.
>>>>>>>
>>>>>>> I did not fully understand the last few sentences from
Livant,
>>>>>>> did we
>>>>>>> manage to agree in a change in the plan?
>>>>>>
>>>>>> A couple of points that we agreed upon:
>>>>>> 1. no need for new mechanism, just initiate this from the
>>>>>> monitoring context.
>>>>>> Preferably, if not difficult, evaluate the monitoring data,
if
>>>>>> host should remain in non-op then don't bother running
initVdsOnUp
>>>>>> 2. configuration of when to call initvdsonup is orthogonal to
>>>>>> auto-init behaviour and if introduced should be on by default
and
>>>>>> user should be able to configure this either on or off for the
>>>>>> host in general (no lower granularity) and can only be
configured
>>>>>> via the API.
>>>>>> When disabled initVdsOnUp would be called only when admin
>>>>>> activates the host/storage and any error would keep it inactive
(I
>>>>>> still don't understand why this is at all needed but
whatever).
>>>>>>
>>>>>
>>>>> Also a note from Moran on the call was to check if we can unify the
>>>>> non-operational and Error statuses of the host.
>>>>> It was mentioned on the call that the reason for having ERROR state
is
>>>>> for recovery (time out of the error state) but since we are about to
>>>>> recover from non-operational status as well there is no reason to
have
>>>>> two different statuses.
>>>>
>>>> they are not exactly the same.
>>>> or should i say, error is supposed to be when reason isn't related
>>>> to host being non-operational.
>>>>
>>>> what is error state?
>>>> a host will go into error state if it fails to run 3 (configurable)
>>>> VMs, that succeeded running on other host on retry.
>>>> i.e., something is wrong with that host, failing to launch VMs.
>>>> as it happens, it already "auto recovers" for this mode after
a
>>>> certain period of time.
>>>>
>>>> why? because the host will fail to run virtual machines, and will be
>>>> the least loaded, so it will be the first target selected to run
>>>> them, which will continue to fail.
>>>>
>>>> so there is a negative scoring mechanism on number of errors, till
>>>> host is taken out for a while.
>>>>
>>>> (I don't remember if the reverse is true and the VM goes into error
>>>> mode if the VM failed to launch on all hosts per number of retries.
>>>> i think this wasn't needed and user just got an error in audit log)
>>>>
>>>> i can see two reasons a host will go into error state:
>>>> 1. monitoring didn't detect an issue yet, and host would
>>>> have/will/should go into non-operational mode.
>>>> if host will go into non-operational mode, and will auto recover
>>>> with the above flow, i guess it is fine.
>>>>
>>>> 2. cause for failure isn't something we monitor for (upgraded to a
>>>> bad version of qemu, or qemu got corrupted).
>>>>
>>>> now, the error mode was developed quite a long time ago (august 2007
>>>> iirc), so could be it mostly compensated for the first reason which
>>>> is now better monitored.
>>>> i wonder how often error state is seen due to a reason which isn't
>>>> monitored already.
>>>> moran - do you have examples of when you see error state of hosts?
>>>
>>> usually it happened when there were a problematic/ misconfigurated
>>> vdsm / libvirt which failed to run vms (nothing we can recover from)-
>>> i haven't faced the issue of "host it too loaded" that status
has
>>> some other syndromes, however the behaviour on that state is very
>>> much the same -waiting for 30 min (?) and than move it to activated.
>>> Moran.
>>
>> 'host is too loaded' is too loaded is the only transient state where a
>> temporary 'error' state makes sense, but in the same time, it can also
>> fit the 'non operational' state description.
>> From my experience, the problem with KVM/libvirt/VDSM mis-configured
>> is never temporary, (= magically solved by itself, without concrete
>> user intervention). IMHO, it should move the host to an error state
>> that would not automatically recover from.
>> Regardless, consolidating the names of the states ('inactive,
>> detached, non operational, maintenance, error, unknown' ...) would be
>> nice too. Probably can't be done for all, of course.
>> Y.
>
> agreed, most of the causes of ERROR state aren't transient, but looks to
> me as if this state is redundant and could be taken care as part of the
> other host states, since the way it's being used today isn't very
> helpful as well.
> Moran.
>
Currently host status is changed to non-operational on various reasons,
some of them are static like vdsm version and cpu model and some of them
are (potentially) transient like network failure.
The Error state, as Itamar detailed earlier on this thread, is used
currently on what I would call (potentially) transient reason.
The original intention (I think) was to move host to non-operational on
reasons which are static and to Error on reasons which are transient,
and I guess that is why there is timeout on the Error state and OE tries
to initialize a host after 30 minutes in Error state.
The problem is that as the code evolved this is not the case anymore.
I suggest that we use the non-operational state for transient reasons,
which we detect in monitoring flow, or execution failures and do the
initialization retry as Laszlo suggested in the document. Use the Error
state for static errors and remove the 'timeout' mechanism we currently
have (from Error state).
we are just adding a retry mechanism where we didn't have it.
I wouldn't remove the one we have so soon, as we may get it back very
fast as 'need retry/timeout on errors'.
it sounds like both statuses are indeed different - but even if we think
error covers mostly non transient, we can't be sure.