On 02/16/2012 10:28 AM, Miki Kenneth wrote:
----- Original Message -----
> From: "Moran Goldboim"<mgoldboi(a)redhat.com>
> To: "Yaniv Kaul"<ykaul(a)redhat.com>
> Cc: engine-devel(a)ovirt.org
> Sent: Thursday, February 16, 2012 10:01:37 AM
> Subject: Re: [Engine-devel] Autorecovery feature plan for review
>
> On 02/16/2012 09:35 AM, Yaniv Kaul wrote:
>> On 02/16/2012 09:29 AM, Moran Goldboim wrote:
>>> On 02/16/2012 12:38 AM, Itamar Heim wrote:
>>>> On 02/15/2012 07:02 PM, Livnat Peer wrote:
>>>>> On 15/02/12 18:28, Ayal Baron wrote:
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> Hi,
>>>>>>>
>>>>>>> A short summary from the call today, please correct me if I
>>>>>>> forgot or
>>>>>>> misunderstood something.
>>>>>>>
>>>>>>> Ayal argued that the failed host/storagedomain should be
>>>>>>> reactivated
>>>>>>> by a periodically executed job, he would prefer if the
engine
>>>>>>> could
>>>>>>> [try to] correct the problem right on discovery.
>>>>>>> Livnat's point was that this is hard to implement and it
is OK
>>>>>>> if we
>>>>>>> move it to Nonoperational state and periodically check it
>>>>>>> again.
>>>>>>>
>>>>>>> There was a little arguing if we call the current behavior a
>>>>>>> bug
>>>>>>> or a
>>>>>>> missing behavior, I believe this is not quite important.
>>>>>>>
>>>>>>> I did not fully understand the last few sentences from
Livant,
>>>>>>> did we
>>>>>>> manage to agree in a change in the plan?
>>>>>> A couple of points that we agreed upon:
>>>>>> 1. no need for new mechanism, just initiate this from the
>>>>>> monitoring context.
>>>>>> Preferably, if not difficult, evaluate the monitoring data,
>>>>>> if
>>>>>> host should remain in non-op then don't bother running
>>>>>> initVdsOnUp
>>>>>> 2. configuration of when to call initvdsonup is orthogonal to
>>>>>> auto-init behaviour and if introduced should be on by default
>>>>>> and
>>>>>> user should be able to configure this either on or off for the
>>>>>> host in general (no lower granularity) and can only be
>>>>>> configured
>>>>>> via the API.
>>>>>> When disabled initVdsOnUp would be called only when admin
>>>>>> activates the host/storage and any error would keep it inactive
>>>>>> (I
>>>>>> still don't understand why this is at all needed but
whatever).
>>>>>>
>>>>> Also a note from Moran on the call was to check if we can unify
>>>>> the
>>>>> non-operational and Error statuses of the host.
>>>>> It was mentioned on the call that the reason for having ERROR
>>>>> state is
>>>>> for recovery (time out of the error state) but since we are
>>>>> about to
>>>>> recover from non-operational status as well there is no reason
>>>>> to have
>>>>> two different statuses.
>>>> they are not exactly the same.
>>>> or should i say, error is supposed to be when reason isn't
>>>> related
>>>> to host being non-operational.
>>>>
>>>> what is error state?
>>>> a host will go into error state if it fails to run 3
>>>> (configurable)
>>>> VMs, that succeeded running on other host on retry.
>>>> i.e., something is wrong with that host, failing to launch VMs.
>>>> as it happens, it already "auto recovers" for this mode after
a
>>>> certain period of time.
>>>>
>>>> why? because the host will fail to run virtual machines, and will
>>>> be
>>>> the least loaded, so it will be the first target selected to run
>>>> them, which will continue to fail.
>>>>
>>>> so there is a negative scoring mechanism on number of errors,
>>>> till
>>>> host is taken out for a while.
>>>>
>>>> (I don't remember if the reverse is true and the VM goes into
>>>> error
>>>> mode if the VM failed to launch on all hosts per number of
>>>> retries.
>>>> i think this wasn't needed and user just got an error in audit
>>>> log)
>>>>
>>>> i can see two reasons a host will go into error state:
>>>> 1. monitoring didn't detect an issue yet, and host would
>>>> have/will/should go into non-operational mode.
>>>> if host will go into non-operational mode, and will auto recover
>>>> with the above flow, i guess it is fine.
>>>>
>>>> 2. cause for failure isn't something we monitor for (upgraded to
>>>> a
>>>> bad version of qemu, or qemu got corrupted).
>>>>
>>>> now, the error mode was developed quite a long time ago (august
>>>> 2007
>>>> iirc), so could be it mostly compensated for the first reason
>>>> which
>>>> is now better monitored.
>>>> i wonder how often error state is seen due to a reason which
>>>> isn't
>>>> monitored already.
>>>> moran - do you have examples of when you see error state of
>>>> hosts?
>>> usually it happened when there were a problematic/ misconfigurated
>>> vdsm / libvirt which failed to run vms (nothing we can recover
>>> from)-
>>> i haven't faced the issue of "host it too loaded" that status
has
>>> some other syndromes, however the behaviour on that state is very
>>> much the same -waiting for 30 min (?) and than move it to
>>> activated.
>>> Moran.
>> 'host is too loaded' is too loaded is the only transient state
>> where a
>> temporary 'error' state makes sense, but in the same time, it can
>> also
>> fit the 'non operational' state description.
>> From my experience, the problem with KVM/libvirt/VDSM
>> mis-configured
>> is never temporary, (= magically solved by itself, without concrete
>> user intervention). IMHO, it should move the host to an error state
>> that would not automatically recover from.
>> Regardless, consolidating the names of the states ('inactive,
>> detached, non operational, maintenance, error, unknown' ...) would
>> be
>> nice too. Probably can't be done for all, of course.
>> Y.
> agreed, most of the causes of ERROR state aren't transient, but looks
> to
> me as if this state is redundant and could be taken care as part of
> the
> other host states, since the way it's being used today isn't very
> helpful as well.
> Moran.
However, I can envision an ERROR state that you don't want to keep retry mechanism
on...
which might be a different behavior than the NON-OP one.
it stills means that the host will be non-operational, just that you
don't want to perform reties on it, it's need to be divided to
transient/non-transient treatments (may apply to other scenarios as well
-like qemu isn't there or virt isn't enabled on bios etc)
Moran.
>
>>
>>> _______________________________________________
>>> Engine-devel mailing list
>>> Engine-devel(a)ovirt.org
>>>
http://lists.ovirt.org/mailman/listinfo/engine-devel
> _______________________________________________
> Engine-devel mailing list
> Engine-devel(a)ovirt.org
>
http://lists.ovirt.org/mailman/listinfo/engine-devel
>