[Engine-devel] Autorecovery feature plan for review

Wed Feb 15 17:02:35 UTC 2012

On 15/02/12 18:28, Ayal Baron wrote:
> 
> 
> ----- Original Message -----
>> Hi,
>>
>> A short summary from the call today, please correct me if I forgot or
>> misunderstood something.
>>
>> Ayal argued that the failed host/storagedomain should be reactivated
>> by a periodically executed job, he would prefer if the engine could
>> [try to] correct the problem right on discovery.
>> Livnat's point was that this is hard to implement and it is OK if we
>> move it to Nonoperational state and periodically check it again.
>>
>> There was a little arguing if we call the current behavior a bug or a
>> missing behavior, I believe this is not quite important.
>>
>> I did not fully understand the last few sentences from Livant, did we
>> manage to agree in a change in the plan?
> 
> A couple of points that we agreed upon:
> 1. no need for new mechanism, just initiate this from the monitoring context.
>    Preferably, if not difficult, evaluate the monitoring data, if host should remain in non-op then don't bother running initVdsOnUp
> 2. configuration of when to call initvdsonup is orthogonal to auto-init behaviour and if introduced should be on by default and user should be able to configure this either on or off for the host in general (no lower granularity) and can only be configured via the API.
> When disabled initVdsOnUp would be called only when admin activates the host/storage and any error would keep it inactive (I still don't understand why this is at all needed but whatever).
> 

Also a note from Moran on the call was to check if we can unify the
non-operational and Error statuses of the host.
It was mentioned on the call that the reason for having ERROR state is
for recovery (time out of the error state) but since we are about to
recover from non-operational status as well there is no reason to have
two different statuses.

> Note that going forward what I envision is engine pushing down the entire host configuration once and from that point on the host would try to keep this configuration up and running.  Once this happens there will be no need for initVdsOnUp at all.
> 
> 
>>
>> Anyway, I agree with Ayal that it would be very nice if the engine
>> could fix the issues right on discovery, but I also agree that this
>> feature would take a bigger effort. It would be nice to know what
>> effort it would take to get the monitoring do this safely. Could we
>> still call it monitoring then?
>>

Basically the monitoring flow moves the host to non-operational, what
Ayal suggests is that it will also trigger the recovery flow
(initialization flow).

I think that modeling it to be triggered from the monitoring flow will
block monitoring of the host during the initialization flow which can
save us races going forward.
Let's see if we can design the solution to be triggered by the monitoring.

>> Laszlo
>>
>> ----- Original Message -----
>>> From: "Ayal Baron" <abaron at redhat.com>
>>> To: "Laszlo Hornyak" <lhornyak at redhat.com>
>>> Cc: engine-devel at ovirt.org, "Yaniv Kaul" <ykaul at redhat.com>
>>> Sent: Wednesday, February 15, 2012 12:46:05 PM
>>> Subject: Re: [Engine-devel] Autorecovery feature plan for review
>>>
>>>
>>>
>>> ----- Original Message -----
>>>> Hi Ayal,
>>>>
>>>> ----- Original Message -----
>>>>> From: "Ayal Baron" <abaron at redhat.com>
>>>>> To: "Yaniv Kaul" <ykaul at redhat.com>
>>>>> Cc: engine-devel at ovirt.org
>>>>> Sent: Wednesday, February 15, 2012 12:19:48 PM
>>>>> Subject: Re: [Engine-devel] Autorecovery feature plan for
>>>>> review
>>>>>
>>>>>
>>>>>>
>>>>>> I still fail to understand why you 'punish' existing objects
>>>>>> and
>>>>>> not
>>>>>> giving them the new feature enabled by default.
>>>>>
>>>>> This is not a feature, it's a bug!
>>>>
>>>> Whatever we call it, it is a change in behavior. We agreed that
>>>> it
>>>> will be enabled for all existing objects by default.
>>>>
>>>> http://globalnerdy.com/wordpress/wp-content/uploads/2007/12/bug_vs_feature.gif
>>>>
>>>>> This should not be treated as a feature and this should not be
>>>>> configurable!
>>>>
>>>> I can imagine some situations when I would not like the
>>>> autorecovery
>>>> to happen, but if everyone agrees not to make it configurable, I
>>>> will just remove it from my patchset.
>>>
>>> It's not autorecovery, you're not recovering anything.  You're
>>> reflecting the fact that the resource is back to normal (not due to
>>> anything that the engine did).
>>> This is why it is a bug today.
>>> This is why it should not be configurable.
>>>
>>>>
>>>>> Today an object moves to non-operational due to state reported
>>>>> by
>>>>> vdsm.  The object should immediately return to up the moment
>>>>> vdsm
>>>>> reports the object as ok (this means that you don't stop
>>>>> monitoring
>>>>> just because there is an error).
>>>>> That's it. no db field and no nothing...
>>>>> This pertains to storage domains, network, host status,
>>>>> whatever.
>>>>>
>>>>>> Y.
>>>>>>
>>>>>>> b. In environment to be clean installed -we have 0 existing
>>>>>>> entities -
>>>>>>> after clean install all new entities in the system will be
>>>>>>> create
>>>>>>> with
>>>>>>> auto recoverable set to true.
>>>>>>> Will this be considered a bad behavior?
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Engine-devel mailing list
>>>>>>> Engine-devel at ovirt.org
>>>>>>> http://lists.ovirt.org/mailman/listinfo/engine-devel
>>>>>>
>>>>>> _______________________________________________
>>>>>> Engine-devel mailing list
>>>>>> Engine-devel at ovirt.org
>>>>>> http://lists.ovirt.org/mailman/listinfo/engine-devel
>>>>>>
>>>>> _______________________________________________
>>>>> Engine-devel mailing list
>>>>> Engine-devel at ovirt.org
>>>>> http://lists.ovirt.org/mailman/listinfo/engine-devel
>>>>>
>>>>
>>>
>>
> _______________________________________________
> Engine-devel mailing list
> Engine-devel at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/engine-devel