[Engine-devel] Autorecovery feature plan for review

Thu Feb 16 07:29:17 UTC 2012

On 02/16/2012 12:38 AM, Itamar Heim wrote:
> On 02/15/2012 07:02 PM, Livnat Peer wrote:
>> On 15/02/12 18:28, Ayal Baron wrote:
>>>
>>>
>>> ----- Original Message -----
>>>> Hi,
>>>>
>>>> A short summary from the call today, please correct me if I forgot or
>>>> misunderstood something.
>>>>
>>>> Ayal argued that the failed host/storagedomain should be reactivated
>>>> by a periodically executed job, he would prefer if the engine could
>>>> [try to] correct the problem right on discovery.
>>>> Livnat's point was that this is hard to implement and it is OK if we
>>>> move it to Nonoperational state and periodically check it again.
>>>>
>>>> There was a little arguing if we call the current behavior a bug or a
>>>> missing behavior, I believe this is not quite important.
>>>>
>>>> I did not fully understand the last few sentences from Livant, did we
>>>> manage to agree in a change in the plan?
>>>
>>> A couple of points that we agreed upon:
>>> 1. no need for new mechanism, just initiate this from the monitoring 
>>> context.
>>>     Preferably, if not difficult, evaluate the monitoring data, if 
>>> host should remain in non-op then don't bother running initVdsOnUp
>>> 2. configuration of when to call initvdsonup is orthogonal to 
>>> auto-init behaviour and if introduced should be on by default and 
>>> user should be able to configure this either on or off for the host 
>>> in general (no lower granularity) and can only be configured via the 
>>> API.
>>> When disabled initVdsOnUp would be called only when admin activates 
>>> the host/storage and any error would keep it inactive (I still don't 
>>> understand why this is at all needed but whatever).
>>>
>>
>> Also a note from Moran on the call was to check if we can unify the
>> non-operational and Error statuses of the host.
>> It was mentioned on the call that the reason for having ERROR state is
>> for recovery (time out of the error state) but since we are about to
>> recover from non-operational status as well there is no reason to have
>> two different statuses.
>
> they are not exactly the same.
> or should i say, error is supposed to be when reason isn't related to 
> host being non-operational.
>
> what is error state?
> a host will go into error state if it fails to run 3 (configurable) 
> VMs, that succeeded running on other host on retry.
> i.e., something is wrong with that host, failing to launch VMs.
> as it happens, it already "auto recovers" for this mode after a 
> certain period of time.
>
> why? because the host will fail to run virtual machines, and will be 
> the least loaded, so it will be the first target selected to run them, 
> which will continue to fail.
>
> so there is a negative scoring mechanism on number of errors, till 
> host is taken out for a while.
>
> (I don't remember if the reverse is true and the VM goes into error 
> mode if the VM failed to launch on all hosts per number of retries. i 
> think this wasn't needed and user just got an error in audit log)
>
> i can see two reasons a host will go into error state:
> 1. monitoring didn't detect an issue yet, and host would 
> have/will/should go into non-operational mode.
> if host will go into non-operational mode, and will auto recover with 
> the above flow, i guess it is fine.
>
> 2. cause for failure isn't something we monitor for (upgraded to a bad 
> version of qemu, or qemu got corrupted).
>
> now, the error mode was developed quite a long time ago (august 2007 
> iirc), so could be it mostly compensated for the first reason which is 
> now better monitored.
> i wonder how often error state is seen due to a reason which isn't 
> monitored already.
> moran - do you have examples of when you see error state of hosts?

usually it happened when there were a problematic/ misconfigurated vdsm 
/ libvirt which failed to run vms (nothing we can recover from)- i 
haven't faced the issue of "host it too loaded" that status has some 
other syndromes, however the behaviour on that state is very much the 
same -waiting for 30 min (?) and than move it to activated.
Moran.