[Engine-devel] Autorecovery feature plan for review

Hi, Please review the plan document for autorecovery. http://www.ovirt.org/wiki/Features/Autorecovery Thank you, Laszlo

Some comments: 1. I think the amount of time between tests should be configurable. 2. I guess some of the actions done by the autorecovery process should be monitored, so take a look at "http://www.ovirt.org/wiki/Features/TaskManagerDetailed#Job_for_System_Monito..." in order to monitor this action. Oved ----- Original Message -----
From: "Laszlo Hornyak" <lhornyak@redhat.com> To: "engine-devel" <engine-devel@ovirt.org>, users@ovirt.org Sent: Monday, February 13, 2012 12:32:34 PM Subject: [Users] Autorecovery feature plan for review
Hi,
Please review the plan document for autorecovery. http://www.ovirt.org/wiki/Features/Autorecovery
Thank you, Laszlo _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

----- Original Message -----
From: "Oved Ourfalli" <ovedo@redhat.com> To: "Laszlo Hornyak" <lhornyak@redhat.com> Cc: "engine-devel" <engine-devel@ovirt.org>, users@ovirt.org Sent: Monday, February 13, 2012 12:31:23 PM Subject: Re: [Users] Autorecovery feature plan for review
Some comments: 1. I think the amount of time between tests should be configurable.
Agreed.
2. I guess some of the actions done by the autorecovery process should be monitored, so take a look at "http://www.ovirt.org/wiki/Features/TaskManagerDetailed#Job_for_System_Monito..." in order to monitor this action.
Oved
----- Original Message -----
From: "Laszlo Hornyak" <lhornyak@redhat.com> To: "engine-devel" <engine-devel@ovirt.org>, users@ovirt.org Sent: Monday, February 13, 2012 12:32:34 PM Subject: [Users] Autorecovery feature plan for review
Hi,
Please review the plan document for autorecovery. http://www.ovirt.org/wiki/Features/Autorecovery
Thank you, Laszlo _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

On 02/13/2012 12:32 PM, Laszlo Hornyak wrote:
Hi,
Please review the plan document for autorecovery. http://www.ovirt.org/wiki/Features/Autorecovery
why would we disable auto recovery by default? it sounds like the preferred behavior?

On 14/02/12 05:56, Itamar Heim wrote:
On 02/13/2012 12:32 PM, Laszlo Hornyak wrote:
Hi,
Please review the plan document for autorecovery. http://www.ovirt.org/wiki/Features/Autorecovery
why would we disable auto recovery by default? it sounds like the preferred behavior?
I think that by default Laszlo meant in the upgrade process to maintain current behavior. I agree that for new entities the default should be true.
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel

On 02/14/2012 08:57 AM, Livnat Peer wrote:
On 14/02/12 05:56, Itamar Heim wrote:
On 02/13/2012 12:32 PM, Laszlo Hornyak wrote:
Hi,
Please review the plan document for autorecovery. http://www.ovirt.org/wiki/Features/Autorecovery
why would we disable auto recovery by default? it sounds like the preferred behavior?
I think that by default Laszlo meant in the upgrade process to maintain current behavior.
I agree that for new entities the default should be true.
i think the only combination which will allow this is for db to default to false and code to default to true for this property?

On 02/14/2012 08:57 AM, Livnat Peer wrote:
On 14/02/12 05:56, Itamar Heim wrote:
On 02/13/2012 12:32 PM, Laszlo Hornyak wrote:
Hi,
Please review the plan document for autorecovery. http://www.ovirt.org/wiki/Features/Autorecovery
why would we disable auto recovery by default? it sounds like the preferred behavior?
I think that by default Laszlo meant in the upgrade process to maintain current behavior.
I agree that for new entities the default should be true.
i think the only combination which will allow this is for db to default to false and code to default to true for this property? Why can't we during upgrade process set to all existing entities in DB
On 02/14/2012 08:59 AM, Itamar Heim wrote: the value to false, but still have the column defined as "default true"?
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel

----- Original Message -----
From: "Yair Zaslavsky" <yzaslavs@redhat.com> To: engine-devel@ovirt.org Sent: Tuesday, February 14, 2012 9:20:10 AM Subject: Re: [Engine-devel] Autorecovery feature plan for review
On 02/14/2012 08:57 AM, Livnat Peer wrote:
On 14/02/12 05:56, Itamar Heim wrote:
On 02/13/2012 12:32 PM, Laszlo Hornyak wrote:
Hi,
Please review the plan document for autorecovery. http://www.ovirt.org/wiki/Features/Autorecovery
why would we disable auto recovery by default? it sounds like the preferred behavior?
I think that by default Laszlo meant in the upgrade process to maintain current behavior.
I agree that for new entities the default should be true.
i think the only combination which will allow this is for db to default to false and code to default to true for this property? Why can't we during upgrade process set to all existing entities in DB
On 02/14/2012 08:59 AM, Itamar Heim wrote: the value to false, but still have the column defined as "default true"?
why all the trouble? i think this field should be mandatory as any other field, user has to specify it during the entity creation, right where he provide the name and any other field for the new entity.
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel

On 02/14/2012 09:48 AM, Omer Frenkel wrote:
----- Original Message -----
From: "Yair Zaslavsky" <yzaslavs@redhat.com> To: engine-devel@ovirt.org Sent: Tuesday, February 14, 2012 9:20:10 AM Subject: Re: [Engine-devel] Autorecovery feature plan for review
On 02/14/2012 08:57 AM, Livnat Peer wrote:
On 14/02/12 05:56, Itamar Heim wrote:
On 02/13/2012 12:32 PM, Laszlo Hornyak wrote:
Hi,
Please review the plan document for autorecovery. http://www.ovirt.org/wiki/Features/Autorecovery
why would we disable auto recovery by default? it sounds like the preferred behavior?
I think that by default Laszlo meant in the upgrade process to maintain current behavior.
I agree that for new entities the default should be true.
i think the only combination which will allow this is for db to default to false and code to default to true for this property? Why can't we during upgrade process set to all existing entities in DB
On 02/14/2012 08:59 AM, Itamar Heim wrote: the value to false, but still have the column defined as "default true"?
why all the trouble? i think this field should be mandatory as any other field, user has to specify it during the entity creation, right where he provide the name and any other field for the new entity.
Fine by me.
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel

On 02/14/2012 09:59 AM, Yair Zaslavsky wrote:
On 02/14/2012 09:48 AM, Omer Frenkel wrote:
----- Original Message -----
From: "Yair Zaslavsky"<yzaslavs@redhat.com> To: engine-devel@ovirt.org Sent: Tuesday, February 14, 2012 9:20:10 AM Subject: Re: [Engine-devel] Autorecovery feature plan for review
On 02/14/2012 08:59 AM, Itamar Heim wrote:
On 02/14/2012 08:57 AM, Livnat Peer wrote:
On 14/02/12 05:56, Itamar Heim wrote:
On 02/13/2012 12:32 PM, Laszlo Hornyak wrote: > Hi, > > Please review the plan document for autorecovery. > http://www.ovirt.org/wiki/Features/Autorecovery why would we disable auto recovery by default? it sounds like the preferred behavior?
I think that by default Laszlo meant in the upgrade process to maintain current behavior.
Why? Why not improve their user experience and provide them with such feature? Current behaviour sucks - as your system admin.
I agree that for new entities the default should be true.
i think the only combination which will allow this is for db to default to false and code to default to true for this property? Why can't we during upgrade process set to all existing entities in DB the value to false, but still have the column defined as "default true"? why all the trouble? i think this field should be mandatory as any other field, user has to specify it during the entity creation, right where he provide the name and any other field for the new entity. Fine by me.
I'm not sure I see the reason a user would want to turn it off, on a per-object basis. If it's in the 'Advanced' settings of a host/storage, fine, but otherwise, it's just another cryptic feature to turn on/off. Y.
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel

On 14/02/12 11:45, Yaniv Kaul wrote:
On 02/14/2012 09:59 AM, Yair Zaslavsky wrote:
On 02/14/2012 09:48 AM, Omer Frenkel wrote:
----- Original Message -----
From: "Yair Zaslavsky"<yzaslavs@redhat.com> To: engine-devel@ovirt.org Sent: Tuesday, February 14, 2012 9:20:10 AM Subject: Re: [Engine-devel] Autorecovery feature plan for review
On 02/14/2012 08:59 AM, Itamar Heim wrote:
On 02/14/2012 08:57 AM, Livnat Peer wrote:
On 14/02/12 05:56, Itamar Heim wrote: > On 02/13/2012 12:32 PM, Laszlo Hornyak wrote: >> Hi, >> >> Please review the plan document for autorecovery. >> http://www.ovirt.org/wiki/Features/Autorecovery > why would we disable auto recovery by default? it sounds like the > preferred behavior? > I think that by default Laszlo meant in the upgrade process to maintain current behavior.
Why? Why not improve their user experience and provide them with such feature? Current behaviour sucks - as your system admin.
I don't have objections either way. Laszlo - let's update the wiki to upgrade by default to true, if we'll get good reason why not to upgrade to true then we can open it again for discussion.
I agree that for new entities the default should be true.
i think the only combination which will allow this is for db to default to false and code to default to true for this property? Why can't we during upgrade process set to all existing entities in DB the value to false, but still have the column defined as "default true"? why all the trouble? i think this field should be mandatory as any other field, user has to specify it during the entity creation, right where he provide the name and any other field for the new entity. Fine by me.
I'm not sure I see the reason a user would want to turn it off, on a per-object basis. If it's in the 'Advanced' settings of a host/storage, fine, but otherwise, it's just another cryptic feature to turn on/off. Y.
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel

----- Original Message -----
From: "Livnat Peer" <lpeer@redhat.com> To: "Yaniv Kaul" <ykaul@redhat.com> Cc: engine-devel@ovirt.org Sent: Tuesday, February 14, 2012 1:00:35 PM Subject: Re: [Engine-devel] Autorecovery feature plan for review
On 14/02/12 11:45, Yaniv Kaul wrote:
On 02/14/2012 09:59 AM, Yair Zaslavsky wrote:
On 02/14/2012 09:48 AM, Omer Frenkel wrote:
----- Original Message -----
From: "Yair Zaslavsky"<yzaslavs@redhat.com> To: engine-devel@ovirt.org Sent: Tuesday, February 14, 2012 9:20:10 AM Subject: Re: [Engine-devel] Autorecovery feature plan for review
On 02/14/2012 08:59 AM, Itamar Heim wrote:
On 02/14/2012 08:57 AM, Livnat Peer wrote: > On 14/02/12 05:56, Itamar Heim wrote: >> On 02/13/2012 12:32 PM, Laszlo Hornyak wrote: >>> Hi, >>> >>> Please review the plan document for autorecovery. >>> http://www.ovirt.org/wiki/Features/Autorecovery >> why would we disable auto recovery by default? it sounds like >> the >> preferred behavior? >> > I think that by default Laszlo meant in the upgrade process to > maintain > current behavior.
Why? Why not improve their user experience and provide them with such feature? Current behaviour sucks - as your system admin.
I don't have objections either way. Laszlo - let's update the wiki to upgrade by default to true, if we'll get good reason why not to upgrade to true then we can open it again for discussion.
So be it, I changed the wikipage.
> > I agree that for new entities the default should be true. i think the only combination which will allow this is for db to default to false and code to default to true for this property? Why can't we during upgrade process set to all existing entities in DB the value to false, but still have the column defined as "default true"? why all the trouble? i think this field should be mandatory as any other field, user has to specify it during the entity creation, right where he provide the name and any other field for the new entity. Fine by me.
I'm not sure I see the reason a user would want to turn it off, on a per-object basis. If it's in the 'Advanced' settings of a host/storage, fine, but otherwise, it's just another cryptic feature to turn on/off. Y.
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel

On 02/14/2012 09:48 AM, Omer Frenkel wrote:
----- Original Message -----
From: "Yair Zaslavsky"<yzaslavs@redhat.com> To: engine-devel@ovirt.org Sent: Tuesday, February 14, 2012 9:20:10 AM Subject: Re: [Engine-devel] Autorecovery feature plan for review
On 02/14/2012 08:57 AM, Livnat Peer wrote:
On 14/02/12 05:56, Itamar Heim wrote:
On 02/13/2012 12:32 PM, Laszlo Hornyak wrote:
Hi,
Please review the plan document for autorecovery. http://www.ovirt.org/wiki/Features/Autorecovery
why would we disable auto recovery by default? it sounds like the preferred behavior?
I think that by default Laszlo meant in the upgrade process to maintain current behavior.
I agree that for new entities the default should be true.
i think the only combination which will allow this is for db to default to false and code to default to true for this property? Why can't we during upgrade process set to all existing entities in DB
On 02/14/2012 08:59 AM, Itamar Heim wrote: the value to false, but still have the column defined as "default true"?
why all the trouble? i think this field should be mandatory as any other field, user has to specify it during the entity creation, right where he provide the name and any other field for the new entity.
because this will break the API? what happens if user doesn't pass it, like they didn't so far? i.e., you need to decide on a default.

On 02/14/2012 09:20 AM, Yair Zaslavsky wrote:
On 02/14/2012 08:57 AM, Livnat Peer wrote:
On 14/02/12 05:56, Itamar Heim wrote:
On 02/13/2012 12:32 PM, Laszlo Hornyak wrote:
Hi,
Please review the plan document for autorecovery. http://www.ovirt.org/wiki/Features/Autorecovery
why would we disable auto recovery by default? it sounds like the preferred behavior?
I think that by default Laszlo meant in the upgrade process to maintain current behavior.
I agree that for new entities the default should be true.
i think the only combination which will allow this is for db to default to false and code to default to true for this property? Why can't we during upgrade process set to all existing entities in DB
On 02/14/2012 08:59 AM, Itamar Heim wrote: the value to false, but still have the column defined as "default true"?
because upgrade and clean install are running the same scripts?

On 02/14/2012 10:03 PM, Itamar Heim wrote:
On 02/14/2012 09:20 AM, Yair Zaslavsky wrote:
On 02/14/2012 08:57 AM, Livnat Peer wrote:
On 14/02/12 05:56, Itamar Heim wrote:
On 02/13/2012 12:32 PM, Laszlo Hornyak wrote:
Hi,
Please review the plan document for autorecovery. http://www.ovirt.org/wiki/Features/Autorecovery
why would we disable auto recovery by default? it sounds like the preferred behavior?
I think that by default Laszlo meant in the upgrade process to maintain current behavior.
I agree that for new entities the default should be true.
i think the only combination which will allow this is for db to default to false and code to default to true for this property? Why can't we during upgrade process set to all existing entities in DB
On 02/14/2012 08:59 AM, Itamar Heim wrote: the value to false, but still have the column defined as "default true"?
because upgrade and clean install are running the same scripts? I guess I still fail to understand. Scenarios (as both upgrade and clean install run the same scripts) a. In environment to be upgraded we have X entities that are non recoverable - after upgrade these X entities have the boolean flag set to false. New entities in the system will be created with auto recoverable set to true. b. In environment to be clean installed -we have 0 existing entities - after clean install all new entities in the system will be create with auto recoverable set to true. Will this be considered a bad behavior?

----- Original Message -----
On 02/14/2012 09:20 AM, Yair Zaslavsky wrote:
On 02/14/2012 08:57 AM, Livnat Peer wrote:
On 14/02/12 05:56, Itamar Heim wrote:
On 02/13/2012 12:32 PM, Laszlo Hornyak wrote: > Hi, > > Please review the plan document for autorecovery. > http://www.ovirt.org/wiki/Features/Autorecovery
why would we disable auto recovery by default? it sounds like the preferred behavior?
I think that by default Laszlo meant in the upgrade process to maintain current behavior.
I agree that for new entities the default should be true.
i think the only combination which will allow this is for db to default to false and code to default to true for this property? Why can't we during upgrade process set to all existing entities in DB
On 02/14/2012 08:59 AM, Itamar Heim wrote: the value to false, but still have the column defined as "default true"?
because upgrade and clean install are running the same scripts? I guess I still fail to understand. Scenarios (as both upgrade and clean install run the same scripts) a. In environment to be upgraded we have X entities that are non recoverable - after upgrade these X entities have the boolean flag set to false. New entities in the system will be created with auto recoverable set to true. b. In environment to be clean installed -we have 0 existing entities
On 02/14/2012 10:03 PM, Itamar Heim wrote: - after clean install all new entities in the system will be create with auto recoverable set to true. Will this be considered a bad behavior?
Why is there a field in the db for this? Why is there absolutely no description in the wiki what this feature *actually* does? Why is there a periodic process to do this? iiuc host/storage/whatever goes into non-operational mode due to monitoring of this object and after a certain amount of time (or immediately) where the object was reported to be in an error state it is moved to non-operational. Monitoring of these objects should just *not* stop and the second it is reported ok, move the object back to up/active/whatever state. What am I missing?
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel

----- Original Message -----
From: "Ayal Baron" <abaron@redhat.com> To: "Yair Zaslavsky" <yzaslavs@redhat.com> Cc: engine-devel@ovirt.org Sent: Wednesday, February 15, 2012 12:37:46 AM Subject: Re: [Engine-devel] Autorecovery feature plan for review
----- Original Message -----
On 02/14/2012 09:20 AM, Yair Zaslavsky wrote:
On 02/14/2012 08:57 AM, Livnat Peer wrote:
On 14/02/12 05:56, Itamar Heim wrote: > On 02/13/2012 12:32 PM, Laszlo Hornyak wrote: >> Hi, >> >> Please review the plan document for autorecovery. >> http://www.ovirt.org/wiki/Features/Autorecovery > > why would we disable auto recovery by default? it sounds like > the > preferred behavior? >
I think that by default Laszlo meant in the upgrade process to maintain current behavior.
I agree that for new entities the default should be true.
i think the only combination which will allow this is for db to default to false and code to default to true for this property? Why can't we during upgrade process set to all existing entities in DB
On 02/14/2012 08:59 AM, Itamar Heim wrote: the value to false, but still have the column defined as "default true"?
because upgrade and clean install are running the same scripts? I guess I still fail to understand. Scenarios (as both upgrade and clean install run the same scripts) a. In environment to be upgraded we have X entities that are non recoverable - after upgrade these X entities have the boolean flag set to false. New entities in the system will be created with auto recoverable set to true. b. In environment to be clean installed -we have 0 existing entities
On 02/14/2012 10:03 PM, Itamar Heim wrote: - after clean install all new entities in the system will be create with auto recoverable set to true. Will this be considered a bad behavior?
Why is there a field in the db for this? Why is there absolutely no description in the wiki what this feature *actually* does? Why is there a periodic process to do this? iiuc host/storage/whatever goes into non-operational mode due to monitoring of this object and after a certain amount of time (or immediately) where the object was reported to be in an error state it is moved to non-operational. Monitoring of these objects should just *not* stop and the second it is reported ok, move the object back to up/active/whatever state. What am I missing? Let me see if I got it right: it means I've one process that will go over all the "down" objects every X seconds, and will issue "activate" action per object? this will be done sequentially I guess... I would reduce the audit log time to 1 hour.
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel

On 02/14/2012 11:36 PM, Yair Zaslavsky wrote:
On 02/14/2012 10:03 PM, Itamar Heim wrote:
On 02/14/2012 09:20 AM, Yair Zaslavsky wrote:
On 02/14/2012 08:57 AM, Livnat Peer wrote:
On 14/02/12 05:56, Itamar Heim wrote:
On 02/13/2012 12:32 PM, Laszlo Hornyak wrote: > Hi, > > Please review the plan document for autorecovery. > http://www.ovirt.org/wiki/Features/Autorecovery why would we disable auto recovery by default? it sounds like the preferred behavior?
I think that by default Laszlo meant in the upgrade process to maintain current behavior.
I agree that for new entities the default should be true. i think the only combination which will allow this is for db to default to false and code to default to true for this property? Why can't we during upgrade process set to all existing entities in DB
On 02/14/2012 08:59 AM, Itamar Heim wrote: the value to false, but still have the column defined as "default true"? because upgrade and clean install are running the same scripts? I guess I still fail to understand. Scenarios (as both upgrade and clean install run the same scripts) a. In environment to be upgraded we have X entities that are non recoverable - after upgrade these X entities have the boolean flag set to false. New entities in the system will be created with auto recoverable set to true.
I still fail to understand why you 'punish' existing objects and not giving them the new feature enabled by default. Y.
b. In environment to be clean installed -we have 0 existing entities - after clean install all new entities in the system will be create with auto recoverable set to true. Will this be considered a bad behavior?
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel

----- Original Message -----
On 02/14/2012 11:36 PM, Yair Zaslavsky wrote:
On 02/14/2012 10:03 PM, Itamar Heim wrote:
On 02/14/2012 09:20 AM, Yair Zaslavsky wrote:
On 02/14/2012 08:57 AM, Livnat Peer wrote:
On 14/02/12 05:56, Itamar Heim wrote: > On 02/13/2012 12:32 PM, Laszlo Hornyak wrote: >> Hi, >> >> Please review the plan document for autorecovery. >> http://www.ovirt.org/wiki/Features/Autorecovery > why would we disable auto recovery by default? it sounds like > the > preferred behavior? > I think that by default Laszlo meant in the upgrade process to maintain current behavior.
I agree that for new entities the default should be true. i think the only combination which will allow this is for db to default to false and code to default to true for this property? Why can't we during upgrade process set to all existing entities in DB
On 02/14/2012 08:59 AM, Itamar Heim wrote: the value to false, but still have the column defined as "default true"? because upgrade and clean install are running the same scripts? I guess I still fail to understand. Scenarios (as both upgrade and clean install run the same scripts) a. In environment to be upgraded we have X entities that are non recoverable - after upgrade these X entities have the boolean flag set to false. New entities in the system will be created with auto recoverable set to true.
I still fail to understand why you 'punish' existing objects and not giving them the new feature enabled by default.
This is not a feature, it's a bug! This should not be treated as a feature and this should not be configurable! Today an object moves to non-operational due to state reported by vdsm. The object should immediately return to up the moment vdsm reports the object as ok (this means that you don't stop monitoring just because there is an error). That's it. no db field and no nothing... This pertains to storage domains, network, host status, whatever.
Y.
b. In environment to be clean installed -we have 0 existing entities - after clean install all new entities in the system will be create with auto recoverable set to true. Will this be considered a bad behavior?
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel

Hi Ayal, ----- Original Message -----
From: "Ayal Baron" <abaron@redhat.com> To: "Yaniv Kaul" <ykaul@redhat.com> Cc: engine-devel@ovirt.org Sent: Wednesday, February 15, 2012 12:19:48 PM Subject: Re: [Engine-devel] Autorecovery feature plan for review
I still fail to understand why you 'punish' existing objects and not giving them the new feature enabled by default.
This is not a feature, it's a bug!
Whatever we call it, it is a change in behavior. We agreed that it will be enabled for all existing objects by default. http://globalnerdy.com/wordpress/wp-content/uploads/2007/12/bug_vs_feature.g...
This should not be treated as a feature and this should not be configurable!
I can imagine some situations when I would not like the autorecovery to happen, but if everyone agrees not to make it configurable, I will just remove it from my patchset.
Today an object moves to non-operational due to state reported by vdsm. The object should immediately return to up the moment vdsm reports the object as ok (this means that you don't stop monitoring just because there is an error). That's it. no db field and no nothing... This pertains to storage domains, network, host status, whatever.
Y.
b. In environment to be clean installed -we have 0 existing entities - after clean install all new entities in the system will be create with auto recoverable set to true. Will this be considered a bad behavior?
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel

----- Original Message -----
Hi Ayal,
----- Original Message -----
From: "Ayal Baron" <abaron@redhat.com> To: "Yaniv Kaul" <ykaul@redhat.com> Cc: engine-devel@ovirt.org Sent: Wednesday, February 15, 2012 12:19:48 PM Subject: Re: [Engine-devel] Autorecovery feature plan for review
I still fail to understand why you 'punish' existing objects and not giving them the new feature enabled by default.
This is not a feature, it's a bug!
Whatever we call it, it is a change in behavior. We agreed that it will be enabled for all existing objects by default.
http://globalnerdy.com/wordpress/wp-content/uploads/2007/12/bug_vs_feature.g...
This should not be treated as a feature and this should not be configurable!
I can imagine some situations when I would not like the autorecovery to happen, but if everyone agrees not to make it configurable, I will just remove it from my patchset.
It's not autorecovery, you're not recovering anything. You're reflecting the fact that the resource is back to normal (not due to anything that the engine did). This is why it is a bug today. This is why it should not be configurable.
Today an object moves to non-operational due to state reported by vdsm. The object should immediately return to up the moment vdsm reports the object as ok (this means that you don't stop monitoring just because there is an error). That's it. no db field and no nothing... This pertains to storage domains, network, host status, whatever.
Y.
b. In environment to be clean installed -we have 0 existing entities - after clean install all new entities in the system will be create with auto recoverable set to true. Will this be considered a bad behavior?
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel

Hi, A short summary from the call today, please correct me if I forgot or misunderstood something. Ayal argued that the failed host/storagedomain should be reactivated by a periodically executed job, he would prefer if the engine could [try to] correct the problem right on discovery. Livnat's point was that this is hard to implement and it is OK if we move it to Nonoperational state and periodically check it again. There was a little arguing if we call the current behavior a bug or a missing behavior, I believe this is not quite important. I did not fully understand the last few sentences from Livant, did we manage to agree in a change in the plan? Anyway, I agree with Ayal that it would be very nice if the engine could fix the issues right on discovery, but I also agree that this feature would take a bigger effort. It would be nice to know what effort it would take to get the monitoring do this safely. Could we still call it monitoring then? Laszlo ----- Original Message -----
From: "Ayal Baron" <abaron@redhat.com> To: "Laszlo Hornyak" <lhornyak@redhat.com> Cc: engine-devel@ovirt.org, "Yaniv Kaul" <ykaul@redhat.com> Sent: Wednesday, February 15, 2012 12:46:05 PM Subject: Re: [Engine-devel] Autorecovery feature plan for review
----- Original Message -----
Hi Ayal,
----- Original Message -----
From: "Ayal Baron" <abaron@redhat.com> To: "Yaniv Kaul" <ykaul@redhat.com> Cc: engine-devel@ovirt.org Sent: Wednesday, February 15, 2012 12:19:48 PM Subject: Re: [Engine-devel] Autorecovery feature plan for review
I still fail to understand why you 'punish' existing objects and not giving them the new feature enabled by default.
This is not a feature, it's a bug!
Whatever we call it, it is a change in behavior. We agreed that it will be enabled for all existing objects by default.
http://globalnerdy.com/wordpress/wp-content/uploads/2007/12/bug_vs_feature.g...
This should not be treated as a feature and this should not be configurable!
I can imagine some situations when I would not like the autorecovery to happen, but if everyone agrees not to make it configurable, I will just remove it from my patchset.
It's not autorecovery, you're not recovering anything. You're reflecting the fact that the resource is back to normal (not due to anything that the engine did). This is why it is a bug today. This is why it should not be configurable.
Today an object moves to non-operational due to state reported by vdsm. The object should immediately return to up the moment vdsm reports the object as ok (this means that you don't stop monitoring just because there is an error). That's it. no db field and no nothing... This pertains to storage domains, network, host status, whatever.
Y.
b. In environment to be clean installed -we have 0 existing entities - after clean install all new entities in the system will be create with auto recoverable set to true. Will this be considered a bad behavior?
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel

----- Original Message -----
Hi,
A short summary from the call today, please correct me if I forgot or misunderstood something.
Ayal argued that the failed host/storagedomain should be reactivated by a periodically executed job, he would prefer if the engine could [try to] correct the problem right on discovery. Livnat's point was that this is hard to implement and it is OK if we move it to Nonoperational state and periodically check it again.
There was a little arguing if we call the current behavior a bug or a missing behavior, I believe this is not quite important.
I did not fully understand the last few sentences from Livant, did we manage to agree in a change in the plan?
A couple of points that we agreed upon: 1. no need for new mechanism, just initiate this from the monitoring context. Preferably, if not difficult, evaluate the monitoring data, if host should remain in non-op then don't bother running initVdsOnUp 2. configuration of when to call initvdsonup is orthogonal to auto-init behaviour and if introduced should be on by default and user should be able to configure this either on or off for the host in general (no lower granularity) and can only be configured via the API. When disabled initVdsOnUp would be called only when admin activates the host/storage and any error would keep it inactive (I still don't understand why this is at all needed but whatever). Note that going forward what I envision is engine pushing down the entire host configuration once and from that point on the host would try to keep this configuration up and running. Once this happens there will be no need for initVdsOnUp at all.
Anyway, I agree with Ayal that it would be very nice if the engine could fix the issues right on discovery, but I also agree that this feature would take a bigger effort. It would be nice to know what effort it would take to get the monitoring do this safely. Could we still call it monitoring then?
Laszlo
----- Original Message -----
From: "Ayal Baron" <abaron@redhat.com> To: "Laszlo Hornyak" <lhornyak@redhat.com> Cc: engine-devel@ovirt.org, "Yaniv Kaul" <ykaul@redhat.com> Sent: Wednesday, February 15, 2012 12:46:05 PM Subject: Re: [Engine-devel] Autorecovery feature plan for review
----- Original Message -----
Hi Ayal,
----- Original Message -----
From: "Ayal Baron" <abaron@redhat.com> To: "Yaniv Kaul" <ykaul@redhat.com> Cc: engine-devel@ovirt.org Sent: Wednesday, February 15, 2012 12:19:48 PM Subject: Re: [Engine-devel] Autorecovery feature plan for review
I still fail to understand why you 'punish' existing objects and not giving them the new feature enabled by default.
This is not a feature, it's a bug!
Whatever we call it, it is a change in behavior. We agreed that it will be enabled for all existing objects by default.
http://globalnerdy.com/wordpress/wp-content/uploads/2007/12/bug_vs_feature.g...
This should not be treated as a feature and this should not be configurable!
I can imagine some situations when I would not like the autorecovery to happen, but if everyone agrees not to make it configurable, I will just remove it from my patchset.
It's not autorecovery, you're not recovering anything. You're reflecting the fact that the resource is back to normal (not due to anything that the engine did). This is why it is a bug today. This is why it should not be configurable.
Today an object moves to non-operational due to state reported by vdsm. The object should immediately return to up the moment vdsm reports the object as ok (this means that you don't stop monitoring just because there is an error). That's it. no db field and no nothing... This pertains to storage domains, network, host status, whatever.
Y.
b. In environment to be clean installed -we have 0 existing entities - after clean install all new entities in the system will be create with auto recoverable set to true. Will this be considered a bad behavior?
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel

On 15/02/12 18:28, Ayal Baron wrote:
----- Original Message -----
Hi,
A short summary from the call today, please correct me if I forgot or misunderstood something.
Ayal argued that the failed host/storagedomain should be reactivated by a periodically executed job, he would prefer if the engine could [try to] correct the problem right on discovery. Livnat's point was that this is hard to implement and it is OK if we move it to Nonoperational state and periodically check it again.
There was a little arguing if we call the current behavior a bug or a missing behavior, I believe this is not quite important.
I did not fully understand the last few sentences from Livant, did we manage to agree in a change in the plan?
A couple of points that we agreed upon: 1. no need for new mechanism, just initiate this from the monitoring context. Preferably, if not difficult, evaluate the monitoring data, if host should remain in non-op then don't bother running initVdsOnUp 2. configuration of when to call initvdsonup is orthogonal to auto-init behaviour and if introduced should be on by default and user should be able to configure this either on or off for the host in general (no lower granularity) and can only be configured via the API. When disabled initVdsOnUp would be called only when admin activates the host/storage and any error would keep it inactive (I still don't understand why this is at all needed but whatever).
Also a note from Moran on the call was to check if we can unify the non-operational and Error statuses of the host. It was mentioned on the call that the reason for having ERROR state is for recovery (time out of the error state) but since we are about to recover from non-operational status as well there is no reason to have two different statuses.
Note that going forward what I envision is engine pushing down the entire host configuration once and from that point on the host would try to keep this configuration up and running. Once this happens there will be no need for initVdsOnUp at all.
Anyway, I agree with Ayal that it would be very nice if the engine could fix the issues right on discovery, but I also agree that this feature would take a bigger effort. It would be nice to know what effort it would take to get the monitoring do this safely. Could we still call it monitoring then?
Basically the monitoring flow moves the host to non-operational, what Ayal suggests is that it will also trigger the recovery flow (initialization flow). I think that modeling it to be triggered from the monitoring flow will block monitoring of the host during the initialization flow which can save us races going forward. Let's see if we can design the solution to be triggered by the monitoring.
Laszlo
----- Original Message -----
From: "Ayal Baron" <abaron@redhat.com> To: "Laszlo Hornyak" <lhornyak@redhat.com> Cc: engine-devel@ovirt.org, "Yaniv Kaul" <ykaul@redhat.com> Sent: Wednesday, February 15, 2012 12:46:05 PM Subject: Re: [Engine-devel] Autorecovery feature plan for review
----- Original Message -----
Hi Ayal,
----- Original Message -----
From: "Ayal Baron" <abaron@redhat.com> To: "Yaniv Kaul" <ykaul@redhat.com> Cc: engine-devel@ovirt.org Sent: Wednesday, February 15, 2012 12:19:48 PM Subject: Re: [Engine-devel] Autorecovery feature plan for review
I still fail to understand why you 'punish' existing objects and not giving them the new feature enabled by default.
This is not a feature, it's a bug!
Whatever we call it, it is a change in behavior. We agreed that it will be enabled for all existing objects by default.
http://globalnerdy.com/wordpress/wp-content/uploads/2007/12/bug_vs_feature.g...
This should not be treated as a feature and this should not be configurable!
I can imagine some situations when I would not like the autorecovery to happen, but if everyone agrees not to make it configurable, I will just remove it from my patchset.
It's not autorecovery, you're not recovering anything. You're reflecting the fact that the resource is back to normal (not due to anything that the engine did). This is why it is a bug today. This is why it should not be configurable.
Today an object moves to non-operational due to state reported by vdsm. The object should immediately return to up the moment vdsm reports the object as ok (this means that you don't stop monitoring just because there is an error). That's it. no db field and no nothing... This pertains to storage domains, network, host status, whatever.
Y.
> b. In environment to be clean installed -we have 0 existing > entities - > after clean install all new entities in the system will be > create > with > auto recoverable set to true. > Will this be considered a bad behavior? > > > _______________________________________________ > Engine-devel mailing list > Engine-devel@ovirt.org > http://lists.ovirt.org/mailman/listinfo/engine-devel
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel

On 02/15/2012 07:02 PM, Livnat Peer wrote:
On 15/02/12 18:28, Ayal Baron wrote:
----- Original Message -----
Hi,
A short summary from the call today, please correct me if I forgot or misunderstood something.
Ayal argued that the failed host/storagedomain should be reactivated by a periodically executed job, he would prefer if the engine could [try to] correct the problem right on discovery. Livnat's point was that this is hard to implement and it is OK if we move it to Nonoperational state and periodically check it again.
There was a little arguing if we call the current behavior a bug or a missing behavior, I believe this is not quite important.
I did not fully understand the last few sentences from Livant, did we manage to agree in a change in the plan?
A couple of points that we agreed upon: 1. no need for new mechanism, just initiate this from the monitoring context. Preferably, if not difficult, evaluate the monitoring data, if host should remain in non-op then don't bother running initVdsOnUp 2. configuration of when to call initvdsonup is orthogonal to auto-init behaviour and if introduced should be on by default and user should be able to configure this either on or off for the host in general (no lower granularity) and can only be configured via the API. When disabled initVdsOnUp would be called only when admin activates the host/storage and any error would keep it inactive (I still don't understand why this is at all needed but whatever).
Also a note from Moran on the call was to check if we can unify the non-operational and Error statuses of the host. It was mentioned on the call that the reason for having ERROR state is for recovery (time out of the error state) but since we are about to recover from non-operational status as well there is no reason to have two different statuses.
they are not exactly the same. or should i say, error is supposed to be when reason isn't related to host being non-operational. what is error state? a host will go into error state if it fails to run 3 (configurable) VMs, that succeeded running on other host on retry. i.e., something is wrong with that host, failing to launch VMs. as it happens, it already "auto recovers" for this mode after a certain period of time. why? because the host will fail to run virtual machines, and will be the least loaded, so it will be the first target selected to run them, which will continue to fail. so there is a negative scoring mechanism on number of errors, till host is taken out for a while. (I don't remember if the reverse is true and the VM goes into error mode if the VM failed to launch on all hosts per number of retries. i think this wasn't needed and user just got an error in audit log) i can see two reasons a host will go into error state: 1. monitoring didn't detect an issue yet, and host would have/will/should go into non-operational mode. if host will go into non-operational mode, and will auto recover with the above flow, i guess it is fine. 2. cause for failure isn't something we monitor for (upgraded to a bad version of qemu, or qemu got corrupted). now, the error mode was developed quite a long time ago (august 2007 iirc), so could be it mostly compensated for the first reason which is now better monitored. i wonder how often error state is seen due to a reason which isn't monitored already. moran - do you have examples of when you see error state of hosts?

On 02/16/2012 12:38 AM, Itamar Heim wrote:
On 02/15/2012 07:02 PM, Livnat Peer wrote:
On 15/02/12 18:28, Ayal Baron wrote:
----- Original Message -----
Hi,
A short summary from the call today, please correct me if I forgot or misunderstood something.
Ayal argued that the failed host/storagedomain should be reactivated by a periodically executed job, he would prefer if the engine could [try to] correct the problem right on discovery. Livnat's point was that this is hard to implement and it is OK if we move it to Nonoperational state and periodically check it again.
There was a little arguing if we call the current behavior a bug or a missing behavior, I believe this is not quite important.
I did not fully understand the last few sentences from Livant, did we manage to agree in a change in the plan?
A couple of points that we agreed upon: 1. no need for new mechanism, just initiate this from the monitoring context. Preferably, if not difficult, evaluate the monitoring data, if host should remain in non-op then don't bother running initVdsOnUp 2. configuration of when to call initvdsonup is orthogonal to auto-init behaviour and if introduced should be on by default and user should be able to configure this either on or off for the host in general (no lower granularity) and can only be configured via the API. When disabled initVdsOnUp would be called only when admin activates the host/storage and any error would keep it inactive (I still don't understand why this is at all needed but whatever).
Also a note from Moran on the call was to check if we can unify the non-operational and Error statuses of the host. It was mentioned on the call that the reason for having ERROR state is for recovery (time out of the error state) but since we are about to recover from non-operational status as well there is no reason to have two different statuses.
they are not exactly the same. or should i say, error is supposed to be when reason isn't related to host being non-operational.
what is error state? a host will go into error state if it fails to run 3 (configurable) VMs, that succeeded running on other host on retry. i.e., something is wrong with that host, failing to launch VMs. as it happens, it already "auto recovers" for this mode after a certain period of time.
why? because the host will fail to run virtual machines, and will be the least loaded, so it will be the first target selected to run them, which will continue to fail.
so there is a negative scoring mechanism on number of errors, till host is taken out for a while.
(I don't remember if the reverse is true and the VM goes into error mode if the VM failed to launch on all hosts per number of retries. i think this wasn't needed and user just got an error in audit log)
i can see two reasons a host will go into error state: 1. monitoring didn't detect an issue yet, and host would have/will/should go into non-operational mode. if host will go into non-operational mode, and will auto recover with the above flow, i guess it is fine.
2. cause for failure isn't something we monitor for (upgraded to a bad version of qemu, or qemu got corrupted).
now, the error mode was developed quite a long time ago (august 2007 iirc), so could be it mostly compensated for the first reason which is now better monitored. i wonder how often error state is seen due to a reason which isn't monitored already. moran - do you have examples of when you see error state of hosts?
usually it happened when there were a problematic/ misconfigurated vdsm / libvirt which failed to run vms (nothing we can recover from)- i haven't faced the issue of "host it too loaded" that status has some other syndromes, however the behaviour on that state is very much the same -waiting for 30 min (?) and than move it to activated. Moran.

On 02/16/2012 09:29 AM, Moran Goldboim wrote:
On 02/16/2012 12:38 AM, Itamar Heim wrote:
On 02/15/2012 07:02 PM, Livnat Peer wrote:
On 15/02/12 18:28, Ayal Baron wrote:
----- Original Message -----
Hi,
A short summary from the call today, please correct me if I forgot or misunderstood something.
Ayal argued that the failed host/storagedomain should be reactivated by a periodically executed job, he would prefer if the engine could [try to] correct the problem right on discovery. Livnat's point was that this is hard to implement and it is OK if we move it to Nonoperational state and periodically check it again.
There was a little arguing if we call the current behavior a bug or a missing behavior, I believe this is not quite important.
I did not fully understand the last few sentences from Livant, did we manage to agree in a change in the plan?
A couple of points that we agreed upon: 1. no need for new mechanism, just initiate this from the monitoring context. Preferably, if not difficult, evaluate the monitoring data, if host should remain in non-op then don't bother running initVdsOnUp 2. configuration of when to call initvdsonup is orthogonal to auto-init behaviour and if introduced should be on by default and user should be able to configure this either on or off for the host in general (no lower granularity) and can only be configured via the API. When disabled initVdsOnUp would be called only when admin activates the host/storage and any error would keep it inactive (I still don't understand why this is at all needed but whatever).
Also a note from Moran on the call was to check if we can unify the non-operational and Error statuses of the host. It was mentioned on the call that the reason for having ERROR state is for recovery (time out of the error state) but since we are about to recover from non-operational status as well there is no reason to have two different statuses.
they are not exactly the same. or should i say, error is supposed to be when reason isn't related to host being non-operational.
what is error state? a host will go into error state if it fails to run 3 (configurable) VMs, that succeeded running on other host on retry. i.e., something is wrong with that host, failing to launch VMs. as it happens, it already "auto recovers" for this mode after a certain period of time.
why? because the host will fail to run virtual machines, and will be the least loaded, so it will be the first target selected to run them, which will continue to fail.
so there is a negative scoring mechanism on number of errors, till host is taken out for a while.
(I don't remember if the reverse is true and the VM goes into error mode if the VM failed to launch on all hosts per number of retries. i think this wasn't needed and user just got an error in audit log)
i can see two reasons a host will go into error state: 1. monitoring didn't detect an issue yet, and host would have/will/should go into non-operational mode. if host will go into non-operational mode, and will auto recover with the above flow, i guess it is fine.
2. cause for failure isn't something we monitor for (upgraded to a bad version of qemu, or qemu got corrupted).
now, the error mode was developed quite a long time ago (august 2007 iirc), so could be it mostly compensated for the first reason which is now better monitored. i wonder how often error state is seen due to a reason which isn't monitored already. moran - do you have examples of when you see error state of hosts?
usually it happened when there were a problematic/ misconfigurated vdsm / libvirt which failed to run vms (nothing we can recover from)- i haven't faced the issue of "host it too loaded" that status has some other syndromes, however the behaviour on that state is very much the same -waiting for 30 min (?) and than move it to activated. Moran.
'host is too loaded' is too loaded is the only transient state where a temporary 'error' state makes sense, but in the same time, it can also fit the 'non operational' state description. From my experience, the problem with KVM/libvirt/VDSM mis-configured is never temporary, (= magically solved by itself, without concrete user intervention). IMHO, it should move the host to an error state that would not automatically recover from. Regardless, consolidating the names of the states ('inactive, detached, non operational, maintenance, error, unknown' ...) would be nice too. Probably can't be done for all, of course. Y.
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel

On 02/16/2012 09:35 AM, Yaniv Kaul wrote:
On 02/16/2012 09:29 AM, Moran Goldboim wrote:
On 02/16/2012 12:38 AM, Itamar Heim wrote:
On 02/15/2012 07:02 PM, Livnat Peer wrote:
On 15/02/12 18:28, Ayal Baron wrote:
----- Original Message -----
Hi,
A short summary from the call today, please correct me if I forgot or misunderstood something.
Ayal argued that the failed host/storagedomain should be reactivated by a periodically executed job, he would prefer if the engine could [try to] correct the problem right on discovery. Livnat's point was that this is hard to implement and it is OK if we move it to Nonoperational state and periodically check it again.
There was a little arguing if we call the current behavior a bug or a missing behavior, I believe this is not quite important.
I did not fully understand the last few sentences from Livant, did we manage to agree in a change in the plan?
A couple of points that we agreed upon: 1. no need for new mechanism, just initiate this from the monitoring context. Preferably, if not difficult, evaluate the monitoring data, if host should remain in non-op then don't bother running initVdsOnUp 2. configuration of when to call initvdsonup is orthogonal to auto-init behaviour and if introduced should be on by default and user should be able to configure this either on or off for the host in general (no lower granularity) and can only be configured via the API. When disabled initVdsOnUp would be called only when admin activates the host/storage and any error would keep it inactive (I still don't understand why this is at all needed but whatever).
Also a note from Moran on the call was to check if we can unify the non-operational and Error statuses of the host. It was mentioned on the call that the reason for having ERROR state is for recovery (time out of the error state) but since we are about to recover from non-operational status as well there is no reason to have two different statuses.
they are not exactly the same. or should i say, error is supposed to be when reason isn't related to host being non-operational.
what is error state? a host will go into error state if it fails to run 3 (configurable) VMs, that succeeded running on other host on retry. i.e., something is wrong with that host, failing to launch VMs. as it happens, it already "auto recovers" for this mode after a certain period of time.
why? because the host will fail to run virtual machines, and will be the least loaded, so it will be the first target selected to run them, which will continue to fail.
so there is a negative scoring mechanism on number of errors, till host is taken out for a while.
(I don't remember if the reverse is true and the VM goes into error mode if the VM failed to launch on all hosts per number of retries. i think this wasn't needed and user just got an error in audit log)
i can see two reasons a host will go into error state: 1. monitoring didn't detect an issue yet, and host would have/will/should go into non-operational mode. if host will go into non-operational mode, and will auto recover with the above flow, i guess it is fine.
2. cause for failure isn't something we monitor for (upgraded to a bad version of qemu, or qemu got corrupted).
now, the error mode was developed quite a long time ago (august 2007 iirc), so could be it mostly compensated for the first reason which is now better monitored. i wonder how often error state is seen due to a reason which isn't monitored already. moran - do you have examples of when you see error state of hosts?
usually it happened when there were a problematic/ misconfigurated vdsm / libvirt which failed to run vms (nothing we can recover from)- i haven't faced the issue of "host it too loaded" that status has some other syndromes, however the behaviour on that state is very much the same -waiting for 30 min (?) and than move it to activated. Moran.
'host is too loaded' is too loaded is the only transient state where a temporary 'error' state makes sense, but in the same time, it can also fit the 'non operational' state description. From my experience, the problem with KVM/libvirt/VDSM mis-configured is never temporary, (= magically solved by itself, without concrete user intervention). IMHO, it should move the host to an error state that would not automatically recover from. Regardless, consolidating the names of the states ('inactive, detached, non operational, maintenance, error, unknown' ...) would be nice too. Probably can't be done for all, of course. Y.
agreed, most of the causes of ERROR state aren't transient, but looks to me as if this state is redundant and could be taken care as part of the other host states, since the way it's being used today isn't very helpful as well. Moran.
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel

----- Original Message -----
From: "Moran Goldboim" <mgoldboi@redhat.com> To: "Yaniv Kaul" <ykaul@redhat.com> Cc: engine-devel@ovirt.org Sent: Thursday, February 16, 2012 10:01:37 AM Subject: Re: [Engine-devel] Autorecovery feature plan for review
On 02/16/2012 09:35 AM, Yaniv Kaul wrote:
On 02/16/2012 09:29 AM, Moran Goldboim wrote:
On 02/16/2012 12:38 AM, Itamar Heim wrote:
On 02/15/2012 07:02 PM, Livnat Peer wrote:
On 15/02/12 18:28, Ayal Baron wrote:
----- Original Message ----- > Hi, > > A short summary from the call today, please correct me if I > forgot or > misunderstood something. > > Ayal argued that the failed host/storagedomain should be > reactivated > by a periodically executed job, he would prefer if the engine > could > [try to] correct the problem right on discovery. > Livnat's point was that this is hard to implement and it is OK > if we > move it to Nonoperational state and periodically check it > again. > > There was a little arguing if we call the current behavior a > bug > or a > missing behavior, I believe this is not quite important. > > I did not fully understand the last few sentences from Livant, > did we > manage to agree in a change in the plan?
A couple of points that we agreed upon: 1. no need for new mechanism, just initiate this from the monitoring context. Preferably, if not difficult, evaluate the monitoring data, if host should remain in non-op then don't bother running initVdsOnUp 2. configuration of when to call initvdsonup is orthogonal to auto-init behaviour and if introduced should be on by default and user should be able to configure this either on or off for the host in general (no lower granularity) and can only be configured via the API. When disabled initVdsOnUp would be called only when admin activates the host/storage and any error would keep it inactive (I still don't understand why this is at all needed but whatever).
Also a note from Moran on the call was to check if we can unify the non-operational and Error statuses of the host. It was mentioned on the call that the reason for having ERROR state is for recovery (time out of the error state) but since we are about to recover from non-operational status as well there is no reason to have two different statuses.
they are not exactly the same. or should i say, error is supposed to be when reason isn't related to host being non-operational.
what is error state? a host will go into error state if it fails to run 3 (configurable) VMs, that succeeded running on other host on retry. i.e., something is wrong with that host, failing to launch VMs. as it happens, it already "auto recovers" for this mode after a certain period of time.
why? because the host will fail to run virtual machines, and will be the least loaded, so it will be the first target selected to run them, which will continue to fail.
so there is a negative scoring mechanism on number of errors, till host is taken out for a while.
(I don't remember if the reverse is true and the VM goes into error mode if the VM failed to launch on all hosts per number of retries. i think this wasn't needed and user just got an error in audit log)
i can see two reasons a host will go into error state: 1. monitoring didn't detect an issue yet, and host would have/will/should go into non-operational mode. if host will go into non-operational mode, and will auto recover with the above flow, i guess it is fine.
2. cause for failure isn't something we monitor for (upgraded to a bad version of qemu, or qemu got corrupted).
now, the error mode was developed quite a long time ago (august 2007 iirc), so could be it mostly compensated for the first reason which is now better monitored. i wonder how often error state is seen due to a reason which isn't monitored already. moran - do you have examples of when you see error state of hosts?
usually it happened when there were a problematic/ misconfigurated vdsm / libvirt which failed to run vms (nothing we can recover from)- i haven't faced the issue of "host it too loaded" that status has some other syndromes, however the behaviour on that state is very much the same -waiting for 30 min (?) and than move it to activated. Moran.
'host is too loaded' is too loaded is the only transient state where a temporary 'error' state makes sense, but in the same time, it can also fit the 'non operational' state description. From my experience, the problem with KVM/libvirt/VDSM mis-configured is never temporary, (= magically solved by itself, without concrete user intervention). IMHO, it should move the host to an error state that would not automatically recover from. Regardless, consolidating the names of the states ('inactive, detached, non operational, maintenance, error, unknown' ...) would be nice too. Probably can't be done for all, of course. Y.
agreed, most of the causes of ERROR state aren't transient, but looks to me as if this state is redundant and could be taken care as part of the other host states, since the way it's being used today isn't very helpful as well. Moran. However, I can envision an ERROR state that you don't want to keep retry mechanism on... which might be a different behavior than the NON-OP one.
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel

On 02/16/2012 10:28 AM, Miki Kenneth wrote:
----- Original Message -----
From: "Moran Goldboim"<mgoldboi@redhat.com> To: "Yaniv Kaul"<ykaul@redhat.com> Cc: engine-devel@ovirt.org Sent: Thursday, February 16, 2012 10:01:37 AM Subject: Re: [Engine-devel] Autorecovery feature plan for review
On 02/16/2012 12:38 AM, Itamar Heim wrote:
On 15/02/12 18:28, Ayal Baron wrote: > > ----- Original Message ----- >> Hi, >> >> A short summary from the call today, please correct me if I >> forgot or >> misunderstood something. >> >> Ayal argued that the failed host/storagedomain should be >> reactivated >> by a periodically executed job, he would prefer if the engine >> could >> [try to] correct the problem right on discovery. >> Livnat's point was that this is hard to implement and it is OK >> if we >> move it to Nonoperational state and periodically check it >> again. >> >> There was a little arguing if we call the current behavior a >> bug >> or a >> missing behavior, I believe this is not quite important. >> >> I did not fully understand the last few sentences from Livant, >> did we >> manage to agree in a change in the plan? > A couple of points that we agreed upon: > 1. no need for new mechanism, just initiate this from the > monitoring context. > Preferably, if not difficult, evaluate the monitoring data, > if > host should remain in non-op then don't bother running > initVdsOnUp > 2. configuration of when to call initvdsonup is orthogonal to > auto-init behaviour and if introduced should be on by default > and > user should be able to configure this either on or off for the > host in general (no lower granularity) and can only be > configured > via the API. > When disabled initVdsOnUp would be called only when admin > activates the host/storage and any error would keep it inactive > (I > still don't understand why this is at all needed but whatever). > Also a note from Moran on the call was to check if we can unify the non-operational and Error statuses of the host. It was mentioned on the call that the reason for having ERROR state is for recovery (time out of the error state) but since we are about to recover from non-operational status as well there is no reason to have two different statuses.
On 02/15/2012 07:02 PM, Livnat Peer wrote: they are not exactly the same. or should i say, error is supposed to be when reason isn't related to host being non-operational.
what is error state? a host will go into error state if it fails to run 3 (configurable) VMs, that succeeded running on other host on retry. i.e., something is wrong with that host, failing to launch VMs. as it happens, it already "auto recovers" for this mode after a certain period of time.
why? because the host will fail to run virtual machines, and will be the least loaded, so it will be the first target selected to run them, which will continue to fail.
so there is a negative scoring mechanism on number of errors, till host is taken out for a while.
(I don't remember if the reverse is true and the VM goes into error mode if the VM failed to launch on all hosts per number of retries. i think this wasn't needed and user just got an error in audit log)
i can see two reasons a host will go into error state: 1. monitoring didn't detect an issue yet, and host would have/will/should go into non-operational mode. if host will go into non-operational mode, and will auto recover with the above flow, i guess it is fine.
2. cause for failure isn't something we monitor for (upgraded to a bad version of qemu, or qemu got corrupted).
now, the error mode was developed quite a long time ago (august 2007 iirc), so could be it mostly compensated for the first reason which is now better monitored. i wonder how often error state is seen due to a reason which isn't monitored already. moran - do you have examples of when you see error state of hosts? usually it happened when there were a problematic/ misconfigurated vdsm / libvirt which failed to run vms (nothing we can recover from)- i haven't faced the issue of "host it too loaded" that status has some other syndromes, however the behaviour on that state is very much the same -waiting for 30 min (?) and than move it to activated. Moran. 'host is too loaded' is too loaded is the only transient state where a temporary 'error' state makes sense, but in the same time, it can also fit the 'non operational' state description. From my experience, the problem with KVM/libvirt/VDSM mis-configured is never temporary, (= magically solved by itself, without concrete user intervention). IMHO, it should move the host to an error state
On 02/16/2012 09:29 AM, Moran Goldboim wrote: that would not automatically recover from. Regardless, consolidating the names of the states ('inactive, detached, non operational, maintenance, error, unknown' ...) would be nice too. Probably can't be done for all, of course. Y. agreed, most of the causes of ERROR state aren't transient, but looks to me as if this state is redundant and could be taken care as part of
On 02/16/2012 09:35 AM, Yaniv Kaul wrote: the other host states, since the way it's being used today isn't very helpful as well. Moran. However, I can envision an ERROR state that you don't want to keep retry mechanism on... which might be a different behavior than the NON-OP one.
it stills means that the host will be non-operational, just that you don't want to perform reties on it, it's need to be divided to transient/non-transient treatments (may apply to other scenarios as well -like qemu isn't there or virt isn't enabled on bios etc) Moran.
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel

On 16/02/12 10:01, Moran Goldboim wrote:
On 02/16/2012 09:35 AM, Yaniv Kaul wrote:
On 02/16/2012 09:29 AM, Moran Goldboim wrote:
On 02/16/2012 12:38 AM, Itamar Heim wrote:
On 02/15/2012 07:02 PM, Livnat Peer wrote:
On 15/02/12 18:28, Ayal Baron wrote:
----- Original Message ----- > Hi, > > A short summary from the call today, please correct me if I > forgot or > misunderstood something. > > Ayal argued that the failed host/storagedomain should be reactivated > by a periodically executed job, he would prefer if the engine could > [try to] correct the problem right on discovery. > Livnat's point was that this is hard to implement and it is OK if we > move it to Nonoperational state and periodically check it again. > > There was a little arguing if we call the current behavior a bug > or a > missing behavior, I believe this is not quite important. > > I did not fully understand the last few sentences from Livant, > did we > manage to agree in a change in the plan?
A couple of points that we agreed upon: 1. no need for new mechanism, just initiate this from the monitoring context. Preferably, if not difficult, evaluate the monitoring data, if host should remain in non-op then don't bother running initVdsOnUp 2. configuration of when to call initvdsonup is orthogonal to auto-init behaviour and if introduced should be on by default and user should be able to configure this either on or off for the host in general (no lower granularity) and can only be configured via the API. When disabled initVdsOnUp would be called only when admin activates the host/storage and any error would keep it inactive (I still don't understand why this is at all needed but whatever).
Also a note from Moran on the call was to check if we can unify the non-operational and Error statuses of the host. It was mentioned on the call that the reason for having ERROR state is for recovery (time out of the error state) but since we are about to recover from non-operational status as well there is no reason to have two different statuses.
they are not exactly the same. or should i say, error is supposed to be when reason isn't related to host being non-operational.
what is error state? a host will go into error state if it fails to run 3 (configurable) VMs, that succeeded running on other host on retry. i.e., something is wrong with that host, failing to launch VMs. as it happens, it already "auto recovers" for this mode after a certain period of time.
why? because the host will fail to run virtual machines, and will be the least loaded, so it will be the first target selected to run them, which will continue to fail.
so there is a negative scoring mechanism on number of errors, till host is taken out for a while.
(I don't remember if the reverse is true and the VM goes into error mode if the VM failed to launch on all hosts per number of retries. i think this wasn't needed and user just got an error in audit log)
i can see two reasons a host will go into error state: 1. monitoring didn't detect an issue yet, and host would have/will/should go into non-operational mode. if host will go into non-operational mode, and will auto recover with the above flow, i guess it is fine.
2. cause for failure isn't something we monitor for (upgraded to a bad version of qemu, or qemu got corrupted).
now, the error mode was developed quite a long time ago (august 2007 iirc), so could be it mostly compensated for the first reason which is now better monitored. i wonder how often error state is seen due to a reason which isn't monitored already. moran - do you have examples of when you see error state of hosts?
usually it happened when there were a problematic/ misconfigurated vdsm / libvirt which failed to run vms (nothing we can recover from)- i haven't faced the issue of "host it too loaded" that status has some other syndromes, however the behaviour on that state is very much the same -waiting for 30 min (?) and than move it to activated. Moran.
'host is too loaded' is too loaded is the only transient state where a temporary 'error' state makes sense, but in the same time, it can also fit the 'non operational' state description. From my experience, the problem with KVM/libvirt/VDSM mis-configured is never temporary, (= magically solved by itself, without concrete user intervention). IMHO, it should move the host to an error state that would not automatically recover from. Regardless, consolidating the names of the states ('inactive, detached, non operational, maintenance, error, unknown' ...) would be nice too. Probably can't be done for all, of course. Y.
agreed, most of the causes of ERROR state aren't transient, but looks to me as if this state is redundant and could be taken care as part of the other host states, since the way it's being used today isn't very helpful as well. Moran.
Currently host status is changed to non-operational on various reasons, some of them are static like vdsm version and cpu model and some of them are (potentially) transient like network failure. The Error state, as Itamar detailed earlier on this thread, is used currently on what I would call (potentially) transient reason. The original intention (I think) was to move host to non-operational on reasons which are static and to Error on reasons which are transient, and I guess that is why there is timeout on the Error state and OE tries to initialize a host after 30 minutes in Error state. The problem is that as the code evolved this is not the case anymore. I suggest that we use the non-operational state for transient reasons, which we detect in monitoring flow, or execution failures and do the initialization retry as Laszlo suggested in the document. Use the Error state for static errors and remove the 'timeout' mechanism we currently have (from Error state). Livnat
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel

On 02/16/2012 11:22 AM, Livnat Peer wrote:
On 16/02/12 10:01, Moran Goldboim wrote:
On 02/16/2012 09:35 AM, Yaniv Kaul wrote:
On 02/16/2012 09:29 AM, Moran Goldboim wrote:
On 02/16/2012 12:38 AM, Itamar Heim wrote:
On 02/15/2012 07:02 PM, Livnat Peer wrote:
On 15/02/12 18:28, Ayal Baron wrote: > > > ----- Original Message ----- >> Hi, >> >> A short summary from the call today, please correct me if I >> forgot or >> misunderstood something. >> >> Ayal argued that the failed host/storagedomain should be reactivated >> by a periodically executed job, he would prefer if the engine could >> [try to] correct the problem right on discovery. >> Livnat's point was that this is hard to implement and it is OK if we >> move it to Nonoperational state and periodically check it again. >> >> There was a little arguing if we call the current behavior a bug >> or a >> missing behavior, I believe this is not quite important. >> >> I did not fully understand the last few sentences from Livant, >> did we >> manage to agree in a change in the plan? > > A couple of points that we agreed upon: > 1. no need for new mechanism, just initiate this from the > monitoring context. > Preferably, if not difficult, evaluate the monitoring data, if > host should remain in non-op then don't bother running initVdsOnUp > 2. configuration of when to call initvdsonup is orthogonal to > auto-init behaviour and if introduced should be on by default and > user should be able to configure this either on or off for the > host in general (no lower granularity) and can only be configured > via the API. > When disabled initVdsOnUp would be called only when admin > activates the host/storage and any error would keep it inactive (I > still don't understand why this is at all needed but whatever). >
Also a note from Moran on the call was to check if we can unify the non-operational and Error statuses of the host. It was mentioned on the call that the reason for having ERROR state is for recovery (time out of the error state) but since we are about to recover from non-operational status as well there is no reason to have two different statuses.
they are not exactly the same. or should i say, error is supposed to be when reason isn't related to host being non-operational.
what is error state? a host will go into error state if it fails to run 3 (configurable) VMs, that succeeded running on other host on retry. i.e., something is wrong with that host, failing to launch VMs. as it happens, it already "auto recovers" for this mode after a certain period of time.
why? because the host will fail to run virtual machines, and will be the least loaded, so it will be the first target selected to run them, which will continue to fail.
so there is a negative scoring mechanism on number of errors, till host is taken out for a while.
(I don't remember if the reverse is true and the VM goes into error mode if the VM failed to launch on all hosts per number of retries. i think this wasn't needed and user just got an error in audit log)
i can see two reasons a host will go into error state: 1. monitoring didn't detect an issue yet, and host would have/will/should go into non-operational mode. if host will go into non-operational mode, and will auto recover with the above flow, i guess it is fine.
2. cause for failure isn't something we monitor for (upgraded to a bad version of qemu, or qemu got corrupted).
now, the error mode was developed quite a long time ago (august 2007 iirc), so could be it mostly compensated for the first reason which is now better monitored. i wonder how often error state is seen due to a reason which isn't monitored already. moran - do you have examples of when you see error state of hosts?
usually it happened when there were a problematic/ misconfigurated vdsm / libvirt which failed to run vms (nothing we can recover from)- i haven't faced the issue of "host it too loaded" that status has some other syndromes, however the behaviour on that state is very much the same -waiting for 30 min (?) and than move it to activated. Moran.
'host is too loaded' is too loaded is the only transient state where a temporary 'error' state makes sense, but in the same time, it can also fit the 'non operational' state description. From my experience, the problem with KVM/libvirt/VDSM mis-configured is never temporary, (= magically solved by itself, without concrete user intervention). IMHO, it should move the host to an error state that would not automatically recover from. Regardless, consolidating the names of the states ('inactive, detached, non operational, maintenance, error, unknown' ...) would be nice too. Probably can't be done for all, of course. Y.
agreed, most of the causes of ERROR state aren't transient, but looks to me as if this state is redundant and could be taken care as part of the other host states, since the way it's being used today isn't very helpful as well. Moran.
Currently host status is changed to non-operational on various reasons, some of them are static like vdsm version and cpu model and some of them are (potentially) transient like network failure.
The Error state, as Itamar detailed earlier on this thread, is used currently on what I would call (potentially) transient reason.
The original intention (I think) was to move host to non-operational on reasons which are static and to Error on reasons which are transient, and I guess that is why there is timeout on the Error state and OE tries to initialize a host after 30 minutes in Error state.
The problem is that as the code evolved this is not the case anymore. I suggest that we use the non-operational state for transient reasons, which we detect in monitoring flow, or execution failures and do the initialization retry as Laszlo suggested in the document. Use the Error state for static errors and remove the 'timeout' mechanism we currently have (from Error state).
we are just adding a retry mechanism where we didn't have it. I wouldn't remove the one we have so soon, as we may get it back very fast as 'need retry/timeout on errors'. it sounds like both statuses are indeed different - but even if we think error covers mostly non transient, we can't be sure.

On 15/02/12 12:23, Yaniv Kaul wrote:
On 02/14/2012 11:36 PM, Yair Zaslavsky wrote:
On 02/14/2012 10:03 PM, Itamar Heim wrote:
On 02/14/2012 09:20 AM, Yair Zaslavsky wrote:
On 02/14/2012 08:57 AM, Livnat Peer wrote:
On 14/02/12 05:56, Itamar Heim wrote: > On 02/13/2012 12:32 PM, Laszlo Hornyak wrote: >> Hi, >> >> Please review the plan document for autorecovery. >> http://www.ovirt.org/wiki/Features/Autorecovery > why would we disable auto recovery by default? it sounds like the > preferred behavior? > I think that by default Laszlo meant in the upgrade process to maintain current behavior.
I agree that for new entities the default should be true. i think the only combination which will allow this is for db to default to false and code to default to true for this property? Why can't we during upgrade process set to all existing entities in DB
On 02/14/2012 08:59 AM, Itamar Heim wrote: the value to false, but still have the column defined as "default true"? because upgrade and clean install are running the same scripts? I guess I still fail to understand. Scenarios (as both upgrade and clean install run the same scripts) a. In environment to be upgraded we have X entities that are non recoverable - after upgrade these X entities have the boolean flag set to false. New entities in the system will be created with auto recoverable set to true.
I still fail to understand why you 'punish' existing objects and not giving them the new feature enabled by default. Y.
We agreed that users will get by default the auto-recovery feature (wiki is updated accordingly). The discussion above is theoretical about setting different values during upgrade and setting default for new entities.
b. In environment to be clean installed -we have 0 existing entities - after clean install all new entities in the system will be create with auto recoverable set to true. Will this be considered a bad behavior?
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
_______________________________________________ Engine-devel mailing list Engine-devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/engine-devel
participants (10)
-
Ayal Baron
-
Itamar Heim
-
Laszlo Hornyak
-
Livnat Peer
-
Miki Kenneth
-
Moran Goldboim
-
Omer Frenkel
-
Oved Ourfalli
-
Yair Zaslavsky
-
Yaniv Kaul