Taking down the engine without setting global maintenance

older
How can I set up a newly created...

Yedidyah Bar David

7 Mar 2019 7 Mar '19

2:26 a.m.

Hi all, How about making this change: Right before the engine goes down cleanly, it marks the shared storage saying it did not crash but exited cleanly, and then HE-HA will not try to restart it on another host. Perhaps make this optional, so that users can do clean shutdowns and still test HA cleanly (or some other use cases, where users might not want this). This should help a lot cases where people restarted their engine for some reason, e.g. upgrade, and forgot to set maintenance. Makes sense? -- Didi

Show replies by date

Martin Sivak

7 Mar 7 Mar

3:29 a.m.

Hi, there is no way to distinguish an engine that is not responsive (software or network issue) from a VM that is being powered off. The shutdown takes some time during which you just do not know. Global maintenance informs the tooling in advance that something like this is going to happen. Who do you expect should be touching the shared storage? The engine VM itself? That might be possible, but remember the jboss instance is just the top of the process hierarchy. There are a lot of components where something might break during shutdown (filesystem umount timeout for example). Martin On Thu, Mar 7, 2019 at 9:27 AM Yedidyah Bar David <didi@redhat.com> wrote:

...

Hi all,

How about making this change:

Right before the engine goes down cleanly, it marks the shared storage saying it did not crash but exited cleanly, and then HE-HA will not try to restart it on another host. Perhaps make this optional, so that users can do clean shutdowns and still test HA cleanly (or some other use cases, where users might not want this).

This should help a lot cases where people restarted their engine for some reason, e.g. upgrade, and forgot to set maintenance.

Makes sense? -- Didi _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/WCLSLEVXPHGRHL...

Yedidyah Bar David

3:34 a.m.

On Thu, Mar 7, 2019 at 11:30 AM Martin Sivak <msivak@redhat.com> wrote:

...

Hi,

there is no way to distinguish an engine that is not responsive (software or network issue) from a VM that is being powered off. The shutdown takes some time during which you just do not know.

_I_ do not know, but the user might still know beforehand.

...

Global maintenance informs the tooling in advance that something like this is going to happen.

Yes. But users keep forgetting setting it. So I am trying to come up with something that will fix that :-) Perhaps instead of my original text, use something like "Right before the engine goes down, it should set global maintenance".

...

Who do you expect should be touching the shared storage? The engine VM itself? That might be possible, but remember the jboss instance is just the top of the process hierarchy. There are a lot of components where something might break during shutdown (filesystem umount timeout for example).

I did say "engine", not "engine vm". But see above for perhaps clearer text.

...

Martin

On Thu, Mar 7, 2019 at 9:27 AM Yedidyah Bar David <didi@redhat.com> wrote:

...
Hi all,

How about making this change:

Right before the engine goes down cleanly, it marks the shared storage saying it did not crash but exited cleanly, and then HE-HA will not try to restart it on another host. Perhaps make this optional, so that users can do clean shutdowns and still test HA cleanly (or some other use cases, where users might not want this).

This should help a lot cases where people restarted their engine for some reason, e.g. upgrade, and forgot to set maintenance.

Makes sense? -- Didi _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/WCLSLEVXPHGRHL...

-- Didi

Simone Tiraboschi

3:41 a.m.

On Thu, Mar 7, 2019 at 10:34 AM Yedidyah Bar David <didi@redhat.com> wrote:

...

On Thu, Mar 7, 2019 at 11:30 AM Martin Sivak <msivak@redhat.com> wrote:

...
Hi,

there is no way to distinguish an engine that is not responsive (software or network issue) from a VM that is being powered off. The shutdown takes some time during which you just do not know.

_I_ do not know, but the user might still know beforehand.

...
Global maintenance informs the tooling in advance that something like this is going to happen.

Yes. But users keep forgetting setting it. So I am trying to come up with something that will fix that :-)

Now we have exactly the opposite: engine-setup is already checking for global maintenance mode (the check acts on the engine DB over what the hosts report when polled so we have a bit of latency here) and engine-setup is exiting if we are on hosted-engine and not in global maintenance mode. https://github.com/oVirt/ovirt-engine/blob/master/packaging/setup/plugins/ov...

...

Perhaps instead of my original text, use something like "Right before the engine goes down, it should set global maintenance".

...
Who do you expect should be touching the shared storage? The engine VM itself? That might be possible, but remember the jboss instance is just the top of the process hierarchy. There are a lot of components where something might break during shutdown (filesystem umount timeout for example).

I did say "engine", not "engine vm". But see above for perhaps clearer text.

...
Martin

On Thu, Mar 7, 2019 at 9:27 AM Yedidyah Bar David <didi@redhat.com>

wrote:

...
...
Hi all,

How about making this change:

Right before the engine goes down cleanly, it marks the shared storage saying it did not crash but exited cleanly, and then HE-HA will not try to restart it on another host. Perhaps make this optional, so that users can do clean shutdowns and still test HA cleanly (or some other use cases, where users might not want this).

This should help a lot cases where people restarted their engine for some reason, e.g. upgrade, and forgot to set maintenance.

Makes sense? -- Didi _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct:

https://www.ovirt.org/community/about/community-guidelines/

...
List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/WCLSLEVXPHGRHL...

-- Didi

Yedidyah Bar David

4:04 a.m.

On Thu, Mar 7, 2019 at 11:41 AM Simone Tiraboschi <stirabos@redhat.com> wrote:

...

On Thu, Mar 7, 2019 at 10:34 AM Yedidyah Bar David <didi@redhat.com> wrote:

...
On Thu, Mar 7, 2019 at 11:30 AM Martin Sivak <msivak@redhat.com> wrote:

...
Hi,

there is no way to distinguish an engine that is not responsive (software or network issue) from a VM that is being powered off. The shutdown takes some time during which you just do not know.

_I_ do not know, but the user might still know beforehand.

...
Global maintenance informs the tooling in advance that something like this is going to happen.

Yes. But users keep forgetting setting it. So I am trying to come up with something that will fix that :-)

Now we have exactly the opposite: engine-setup is already checking for global maintenance mode (the check acts on the engine DB over what the hosts report when polled so we have a bit of latency here) and engine-setup is exiting if we are on hosted-engine and not in global maintenance mode. https://github.com/oVirt/ovirt-engine/blob/master/packaging/setup/plugins/ov...

You are right, if the engine restart was only via engine-setup. But there might be other reasons for restarting. Martin's claim, AFAIU, is more-or-less: When the engine goes down, it can't know if it's part of a graceful/clean reboot. It can be due to a problem, which is severe enough to take the machine down and not take it up again, but still not severe enough to prevent clean shutdown of the engine itself. Martin - is it so? Not sure I agree personally, that this flow is likely enough to make my suggestion problematic (meaning, will cause HA to leave the engine vm down, when it was actually better to try starting it on another host). But I can see the point. Let's say that I mainly think we should differentiate between a clean shutdown and a non-responsive engine (died via a power cut, or network problem, or whatever). If we do not want to consider this as a "global maintenance" (meaning, do nothing until the user clears it, or if we set it ourselves, until the engine starts again), perhaps at least make HA wait longer (say, 30 minutes), and/or notify a few times by email, or something like that. I simply wonder how many times HA actually saved people from a long(er) unplanned engine downtime compared to how many times it was simply annoyingly restarting the vm in the middle of some routine maintenance...

...

...
Perhaps instead of my original text, use something like "Right before the engine goes down, it should set global maintenance".

...
Who do you expect should be touching the shared storage? The engine VM itself? That might be possible, but remember the jboss instance is just the top of the process hierarchy. There are a lot of components where something might break during shutdown (filesystem umount timeout for example).

I did say "engine", not "engine vm". But see above for perhaps clearer text.

...
Martin

On Thu, Mar 7, 2019 at 9:27 AM Yedidyah Bar David <didi@redhat.com> wrote:

...
Hi all,

How about making this change:

Right before the engine goes down cleanly, it marks the shared storage saying it did not crash but exited cleanly, and then HE-HA will not try to restart it on another host. Perhaps make this optional, so that users can do clean shutdowns and still test HA cleanly (or some other use cases, where users might not want this).

This should help a lot cases where people restarted their engine for some reason, e.g. upgrade, and forgot to set maintenance.

Makes sense? -- Didi _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/WCLSLEVXPHGRHL...

-- Didi

-- Didi

Martin Sivak

5:22 a.m.

...

When the engine goes down, it can't know if it's part of a graceful/clean reboot. It can be due to a problem, which is severe enough to take the machine down and not take it up again, but still not severe enough to prevent clean shutdown of the engine itself.

That and the fact that the engine does not have a direct access to storage and has to go through vdsm. Meaning the flagging mechanism might not be reliable enough. Also there has to be an automatic way to reset it and make the VM start again or the user will wonder why the next outage left the engine down. Figuring out the rules for the two automatic actions (get GM and reset GM) is not trivial.

...

perhaps at least make HA wait longer (say, 30 minutes), and/or notify a few times by email, or something like that.

The delay is at least 5 minutes before shutdown is initiated and you will get couple of emails (well at least one, I do not remember how often we repeat it).

...

I simply wonder how many times HA actually saved people from a long(er) unplanned engine downtime compared to how many times it was simply annoyingly restarting the vm in the middle of some routine maintenance...

You do not touch the engine that often and that is why people forget the right procedure. Engine offline "migration" was visible in many logs I reviewed during bug analyses. So this is probably working well for most people most of the time (eg. when nothing is changing). Martin On Thu, Mar 7, 2019 at 11:04 AM Yedidyah Bar David <didi@redhat.com> wrote:

...

On Thu, Mar 7, 2019 at 11:41 AM Simone Tiraboschi <stirabos@redhat.com> wrote:

...
On Thu, Mar 7, 2019 at 10:34 AM Yedidyah Bar David <didi@redhat.com> wrote:

...
On Thu, Mar 7, 2019 at 11:30 AM Martin Sivak <msivak@redhat.com> wrote:

...
Hi,

there is no way to distinguish an engine that is not responsive (software or network issue) from a VM that is being powered off. The shutdown takes some time during which you just do not know.

_I_ do not know, but the user might still know beforehand.

...
Global maintenance informs the tooling in advance that something like this is going to happen.

Yes. But users keep forgetting setting it. So I am trying to come up with something that will fix that :-)

Now we have exactly the opposite: engine-setup is already checking for global maintenance mode (the check acts on the engine DB over what the hosts report when polled so we have a bit of latency here) and engine-setup is exiting if we are on hosted-engine and not in global maintenance mode. https://github.com/oVirt/ovirt-engine/blob/master/packaging/setup/plugins/ov...

You are right, if the engine restart was only via engine-setup. But there might be other reasons for restarting.

Martin's claim, AFAIU, is more-or-less:

When the engine goes down, it can't know if it's part of a graceful/clean reboot. It can be due to a problem, which is severe enough to take the machine down and not take it up again, but still not severe enough to prevent clean shutdown of the engine itself.

Martin - is it so?

Not sure I agree personally, that this flow is likely enough to make my suggestion problematic (meaning, will cause HA to leave the engine vm down, when it was actually better to try starting it on another host). But I can see the point.

Let's say that I mainly think we should differentiate between a clean shutdown and a non-responsive engine (died via a power cut, or network problem, or whatever). If we do not want to consider this as a "global maintenance" (meaning, do nothing until the user clears it, or if we set it ourselves, until the engine starts again), perhaps at least make HA wait longer (say, 30 minutes), and/or notify a few times by email, or something like that.

I simply wonder how many times HA actually saved people from a long(er) unplanned engine downtime compared to how many times it was simply annoyingly restarting the vm in the middle of some routine maintenance...

...
...
Perhaps instead of my original text, use something like "Right before the engine goes down, it should set global maintenance".

...
Who do you expect should be touching the shared storage? The engine VM itself? That might be possible, but remember the jboss instance is just the top of the process hierarchy. There are a lot of components where something might break during shutdown (filesystem umount timeout for example).

I did say "engine", not "engine vm". But see above for perhaps clearer text.

...
Martin

On Thu, Mar 7, 2019 at 9:27 AM Yedidyah Bar David <didi@redhat.com> wrote:

...
Hi all,

How about making this change:

Right before the engine goes down cleanly, it marks the shared storage saying it did not crash but exited cleanly, and then HE-HA will not try to restart it on another host. Perhaps make this optional, so that users can do clean shutdowns and still test HA cleanly (or some other use cases, where users might not want this).

This should help a lot cases where people restarted their engine for some reason, e.g. upgrade, and forgot to set maintenance.

Makes sense? -- Didi _______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/WCLSLEVXPHGRHL...

-- Didi

-- Didi

2578

Age (days ago)

2578

Last active (days ago)

List overview

Download

5 comments

3 participants

participants (3)

Martin Sivak
Simone Tiraboschi
Yedidyah Bar David