Please note that we would also like to enrich the scheduler to be
aware of max incoming migrations
limit thus preventing the storms, but it is a separate topic (no patches around yet).
This is easy to do, but please make sure you distinguish migration
from a VM start.
It might also be better to only use the number of ongoing migrations
for penalizing the host. In that case the storm would be smaller and
an overloaded host would still be a possible migration destination
when there is no better host to use (because of other constraints for
example).
There is also the consideration of what happens when a Maintenance
mode is triggered. The user might want the "storm" to happen to be
able to save VMs from a compromised host before it fails completely.
This might work fine when the scoring approach is used.
Martin
On Tue, Sep 29, 2015 at 3:35 PM, Tomas Jelinek <tjelinek(a)redhat.com> wrote:
Hi all,
as part of the effort to enhance the migration convergence [1] we are proposing a
semaphore for incoming migrations [2] (similar to outgoing).
It's purpose is to protect the destination host from migration storms where too many
migrations are coming to it from different sources.
There are basically 3 ways how to do it (with pros/cons):
1: when the destination host refuses the migration, the source host tries it again later
(considering no migration will take forever after some time the migration will succeed to
start)
(+) pros:
(+) if the engine wants to migrate to a specific host (and only to the specific host
because user did pick it) than it only sends the command and it will happen (now or
later)
(+) will not interfere with engine re-runs since the migration will fail only when
there is a real issue
(+) will be consistent with the current outgoing semaphore (since the outgoing
semaphore also waits until has capacity and than starts the migration)
(+) VDSM is more autonomous because after the engine sends the command, VDSM will do
it even if engine disappears in this moment
(-) cons:
(-) re-try on VDSM is not common
(-) if the user does not pick a specific destination and he just wants to migrate the
machine out of the source, waiting on the destination to have capacity can be wasteful
since failing the migration and picking a different host could lead to better results
2: when the destination host refuses the migration, the source host returns to engine
"migration failed" and the engine will have to handle it somehow
(+) pros:
(+) simpler vdsm (try to migrate, if the destination does not have capacity, fail)
(+) lets the engine to pick a different destination host
(-) cons:
(-) not consistent with the outgoing migration semaphore (since if there are more VMs
waiting for outgoing migrations semaphore, the migration does not fail but waits)
(-) engine would have to handle different kinds of migration failed reasons
(-) VDSM is not autonomous - if the engine disappears the migration will not be
started
(-) Here I'm not sure about the consequences to scheduler but I think it would
have to be reworked to accommodate the different kinds of re-run. Any ideas from someone
more familiar with this? Roy, Martin?
3: (hybrid) - if the user picks a specific host, VDSM will use the first way, if the user
will not pick a specific host, VDSM will use the second option
(+) pros:
(+) works well with both cases when the intention is to migrate the machine TO A
SPECIFIC host and when the intention is just to migrate the VM out to ANY host
(-) cons:
(-) more complicated VDSM
(-) still will interfere with engine scheduling
(-) not consistent with current VDSM's outgoing semaphore
The currently proposed patch [2] is the first option.
Please note that we would also like to enrich the scheduler to be aware of max incoming
migrations limit thus preventing the storms, but it is a separate topic (no patches around
yet).
Here the question is that when the storm happens, how should VDSM protect itself.
Any ideas?
Thank you,
Tomas
[1]:
www.ovirt.org/Features/Migration_Enhancements
[2]:
https://gerrit.ovirt.org/#/c/45954/