[ovirt-devel] incoming migration semaphore (and possible SLA consequences)
Tomas Jelinek
tjelinek at redhat.com
Tue Sep 29 13:35:38 UTC 2015
Hi all,
as part of the effort to enhance the migration convergence [1] we are proposing a semaphore for incoming migrations [2] (similar to outgoing).
It's purpose is to protect the destination host from migration storms where too many migrations are coming to it from different sources.
There are basically 3 ways how to do it (with pros/cons):
1: when the destination host refuses the migration, the source host tries it again later (considering no migration will take forever after some time the migration will succeed to start)
(+) pros:
(+) if the engine wants to migrate to a specific host (and only to the specific host because user did pick it) than it only sends the command and it will happen (now or later)
(+) will not interfere with engine re-runs since the migration will fail only when there is a real issue
(+) will be consistent with the current outgoing semaphore (since the outgoing semaphore also waits until has capacity and than starts the migration)
(+) VDSM is more autonomous because after the engine sends the command, VDSM will do it even if engine disappears in this moment
(-) cons:
(-) re-try on VDSM is not common
(-) if the user does not pick a specific destination and he just wants to migrate the machine out of the source, waiting on the destination to have capacity can be wasteful since failing the migration and picking a different host could lead to better results
2: when the destination host refuses the migration, the source host returns to engine "migration failed" and the engine will have to handle it somehow
(+) pros:
(+) simpler vdsm (try to migrate, if the destination does not have capacity, fail)
(+) lets the engine to pick a different destination host
(-) cons:
(-) not consistent with the outgoing migration semaphore (since if there are more VMs waiting for outgoing migrations semaphore, the migration does not fail but waits)
(-) engine would have to handle different kinds of migration failed reasons
(-) VDSM is not autonomous - if the engine disappears the migration will not be started
(-) Here I'm not sure about the consequences to scheduler but I think it would have to be reworked to accommodate the different kinds of re-run. Any ideas from someone more familiar with this? Roy, Martin?
3: (hybrid) - if the user picks a specific host, VDSM will use the first way, if the user will not pick a specific host, VDSM will use the second option
(+) pros:
(+) works well with both cases when the intention is to migrate the machine TO A SPECIFIC host and when the intention is just to migrate the VM out to ANY host
(-) cons:
(-) more complicated VDSM
(-) still will interfere with engine scheduling
(-) not consistent with current VDSM's outgoing semaphore
The currently proposed patch [2] is the first option.
Please note that we would also like to enrich the scheduler to be aware of max incoming migrations limit thus preventing the storms, but it is a separate topic (no patches around yet).
Here the question is that when the storm happens, how should VDSM protect itself.
Any ideas?
Thank you,
Tomas
[1]: www.ovirt.org/Features/Migration_Enhancements
[2]: https://gerrit.ovirt.org/#/c/45954/
More information about the Devel
mailing list