Hi all,
as part of the effort to enhance the migration convergence [1] we are proposing a
semaphore for incoming migrations [2] (similar to outgoing).
It's purpose is to protect the destination host from migration storms where too many
migrations are coming to it from different sources.
There are basically 3 ways how to do it (with pros/cons):
1: when the destination host refuses the migration, the source host tries it again later
(considering no migration will take forever after some time the migration will succeed to
start)
(+) pros:
(+) if the engine wants to migrate to a specific host (and only to the specific host
because user did pick it) than it only sends the command and it will happen (now or
later)
(+) will not interfere with engine re-runs since the migration will fail only when
there is a real issue
(+) will be consistent with the current outgoing semaphore (since the outgoing
semaphore also waits until has capacity and than starts the migration)
(+) VDSM is more autonomous because after the engine sends the command, VDSM will do it
even if engine disappears in this moment
(-) cons:
(-) re-try on VDSM is not common
(-) if the user does not pick a specific destination and he just wants to migrate the
machine out of the source, waiting on the destination to have capacity can be wasteful
since failing the migration and picking a different host could lead to better results
2: when the destination host refuses the migration, the source host returns to engine
"migration failed" and the engine will have to handle it somehow
(+) pros:
(+) simpler vdsm (try to migrate, if the destination does not have capacity, fail)
(+) lets the engine to pick a different destination host
(-) cons:
(-) not consistent with the outgoing migration semaphore (since if there are more VMs
waiting for outgoing migrations semaphore, the migration does not fail but waits)
(-) engine would have to handle different kinds of migration failed reasons
(-) VDSM is not autonomous - if the engine disappears the migration will not be
started
(-) Here I'm not sure about the consequences to scheduler but I think it would have
to be reworked to accommodate the different kinds of re-run. Any ideas from someone more
familiar with this? Roy, Martin?
3: (hybrid) - if the user picks a specific host, VDSM will use the first way, if the user
will not pick a specific host, VDSM will use the second option
(+) pros:
(+) works well with both cases when the intention is to migrate the machine TO A
SPECIFIC host and when the intention is just to migrate the VM out to ANY host
(-) cons:
(-) more complicated VDSM
(-) still will interfere with engine scheduling
(-) not consistent with current VDSM's outgoing semaphore
The currently proposed patch [2] is the first option.
Please note that we would also like to enrich the scheduler to be aware of max incoming
migrations limit thus preventing the storms, but it is a separate topic (no patches around
yet).
Here the question is that when the storm happens, how should VDSM protect itself.
Any ideas?
Thank you,
Tomas
[1]:
www.ovirt.org/Features/Migration_Enhancements
[2]:
https://gerrit.ovirt.org/#/c/45954/