incoming migration semaphore (and possible SLA consequences)

29 Sep 2015

      Hi all,

as part of the effort to enhance the migration convergence [1] we are proposing a semaphore for incoming migrations [2] (similar to outgoing).
It's purpose is to protect the destination host from migration storms where too many migrations are coming to it from different sources.

There are basically 3 ways how to do it (with pros/cons):

1: when the destination host refuses the migration, the source host tries it again later (considering no migration will take forever after some time the migration will succeed to start)
 (+) pros:
   (+) if the engine wants to migrate to a specific host (and only to the specific host because user did pick it) than it only sends the command and it will happen (now or later)
   (+) will not interfere with engine re-runs since the migration will fail only when there is a real issue
   (+) will be consistent with the current outgoing semaphore (since the outgoing semaphore also waits until has capacity and than starts the migration)
   (+) VDSM is more autonomous because after the engine sends the command, VDSM will do it even if engine disappears in this moment
 (-) cons:
   (-) re-try on VDSM is not common
   (-) if the user does not pick a specific destination and he just wants to migrate the machine out of the source, waiting on the destination to have capacity can be wasteful since failing the migration and picking a different host could lead to better results

2: when the destination host refuses the migration, the source host returns to engine "migration failed" and the engine will have to handle it somehow
 (+) pros:
   (+) simpler vdsm (try to migrate, if the destination does not have capacity, fail)
   (+) lets the engine to pick a different destination host
 (-) cons:
   (-) not consistent with the outgoing migration semaphore (since if there are more VMs waiting for outgoing migrations semaphore, the migration does not fail but waits)
   (-) engine would have to handle different kinds of migration failed reasons
   (-) VDSM is not autonomous - if the engine disappears the migration will not be started
   (-) Here I'm not sure about the consequences to scheduler but I think it would have to be reworked to accommodate the different kinds of re-run. Any ideas from someone more familiar with this? Roy, Martin?

3: (hybrid) - if the user picks a specific host, VDSM will use the first way, if the user will not pick a specific host, VDSM will use the second option
 (+) pros:
   (+) works well with both cases when the intention is to migrate the machine TO A SPECIFIC host and when the intention is just to migrate the VM out to ANY host
 (-) cons:
   (-) more complicated VDSM
   (-) still will interfere with engine scheduling
   (-) not consistent with current VDSM's outgoing semaphore

The currently proposed patch [2] is the first option.

Please note that we would also like to enrich the scheduler to be aware of max incoming migrations limit thus preventing the storms, but it is a separate topic (no patches around yet).

Here the question is that when the storm happens, how should VDSM protect itself.

Any ideas?

Thank you,
Tomas

[1]: www.ovirt.org/Features/Migration_Enhancements
[2]: https://gerrit.ovirt.org/#/c/45954/

Tomas Jelinek

Martin Sivak

Tomas Jelinek

Martin Betak

tags

participants (3)