[ovirt-devel] incoming migration semaphore (and possible SLA consequences)

Wed Sep 30 11:15:29 UTC 2015

----- Original Message -----
> From: "Martin Sivak" <msivak at redhat.com>
> To: "Tomas Jelinek" <tjelinek at redhat.com>
> Cc: "engine-devel at ovirt.org" <devel at ovirt.org>, "Martin Betak" <mbetak at redhat.com>, "Francesco Romani"
> <fromani at redhat.com>, "Martin Polednik" <mpolednik at redhat.com>, "Roy Golan" <rgolan at redhat.com>
> Sent: Wednesday, September 30, 2015 10:41:42 AM
> Subject: Re: incoming migration semaphore (and possible SLA consequences)
> 
> > Please note that we would also like to enrich the scheduler to be aware of
> > max incoming migrations
> > limit thus preventing the storms, but it is a separate topic (no patches
> > around yet).
> 
> This is easy to do, but please make sure you distinguish migration
> from a VM start.
> 
> It might also be better to only use the number of ongoing migrations
> for penalizing the host. In that case the storm would be smaller and
> an overloaded host would still be a possible migration destination
> when there is no better host to use (because of other constraints for
> example).
> 
> There is also the consideration of what happens when a Maintenance
> mode is triggered. The user might want the "storm" to happen to be
> able to save VMs from a compromised host before it fails completely.
> This might work fine when the scoring approach is used.

The combination of re-try on VDSM and a scoring approach on engine sounds good to me, since the storms should not happen, but when they do,
they are intentional so VDSM should perform the migration as requested by engine.

> 
> Martin
> 
> 
> On Tue, Sep 29, 2015 at 3:35 PM, Tomas Jelinek <tjelinek at redhat.com> wrote:
> > Hi all,
> >
> > as part of the effort to enhance the migration convergence [1] we are
> > proposing a semaphore for incoming migrations [2] (similar to outgoing).
> > It's purpose is to protect the destination host from migration storms where
> > too many migrations are coming to it from different sources.
> >
> > There are basically 3 ways how to do it (with pros/cons):
> >
> > 1: when the destination host refuses the migration, the source host tries
> > it again later (considering no migration will take forever after some time
> > the migration will succeed to start)
> >  (+) pros:
> >    (+) if the engine wants to migrate to a specific host (and only to the
> >    specific host because user did pick it) than it only sends the command
> >    and it will happen (now or later)
> >    (+) will not interfere with engine re-runs since the migration will fail
> >    only when there is a real issue
> >    (+) will be consistent with the current outgoing semaphore (since the
> >    outgoing semaphore also waits until has capacity and than starts the
> >    migration)
> >    (+) VDSM is more autonomous because after the engine sends the command,
> >    VDSM will do it even if engine disappears in this moment
> >  (-) cons:
> >    (-) re-try on VDSM is not common
> >    (-) if the user does not pick a specific destination and he just wants
> >    to migrate the machine out of the source, waiting on the destination to
> >    have capacity can be wasteful since failing the migration and picking a
> >    different host could lead to better results
> >
> > 2: when the destination host refuses the migration, the source host returns
> > to engine "migration failed" and the engine will have to handle it somehow
> >  (+) pros:
> >    (+) simpler vdsm (try to migrate, if the destination does not have
> >    capacity, fail)
> >    (+) lets the engine to pick a different destination host
> >  (-) cons:
> >    (-) not consistent with the outgoing migration semaphore (since if there
> >    are more VMs waiting for outgoing migrations semaphore, the migration
> >    does not fail but waits)
> >    (-) engine would have to handle different kinds of migration failed
> >    reasons
> >    (-) VDSM is not autonomous - if the engine disappears the migration will
> >    not be started
> >    (-) Here I'm not sure about the consequences to scheduler but I think it
> >    would have to be reworked to accommodate the different kinds of re-run.
> >    Any ideas from someone more familiar with this? Roy, Martin?
> >
> > 3: (hybrid) - if the user picks a specific host, VDSM will use the first
> > way, if the user will not pick a specific host, VDSM will use the second
> > option
> >  (+) pros:
> >    (+) works well with both cases when the intention is to migrate the
> >    machine TO A SPECIFIC host and when the intention is just to migrate
> >    the VM out to ANY host
> >  (-) cons:
> >    (-) more complicated VDSM
> >    (-) still will interfere with engine scheduling
> >    (-) not consistent with current VDSM's outgoing semaphore
> >
> > The currently proposed patch [2] is the first option.
> >
> > Please note that we would also like to enrich the scheduler to be aware of
> > max incoming migrations limit thus preventing the storms, but it is a
> > separate topic (no patches around yet).
> >
> > Here the question is that when the storm happens, how should VDSM protect
> > itself.
> >
> > Any ideas?
> >
> > Thank you,
> > Tomas
> >
> > [1]: www.ovirt.org/Features/Migration_Enhancements
> > [2]: https://gerrit.ovirt.org/#/c/45954/
> >
> >
>