[ovirt-users] Automatically migrate VM between hosts in the same cluster

Thu Sep 17 19:30:41 UTC 2015

I don't really think this is practical:

> - If the PSU failed, your UPS could alert you. If you have one...

If you have only one PSU in a host, a UPS is not going to stop you 
losing all the VMs on that host. OK, if you had N+1 PSUs, you may be 
able to monitor for this (IPMI/LOM/DRAC etc)and use the API to put a 
host into maintenance. Also a lot of people rely on low-cost white-box 
servers and decide that it's OK if a single PSU in a host dies, as, 
well, we have HA to start on other hosts. If they have N+1 PSUs in the 
hosts do they really have to migrate everything off? Swings and 
roundabouts really.

I'm also not sure I've seen any practical DC setups where a UPS can 
monitor the load for every single attached physical machine and figure 
out that one of the redundant PSUs in it has failed - I'd love to know 
if there are as that would be really cool.

> - If the machine is going down in an ordinary flow, surely it can be 
> done.

Isn't that what "Maintenance mode" is for?

>
>     Even if it was a network failure and the host was still up, how
>     would you live migrate a VM from a host you can't even talk to?
>
>
> It could be suspended to disk (local) - if the disk is available.
> Then the decision if it is to be resumed from local disk or not (as it 
> might be HA'ed and is running elsewhere) need to be taken later, of 
> course.

Yes, but that's not even remotely possible with Ovirt right now. I was 
trying to be practical as the OP has only just started using Ovirt and I 
think it might be a bit much to ask him to start coding up what he'd like.

>
>
>     The only way you could do it was if you somehow magically knew far
>     enough in advance that the host was about to fail (!) and that
>     gave enough time to migrate the machines off. But how would you
>     ever know that "machine quux.bar.net <http://quux.bar.net> is
>     going to fail in 7 minutes"?
>
>
> I completely agree there are situations in which you can't foresee the 
> failure.
> But in many, you can. In those cases, it makes sense for the host to 
> self-initiate 'move to maintenance' mode. The policy of what to do 
> when 'self-moving-to-maintenance-mode' could be pre-fetched from the 
> engine.
> Y.

Hmm, I would love that to be true. But I've seen so many so called 
"corner-cases" that I now think the failure area in a datacenter is a 
fractal with infinite corners. Yes, you could monitor SMART on local 
drives, pick up uncorrected ECC errors, use "sensors" to check for 
sagging voltages or high temps, but I don't think you can ever hope to 
catch everything, and you could end up doing a migration "storm" for . 
I've had more than enough of "Enterprise Spec" switches suddenly going 
nuts and spamming corrupt MACs all over the LAN to know you can't ever 
account for everything.

I think it's better to adopt the model of redundancy in software and 
services, so no-one even notices if a VM host goes away, there's always 
something else to take up the slack. Just like the origins of the 
Internet - the network should be dumb and the applications should cope 
with it! Any infrastructure that can't cope with the loss of a few VMs 
for a few minutes probably needs a refresh.

Cheers

Alex

 .
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/users/attachments/20150917/20c7f1a7/attachment-0001.html>