[Engine-devel] Mom Balloon policy issue

Wed Oct 10 09:23:12 UTC 2012

Regarding qemu-kvm memory allocation behaviour, this is the clearest explanation I found:

"Unless using explicit hugepages for the guest, KVM will
*not* initially allocate all RAM from the host OS. Each
page of RAM is only allocated when the guest OS touches
that page. So if you set memory=8G and currentMemory=4G,
although you might expect resident usage of KVM on the
host to be 4 G, it can in fact be pretty much any value
between 0 and 4 GB depending on whether the guest OS
has touched the memory pages, or as much as 8 GB if the
guest OS has no balloon driver and touches all memory
immediately."[1]

And as far as I've seen it behave the same after the guest is ballooned and then deflated.

Conclusion, the time in which the guest is allocated the resident memory is not deterministic and dependent on use and OS 
For example Windows7 uses all available memory as cache while fedora does not.
So an inactive guest running windows7 will consume all possible resident memory right away while an inactive guest running fedora might never grow in memory size.

[1] http://www.redhat.com/archives/virt-tools-list/2011-January/msg00001.html

------------------
Noam Slomianko
Red Hat Enterprise Virtualization, SLA team

----- Original Message -----
From: "Adam Litke" <agl at us.ibm.com>
To: "Noam Slomianko" <nslomian at redhat.com>
Cc: "Doron Fediuck" <dfediuck at redhat.com>, vdsm-devel at lists.fedorahosted.org, engine-devel at ovirt.org
Sent: Tuesday, October 9, 2012 8:06:02 PM
Subject: Re: Mom Balloon policy issue

Thanks for writing this.  Some thoughts inline, below.  Also, cc'ing some lists
in case other folks want to participate in the discussion.

On Tue, Oct 09, 2012 at 01:12:30PM -0400, Noam Slomianko wrote:
> Greetings,
> 
> I've fiddled around with ballooning and wanted to raise a question for debate.
> 
> Currently as long as the host is under memory pressure, MOM will try and reclaim back memory from all guests with more free memory then a given threshold.
> 
> Main issue: Guest allocated memory is not the same as the resident (physical) memory used by qemu.
> This means that when memory is reclaimed back (the balloon is inflated) we might not get as much memory as planed back (or non at all).
> 
>  *Example1 no memory is reclaimed back:
>     name | allocated memory | used by the vm | resident memory used in the host by qemu
>     Vm1  |       4G         |       4G,      |                4G
>     Vm2  |       4G         |       1G       |                1G
>  - MOM will inflate the balloon in vm2 (as vm has no free memory) and will gain no memory

One thing to keep in mind is that VMs having less host RSS than their memory
allocation is a temporary condition.  All VMs will eventually consume their full
allocation if allowed to run.  I'd be curious to know how long this process
takes in general.

We might be able to handle this case by refusing to inflate the balloon if:
    (VM free memory - planned balloon inflation) > host RSS

>  *Example1 memory is reclaimed partially:
>     name | allocated memory | used by the vm | resident memory used in the host by qemu
>     Vm1  |       4G         |       4G,      |                4G
>     Vm2  |       4G         |       1G       |                1G
>     Vm3  |       4G         |       1G       |                4G
>  - MOM will inflate the balloon in vm2 and vm3 slowly gaining only from vm3

The above rule extension may help here too.

> this behaviour might in the cause us to:
>  * spend time reclaiming memory from many guests when we can reclaim only from a subgroup
>  * be under the impression that we have more potential memory to reclaim when we do
>  * bring inactive VMs dangerously low as they are constantly reclaimed (I've had guests crashing from kernel out of memory)
> 
> 
> To address this I suggest that we collect guest memory stats from libvirt as well, so we have the option to use them in our calculations.
> This can be achieved with the command "virsh dommemstat <domain>" which returns
>     actual 3915372 (allocated)
>     rss 2141580 (resident memory used by qemu)

I would suggest adding these two fields to the VmStats that are collected by
vdsm.  Then, to try it out, add the fields to the GuestMemory Collector.  (Note:
MOM does have a collector that gathers RSS for VMs.  It's called GuestQemuProc).
You can then extend the Balloon policy to add a snippet to check if the proposed
balloon adjustment should be carried out.  You could add the logic to the
change_big_enough function.

> additional topic:
>  * should we include per guest config (for example a hard minimum memory cap, this vm cannot run effectively with less then 1G memory)

Yes.  This is probably something we want to do.  There is a whole topic around
VM tagging that we should consider.  In the future we will want to be able to do
many different things in policy based on a VMs tag.  For example, some VMs may
be completely exempt from ballooning.  Others may have a minimum limit.

I want to avoid passing in the raw guest configuration because MOM needs to work
with direct libvirt vms and with ovirt/vdsm vms.  Therefore, we want to think
carefully about the abstractions we use when presenting VM properties to MOM.

-- 
Adam Litke <agl at us.ibm.com>
IBM Linux Technology Center