[ovirt-users] VDSM memory consumption

Darrell Budic budic at onholyground.com
Thu Mar 26 16:12:51 UTC 2015


> On Mar 26, 2015, at 6:42 AM, Dan Kenigsberg <danken at redhat.com> wrote:
> 
> On Wed, Mar 25, 2015 at 01:29:25PM -0500, Darrell Budic wrote:
>> 
>>> On Mar 25, 2015, at 5:34 AM, Dan Kenigsberg <danken at redhat.com> wrote:
>>> 
>>> On Tue, Mar 24, 2015 at 02:01:40PM -0500, Darrell Budic wrote:
>>>> 
>>>>> On Mar 24, 2015, at 4:33 AM, Dan Kenigsberg <danken at redhat.com> wrote:
>>>>> 
>>>>> On Mon, Mar 23, 2015 at 04:00:14PM -0400, John Taylor wrote:
>>>>>> Chris Adams <cma at cmadams.net> writes:
>>>>>> 
>>>>>>> Once upon a time, Sven Kieske <s.kieske at mittwald.de> said:
>>>>>>>> On 13/03/15 12:29, Kapetanakis Giannis wrote:
>>>>>>>>> We also face this problem since 3.5 in two different installations...
>>>>>>>>> Hope it's fixed soon
>>>>>>>> 
>>>>>>>> Nothing will get fixed if no one bothers to
>>>>>>>> open BZs and send relevants log files to help
>>>>>>>> track down the problems.
>>>>>>> 
>>>>>>> There's already an open BZ:
>>>>>>> 
>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1158108
>>>>>>> 
>>>>>>> I'm not sure if that is exactly the same problem I'm seeing or not; my
>>>>>>> vdsm process seems to be growing faster (RSS grew 952K in a 5 minute
>>>>>>> period just now; VSZ didn't change).
>>>>>> 
>>>>>> For those following this I've added a comment on the bz [1], although in
>>>>>> my case the memory leak is, like Chris Adams, a lot more than the 300KiB/h
>>>>>> in the original bug report by Daniel Helgenberger .
>>>>>> 
>>>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1158108
>>>>> 
>>>>> That's interesting (and worrying).
>>>>> Could you check your suggestion by editing sampling.py so that
>>>>> _get_interfaces_and_samples() returns the empty dict immediately?
>>>>> Would this make the leak disappear?
>>>> 
>>>> Looks like you’ve got something there. Just a quick test for now, watching RSS in top. I’ll let it go this way for a while and see what it looks in a few hours.
>>>> 
>>>> System 1: 13 VMs w/ 24 interfaces between them
>>>> 
>>>> 11:47 killed a vdsm @ 9.116G RSS (after maybe a week and a half running)
>>>> 
>>>> 11:47: 97xxx
>>>> 11:57 135544 and climbing
>>>> 12:00 136400
>>>> 
>>>> restarted with sampling.py modified to just return empty set:
>>>> 
>>>> def _get_interfaces_and_samples():
>>>>   links_and_samples = {}
>>>>   return links_and_samples
>>> 
>>> Thanks for the input. Just to be a little more certain that the culprit
>>> is _get_interfaces_and_samples() per se, would you please decorate it
>>> with memoized, and add a log line in the end
>>> 
>>> @utils.memoized   # add this line
>>> def _get_interfaces_and_samples():
>>>   ...
>>>   logging.debug('LINKS %s', links_and_samples)  ## and this line
>>>   return links_and_samples
>>> 
>>> I'd like to see what happens when the function is run only once, and
>>> returns a non-empty reasonable dictionary of links and samples.
>> 
>> Looks similar, I modified my second server for this test:
> 
> Thanks again. Would you be kind to search further?
> Does the following script leak anything on your host, when placed in your
> /usr/share/vdsm:
> 
>    #!/usr/bin/python
> 
>    from time import sleep
>    from virt.sampling import _get_interfaces_and_samples
> 
>    while True:
>        _get_interfaces_and_samples()
>        sleep(0.2)
> 
> Something that can be a bit harder would be to:
> # service vdsmd stop
> # su - vdsm -s /bin/bash
> # cd /usr/share/vdsm
> # valgrind --leak-check=full --log-file=/tmp/your.log vdsm
> 
> as suggested by Thomas on
> https://bugzilla.redhat.com/show_bug.cgi?id=1158108#c6

Yes, this script leaks quickly. Started out at a RSS of 21000ish, already at 26744 a minute in, about 5 minutes later it’s at 39384 and climbing.

Been abusing a production server for those simple tests, but didn’t want to run valgrind against it right this minute. Did run it against the test.py script above though, got this (fpaste.org didn’t like, too long maybe?): http://tower.onholyground.com/valgrind-test.log

To comment on some other posts in this thread, I also see leaks on my test system which is running Centos 6.6, but it only has 3 VMs across 2 servers and 3 configured networks and it leaks MUCH slower. I suspect people don’t notice this on test systems because they don’t have a lot of VMs/interfaces running, and don’t leave them up for weeks at a time. That’s why I was running these tests on my production box, to have more VMs up.







More information about the Users mailing list