On Mar 25, 2015, at 5:34 AM, Dan Kenigsberg <danken(a)redhat.com>
wrote:
On Tue, Mar 24, 2015 at 02:01:40PM -0500, Darrell Budic wrote:
>
>> On Mar 24, 2015, at 4:33 AM, Dan Kenigsberg <danken(a)redhat.com> wrote:
>>
>> On Mon, Mar 23, 2015 at 04:00:14PM -0400, John Taylor wrote:
>>> Chris Adams <cma(a)cmadams.net> writes:
>>>
>>>> Once upon a time, Sven Kieske <s.kieske(a)mittwald.de> said:
>>>>> On 13/03/15 12:29, Kapetanakis Giannis wrote:
>>>>>> We also face this problem since 3.5 in two different
installations...
>>>>>> Hope it's fixed soon
>>>>>
>>>>> Nothing will get fixed if no one bothers to
>>>>> open BZs and send relevants log files to help
>>>>> track down the problems.
>>>>
>>>> There's already an open BZ:
>>>>
>>>>
https://bugzilla.redhat.com/show_bug.cgi?id=1158108
>>>>
>>>> I'm not sure if that is exactly the same problem I'm seeing or
not; my
>>>> vdsm process seems to be growing faster (RSS grew 952K in a 5 minute
>>>> period just now; VSZ didn't change).
>>>
>>> For those following this I've added a comment on the bz [1], although in
>>> my case the memory leak is, like Chris Adams, a lot more than the 300KiB/h
>>> in the original bug report by Daniel Helgenberger .
>>>
>>> [1]
https://bugzilla.redhat.com/show_bug.cgi?id=1158108
>>
>> That's interesting (and worrying).
>> Could you check your suggestion by editing sampling.py so that
>> _get_interfaces_and_samples() returns the empty dict immediately?
>> Would this make the leak disappear?
>
> Looks like you’ve got something there. Just a quick test for now, watching RSS in
top. I’ll let it go this way for a while and see what it looks in a few hours.
>
> System 1: 13 VMs w/ 24 interfaces between them
>
> 11:47 killed a vdsm @ 9.116G RSS (after maybe a week and a half running)
>
> 11:47: 97xxx
> 11:57 135544 and climbing
> 12:00 136400
>
> restarted with sampling.py modified to just return empty set:
>
> def _get_interfaces_and_samples():
> links_and_samples = {}
> return links_and_samples
Thanks for the input. Just to be a little more certain that the culprit
is _get_interfaces_and_samples() per se, would you please decorate it
with memoized, and add a log line in the end
@utils.memoized # add this line
def _get_interfaces_and_samples():
...
logging.debug('LINKS %s', links_and_samples) ## and this line
return links_and_samples
I'd like to see what happens when the function is run only once, and
returns a non-empty reasonable dictionary of links and samples.
Looks similar, I modified my second server for this test:
12:25, still growing from yesterday: 544512
restarted with mods for logging and memoize:
stabilized @ 12:32: 114284
1:23: 115300
Thread-12::DEBUG::2015-03-25
12:28:08,080::sampling::243::root::(_get_interfaces_and_samples) LINKS {'vnet18':
<virt.sampling.InterfaceSample instance at 0x7f38c03e85f0>, 'vnet19':
<virt.sampling.InterfaceSample instance at 0x7f38b42cbcf8>, 'bond0':
<virt.sampling.InterfaceSample instance at 0x7f38b429afc8>, 'vnet13':
<virt.sampling.InterfaceSample instance at 0x7f38b42c8680>, 'vnet16':
<virt.sampling.InterfaceSample instance at 0x7f38b42cb368>, 'private':
<virt.sampling.InterfaceSample instance at 0x7f38b42b8bd8>, 'bond0.100':
<virt.sampling.InterfaceSample instance at 0x7f38b42bdd88>, 'vnet0':
<virt.sampling.InterfaceSample instance at 0x7f38b42c1f80>, 'enp3s0':
<virt.sampling.InterfaceSample instance at 0x7f38b429cef0>, 'vnet2':
<virt.sampling.InterfaceSample instance at 0x7f38b42bbbd8>, 'vnet3':
<virt.sampling.InterfaceSample instance at 0x7f38b42c37e8>, 'vnet4':
<virt.sampling.InterfaceSample instance at 0x7f38b42c5518>, 'vnet5':
<virt.sampling.InterfaceSample instance at 0x7f38b42c6ab8>, 'vnet6':
<virt.sampling.InterfaceSample instance at 0x7f38b42c7248>, 'vnet7':
<virt.sampling.InterfaceSample instance at 0x7f38c03e7a28>, 'vnet8':
<virt.sampling.InterfaceSample instance at 0x7f38b42c7c20>, 'bond0.1100':
<virt.sampling.InterfaceSample instance at 0x7f38b42be710>, 'bond0.1103':
<virt.sampling.InterfaceSample instance at 0x7f38b429dc68>, 'ovirtmgmt':
<virt.sampling.InterfaceSample instance at 0x7f38b42b16c8>, 'lo':
<virt.sampling.InterfaceSample instance at 0x7f38b429a8c0>, 'vnet22':
<virt.sampling.InterfaceSample instance at 0x7f38c03e7128>, 'vnet21':
<virt.sampling.InterfaceSample instance at 0x7f38b42cd368>, 'vnet20':
<virt.sampling.InterfaceSample instance at 0x7f38b42cc7a0>, 'internet':
<virt.sampling.InterfaceSample instance at 0x7f38b42aa098>, 'bond0.1203':
<virt.sampling.InterfaceSample instance at 0x7f38b42aa8c0>, 'bond0.1223':
<virt.sampling.InterfaceSample instance at 0x7f38b42bb128>, ‘XXXXXXXXXXX':
<virt.sampling.InterfaceSample instance at 0x7f38b42bee60>, ‘XXXXXXX':
<virt.sampling.InterfaceSample instance at 0x7f38b42beef0>, ';vdsmdummy;':
<virt.sampling.InterfaceSample instance at 0x7f38b42bdc20>, 'vnet14':
<virt.sampling.InterfaceSample instance at 0x7f38b42ca050>, 'mgmt':
<virt.sampling.InterfaceSample instance at 0x7f38b42be248>, 'vnet15':
<virt.sampling.InterfaceSample instance at 0x7f38b42cab00>, 'enp2s0':
<virt.sampling.InterfaceSample instance at 0x7f38b429c200>, 'bond0.1110':
<virt.sampling.InterfaceSample instance at 0x7f38b42bed40>, 'vnet1':
<virt.sampling.InterfaceSample instance at 0x7f38b42c27e8>, 'bond0.1233':
<virt.sampling.InterfaceSample instance at 0x7f38b42bedd0>, 'bond0.1213':
<virt.sampling.InterfaceSample instance at 0x7f38b42b2128>}
Didn’t see the significant CPU use difference on this one, so thinking it was all ksmd on
yesterdays tests.
Yesterdays test is still going, and still hovering around 135016 or so.