[ovirt-users] VDSM memory consumption

Darrell Budic budic at onholyground.com
Wed Apr 1 01:43:17 UTC 2015


Finally got a chance to implement this, so testing this on my centos7 hosts, and it looks good. I’ll keep eye on it for a couple days, but after a couple of hours, there’s no evidence of any leakage.


> On Mar 30, 2015, at 4:14 PM, John Taylor <jtt77777 at yahoo.com> wrote:
> 
> Dan Kenigsberg <danken at redhat.com> writes:
> 
>> On Sat, Mar 28, 2015 at 10:20:25AM -0400, John Taylor wrote:
>>> Daniel Helgenberger <daniel.helgenberger at m-box.de> writes:
>>> 
>>>> Hello Everyone,
>>>> 
>>>> I did create the original BZ on this. In the mean time, lab system I
>>>> used is dismantled and the production system is yet to deploy.
>>>> 
>>>> As I wrote in BZ1147148 [1], I experienced two different issues. One,
>>>> one big mem leak of about 15MiB/h and a smaller one, ~300KiB. These seem
>>>> unrelated.
>>>> 
>>>> The larger leak was indeed related to SSL in some way; not necessarily
>>>> M2Crypto. However, after disabling SSL this was gone leaving the smaller
>>>> leak.
>>>> 
>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1147148
>>> 
>>> 
>>> I think there are, at least for the purpose of this discussion, 3 leaks:
>>> 1. the M2Crypto leak
>>> 2. a slower leak 
>>> 3. a large leak that's not M2Crypto related that's part of sampling
>>> 
>>> My efforts have been around finding the source of my larger leak, which
>>> I think is #3.  I had disabled ssl so I knew that M2Crypto
>>> isn't/shouldn't be the problem as in bz1147148, and ssl is beside the
>>> point as it happens with a deactived host. It's part of sampling which
>>> always runs.
>>> 
>>> What I've found is, after trying to get the smallest reproducer, that
>>> it's not the netlink.iter_links that I commented on in [1] that is the
>>> problem. But in the _get_intefaces_and_samples loop is the call to
>>> create an InterfaceSample and that has getLinkSpeed() which, for vlans,
>>> ends up calling ipwrapper.getLink, and that to
>>> netlink.get_link(name)
>>> 
>>> netlink.get_link(name) *is* the source of my big leak. This is vdsm
>>> 4.16.10, so it is [2] and it's been changed in master for the removal of
>>> support for libnl v1 so it might not be a problem anymore. 
>>> 
>>> def get_link(name):
>>>    """Returns the information dictionary of the name specified link."""
>>>    with _pool.socket() as sock:
>>>        with _nl_link_cache(sock) as cache:
>>>            link = _rtnl_link_get_by_name(cache, name)
>>>            if not link:
>>>                raise IOError(errno.ENODEV, '%s is not present in the system' %
>>>                              name)
>>>            return _link_info(cache, link)
>>> 
>>> 
>>> The libnl documentation note at [3] says that for the rtnl_link_get_by_name function 
>>> "Attention
>>>    The reference counter of the returned link object will be incremented. Use rtnl_link_put() to release the reference."
>>> 
>>> So I took that hint, and made a change that does the rtnl_link_put() in
>>> get_link(name) and it looks like it works for me.
>>> 
>>> diff oldnetlink.py netlink.py
>>> 67d66
>>> <             return _link_info(cache, link)
>>> 68a68,70
>>>>            li = _link_info(cache, link)
>>>>            _rtnl_link_put(link)
>>>>            return li
>>> 333a336,337
>>>> 
>>>> _rtnl_link_put  = _none_proto(('rtnl_link_put', LIBNL_ROUTE))
>>> 
>>> Hope that helps. And if someone else could confirm that would be great.
>>> 
>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1158108
>>> [2]
>>> https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=blob;f=lib/vdsm/netlink.py;h=afae5cecb5ce701d00fb8f019ec92b3331a39036;hb=5608cfdf43db9186dabac4b2a779f9557e798968
>>> [3] http://www.infradead.org/~tgr/libnl/doc/api/group__link.html#ga1d583e4f0b43c89d854e5e681a529fad
>> 
>> Thanks, John, for a great detective work.
>> 
>> I'm afraid that with even on the master branch we keep calling
>> rtnl_link_get_link() and rtnl_link_get_by_name() without clearing the
>> reference count, so a fix is due there, too.
>> 
>> Would you consider posting a fully-fledged fix to gerrit? I still need
>> to understand what is the use of that refcount, so that we do not
>> release it too early.
>> 
>> Regards,
>> Dan.
> 
> Dan,
> 
> I'm happy to [1], although I've probably gotten something wrong with how
> it's supposed to be done :) It's for the version I'm using so it's for
> branch ovirt-3.5.
> 
> [1] https://gerrit.ovirt.org/#/c/39372/
> 
> Thanks,
> -John
> _______________________________________________
> Users mailing list
> Users at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users




More information about the Users mailing list