On Sat, Mar 28, 2015 at 10:20:25AM -0400, John Taylor wrote:
> Daniel Helgenberger <daniel.helgenberger(a)m-box.de> writes:
>
> > Hello Everyone,
> >
> > I did create the original BZ on this. In the mean time, lab system I
> > used is dismantled and the production system is yet to deploy.
> >
> > As I wrote in BZ1147148 [1], I experienced two different issues. One,
> > one big mem leak of about 15MiB/h and a smaller one, ~300KiB. These seem
> > unrelated.
> >
> > The larger leak was indeed related to SSL in some way; not necessarily
> > M2Crypto. However, after disabling SSL this was gone leaving the smaller
> > leak.
> >
> > [1]
https://bugzilla.redhat.com/show_bug.cgi?id=1147148
>
>
> I think there are, at least for the purpose of this discussion, 3 leaks:
> 1. the M2Crypto leak
> 2. a slower leak
> 3. a large leak that's not M2Crypto related that's part of sampling
>
> My efforts have been around finding the source of my larger leak, which
> I think is #3. I had disabled ssl so I knew that M2Crypto
> isn't/shouldn't be the problem as in bz1147148, and ssl is beside the
> point as it happens with a deactived host. It's part of sampling which
> always runs.
>
> What I've found is, after trying to get the smallest reproducer, that
> it's not the netlink.iter_links that I commented on in [1] that is the
> problem. But in the _get_intefaces_and_samples loop is the call to
> create an InterfaceSample and that has getLinkSpeed() which, for vlans,
> ends up calling ipwrapper.getLink, and that to
> netlink.get_link(name)
>
> netlink.get_link(name) *is* the source of my big leak. This is vdsm
> 4.16.10, so it is [2] and it's been changed in master for the removal of
> support for libnl v1 so it might not be a problem anymore.
>
> def get_link(name):
> """Returns the information dictionary of the name specified
link."""
> with _pool.socket() as sock:
> with _nl_link_cache(sock) as cache:
> link = _rtnl_link_get_by_name(cache, name)
> if not link:
> raise IOError(errno.ENODEV, '%s is not present in the system'
%
> name)
> return _link_info(cache, link)
>
>
> The libnl documentation note at [3] says that for the rtnl_link_get_by_name function
> "Attention
> The reference counter of the returned link object will be incremented. Use
rtnl_link_put() to release the reference."
>
> So I took that hint, and made a change that does the rtnl_link_put() in
> get_link(name) and it looks like it works for me.
>
> diff oldnetlink.py netlink.py
> 67d66
> < return _link_info(cache, link)
> 68a68,70
> > li = _link_info(cache, link)
> > _rtnl_link_put(link)
> > return li
> 333a336,337
> >
> > _rtnl_link_put = _none_proto(('rtnl_link_put', LIBNL_ROUTE))
>
> Hope that helps. And if someone else could confirm that would be great.
>
> [1]
https://bugzilla.redhat.com/show_bug.cgi?id=1158108
> [2]
>
https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=blob;f=lib/vdsm/netlink.py;h...
> [3]
http://www.infradead.org/~tgr/libnl/doc/api/group__link.html#ga1d583e4f0b...
Thanks, John, for a great detective work.
I'm afraid that with even on the master branch we keep calling
rtnl_link_get_link() and rtnl_link_get_by_name() without clearing the
reference count, so a fix is due there, too.
Would you consider posting a fully-fledged fix to gerrit? I still need
to understand what is the use of that refcount, so that we do not
release it too early.
Regards,
Dan.
Dan,
I'm happy to [1], although I've probably gotten something wrong with how
it's supposed to be done :) It's for the version I'm using so it's for
branch ovirt-3.5.
[1]