Upgraded to oVirt 4.4.9, still have vdsmd memory leak

I have seen vdsmd leak memory for years (I've been running oVirt since version 3.5), but never been able to nail it down. I've upgraded a cluster to oVirt 4.4.9 (reloading the hosts with CentOS 8-stream), and I still see it happen. One host in the cluster, which has been up 8 days, has vdsmd with 4.3 GB resident memory. On a couple of other hosts, it's around half a gigabyte. In the past, it seemed more likely to happen on the hosted engine hosts and/or the SPM host... but the host with the 4.3 GB vdsmd is not either of those. I'm not sure what I do that would make my setup "special" compared to others; I loaded a pretty minimal install of CentOS 8-stream, with the only extra thing being I add the core parts of the Dell PowerEdge OpenManage tools (so I can get remote SNMP hardware monitoring). When I run "pmap $(pidof -x vdsmd)", the bulk of the RAM use is a single anonymous block (which I'm guessing is just the python general memory allocator). I thought maybe the switch to CentOS 8 and python 3 might clear something up, but obviously not. Any ideas? -- Chris Adams <cma@cmadams.net>

Il giorno mer 10 nov 2021 alle ore 15:45 Chris Adams <cma@cmadams.net> ha scritto:
I have seen vdsmd leak memory for years (I've been running oVirt since version 3.5), but never been able to nail it down. I've upgraded a cluster to oVirt 4.4.9 (reloading the hosts with CentOS 8-stream), and I still see it happen. One host in the cluster, which has been up 8 days, has vdsmd with 4.3 GB resident memory. On a couple of other hosts, it's around half a gigabyte.
In the past, it seemed more likely to happen on the hosted engine hosts and/or the SPM host... but the host with the 4.3 GB vdsmd is not either of those.
I'm not sure what I do that would make my setup "special" compared to others; I loaded a pretty minimal install of CentOS 8-stream, with the only extra thing being I add the core parts of the Dell PowerEdge OpenManage tools (so I can get remote SNMP hardware monitoring).
When I run "pmap $(pidof -x vdsmd)", the bulk of the RAM use is a single anonymous block (which I'm guessing is just the python general memory allocator).
I thought maybe the switch to CentOS 8 and python 3 might clear something up, but obviously not. Any ideas?
I guess we still have the reproducibility issue ( https://lists.ovirt.org/archives/list/devel@ovirt.org/thread/KO5SEPAZMLBWSBS... ). But maybe in the meanwhile there's a new way to track things down. +Marcin Sobczyk <msobczyk@redhat.com> ?
-- Chris Adams <cma@cmadams.net> _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/3PTE35WMIVGLV2...
-- Sandro Bonazzola MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV Red Hat EMEA <https://www.redhat.com/> sbonazzo@redhat.com <https://www.redhat.com/> *Red Hat respects your work life balance. Therefore there is no need to answer this email out of your office hours.*

Il giorno ven 12 nov 2021 alle ore 09:47 Sandro Bonazzola < sbonazzo@redhat.com> ha scritto:
Il giorno mer 10 nov 2021 alle ore 15:45 Chris Adams <cma@cmadams.net> ha scritto:
I have seen vdsmd leak memory for years (I've been running oVirt since version 3.5), but never been able to nail it down. I've upgraded a cluster to oVirt 4.4.9 (reloading the hosts with CentOS 8-stream), and I still see it happen. One host in the cluster, which has been up 8 days, has vdsmd with 4.3 GB resident memory. On a couple of other hosts, it's around half a gigabyte.
In the past, it seemed more likely to happen on the hosted engine hosts and/or the SPM host... but the host with the 4.3 GB vdsmd is not either of those.
I'm not sure what I do that would make my setup "special" compared to others; I loaded a pretty minimal install of CentOS 8-stream, with the only extra thing being I add the core parts of the Dell PowerEdge OpenManage tools (so I can get remote SNMP hardware monitoring).
When I run "pmap $(pidof -x vdsmd)", the bulk of the RAM use is a single anonymous block (which I'm guessing is just the python general memory allocator).
I thought maybe the switch to CentOS 8 and python 3 might clear something up, but obviously not. Any ideas?
I guess we still have the reproducibility issue ( https://lists.ovirt.org/archives/list/devel@ovirt.org/thread/KO5SEPAZMLBWSBS... ). But maybe in the meanwhile there's a new way to track things down. +Marcin Sobczyk <msobczyk@redhat.com> ?
Perhaps https://docs.python.org/3.6/library/tracemalloc.html ?
-- Chris Adams <cma@cmadams.net> _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/3PTE35WMIVGLV2...
--
Sandro Bonazzola
MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV
Red Hat EMEA <https://www.redhat.com/>
sbonazzo@redhat.com <https://www.redhat.com/>
*Red Hat respects your work life balance. Therefore there is no need to answer this email out of your office hours.*
-- Sandro Bonazzola MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV Red Hat EMEA <https://www.redhat.com/> sbonazzo@redhat.com <https://www.redhat.com/> *Red Hat respects your work life balance. Therefore there is no need to answer this email out of your office hours.*

Il giorno ven 12 nov 2021 alle ore 09:50 Sandro Bonazzola < sbonazzo@redhat.com> ha scritto:
Il giorno ven 12 nov 2021 alle ore 09:47 Sandro Bonazzola < sbonazzo@redhat.com> ha scritto:
Il giorno mer 10 nov 2021 alle ore 15:45 Chris Adams <cma@cmadams.net> ha scritto:
I have seen vdsmd leak memory for years (I've been running oVirt since version 3.5), but never been able to nail it down. I've upgraded a cluster to oVirt 4.4.9 (reloading the hosts with CentOS 8-stream), and I still see it happen. One host in the cluster, which has been up 8 days, has vdsmd with 4.3 GB resident memory. On a couple of other hosts, it's around half a gigabyte.
In the past, it seemed more likely to happen on the hosted engine hosts and/or the SPM host... but the host with the 4.3 GB vdsmd is not either of those.
I'm not sure what I do that would make my setup "special" compared to others; I loaded a pretty minimal install of CentOS 8-stream, with the only extra thing being I add the core parts of the Dell PowerEdge OpenManage tools (so I can get remote SNMP hardware monitoring).
When I run "pmap $(pidof -x vdsmd)", the bulk of the RAM use is a single anonymous block (which I'm guessing is just the python general memory allocator).
I thought maybe the switch to CentOS 8 and python 3 might clear something up, but obviously not. Any ideas?
I guess we still have the reproducibility issue ( https://lists.ovirt.org/archives/list/devel@ovirt.org/thread/KO5SEPAZMLBWSBS... ). But maybe in the meanwhile there's a new way to track things down. +Marcin Sobczyk <msobczyk@redhat.com> ?
Perhaps https://docs.python.org/3.6/library/tracemalloc.html ?
+David Malcolm <dmalcolm@redhat.com> I saw your slides on python memory leak debugging, maybe you can give some suggestions here.
-- Chris Adams <cma@cmadams.net> _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/3PTE35WMIVGLV2...
--
Sandro Bonazzola
MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV
Red Hat EMEA <https://www.redhat.com/>
sbonazzo@redhat.com <https://www.redhat.com/>
*Red Hat respects your work life balance. Therefore there is no need to answer this email out of your office hours.*
--
Sandro Bonazzola
MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV
Red Hat EMEA <https://www.redhat.com/>
sbonazzo@redhat.com <https://www.redhat.com/>
*Red Hat respects your work life balance. Therefore there is no need to answer this email out of your office hours.*
-- Sandro Bonazzola MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV Red Hat EMEA <https://www.redhat.com/> sbonazzo@redhat.com <https://www.redhat.com/> *Red Hat respects your work life balance. Therefore there is no need to answer this email out of your office hours.*

On Fri, 2021-11-12 at 09:54 +0100, Sandro Bonazzola wrote:
Il giorno ven 12 nov 2021 alle ore 09:50 Sandro Bonazzola < sbonazzo@redhat.com> ha scritto:
Il giorno ven 12 nov 2021 alle ore 09:47 Sandro Bonazzola < sbonazzo@redhat.com> ha scritto:
Il giorno mer 10 nov 2021 alle ore 15:45 Chris Adams <cma@cmadams.net> ha scritto:
I have seen vdsmd leak memory for years (I've been running oVirt since version 3.5), but never been able to nail it down. I've upgraded a cluster to oVirt 4.4.9 (reloading the hosts with CentOS 8- stream), and I still see it happen. One host in the cluster, which has been up 8 days, has vdsmd with 4.3 GB resident memory. On a couple of other hosts, it's around half a gigabyte.
In the past, it seemed more likely to happen on the hosted engine hosts and/or the SPM host... but the host with the 4.3 GB vdsmd is not either of those.
I'm not sure what I do that would make my setup "special" compared to others; I loaded a pretty minimal install of CentOS 8-stream, with the only extra thing being I add the core parts of the Dell PowerEdge OpenManage tools (so I can get remote SNMP hardware monitoring).
When I run "pmap $(pidof -x vdsmd)", the bulk of the RAM use is a single anonymous block (which I'm guessing is just the python general memory allocator).
I thought maybe the switch to CentOS 8 and python 3 might clear something up, but obviously not. Any ideas?
I guess we still have the reproducibility issue ( https://lists.ovirt.org/archives/list/devel@ovirt.org/thread/KO5SEPAZMLBWSBS... ). But maybe in the meanwhile there's a new way to track things down. +Marcin Sobczyk <msobczyk@redhat.com> ?
Perhaps https://docs.python.org/3.6/library/tracemalloc.html ?
+David Malcolm <dmalcolm@redhat.com> I saw your slides on python memory leak debugging, maybe you can give some suggestions here.
I haven't worked on Python itself in > 8 years, so my knowledge is out- of-date here. Adding in Victor Stinner, who has worked on the CPython memory allocators more recently, and, in particular, implemented the tracemalloc library linked to above. Dave

Hi, I wrote the tracemalloc module which is easy to use on Python 3.4 and newer. If you take tracemalloc snapshots while the memory usage is growing, and comparing snapshots don't show anything obvious, you can maybe suspect memory fragmentation. You're talking about 4 GB of memory usage, I don't think that memory fragmentation can explain it. Do you need my help to use tracemalloc? Quick tutorial in the official documentation: https://docs.python.org/dev/library/tracemalloc.html#compute-differences Victor On Fri, Nov 12, 2021 at 3:51 PM David Malcolm <dmalcolm@redhat.com> wrote:
On Fri, 2021-11-12 at 09:54 +0100, Sandro Bonazzola wrote:
Il giorno ven 12 nov 2021 alle ore 09:50 Sandro Bonazzola < sbonazzo@redhat.com> ha scritto:
Il giorno ven 12 nov 2021 alle ore 09:47 Sandro Bonazzola < sbonazzo@redhat.com> ha scritto:
Il giorno mer 10 nov 2021 alle ore 15:45 Chris Adams <cma@cmadams.net> ha scritto:
I have seen vdsmd leak memory for years (I've been running oVirt since version 3.5), but never been able to nail it down. I've upgraded a cluster to oVirt 4.4.9 (reloading the hosts with CentOS 8- stream), and I still see it happen. One host in the cluster, which has been up 8 days, has vdsmd with 4.3 GB resident memory. On a couple of other hosts, it's around half a gigabyte.
In the past, it seemed more likely to happen on the hosted engine hosts and/or the SPM host... but the host with the 4.3 GB vdsmd is not either of those.
I'm not sure what I do that would make my setup "special" compared to others; I loaded a pretty minimal install of CentOS 8-stream, with the only extra thing being I add the core parts of the Dell PowerEdge OpenManage tools (so I can get remote SNMP hardware monitoring).
When I run "pmap $(pidof -x vdsmd)", the bulk of the RAM use is a single anonymous block (which I'm guessing is just the python general memory allocator).
I thought maybe the switch to CentOS 8 and python 3 might clear something up, but obviously not. Any ideas?
I guess we still have the reproducibility issue ( https://lists.ovirt.org/archives/list/devel@ovirt.org/thread/KO5SEPAZMLBWSBS... ). But maybe in the meanwhile there's a new way to track things down. +Marcin Sobczyk <msobczyk@redhat.com> ?
Perhaps https://docs.python.org/3.6/library/tracemalloc.html ?
+David Malcolm <dmalcolm@redhat.com> I saw your slides on python memory leak debugging, maybe you can give some suggestions here.
I haven't worked on Python itself in > 8 years, so my knowledge is out- of-date here.
Adding in Victor Stinner, who has worked on the CPython memory allocators more recently, and, in particular, implemented the tracemalloc library linked to above.
Dave

Once upon a time, Victor Stinner <vstinner@redhat.com> said:
I wrote the tracemalloc module which is easy to use on Python 3.4 and newer. If you take tracemalloc snapshots while the memory usage is growing, and comparing snapshots don't show anything obvious, you can maybe suspect memory fragmentation. You're talking about 4 GB of memory usage, I don't think that memory fragmentation can explain it. Do you need my help to use tracemalloc?
Any tips on where I should add that to vdsm's code? It'll probably take me a little time to get this going - I'm only seeing this on my production cluster (of course), not my dev cluster. One difference is prod is iSCSI storage with multipath, while dev is Gluster (so that may be a clue to the source of the issue). -- Chris Adams <cma@cmadams.net>

Once upon a time, Victor Stinner <vstinner@redhat.com> said:
I wrote the tracemalloc module which is easy to use on Python 3.4 and newer. If you take tracemalloc snapshots while the memory usage is growing, and comparing snapshots don't show anything obvious, you can maybe suspect memory fragmentation. You're talking about 4 GB of memory usage, I don't think that memory fragmentation can explain it. Do you need my help to use tracemalloc?
My python is rudimentary at best (my programming has all been in other languages), but here's what I tried for starters: I added a USR2 signal handler to log the top users, but it doesn't seem to show anything growing like the RSS is actually doing. I made the following change: --- /usr/lib/python3.6/site-packages/vdsm/vdsmd.py.dist~ 2021-10-25 11:27:46.000000000 -0500 +++ /usr/lib/python3.6/site-packages/vdsm/vdsmd.py 2021-12-02 13:08:46.000000000 -0600 @@ -29,6 +29,7 @@ import syslog import resource import tempfile +import tracemalloc from logging import config as lconfig from vdsm import constants @@ -82,6 +83,14 @@ irs.spmStop( irs.getConnectedStoragePoolsList()['poollist'][0]) + def sigusr2Handler(signum, frame): + snapshot = tracemalloc.take_snapshot() + top_stats = snapshot.statistics('lineno') + lentry = 'Top memory users:\n' + for stat in top_stats[:10]: + lentry += ' ' + str(stat) + '\n' + log.info(lentry) + def sigalrmHandler(signum, frame): # Used in panic.panic() when shuting down logging, must not log. raise RuntimeError("Alarm timeout") @@ -89,6 +98,7 @@ sigutils.register() signal.signal(signal.SIGTERM, sigtermHandler) signal.signal(signal.SIGUSR1, sigusr1Handler) + signal.signal(signal.SIGUSR2, sigusr2Handler) signal.signal(signal.SIGALRM, sigalrmHandler) zombiereaper.registerSignalHandler() And also set a systemd override on vdsmd.service to add PYTHONTRACEMALLOC=25. That gets log entries like this: 2021-12-03 07:30:37,244-0600 INFO (MainThread) [vds] Top memory users: /usr/lib64/python3.6/site-packages/libvirt.py:442: size=34.0 MiB, count=630128, average=57 B <frozen importlib._bootstrap_external>:487: size=16.5 MiB, count=191152, average=90 B /usr/lib64/python3.6/json/decoder.py:355: size=14.6 MiB, count=142411, average=108 B /usr/lib/python3.6/site-packages/vdsm/host/stats.py:138: size=3678 KiB, count=22428, average=168 B <frozen importlib._bootstrap>:219: size=2027 KiB, count=17555, average=118 B /usr/lib/python3.6/site-packages/vdsm/api/vdsmapi.py:143: size=1724 KiB, count=23388, average=75 B /usr/lib/python3.6/site-packages/vdsm/virt/vmchannels.py:163: size=1502 KiB, count=24039, average=64 B /usr/lib64/python3.6/linecache.py:137: size=1383 KiB, count=13404, average=106 B /usr/lib/python3.6/site-packages/vdsm/utils.py:358: size=1305 KiB, count=8587, average=156 B /usr/lib64/python3.6/functools.py:67: size=1134 KiB, count=9624, average=121 B (vdsmd:92) But at the time I generated that, the RSS was over 340MB. Interestingly, when I sent the signal, the RSS jumped to over 430MB (but maybe my change did that?). -- Chris Adams <cma@cmadams.net>

Hi, I like to compute differences between two snapshots (diff), rather than looking at a single snapshot: https://docs.python.org/dev/library/tracemalloc.html#compute-differences You can modify your signal handler to write the snapshot into a file using pickle.dump(): https://github.com/vstinner/tracemallocqt#usage Then use pickle.load() to reload snapshots from files. You can take multiple snapshots and compare snapshot 1 with snapshot 2, compare 1 with 3, etc. If there is a major memory increase between two snapshots, I expect a significant difference between these two snapshots. You can configure tracemalloc to decide how many frames per traceback are stored. See -X tracemalloc=NFRAME, PYTHONTRACEMALLOC=NFRAME and start() argument: https://docs.python.org/dev/library/tracemalloc.html#tracemalloc.start tracemalloc only "sees" memory allocations made by Python. You can get the "current size size of memory blocks traced by the tracemalloc module" with: https://docs.python.org/dev/library/tracemalloc.html#tracemalloc.get_traced_... Note: tracemalloc itself consumes a lot of memory, which can explain why your application uses more RSS memory when tracemalloc is used. If there is a huge difference between the RSS memory increase the what tracemalloc see (ex: RSS: +100 MB, tracemalloc: +1 MB), maybe you should use another tool working at the malloc/free level, like Valgrind. Victor

Once upon a time, Victor Stinner <vstinner@redhat.com> said:
Then use pickle.load() to reload snapshots from files. You can take multiple snapshots and compare snapshot 1 with snapshot 2, compare 1 with 3, etc. If there is a major memory increase between two snapshots, I expect a significant difference between these two snapshots.
I tried this approach (tried Valgrind but it caused vdsmd to run too slow, so oVirt saw timeouts and moved VMs away), and it does show a pretty big jump overnight. Below is the output of a compare of tracemalloc dump between yesterday afternoon and this morning. The files in the traceback are from RPMs: python3-libvirt-7.6.0-1.el8s.x86_64 vdsm-common-4.40.90.4-1.el8.noarch Looking at the code, I'm not sure what to make of it though. Top differences /usr/lib64/python3.6/site-packages/libvirt.py:442: size=295 MiB (+285 MiB), count=5511282 (+5312311), average=56 B /usr/lib64/python3.6/json/decoder.py:355: size=73.9 MiB (+70.2 MiB), count=736108 (+697450), average=105 B /usr/lib64/python3.6/logging/__init__.py:1630: size=44.2 MiB (+43.8 MiB), count=345704 (+342481), average=134 B /usr/lib64/python3.6/site-packages/libvirt.py:5695: size=30.3 MiB (+30.0 MiB), count=190449 (+188665), average=167 B /usr/lib/python3.6/site-packages/vdsm/host/stats.py:138: size=12.1 MiB (+11.4 MiB), count=75366 (+70991), average=168 B /usr/lib/python3.6/site-packages/vdsm/utils.py:358: size=10.4 MiB (+9968 KiB), count=70204 (+65272), average=156 B /usr/lib64/python3.6/site-packages/libvirt.py:537: size=7676 KiB (+7656 KiB), count=109119 (+108886), average=72 B /usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py:256: size=7813 KiB (+7505 KiB), count=125015 (+120083), average=64 B /usr/lib64/python3.6/asyncore.py:173: size=6934 KiB (+6735 KiB), count=110941 (+107755), average=64 B /usr/lib/python3.6/site-packages/vdsm/virt/vmchannels.py:163: size=5984 KiB (+5631 KiB), count=95744 (+90103), average=64 B Top block 5511282 memory blocks: 302589.8 KiB File "/usr/lib64/python3.6/site-packages/libvirt.py", line 442 ret = libvirtmod.virEventRunDefaultImpl() File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 69 libvirt.virEventRunDefaultImpl() File "/usr/lib/python3.6/site-packages/vdsm/common/concurrent.py", line 260 ret = func(*args, **kwargs) File "/usr/lib64/python3.6/threading.py", line 885 self._target(*self._args, **self._kwargs) File "/usr/lib64/python3.6/threading.py", line 937 self.run() File "/usr/lib64/python3.6/threading.py", line 905 self._bootstrap_inner() -- Chris Adams <cma@cmadams.net>

On Tue, Dec 7, 2021 at 6:12 PM Chris Adams <cma@cmadams.net> wrote:
Top differences /usr/lib64/python3.6/site-packages/libvirt.py:442: size=295 MiB (+285 MiB), count=5511282 (+5312311), average=56 B /usr/lib64/python3.6/json/decoder.py:355: size=73.9 MiB (+70.2 MiB), count=736108 (+697450), average=105 B /usr/lib64/python3.6/logging/__init__.py:1630: size=44.2 MiB (+43.8 MiB), count=345704 (+342481), average=134 B /usr/lib64/python3.6/site-packages/libvirt.py:5695: size=30.3 MiB (+30.0 MiB), count=190449 (+188665), average=167 B /usr/lib/python3.6/site-packages/vdsm/host/stats.py:138: size=12.1 MiB (+11.4 MiB), count=75366 (+70991), average=168 B /usr/lib/python3.6/site-packages/vdsm/utils.py:358: size=10.4 MiB (+9968 KiB), count=70204 (+65272), average=156 B
That's quite significant!
Top block 5511282 memory blocks: 302589.8 KiB File "/usr/lib64/python3.6/site-packages/libvirt.py", line 442 ret = libvirtmod.virEventRunDefaultImpl() File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 69 libvirt.virEventRunDefaultImpl() File "/usr/lib/python3.6/site-packages/vdsm/common/concurrent.py", line 260 ret = func(*args, **kwargs)
You should check where these "ret" objects (of libvirt.py:442) are stored: 5,511,282 is a lot of small objects (average: 56 bytes)! Maybe they are stored in a list and never destroyed. Maybe it's a reference leak in the libvirtmod.virEventRunDefaultImpl() function of "libvirtmod" C extension: missing Py_DECREF() somewhere. Or something somehow prevents to delete these projects object. For example, an exception is stored somewhere which keeps all variables alive (in Python 3, an exception stores a traceback object which keeps all variables of all frames alive). On GitHub and GitLab, I found the following code. Maybe there are minor differences in the versions that you are using. https://gitlab.com/libvirt/libvirt-python (I built the code locally to get build/libvirt.py) build/libvirt.c: --- PyObject * libvirt_intWrap(int val) { return PyLong_FromLong((long) val); } PyObject * libvirt_virEventRunDefaultImpl(PyObject *self ATTRIBUTE_UNUSED, PyObject *args ATTRIBUTE_UNUSED) { PyObject *py_retval; int c_retval; LIBVIRT_BEGIN_ALLOW_THREADS; c_retval = virEventRunDefaultImpl(); LIBVIRT_END_ALLOW_THREADS; py_retval = libvirt_intWrap((int) c_retval); return py_retval; } static PyMethodDef libvirtMethods[] = { { (char *)"virEventRunDefaultImpl", libvirt_virEventRunDefaultImpl, METH_VARARGS, NULL }, ... {NULL, NULL, 0, NULL} }; --- This code looks correct and straightforward. Is it possible that internally virEventRunDefaultImpl() calls a Python memory allocator? build/libvirt.py: --- def virEventRunDefaultImpl(): ret = libvirtmod.virEventRunDefaultImpl() if ret == -1: raise libvirtError('virEventRunDefaultImpl() failed') return ret --- Again, this code looks correct and straightforward. https://github.com/oVirt/vdsm/blob/37ed5c279c2dd9c9bb06329d674882e0f98f34d6/... vdsm/common/libvirtconnection.py: --- def __run(self): try: libvirt.virEventRegisterDefaultImpl() while self.run: libvirt.virEventRunDefaultImpl() finally: self.run = False --- libvirt.virEventRunDefaultImpl() result is ignored and so I don't see anything obvious which would explain a leak. Sometimes, looking at the top function is misleading since the explanation can be found in one of the caller functions. For example, which function creates 70.2 MiB of objects from a JSON document? What calls json/decoder.py:355? Victor

@Jiri Denemark <jdenemar@redhat.com> @Eduardo Lima <etrunko@redhat.com> can you please have a look on libvirt side? @Martin Perina <mperina@redhat.com> the host/stats part within vdsm was handled by people who are not working anymore on oVirt project, perhaps someone from infra can have a look? Il giorno gio 9 dic 2021 alle ore 11:20 Victor Stinner <vstinner@redhat.com> ha scritto:
On Tue, Dec 7, 2021 at 6:12 PM Chris Adams <cma@cmadams.net> wrote:
Top differences /usr/lib64/python3.6/site-packages/libvirt.py:442: size=295 MiB (+285 MiB), count=5511282 (+5312311), average=56 B /usr/lib64/python3.6/json/decoder.py:355: size=73.9 MiB (+70.2 MiB), count=736108 (+697450), average=105 B /usr/lib64/python3.6/logging/__init__.py:1630: size=44.2 MiB (+43.8 MiB), count=345704 (+342481), average=134 B /usr/lib64/python3.6/site-packages/libvirt.py:5695: size=30.3 MiB (+30.0 MiB), count=190449 (+188665), average=167 B /usr/lib/python3.6/site-packages/vdsm/host/stats.py:138: size=12.1 MiB (+11.4 MiB), count=75366 (+70991), average=168 B /usr/lib/python3.6/site-packages/vdsm/utils.py:358: size=10.4 MiB (+9968 KiB), count=70204 (+65272), average=156 B
That's quite significant!
Top block 5511282 memory blocks: 302589.8 KiB File "/usr/lib64/python3.6/site-packages/libvirt.py", line 442 ret = libvirtmod.virEventRunDefaultImpl() File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 69 libvirt.virEventRunDefaultImpl() File "/usr/lib/python3.6/site-packages/vdsm/common/concurrent.py", line 260 ret = func(*args, **kwargs)
You should check where these "ret" objects (of libvirt.py:442) are stored: 5,511,282 is a lot of small objects (average: 56 bytes)! Maybe they are stored in a list and never destroyed.
Maybe it's a reference leak in the libvirtmod.virEventRunDefaultImpl() function of "libvirtmod" C extension: missing Py_DECREF() somewhere.
Or something somehow prevents to delete these projects object. For example, an exception is stored somewhere which keeps all variables alive (in Python 3, an exception stores a traceback object which keeps all variables of all frames alive).
On GitHub and GitLab, I found the following code. Maybe there are minor differences in the versions that you are using.
https://gitlab.com/libvirt/libvirt-python (I built the code locally to get build/libvirt.py)
build/libvirt.c: --- PyObject * libvirt_intWrap(int val) { return PyLong_FromLong((long) val); }
PyObject * libvirt_virEventRunDefaultImpl(PyObject *self ATTRIBUTE_UNUSED, PyObject *args ATTRIBUTE_UNUSED) { PyObject *py_retval; int c_retval; LIBVIRT_BEGIN_ALLOW_THREADS; c_retval = virEventRunDefaultImpl(); LIBVIRT_END_ALLOW_THREADS; py_retval = libvirt_intWrap((int) c_retval); return py_retval; }
static PyMethodDef libvirtMethods[] = { { (char *)"virEventRunDefaultImpl", libvirt_virEventRunDefaultImpl, METH_VARARGS, NULL }, ... {NULL, NULL, 0, NULL} }; ---
This code looks correct and straightforward. Is it possible that internally virEventRunDefaultImpl() calls a Python memory allocator?
build/libvirt.py: --- def virEventRunDefaultImpl(): ret = libvirtmod.virEventRunDefaultImpl() if ret == -1: raise libvirtError('virEventRunDefaultImpl() failed') return ret ---
Again, this code looks correct and straightforward.
https://github.com/oVirt/vdsm/blob/37ed5c279c2dd9c9bb06329d674882e0f98f34d6/...
vdsm/common/libvirtconnection.py: --- def __run(self): try: libvirt.virEventRegisterDefaultImpl() while self.run: libvirt.virEventRunDefaultImpl() finally: self.run = False ---
libvirt.virEventRunDefaultImpl() result is ignored and so I don't see anything obvious which would explain a leak.
Sometimes, looking at the top function is misleading since the explanation can be found in one of the caller functions.
For example, which function creates 70.2 MiB of objects from a JSON document? What calls json/decoder.py:355?
Victor _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/O5OAA6KNLINLRT...
-- Sandro Bonazzola MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV Red Hat EMEA <https://www.redhat.com/> sbonazzo@redhat.com <https://www.redhat.com/> *Red Hat respects your work life balance. Therefore there is no need to answer this email out of your office hours.*

Once upon a time, Victor Stinner <vstinner@redhat.com> said:
Or something somehow prevents to delete these projects object. For example, an exception is stored somewhere which keeps all variables alive (in Python 3, an exception stores a traceback object which keeps all variables of all frames alive).
I think I found the cause, if not the actual code issue... due to a long-standing local config typo (how embarassing), these servers had the vdsm (TCP 54321) port open to the world. It appears that something is leaking memory on bad connections (like from port scans I expect). I blocked the outside access, and the vdsmd processes are not growing since then. It'd probably be good to handle this better (and now knowing a probable cause may help someone track it down), but also I think I've solved my immediate problem. -- Chris Adams <cma@cmadams.net>

On Wed, Nov 10, 2021 at 4:46 PM Chris Adams <cma@cmadams.net> wrote:
I have seen vdsmd leak memory for years (I've been running oVirt since version 3.5), but never been able to nail it down. I've upgraded a cluster to oVirt 4.4.9 (reloading the hosts with CentOS 8-stream), and I still see it happen. One host in the cluster, which has been up 8 days, has vdsmd with 4.3 GB resident memory. On a couple of other hosts, it's around half a gigabyte.
Can you share vdsm logs from the time vdsm started? We have these logs: 2021-11-14 15:16:32,956+0200 DEBUG (health) [health] Checking health (health:93) 2021-11-14 15:16:32,977+0200 DEBUG (health) [health] Collected 5001 objects (health:101) 2021-11-14 15:16:32,977+0200 DEBUG (health) [health] user=2.46%, sys=0.74%, rss=108068 kB (-376), threads=47 (health:126) 2021-11-14 15:16:32,977+0200 INFO (health) [health] LVM cache hit ratio: 97.64% (hits: 5431 misses: 131) (health:131) They may provide useful info on the leak. You need to enable DEBUG logs for root logger in /etc/vdsm/logger.conf: [logger_root] level=DEBUG handlers=syslog,logthread propagate=0 and restart vdsmd service. Nir
participants (5)
-
Chris Adams
-
David Malcolm
-
Nir Soffer
-
Sandro Bonazzola
-
Victor Stinner