VDSM memory consumption

Federico Alberto Sayd

6 Mar 2015 6 Mar '15

5:12 p.m.

Hello: I am experiencing troubles with VDSM memory consuption. I am running Engine: ovirt 3.5.1 Nodes: Centos 6.6 VDSM 4.16.10-8 Libvirt: libvirt-0.10.2-46 Kernel: 2.6.32 When the host boots, memory consuption is normal, but after 2 or 3 days running, VDSM memory consuption grows and it consumes more memory that all vm's running in the host. If I restart the vdsm service, memory consuption normalizes, but then it start growing again. I have seen some BZ about vdsm and supervdsm about memory leaks, but I don't know if VDSM 4.6.10.8 is still affected by a related bug. Any help? If you need, I can provide more information Thank you

Show replies by date

Chris Adams

6 Mar 6 Mar

5:23 p.m.

Once upon a time, Federico Alberto Sayd <fsayd@uncu.edu.ar> said:

...

I am experiencing troubles with VDSM memory consuption.

I am running

Engine: ovirt 3.5.1

Nodes:

Centos 6.6 VDSM 4.16.10-8 Libvirt: libvirt-0.10.2-46 Kernel: 2.6.32

When the host boots, memory consuption is normal, but after 2 or 3 days running, VDSM memory consuption grows and it consumes more memory that all vm's running in the host. If I restart the vdsm service, memory consuption normalizes, but then it start growing again.

I have seen some BZ about vdsm and supervdsm about memory leaks, but I don't know if VDSM 4.6.10.8 is still affected by a related bug.

Can't help, but I see the same thing with CentOS 7 nodes and the same version of vdsm. -- Chris Adams <cma@cmadams.net>

Darrell Budic

5:58 p.m.

I believe the supervdsm leak was fixed, but 3.5.1 versions of vdsmd still leaks slowly, ~300k/hr, yes. https://bugzilla.redhat.com/show_bug.cgi?id=1158108

...

On Mar 6, 2015, at 10:23 AM, Chris Adams <cma@cmadams.net> wrote:

Once upon a time, Federico Alberto Sayd <fsayd@uncu.edu.ar> said:

...
I am experiencing troubles with VDSM memory consuption.

I am running

Engine: ovirt 3.5.1

Nodes:

Centos 6.6 VDSM 4.16.10-8 Libvirt: libvirt-0.10.2-46 Kernel: 2.6.32

When the host boots, memory consuption is normal, but after 2 or 3 days running, VDSM memory consuption grows and it consumes more memory that all vm's running in the host. If I restart the vdsm service, memory consuption normalizes, but then it start growing again.

I have seen some BZ about vdsm and supervdsm about memory leaks, but I don't know if VDSM 4.6.10.8 is still affected by a related bug.

Can't help, but I see the same thing with CentOS 7 nodes and the same version of vdsm. -- Chris Adams <cma@cmadams.net> _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Dan Kenigsberg

9 Mar 9 Mar

10:51 a.m.

On Fri, Mar 06, 2015 at 10:58:53AM -0600, Darrell Budic wrote:

...

I believe the supervdsm leak was fixed, but 3.5.1 versions of vdsmd still leaks slowly, ~300k/hr, yes.

https://bugzilla.redhat.com/show_bug.cgi?id=1158108

...
On Mar 6, 2015, at 10:23 AM, Chris Adams <cma@cmadams.net> wrote:

Once upon a time, Federico Alberto Sayd <fsayd@uncu.edu.ar> said:

...
I am experiencing troubles with VDSM memory consuption.

I am running

Engine: ovirt 3.5.1

Nodes:

Centos 6.6 VDSM 4.16.10-8 Libvirt: libvirt-0.10.2-46 Kernel: 2.6.32

When the host boots, memory consuption is normal, but after 2 or 3 days running, VDSM memory consuption grows and it consumes more memory that all vm's running in the host. If I restart the vdsm service, memory consuption normalizes, but then it start growing again.

I have seen some BZ about vdsm and supervdsm about memory leaks, but I don't know if VDSM 4.6.10.8 is still affected by a related bug.

Can't help, but I see the same thing with CentOS 7 nodes and the same version of vdsm. -- Chris Adams <cma@cmadams.net> _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

I'm afraid that we are yet to find a solution for this issue, which is completly different from the horrible leak of supervdsm < 4.16.7. Could you corroborate the claim of Bug 1147148 - M2Crypto usage in vdsm leaks memory ? Does the leak disappear once you start using plaintext transport? Regards, Dan.

Darrell Budic

4:40 p.m.

...

On Mar 9, 2015, at 4:51 AM, Dan Kenigsberg <danken@redhat.com> wrote:

On Fri, Mar 06, 2015 at 10:58:53AM -0600, Darrell Budic wrote:

...
I believe the supervdsm leak was fixed, but 3.5.1 versions of vdsmd still leaks slowly, ~300k/hr, yes.

https://bugzilla.redhat.com/show_bug.cgi?id=1158108

...
On Mar 6, 2015, at 10:23 AM, Chris Adams <cma@cmadams.net> wrote:

Once upon a time, Federico Alberto Sayd <fsayd@uncu.edu.ar> said:

...
I am experiencing troubles with VDSM memory consuption.

I am running

Engine: ovirt 3.5.1

Nodes:

Centos 6.6 VDSM 4.16.10-8 Libvirt: libvirt-0.10.2-46 Kernel: 2.6.32

When the host boots, memory consuption is normal, but after 2 or 3 days running, VDSM memory consuption grows and it consumes more memory that all vm's running in the host. If I restart the vdsm service, memory consuption normalizes, but then it start growing again.

I have seen some BZ about vdsm and supervdsm about memory leaks, but I don't know if VDSM 4.6.10.8 is still affected by a related bug.

Can't help, but I see the same thing with CentOS 7 nodes and the same version of vdsm. -- Chris Adams <cma@cmadams.net> _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

I'm afraid that we are yet to find a solution for this issue, which is completly different from the horrible leak of supervdsm < 4.16.7.

Could you corroborate the claim of Bug 1147148 - M2Crypto usage in vdsm leaks memory ? Does the leak disappear once you start using plaintext transport?

Regards, Dan.

I don’t think this is crypto related, but I could try that if you still need some confirmation (and point me at a quick doc on switching to plaintext?). This is from #ovirt around November 18th I think, Saggi thought he’d found something related: 9:58:43 AM saggi: YamakasY: Found the leak 9:58:48 AM saggi: YamakasY: Or at least the flow 9:58:57 AM saggi: YamakasY: The good news is that I can reproduce 9:59:20 AM YamakasY: saggi: that's kewl! 9:59:25 AM YamakasY: saggi: what happens ? 9:59:41 AM YamakasY: I know from Telsin (ping ping!) that he sees it going faster on gluster usage tdosek left the room (quit: Ping timeout: 480 seconds). (10:00:02 AM) djasa left the room (quit: Quit: Leaving). (10:00:24 AM) mlipchuk left the room (quit: Quit: Leaving.). (10:00:29 AM) laravot left the room (quit: Quit: Leaving.). (10:01:19 AM) 10:01:54 AM saggi: YamakasY: it's in getCapabilities(). Here is the RSS graph. The flatlines are when I stopped calling it and called other verbs. http://i.imgur.com/CLm0Q75.png movciari left the room (quit: Ping timeout: 480 seconds). (10:02:34 AM) 10:02:46 AM saggi: YamakasY: horizontal is time since epoch and vertical is RSS in bytes bobdrad left the room (quit: Quit: Leaving.). (10:03:25 AM) 10:03:52 AM YamakasY: saggi: I have seen that line soooo much! 10:04:11 AM YamakasY: I think I even made a mailing about it 10:04:18 AM YamakasY: at least asked here 10:04:32 AM YamakasY: no-one knew, but those lines are almost blowing you away 10:04:35 AM YamakasY: can we patch it ? 10:04:59 AM YamakasY: wow, nice one to catch 10:05:28 AM saggi: YamakasY: I now have a smaller part of the code to scan through and a way to reproduce so hopefully I'll have a patch soon was that ever followed up on?

Dan Kenigsberg

11:29 p.m.

On Mon, Mar 09, 2015 at 10:40:51AM -0500, Darrell Budic wrote:

...

...
On Mar 9, 2015, at 4:51 AM, Dan Kenigsberg <danken@redhat.com> wrote:

On Fri, Mar 06, 2015 at 10:58:53AM -0600, Darrell Budic wrote:

...
I believe the supervdsm leak was fixed, but 3.5.1 versions of vdsmd still leaks slowly, ~300k/hr, yes.

https://bugzilla.redhat.com/show_bug.cgi?id=1158108

...
On Mar 6, 2015, at 10:23 AM, Chris Adams <cma@cmadams.net> wrote:

Once upon a time, Federico Alberto Sayd <fsayd@uncu.edu.ar> said:

...
I am experiencing troubles with VDSM memory consuption.

I am running

Engine: ovirt 3.5.1

Nodes:

Centos 6.6 VDSM 4.16.10-8 Libvirt: libvirt-0.10.2-46 Kernel: 2.6.32

When the host boots, memory consuption is normal, but after 2 or 3 days running, VDSM memory consuption grows and it consumes more memory that all vm's running in the host. If I restart the vdsm service, memory consuption normalizes, but then it start growing again.

I have seen some BZ about vdsm and supervdsm about memory leaks, but I don't know if VDSM 4.6.10.8 is still affected by a related bug.

Can't help, but I see the same thing with CentOS 7 nodes and the same version of vdsm. -- Chris Adams <cma@cmadams.net> _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

I'm afraid that we are yet to find a solution for this issue, which is completly different from the horrible leak of supervdsm < 4.16.7.

Could you corroborate the claim of Bug 1147148 - M2Crypto usage in vdsm leaks memory ? Does the leak disappear once you start using plaintext transport?

Regards, Dan.

I don’t think this is crypto related, but I could try that if you still need some confirmation (and point me at a quick doc on switching to plaintext?).

This is from #ovirt around November 18th I think, Saggi thought he’d found something related:

9:58:43 AM saggi: YamakasY: Found the leak 9:58:48 AM saggi: YamakasY: Or at least the flow 9:58:57 AM saggi: YamakasY: The good news is that I can reproduce 9:59:20 AM YamakasY: saggi: that's kewl! 9:59:25 AM YamakasY: saggi: what happens ? 9:59:41 AM YamakasY: I know from Telsin (ping ping!) that he sees it going faster on gluster usage tdosek left the room (quit: Ping timeout: 480 seconds). (10:00:02 AM) djasa left the room (quit: Quit: Leaving). (10:00:24 AM) mlipchuk left the room (quit: Quit: Leaving.). (10:00:29 AM) laravot left the room (quit: Quit: Leaving.). (10:01:19 AM) 10:01:54 AM saggi: YamakasY: it's in getCapabilities(). Here is the RSS graph. The flatlines are when I stopped calling it and called other verbs. http://i.imgur.com/CLm0Q75.png

I do recall what is the issue Saggi and YamakasY were dicussing (CCing the pair), or if it reached fruition as a patch. It is certainly something other than Bug 1158108, as the latter speak about a leak in a normal working state, with no getCapabilities calls.

Matt .

11:49 p.m.

Hi, I also see this on the latest 3.5 version, I'm thinking about setting up a cronjob to restart vdsm every night. I cannot believe that people say they don't have this issue. Can someone of the devs dive in maybe ? Thanks! Matt 2015-03-09 23:29 GMT+01:00 Dan Kenigsberg <danken@redhat.com>:

...

On Mon, Mar 09, 2015 at 10:40:51AM -0500, Darrell Budic wrote:

...
...
On Mar 9, 2015, at 4:51 AM, Dan Kenigsberg <danken@redhat.com> wrote:

On Fri, Mar 06, 2015 at 10:58:53AM -0600, Darrell Budic wrote:

...
I believe the supervdsm leak was fixed, but 3.5.1 versions of vdsmd still leaks slowly, ~300k/hr, yes.

https://bugzilla.redhat.com/show_bug.cgi?id=1158108

...
On Mar 6, 2015, at 10:23 AM, Chris Adams <cma@cmadams.net> wrote:

Once upon a time, Federico Alberto Sayd <fsayd@uncu.edu.ar> said:

...
I am experiencing troubles with VDSM memory consuption.

I am running

Engine: ovirt 3.5.1

Nodes:

Centos 6.6 VDSM 4.16.10-8 Libvirt: libvirt-0.10.2-46 Kernel: 2.6.32

When the host boots, memory consuption is normal, but after 2 or 3 days running, VDSM memory consuption grows and it consumes more memory that all vm's running in the host. If I restart the vdsm service, memory consuption normalizes, but then it start growing again.

I have seen some BZ about vdsm and supervdsm about memory leaks, but I don't know if VDSM 4.6.10.8 is still affected by a related bug.

Can't help, but I see the same thing with CentOS 7 nodes and the same version of vdsm. -- Chris Adams <cma@cmadams.net> _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

I'm afraid that we are yet to find a solution for this issue, which is completly different from the horrible leak of supervdsm < 4.16.7.

Could you corroborate the claim of Bug 1147148 - M2Crypto usage in vdsm leaks memory ? Does the leak disappear once you start using plaintext transport?

Regards, Dan.

I don’t think this is crypto related, but I could try that if you still need some confirmation (and point me at a quick doc on switching to plaintext?).

This is from #ovirt around November 18th I think, Saggi thought he’d found something related:

9:58:43 AM saggi: YamakasY: Found the leak 9:58:48 AM saggi: YamakasY: Or at least the flow 9:58:57 AM saggi: YamakasY: The good news is that I can reproduce 9:59:20 AM YamakasY: saggi: that's kewl! 9:59:25 AM YamakasY: saggi: what happens ? 9:59:41 AM YamakasY: I know from Telsin (ping ping!) that he sees it going faster on gluster usage tdosek left the room (quit: Ping timeout: 480 seconds). (10:00:02 AM) djasa left the room (quit: Quit: Leaving). (10:00:24 AM) mlipchuk left the room (quit: Quit: Leaving.). (10:00:29 AM) laravot left the room (quit: Quit: Leaving.). (10:01:19 AM) 10:01:54 AM saggi: YamakasY: it's in getCapabilities(). Here is the RSS graph. The flatlines are when I stopped calling it and called other verbs. http://i.imgur.com/CLm0Q75.png

I do recall what is the issue Saggi and YamakasY were dicussing (CCing the pair), or if it reached fruition as a patch. It is certainly something other than Bug 1158108, as the latter speak about a leak in a normal working state, with no getCapabilities calls.

Matt .

11:49 p.m.

Oh Sorry Dan, you are one of them ;) Nice to have you on this :) 2015-03-09 23:49 GMT+01:00 Matt . <yamakasi.014@gmail.com>:

...

Hi,

I also see this on the latest 3.5 version, I'm thinking about setting up a cronjob to restart vdsm every night.

I cannot believe that people say they don't have this issue.

Can someone of the devs dive in maybe ?

Thanks!

Matt

2015-03-09 23:29 GMT+01:00 Dan Kenigsberg <danken@redhat.com>:

...
On Mon, Mar 09, 2015 at 10:40:51AM -0500, Darrell Budic wrote:

...
...
On Mar 9, 2015, at 4:51 AM, Dan Kenigsberg <danken@redhat.com> wrote:

On Fri, Mar 06, 2015 at 10:58:53AM -0600, Darrell Budic wrote:

...
I believe the supervdsm leak was fixed, but 3.5.1 versions of vdsmd still leaks slowly, ~300k/hr, yes.

https://bugzilla.redhat.com/show_bug.cgi?id=1158108

...
On Mar 6, 2015, at 10:23 AM, Chris Adams <cma@cmadams.net> wrote:

Once upon a time, Federico Alberto Sayd <fsayd@uncu.edu.ar> said: > I am experiencing troubles with VDSM memory consuption. > > I am running > > Engine: ovirt 3.5.1 > > Nodes: > > Centos 6.6 > VDSM 4.16.10-8 > Libvirt: libvirt-0.10.2-46 > Kernel: 2.6.32 > > When the host boots, memory consuption is normal, but after 2 or 3 > days running, VDSM memory consuption grows and it consumes more > memory that all vm's running in the host. If I restart the vdsm > service, memory consuption normalizes, but then it start growing > again. > > I have seen some BZ about vdsm and supervdsm about memory leaks, but > I don't know if VDSM 4.6.10.8 is still affected by a related bug.

Can't help, but I see the same thing with CentOS 7 nodes and the same version of vdsm. -- Chris Adams <cma@cmadams.net> _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

I'm afraid that we are yet to find a solution for this issue, which is completly different from the horrible leak of supervdsm < 4.16.7.

Could you corroborate the claim of Bug 1147148 - M2Crypto usage in vdsm leaks memory ? Does the leak disappear once you start using plaintext transport?

Regards, Dan.

I don’t think this is crypto related, but I could try that if you still need some confirmation (and point me at a quick doc on switching to plaintext?).

This is from #ovirt around November 18th I think, Saggi thought he’d found something related:

9:58:43 AM saggi: YamakasY: Found the leak 9:58:48 AM saggi: YamakasY: Or at least the flow 9:58:57 AM saggi: YamakasY: The good news is that I can reproduce 9:59:20 AM YamakasY: saggi: that's kewl! 9:59:25 AM YamakasY: saggi: what happens ? 9:59:41 AM YamakasY: I know from Telsin (ping ping!) that he sees it going faster on gluster usage tdosek left the room (quit: Ping timeout: 480 seconds). (10:00:02 AM) djasa left the room (quit: Quit: Leaving). (10:00:24 AM) mlipchuk left the room (quit: Quit: Leaving.). (10:00:29 AM) laravot left the room (quit: Quit: Leaving.). (10:01:19 AM) 10:01:54 AM saggi: YamakasY: it's in getCapabilities(). Here is the RSS graph. The flatlines are when I stopped calling it and called other verbs. http://i.imgur.com/CLm0Q75.png

I do recall what is the issue Saggi and YamakasY were dicussing (CCing the pair), or if it reached fruition as a patch. It is certainly something other than Bug 1158108, as the latter speak about a leak in a normal working state, with no getCapabilities calls.

Dan Kenigsberg

10 Mar 10 Mar

11:47 a.m.

On Mon, Mar 09, 2015 at 11:49:01PM +0100, Matt . wrote:

...

Hi,

I also see this on the latest 3.5 version, I'm thinking about setting up a cronjob to restart vdsm every night.

I cannot believe that people say they don't have this issue.

Can someone of the devs dive in maybe ?

...

...
...
10:01:54 AM saggi: YamakasY: it's in getCapabilities(). Here is the RSS graph. The flatlines are when I stopped calling it and called other verbs. http://i.imgur.com/CLm0Q75.png

I do ***NOT*** recall what is the issue Saggi and YamakasY were dicussing (CCing the pair), or if it reached fruition as a patch. It is certainly something other than Bug 1158108, as the latter speak about a leak in a normal working state, with no getCapabilities calls.

Please notice an important word that fell off my text. Do YOU recall if a fix was posted?

Matt .

3:28 p.m.

NO! The fix that should have fixed it didn't change a thing... we lost track there as some devs were going to look at it. 2015-03-10 11:47 GMT+01:00 Dan Kenigsberg <danken@redhat.com>:

...

On Mon, Mar 09, 2015 at 11:49:01PM +0100, Matt . wrote:

...
Hi,

I also see this on the latest 3.5 version, I'm thinking about setting up a cronjob to restart vdsm every night.

I cannot believe that people say they don't have this issue.

Can someone of the devs dive in maybe ?

...
...
...
10:01:54 AM saggi: YamakasY: it's in getCapabilities(). Here is the RSS graph. The flatlines are when I stopped calling it and called other verbs. http://i.imgur.com/CLm0Q75.png

I do ***NOT*** recall what is the issue Saggi and YamakasY were dicussing (CCing the pair), or if it reached fruition as a patch. It is certainly something other than Bug 1158108, as the latter speak about a leak in a normal working state, with no getCapabilities calls.

Please notice an important word that fell off my text. Do YOU recall if a fix was posted?

Daniel Helgenberger

26 Mar 26 Mar

1:33 p.m.

Hello Everyone, I did create the original BZ on this. In the mean time, lab system I used is dismantled and the production system is yet to deploy. As I wrote in BZ1147148 [1], I experienced two different issues. One, one big mem leak of about 15MiB/h and a smaller one, ~300KiB. These seem unrelated. The larger leak was indeed related to SSL in some way; not necessarily M2Crypto. However, after disabling SSL this was gone leaving the smaller leak. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1147148 On Mo, 2015-03-09 at 23:49 +0100, Matt . wrote:

...

Hi,

I also see this on the latest 3.5 version, I'm thinking about setting up a cronjob to restart vdsm every night. I did the same thing. In general, it seems to be a bad idea as it compromised system stability on the long run. While VMs seem to be fine, engine does not like this very much.

...

I cannot believe that people say they don't have this issue. This was hard for me to accept as well. I know of Markus Stockhausen and Seven Kieske, both confirmed the small leak. This might also be some special other service; though I started out with a minimal install of Centos 6.

Can someone of the devs dive in maybe ?

Thanks!

Matt

2015-03-09 23:29 GMT+01:00 Dan Kenigsberg <danken@redhat.com>:

...
On Mon, Mar 09, 2015 at 10:40:51AM -0500, Darrell Budic wrote:

...
...
On Mar 9, 2015, at 4:51 AM, Dan Kenigsberg <danken@redhat.com> wrote:

On Fri, Mar 06, 2015 at 10:58:53AM -0600, Darrell Budic wrote:

...
I believe the supervdsm leak was fixed, but 3.5.1 versions of vdsmd still leaks slowly, ~300k/hr, yes.

https://bugzilla.redhat.com/show_bug.cgi?id=1158108

...
On Mar 6, 2015, at 10:23 AM, Chris Adams <cma@cmadams.net> wrote:

Once upon a time, Federico Alberto Sayd <fsayd@uncu.edu.ar> said: > I am experiencing troubles with VDSM memory consuption. > > I am running > > Engine: ovirt 3.5.1 > > Nodes: > > Centos 6.6 > VDSM 4.16.10-8 > Libvirt: libvirt-0.10.2-46 > Kernel: 2.6.32 > > When the host boots, memory consuption is normal, but after 2 or 3 > days running, VDSM memory consuption grows and it consumes more > memory that all vm's running in the host. If I restart the vdsm > service, memory consuption normalizes, but then it start growing > again. > > I have seen some BZ about vdsm and supervdsm about memory leaks, but > I don't know if VDSM 4.6.10.8 is still affected by a related bug.

Can't help, but I see the same thing with CentOS 7 nodes and the same version of vdsm. -- Chris Adams <cma@cmadams.net> _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

I'm afraid that we are yet to find a solution for this issue, which is completly different from the horrible leak of supervdsm < 4.16.7.

Could you corroborate the claim of Bug 1147148 - M2Crypto usage in vdsm leaks memory ? Does the leak disappear once you start using plaintext transport?

Regards, Dan.

I don’t think this is crypto related, but I could try that if you still need some confirmation (and point me at a quick doc on switching to plaintext?).

This is from #ovirt around November 18th I think, Saggi thought he’d found something related:

9:58:43 AM saggi: YamakasY: Found the leak 9:58:48 AM saggi: YamakasY: Or at least the flow 9:58:57 AM saggi: YamakasY: The good news is that I can reproduce 9:59:20 AM YamakasY: saggi: that's kewl! 9:59:25 AM YamakasY: saggi: what happens ? 9:59:41 AM YamakasY: I know from Telsin (ping ping!) that he sees it going faster on gluster usage tdosek left the room (quit: Ping timeout: 480 seconds). (10:00:02 AM) djasa left the room (quit: Quit: Leaving). (10:00:24 AM) mlipchuk left the room (quit: Quit: Leaving.). (10:00:29 AM) laravot left the room (quit: Quit: Leaving.). (10:01:19 AM) 10:01:54 AM saggi: YamakasY: it's in getCapabilities(). Here is the RSS graph. The flatlines are when I stopped calling it and called other verbs. http://i.imgur.com/CLm0Q75.png

I do recall what is the issue Saggi and YamakasY were dicussing (CCing the pair), or if it reached fruition as a patch. It is certainly something other than Bug 1158108, as the latter speak about a leak in a normal working state, with no getCapabilities calls.

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

-- Daniel Helgenberger m box bewegtbild GmbH P: +49/30/2408781-22 F: +49/30/2408781-10 ACKERSTR. 19 D-10115 BERLIN www.m-box.de www.monkeymen.tv Geschäftsführer: Martin Retschitzegger / Michaela Göllner Handeslregister: Amtsgericht Charlottenburg / HRB 112767

Matt .

1:43 p.m.

Hi Daniel, Great! Thanks. I only see this issue happening on CentOS 7, Joop van de Wege also confirmed he didn't see it on CentOS 6. Cheers, Matt 2015-03-26 13:33 GMT+01:00 Daniel Helgenberger <daniel.helgenberger@m-box.de>:

...

Hello Everyone,

I did create the original BZ on this. In the mean time, lab system I used is dismantled and the production system is yet to deploy.

As I wrote in BZ1147148 [1], I experienced two different issues. One, one big mem leak of about 15MiB/h and a smaller one, ~300KiB. These seem unrelated.

The larger leak was indeed related to SSL in some way; not necessarily M2Crypto. However, after disabling SSL this was gone leaving the smaller leak.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1147148 On Mo, 2015-03-09 at 23:49 +0100, Matt . wrote:

...
Hi,

I also see this on the latest 3.5 version, I'm thinking about setting up a cronjob to restart vdsm every night. I did the same thing. In general, it seems to be a bad idea as it compromised system stability on the long run. While VMs seem to be fine, engine does not like this very much.

...
I cannot believe that people say they don't have this issue. This was hard for me to accept as well. I know of Markus Stockhausen and Seven Kieske, both confirmed the small leak. This might also be some special other service; though I started out with a minimal install of Centos 6.

Can someone of the devs dive in maybe ?

Thanks!

Matt

2015-03-09 23:29 GMT+01:00 Dan Kenigsberg <danken@redhat.com>:

...
On Mon, Mar 09, 2015 at 10:40:51AM -0500, Darrell Budic wrote:

...
...
On Mar 9, 2015, at 4:51 AM, Dan Kenigsberg <danken@redhat.com> wrote:

On Fri, Mar 06, 2015 at 10:58:53AM -0600, Darrell Budic wrote:

...
I believe the supervdsm leak was fixed, but 3.5.1 versions of vdsmd still leaks slowly, ~300k/hr, yes.

https://bugzilla.redhat.com/show_bug.cgi?id=1158108

> On Mar 6, 2015, at 10:23 AM, Chris Adams <cma@cmadams.net> wrote: > > Once upon a time, Federico Alberto Sayd <fsayd@uncu.edu.ar> said: >> I am experiencing troubles with VDSM memory consuption. >> >> I am running >> >> Engine: ovirt 3.5.1 >> >> Nodes: >> >> Centos 6.6 >> VDSM 4.16.10-8 >> Libvirt: libvirt-0.10.2-46 >> Kernel: 2.6.32 >> >> When the host boots, memory consuption is normal, but after 2 or 3 >> days running, VDSM memory consuption grows and it consumes more >> memory that all vm's running in the host. If I restart the vdsm >> service, memory consuption normalizes, but then it start growing >> again. >> >> I have seen some BZ about vdsm and supervdsm about memory leaks, but >> I don't know if VDSM 4.6.10.8 is still affected by a related bug. > > Can't help, but I see the same thing with CentOS 7 nodes and the same > version of vdsm. > -- > Chris Adams <cma@cmadams.net> > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users

I'm afraid that we are yet to find a solution for this issue, which is completly different from the horrible leak of supervdsm < 4.16.7.

Could you corroborate the claim of Bug 1147148 - M2Crypto usage in vdsm leaks memory ? Does the leak disappear once you start using plaintext transport?

Regards, Dan.

I don’t think this is crypto related, but I could try that if you still need some confirmation (and point me at a quick doc on switching to plaintext?).

This is from #ovirt around November 18th I think, Saggi thought he’d found something related:

9:58:43 AM saggi: YamakasY: Found the leak 9:58:48 AM saggi: YamakasY: Or at least the flow 9:58:57 AM saggi: YamakasY: The good news is that I can reproduce 9:59:20 AM YamakasY: saggi: that's kewl! 9:59:25 AM YamakasY: saggi: what happens ? 9:59:41 AM YamakasY: I know from Telsin (ping ping!) that he sees it going faster on gluster usage tdosek left the room (quit: Ping timeout: 480 seconds). (10:00:02 AM) djasa left the room (quit: Quit: Leaving). (10:00:24 AM) mlipchuk left the room (quit: Quit: Leaving.). (10:00:29 AM) laravot left the room (quit: Quit: Leaving.). (10:01:19 AM) 10:01:54 AM saggi: YamakasY: it's in getCapabilities(). Here is the RSS graph. The flatlines are when I stopped calling it and called other verbs. http://i.imgur.com/CLm0Q75.png

I do recall what is the issue Saggi and YamakasY were dicussing (CCing the pair), or if it reached fruition as a patch. It is certainly something other than Bug 1158108, as the latter speak about a leak in a normal working state, with no getCapabilities calls.

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

-- Daniel Helgenberger m box bewegtbild GmbH

P: +49/30/2408781-22 F: +49/30/2408781-10

ACKERSTR. 19 D-10115 BERLIN

www.m-box.de www.monkeymen.tv

Geschäftsführer: Martin Retschitzegger / Michaela Göllner Handeslregister: Amtsgericht Charlottenburg / HRB 112767

Federico Alberto Sayd

1:56 p.m.

On 26/03/15 09:43, Matt . wrote:

...

Hi Daniel,

Great! Thanks.

I only see this issue happening on CentOS 7, Joop van de Wege also confirmed he didn't see it on CentOS 6.

Cheers,

Matt I have experienced the same issue on Centos 6.6 and Centos 7 both managed by the same engine.

Cheers Federico

...

2015-03-26 13:33 GMT+01:00 Daniel Helgenberger <daniel.helgenberger@m-box.de>:

...
Hello Everyone,

I did create the original BZ on this. In the mean time, lab system I used is dismantled and the production system is yet to deploy.

As I wrote in BZ1147148 [1], I experienced two different issues. One, one big mem leak of about 15MiB/h and a smaller one, ~300KiB. These seem unrelated.

The larger leak was indeed related to SSL in some way; not necessarily M2Crypto. However, after disabling SSL this was gone leaving the smaller leak.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1147148 On Mo, 2015-03-09 at 23:49 +0100, Matt . wrote:

...
Hi,

I also see this on the latest 3.5 version, I'm thinking about setting up a cronjob to restart vdsm every night. I did the same thing. In general, it seems to be a bad idea as it compromised system stability on the long run. While VMs seem to be fine, engine does not like this very much.

...
I cannot believe that people say they don't have this issue. This was hard for me to accept as well. I know of Markus Stockhausen and Seven Kieske, both confirmed the small leak. This might also be some special other service; though I started out with a minimal install of Centos 6. Can someone of the devs dive in maybe ?

Thanks!

Matt

2015-03-09 23:29 GMT+01:00 Dan Kenigsberg <danken@redhat.com>:

...
...
...
On Mar 9, 2015, at 4:51 AM, Dan Kenigsberg <danken@redhat.com> wrote:

On Fri, Mar 06, 2015 at 10:58:53AM -0600, Darrell Budic wrote: > I believe the supervdsm leak was fixed, but 3.5.1 versions of vdsmd still leaks slowly, ~300k/hr, yes. > > https://bugzilla.redhat.com/show_bug.cgi?id=1158108 > > >> On Mar 6, 2015, at 10:23 AM, Chris Adams <cma@cmadams.net> wrote: >> >> Once upon a time, Federico Alberto Sayd <fsayd@uncu.edu.ar> said: >>> I am experiencing troubles with VDSM memory consuption. >>> >>> I am running >>> >>> Engine: ovirt 3.5.1 >>> >>> Nodes: >>> >>> Centos 6.6 >>> VDSM 4.16.10-8 >>> Libvirt: libvirt-0.10.2-46 >>> Kernel: 2.6.32 >>> >>> When the host boots, memory consuption is normal, but after 2 or 3 >>> days running, VDSM memory consuption grows and it consumes more >>> memory that all vm's running in the host. If I restart the vdsm >>> service, memory consuption normalizes, but then it start growing >>> again. >>> >>> I have seen some BZ about vdsm and supervdsm about memory leaks, but >>> I don't know if VDSM 4.6.10.8 is still affected by a related bug. >> Can't help, but I see the same thing with CentOS 7 nodes and the same >> version of vdsm. >> -- >> Chris Adams <cma@cmadams.net> >> _______________________________________________ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users I'm afraid that we are yet to find a solution for this issue, which is completly different from the horrible leak of supervdsm < 4.16.7.

Could you corroborate the claim of Bug 1147148 - M2Crypto usage in vdsm leaks memory ? Does the leak disappear once you start using plaintext transport?

Regards, Dan. I don’t think this is crypto related, but I could try that if you still need some confirmation (and point me at a quick doc on switching to plaintext?).

This is from #ovirt around November 18th I think, Saggi thought he’d found something related:

9:58:43 AM saggi: YamakasY: Found the leak 9:58:48 AM saggi: YamakasY: Or at least the flow 9:58:57 AM saggi: YamakasY: The good news is that I can reproduce 9:59:20 AM YamakasY: saggi: that's kewl! 9:59:25 AM YamakasY: saggi: what happens ? 9:59:41 AM YamakasY: I know from Telsin (ping ping!) that he sees it going faster on gluster usage tdosek left the room (quit: Ping timeout: 480 seconds). (10:00:02 AM) djasa left the room (quit: Quit: Leaving). (10:00:24 AM) mlipchuk left the room (quit: Quit: Leaving.). (10:00:29 AM) laravot left the room (quit: Quit: Leaving.). (10:01:19 AM) 10:01:54 AM saggi: YamakasY: it's in getCapabilities(). Here is the RSS graph. The flatlines are when I stopped calling it and called other verbs. http://i.imgur.com/CLm0Q75.png I do recall what is the issue Saggi and YamakasY were dicussing (CCing

On Mon, Mar 09, 2015 at 10:40:51AM -0500, Darrell Budic wrote: the pair), or if it reached fruition as a patch. It is certainly something other than Bug 1158108, as the latter speak about a leak in a normal working state, with no getCapabilities calls.

_______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users -- Daniel Helgenberger m box bewegtbild GmbH

P: +49/30/2408781-22 F: +49/30/2408781-10

ACKERSTR. 19 D-10115 BERLIN

www.m-box.de www.monkeymen.tv

Geschäftsführer: Martin Retschitzegger / Michaela Göllner Handeslregister: Amtsgericht Charlottenburg / HRB 112767

Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

John Taylor

28 Mar 28 Mar

3:20 p.m.

Daniel Helgenberger <daniel.helgenberger@m-box.de> writes:

...

Hello Everyone,

I did create the original BZ on this. In the mean time, lab system I used is dismantled and the production system is yet to deploy.

As I wrote in BZ1147148 [1], I experienced two different issues. One, one big mem leak of about 15MiB/h and a smaller one, ~300KiB. These seem unrelated.

The larger leak was indeed related to SSL in some way; not necessarily M2Crypto. However, after disabling SSL this was gone leaving the smaller leak.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1147148

I think there are, at least for the purpose of this discussion, 3 leaks: 1. the M2Crypto leak 2. a slower leak 3. a large leak that's not M2Crypto related that's part of sampling My efforts have been around finding the source of my larger leak, which I think is #3. I had disabled ssl so I knew that M2Crypto isn't/shouldn't be the problem as in bz1147148, and ssl is beside the point as it happens with a deactived host. It's part of sampling which always runs. What I've found is, after trying to get the smallest reproducer, that it's not the netlink.iter_links that I commented on in [1] that is the problem. But in the _get_intefaces_and_samples loop is the call to create an InterfaceSample and that has getLinkSpeed() which, for vlans, ends up calling ipwrapper.getLink, and that to netlink.get_link(name) netlink.get_link(name) *is* the source of my big leak. This is vdsm 4.16.10, so it is [2] and it's been changed in master for the removal of support for libnl v1 so it might not be a problem anymore. def get_link(name): """Returns the information dictionary of the name specified link.""" with _pool.socket() as sock: with _nl_link_cache(sock) as cache: link = _rtnl_link_get_by_name(cache, name) if not link: raise IOError(errno.ENODEV, '%s is not present in the system' % name) return _link_info(cache, link) The libnl documentation note at [3] says that for the rtnl_link_get_by_name function "Attention The reference counter of the returned link object will be incremented. Use rtnl_link_put() to release the reference." So I took that hint, and made a change that does the rtnl_link_put() in get_link(name) and it looks like it works for me. diff oldnetlink.py netlink.py 67d66 < return _link_info(cache, link) 68a68,70

...

li = _link_info(cache, link) _rtnl_link_put(link) return li

333a336,337

...

_rtnl_link_put = _none_proto(('rtnl_link_put', LIBNL_ROUTE))

Hope that helps. And if someone else could confirm that would be great. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1158108 [2] https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=blob;f=lib/vdsm/netlink.py;h=af... [3] http://www.infradead.org/~tgr/libnl/doc/api/group__link.html#ga1d583e4f0b43c... -John

Dan Kenigsberg

30 Mar 30 Mar

11:01 a.m.

On Sat, Mar 28, 2015 at 10:20:25AM -0400, John Taylor wrote:

...

Daniel Helgenberger <daniel.helgenberger@m-box.de> writes:

...
Hello Everyone,

I did create the original BZ on this. In the mean time, lab system I used is dismantled and the production system is yet to deploy.

As I wrote in BZ1147148 [1], I experienced two different issues. One, one big mem leak of about 15MiB/h and a smaller one, ~300KiB. These seem unrelated.

The larger leak was indeed related to SSL in some way; not necessarily M2Crypto. However, after disabling SSL this was gone leaving the smaller leak.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1147148

I think there are, at least for the purpose of this discussion, 3 leaks: 1. the M2Crypto leak 2. a slower leak 3. a large leak that's not M2Crypto related that's part of sampling

My efforts have been around finding the source of my larger leak, which I think is #3. I had disabled ssl so I knew that M2Crypto isn't/shouldn't be the problem as in bz1147148, and ssl is beside the point as it happens with a deactived host. It's part of sampling which always runs.

What I've found is, after trying to get the smallest reproducer, that it's not the netlink.iter_links that I commented on in [1] that is the problem. But in the _get_intefaces_and_samples loop is the call to create an InterfaceSample and that has getLinkSpeed() which, for vlans, ends up calling ipwrapper.getLink, and that to netlink.get_link(name)

netlink.get_link(name) *is* the source of my big leak. This is vdsm 4.16.10, so it is [2] and it's been changed in master for the removal of support for libnl v1 so it might not be a problem anymore.

def get_link(name): """Returns the information dictionary of the name specified link.""" with _pool.socket() as sock: with _nl_link_cache(sock) as cache: link = _rtnl_link_get_by_name(cache, name) if not link: raise IOError(errno.ENODEV, '%s is not present in the system' % name) return _link_info(cache, link)

The libnl documentation note at [3] says that for the rtnl_link_get_by_name function "Attention The reference counter of the returned link object will be incremented. Use rtnl_link_put() to release the reference."

So I took that hint, and made a change that does the rtnl_link_put() in get_link(name) and it looks like it works for me.

diff oldnetlink.py netlink.py 67d66 < return _link_info(cache, link) 68a68,70

...
li = _link_info(cache, link) _rtnl_link_put(link) return li

333a336,337

...
_rtnl_link_put = _none_proto(('rtnl_link_put', LIBNL_ROUTE))

Hope that helps. And if someone else could confirm that would be great.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1158108 [2] https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=blob;f=lib/vdsm/netlink.py;h=af... [3] http://www.infradead.org/~tgr/libnl/doc/api/group__link.html#ga1d583e4f0b43c...

Thanks, John, for a great detective work. I'm afraid that with even on the master branch we keep calling rtnl_link_get_link() and rtnl_link_get_by_name() without clearing the reference count, so a fix is due there, too. Would you consider posting a fully-fledged fix to gerrit? I still need to understand what is the use of that refcount, so that we do not release it too early. Regards, Dan.

John Taylor

11:14 p.m.

Dan Kenigsberg <danken@redhat.com> writes:

...

On Sat, Mar 28, 2015 at 10:20:25AM -0400, John Taylor wrote:

...
Daniel Helgenberger <daniel.helgenberger@m-box.de> writes:

...
Hello Everyone,

I did create the original BZ on this. In the mean time, lab system I used is dismantled and the production system is yet to deploy.

As I wrote in BZ1147148 [1], I experienced two different issues. One, one big mem leak of about 15MiB/h and a smaller one, ~300KiB. These seem unrelated.

The larger leak was indeed related to SSL in some way; not necessarily M2Crypto. However, after disabling SSL this was gone leaving the smaller leak.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1147148

I think there are, at least for the purpose of this discussion, 3 leaks: 1. the M2Crypto leak 2. a slower leak 3. a large leak that's not M2Crypto related that's part of sampling

My efforts have been around finding the source of my larger leak, which I think is #3. I had disabled ssl so I knew that M2Crypto isn't/shouldn't be the problem as in bz1147148, and ssl is beside the point as it happens with a deactived host. It's part of sampling which always runs.

What I've found is, after trying to get the smallest reproducer, that it's not the netlink.iter_links that I commented on in [1] that is the problem. But in the _get_intefaces_and_samples loop is the call to create an InterfaceSample and that has getLinkSpeed() which, for vlans, ends up calling ipwrapper.getLink, and that to netlink.get_link(name)

netlink.get_link(name) *is* the source of my big leak. This is vdsm 4.16.10, so it is [2] and it's been changed in master for the removal of support for libnl v1 so it might not be a problem anymore.

def get_link(name): """Returns the information dictionary of the name specified link.""" with _pool.socket() as sock: with _nl_link_cache(sock) as cache: link = _rtnl_link_get_by_name(cache, name) if not link: raise IOError(errno.ENODEV, '%s is not present in the system' % name) return _link_info(cache, link)

The libnl documentation note at [3] says that for the rtnl_link_get_by_name function "Attention The reference counter of the returned link object will be incremented. Use rtnl_link_put() to release the reference."

So I took that hint, and made a change that does the rtnl_link_put() in get_link(name) and it looks like it works for me.

diff oldnetlink.py netlink.py 67d66 < return _link_info(cache, link) 68a68,70

...
li = _link_info(cache, link) _rtnl_link_put(link) return li

333a336,337

...
_rtnl_link_put = _none_proto(('rtnl_link_put', LIBNL_ROUTE))

Hope that helps. And if someone else could confirm that would be great.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1158108 [2] https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=blob;f=lib/vdsm/netlink.py;h=af... [3] http://www.infradead.org/~tgr/libnl/doc/api/group__link.html#ga1d583e4f0b43c...

Thanks, John, for a great detective work.

I'm afraid that with even on the master branch we keep calling rtnl_link_get_link() and rtnl_link_get_by_name() without clearing the reference count, so a fix is due there, too.

Would you consider posting a fully-fledged fix to gerrit? I still need to understand what is the use of that refcount, so that we do not release it too early.

Regards, Dan.

Dan, I'm happy to [1], although I've probably gotten something wrong with how it's supposed to be done :) It's for the version I'm using so it's for branch ovirt-3.5. [1] https://gerrit.ovirt.org/#/c/39372/ Thanks, -John

Darrell Budic

1 Apr 1 Apr

3:43 a.m.

Finally got a chance to implement this, so testing this on my centos7 hosts, and it looks good. I’ll keep eye on it for a couple days, but after a couple of hours, there’s no evidence of any leakage.

...

On Mar 30, 2015, at 4:14 PM, John Taylor <jtt77777@yahoo.com> wrote:

Dan Kenigsberg <danken@redhat.com> writes:

...
On Sat, Mar 28, 2015 at 10:20:25AM -0400, John Taylor wrote:

...
Daniel Helgenberger <daniel.helgenberger@m-box.de> writes:

...
Hello Everyone,

I did create the original BZ on this. In the mean time, lab system I used is dismantled and the production system is yet to deploy.

As I wrote in BZ1147148 [1], I experienced two different issues. One, one big mem leak of about 15MiB/h and a smaller one, ~300KiB. These seem unrelated.

The larger leak was indeed related to SSL in some way; not necessarily M2Crypto. However, after disabling SSL this was gone leaving the smaller leak.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1147148

I think there are, at least for the purpose of this discussion, 3 leaks: 1. the M2Crypto leak 2. a slower leak 3. a large leak that's not M2Crypto related that's part of sampling

My efforts have been around finding the source of my larger leak, which I think is #3. I had disabled ssl so I knew that M2Crypto isn't/shouldn't be the problem as in bz1147148, and ssl is beside the point as it happens with a deactived host. It's part of sampling which always runs.

What I've found is, after trying to get the smallest reproducer, that it's not the netlink.iter_links that I commented on in [1] that is the problem. But in the _get_intefaces_and_samples loop is the call to create an InterfaceSample and that has getLinkSpeed() which, for vlans, ends up calling ipwrapper.getLink, and that to netlink.get_link(name)

netlink.get_link(name) *is* the source of my big leak. This is vdsm 4.16.10, so it is [2] and it's been changed in master for the removal of support for libnl v1 so it might not be a problem anymore.

def get_link(name): """Returns the information dictionary of the name specified link.""" with _pool.socket() as sock: with _nl_link_cache(sock) as cache: link = _rtnl_link_get_by_name(cache, name) if not link: raise IOError(errno.ENODEV, '%s is not present in the system' % name) return _link_info(cache, link)

The libnl documentation note at [3] says that for the rtnl_link_get_by_name function "Attention The reference counter of the returned link object will be incremented. Use rtnl_link_put() to release the reference."

So I took that hint, and made a change that does the rtnl_link_put() in get_link(name) and it looks like it works for me.

diff oldnetlink.py netlink.py 67d66 < return _link_info(cache, link) 68a68,70

...
li = _link_info(cache, link) _rtnl_link_put(link) return li

333a336,337

...
_rtnl_link_put = _none_proto(('rtnl_link_put', LIBNL_ROUTE))

Hope that helps. And if someone else could confirm that would be great.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1158108 [2] https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=blob;f=lib/vdsm/netlink.py;h=af... [3] http://www.infradead.org/~tgr/libnl/doc/api/group__link.html#ga1d583e4f0b43c...

Thanks, John, for a great detective work.

I'm afraid that with even on the master branch we keep calling rtnl_link_get_link() and rtnl_link_get_by_name() without clearing the reference count, so a fix is due there, too.

Would you consider posting a fully-fledged fix to gerrit? I still need to understand what is the use of that refcount, so that we do not release it too early.

Regards, Dan.

Dan,

I'm happy to [1], although I've probably gotten something wrong with how it's supposed to be done :) It's for the version I'm using so it's for branch ovirt-3.5.

[1] https://gerrit.ovirt.org/#/c/39372/

Thanks, -John _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

Chris Adams

9 Mar 9 Mar

6:17 p.m.

Once upon a time, Dan Kenigsberg <danken@redhat.com> said:

...

I'm afraid that we are yet to find a solution for this issue, which is completly different from the horrible leak of supervdsm < 4.16.7.

Could you corroborate the claim of Bug 1147148 - M2Crypto usage in vdsm leaks memory ? Does the leak disappear once you start using plaintext transport?

So, to confirm, it looks like to do that, the steps would be: - In the [vars] section of /etc/vdsm/vdsm.conf, set "ssl = false". - Restart the vdsmd service. Is that all that is needed? Is it safe to restart vdsmd on a node with active VMs? -- Chris Adams <cma@cmadams.net>

Dan Kenigsberg

11:19 p.m.

On Mon, Mar 09, 2015 at 12:17:00PM -0500, Chris Adams wrote:

...

Once upon a time, Dan Kenigsberg <danken@redhat.com> said:

...
I'm afraid that we are yet to find a solution for this issue, which is completly different from the horrible leak of supervdsm < 4.16.7.

Could you corroborate the claim of Bug 1147148 - M2Crypto usage in vdsm leaks memory ? Does the leak disappear once you start using plaintext transport?

So, to confirm, it looks like to do that, the steps would be:

- In the [vars] section of /etc/vdsm/vdsm.conf, set "ssl = false". - Restart the vdsmd service.

Is that all that is needed?

No. You'd have to reconfigure libvirtd to work in plaintext vdsm-tool congfigure --force and also set you Engine to work in plaintext (unfortunately, I don't recall how's that done. surely Yaniv does)

...

Is it safe to restart vdsmd on a node with active VMs?

It's safe in the sense that I have not heard of a single failure to reconnected to already-running VMs in years. However, this is still not recommended for production environment, and particularly not if one of the VMs is defined as highly-available. This can end up with your host being fenced and all your VMs dead. Dan.

ybronhei

10 Mar 10 Mar

2:42 p.m.

...

On Mon, Mar 09, 2015 at 12:17:00PM -0500, Chris Adams wrote:

...
Once upon a time, Dan Kenigsberg <danken@redhat.com> said:

...
I'm afraid that we are yet to find a solution for this issue, which is completly different from the horrible leak of supervdsm < 4.16.7.

Could you corroborate the claim of Bug 1147148 - M2Crypto usage in vdsm leaks memory ? Does the leak disappear once you start using plaintext transport?

So, to confirm, it looks like to do that, the steps would be:

- In the [vars] section of /etc/vdsm/vdsm.conf, set "ssl = false". - Restart the vdsmd service.

Is that all that is needed?

No. You'd have to reconfigure libvirtd to work in plaintext

vdsm-tool congfigure --force

and also set you Engine to work in plaintext (unfortunately, I don't recall how's that done. surely Yaniv does) if the host already managed by the engine you can move it to

On 03/10/2015 12:19 AM, Dan Kenigsberg wrote: maintenance, set directly in vdc_options table by psql client to your db- update to False in vdc_options the value of 'EncryptHostCommunication' 'SSLEnabled' options, then restart ovirt-engine. expect the engine side, run also the changes on host (ssl=False and configure --force as Dan mentions above) and reactivate the host.

...

...
Is it safe to restart vdsmd on a node with active VMs?

It's safe in the sense that I have not heard of a single failure to reconnected to already-running VMs in years. However, this is still not recommended for production environment, and particularly not if one of the VMs is defined as highly-available. This can end up with your host being fenced and all your VMs dead.

Dan.

-- Yaniv Bronhaim.

Kapetanakis Giannis

13 Mar 13 Mar

12:29 p.m.

On 06/03/15 18:12, Federico Alberto Sayd wrote:

...

Hello:

I am experiencing troubles with VDSM memory consuption.

I am running

Engine: ovirt 3.5.1

Nodes:

Centos 6.6 VDSM 4.16.10-8 Libvirt: libvirt-0.10.2-46 Kernel: 2.6.32

When the host boots, memory consuption is normal, but after 2 or 3 days running, VDSM memory consuption grows and it consumes more memory that all vm's running in the host. If I restart the vdsm service, memory consuption normalizes, but then it start growing again.

I have seen some BZ about vdsm and supervdsm about memory leaks, but I don't know if VDSM 4.6.10.8 is still affected by a related bug.

Any help? If you need, I can provide more information

Thank you

We also face this problem since 3.5 in two different installations... Hope it's fixed soon G

Sven Kieske

2:10 p.m.

On 13/03/15 12:29, Kapetanakis Giannis wrote:

...

We also face this problem since 3.5 in two different installations... Hope it's fixed soon

Nothing will get fixed if no one bothers to open BZs and send relevants log files to help track down the problems. -- Mit freundlichen Grüßen / Regards Sven Kieske Systemadministrator Mittwald CM Service GmbH & Co. KG Königsberger Straße 6 32339 Espelkamp T: +49-5772-293-100 F: +49-5772-293-333 https://www.mittwald.de Geschäftsführer: Robert Meyer St.Nr.: 331/5721/1033, USt-IdNr.: DE814773217, HRA 6640, AG Bad Oeynhausen Komplementärin: Robert Meyer Verwaltungs GmbH, HRB 13260, AG Bad Oeynhausen

Chris Adams

2:36 p.m.

Once upon a time, Sven Kieske <s.kieske@mittwald.de> said:

...

On 13/03/15 12:29, Kapetanakis Giannis wrote:

...
We also face this problem since 3.5 in two different installations... Hope it's fixed soon

Nothing will get fixed if no one bothers to open BZs and send relevants log files to help track down the problems.

There's already an open BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1158108 I'm not sure if that is exactly the same problem I'm seeing or not; my vdsm process seems to be growing faster (RSS grew 952K in a 5 minute period just now; VSZ didn't change). -- Chris Adams <cma@cmadams.net>

John Taylor

23 Mar 23 Mar

9 p.m.

Chris Adams <cma@cmadams.net> writes:

...

Once upon a time, Sven Kieske <s.kieske@mittwald.de> said:

...
On 13/03/15 12:29, Kapetanakis Giannis wrote:

...
We also face this problem since 3.5 in two different installations... Hope it's fixed soon

Nothing will get fixed if no one bothers to open BZs and send relevants log files to help track down the problems.

There's already an open BZ:

https://bugzilla.redhat.com/show_bug.cgi?id=1158108

I'm not sure if that is exactly the same problem I'm seeing or not; my vdsm process seems to be growing faster (RSS grew 952K in a 5 minute period just now; VSZ didn't change).

For those following this I've added a comment on the bz [1], although in my case the memory leak is, like Chris Adams, a lot more than the 300KiB/h in the original bug report by Daniel Helgenberger . [1] https://bugzilla.redhat.com/show_bug.cgi?id=1158108 -John

Dan Kenigsberg

24 Mar 24 Mar

10:33 a.m.

On Mon, Mar 23, 2015 at 04:00:14PM -0400, John Taylor wrote:

...

Chris Adams <cma@cmadams.net> writes:

...
Once upon a time, Sven Kieske <s.kieske@mittwald.de> said:

...
On 13/03/15 12:29, Kapetanakis Giannis wrote:

...
We also face this problem since 3.5 in two different installations... Hope it's fixed soon

Nothing will get fixed if no one bothers to open BZs and send relevants log files to help track down the problems.

There's already an open BZ:

https://bugzilla.redhat.com/show_bug.cgi?id=1158108

I'm not sure if that is exactly the same problem I'm seeing or not; my vdsm process seems to be growing faster (RSS grew 952K in a 5 minute period just now; VSZ didn't change).

For those following this I've added a comment on the bz [1], although in my case the memory leak is, like Chris Adams, a lot more than the 300KiB/h in the original bug report by Daniel Helgenberger .

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1158108

That's interesting (and worrying). Could you check your suggestion by editing sampling.py so that _get_interfaces_and_samples() returns the empty dict immediately? Would this make the leak disappear? Dan.

Darrell Budic

8:01 p.m.

...

On Mar 24, 2015, at 4:33 AM, Dan Kenigsberg <danken@redhat.com> wrote:

On Mon, Mar 23, 2015 at 04:00:14PM -0400, John Taylor wrote:

...
Chris Adams <cma@cmadams.net> writes:

...
Once upon a time, Sven Kieske <s.kieske@mittwald.de> said:

...
On 13/03/15 12:29, Kapetanakis Giannis wrote:

...
We also face this problem since 3.5 in two different installations... Hope it's fixed soon

Nothing will get fixed if no one bothers to open BZs and send relevants log files to help track down the problems.

There's already an open BZ:

https://bugzilla.redhat.com/show_bug.cgi?id=1158108

I'm not sure if that is exactly the same problem I'm seeing or not; my vdsm process seems to be growing faster (RSS grew 952K in a 5 minute period just now; VSZ didn't change).

For those following this I've added a comment on the bz [1], although in my case the memory leak is, like Chris Adams, a lot more than the 300KiB/h in the original bug report by Daniel Helgenberger .

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1158108

That's interesting (and worrying). Could you check your suggestion by editing sampling.py so that _get_interfaces_and_samples() returns the empty dict immediately? Would this make the leak disappear?

Looks like you’ve got something there. Just a quick test for now, watching RSS in top. I’ll let it go this way for a while and see what it looks in a few hours. System 1: 13 VMs w/ 24 interfaces between them 11:47 killed a vdsm @ 9.116G RSS (after maybe a week and a half running) 11:47: 97xxx 11:57 135544 and climbing 12:00 136400 restarted with sampling.py modified to just return empty set: def _get_interfaces_and_samples(): links_and_samples = {} return links_and_samples 12:02 quickly grew to 127694 12:13: 133352 12:20: 132476 12:31: 132732 12:40: 132656 12:50: 132800 1:30: 133928 1:40: 133136 1:50: 133116 2:00: 133128 interestingly, it looks like overall system load dropped significantly (from ~40-45% to 10% reported). mostly ksmd getting out of the way after freeing 9G, but feels like more than that. (this is a 6 core system, usually saw ksmd using ~80% of a single cpu, roughly 15% of the total available) Second system, 10 Vms w/ 17 interfaces vdsmd @ 5.027G RSS (slightly less uptime that previous host) freeing this ram caused a ~16% utilization drop as ksmd stopped running as hard. restarted at 12:10 12:10: 106224 12:20: 111220 12:31: 114616 12:40: 117500 12:50: 120504 1:30: 133040 1:40: 136140 1:50: 139032 2:00: 142292

Dan Kenigsberg

25 Mar 25 Mar

11:34 a.m.

On Tue, Mar 24, 2015 at 02:01:40PM -0500, Darrell Budic wrote:

...

...
On Mar 24, 2015, at 4:33 AM, Dan Kenigsberg <danken@redhat.com> wrote:

On Mon, Mar 23, 2015 at 04:00:14PM -0400, John Taylor wrote:

...
Chris Adams <cma@cmadams.net> writes:

...
Once upon a time, Sven Kieske <s.kieske@mittwald.de> said:

...
On 13/03/15 12:29, Kapetanakis Giannis wrote:

...
We also face this problem since 3.5 in two different installations... Hope it's fixed soon

Nothing will get fixed if no one bothers to open BZs and send relevants log files to help track down the problems.

There's already an open BZ:

https://bugzilla.redhat.com/show_bug.cgi?id=1158108

I'm not sure if that is exactly the same problem I'm seeing or not; my vdsm process seems to be growing faster (RSS grew 952K in a 5 minute period just now; VSZ didn't change).

For those following this I've added a comment on the bz [1], although in my case the memory leak is, like Chris Adams, a lot more than the 300KiB/h in the original bug report by Daniel Helgenberger .

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1158108

That's interesting (and worrying). Could you check your suggestion by editing sampling.py so that _get_interfaces_and_samples() returns the empty dict immediately? Would this make the leak disappear?

Looks like you’ve got something there. Just a quick test for now, watching RSS in top. I’ll let it go this way for a while and see what it looks in a few hours.

System 1: 13 VMs w/ 24 interfaces between them

11:47 killed a vdsm @ 9.116G RSS (after maybe a week and a half running)

11:47: 97xxx 11:57 135544 and climbing 12:00 136400

restarted with sampling.py modified to just return empty set:

def _get_interfaces_and_samples(): links_and_samples = {} return links_and_samples

Thanks for the input. Just to be a little more certain that the culprit is _get_interfaces_and_samples() per se, would you please decorate it with memoized, and add a log line in the end @utils.memoized # add this line def _get_interfaces_and_samples(): ... logging.debug('LINKS %s', links_and_samples) ## and this line return links_and_samples I'd like to see what happens when the function is run only once, and returns a non-empty reasonable dictionary of links and samples.

Darrell Budic

7:29 p.m.

...

On Mar 25, 2015, at 5:34 AM, Dan Kenigsberg <danken@redhat.com> wrote:

On Tue, Mar 24, 2015 at 02:01:40PM -0500, Darrell Budic wrote:

...
...
On Mar 24, 2015, at 4:33 AM, Dan Kenigsberg <danken@redhat.com> wrote:

On Mon, Mar 23, 2015 at 04:00:14PM -0400, John Taylor wrote:

...
Chris Adams <cma@cmadams.net> writes:

...
Once upon a time, Sven Kieske <s.kieske@mittwald.de> said:

...
On 13/03/15 12:29, Kapetanakis Giannis wrote: > We also face this problem since 3.5 in two different installations... > Hope it's fixed soon

Nothing will get fixed if no one bothers to open BZs and send relevants log files to help track down the problems.

There's already an open BZ:

https://bugzilla.redhat.com/show_bug.cgi?id=1158108

I'm not sure if that is exactly the same problem I'm seeing or not; my vdsm process seems to be growing faster (RSS grew 952K in a 5 minute period just now; VSZ didn't change).

For those following this I've added a comment on the bz [1], although in my case the memory leak is, like Chris Adams, a lot more than the 300KiB/h in the original bug report by Daniel Helgenberger .

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1158108

That's interesting (and worrying). Could you check your suggestion by editing sampling.py so that _get_interfaces_and_samples() returns the empty dict immediately? Would this make the leak disappear?

Looks like you’ve got something there. Just a quick test for now, watching RSS in top. I’ll let it go this way for a while and see what it looks in a few hours.

System 1: 13 VMs w/ 24 interfaces between them

11:47 killed a vdsm @ 9.116G RSS (after maybe a week and a half running)

11:47: 97xxx 11:57 135544 and climbing 12:00 136400

restarted with sampling.py modified to just return empty set:

def _get_interfaces_and_samples(): links_and_samples = {} return links_and_samples

Thanks for the input. Just to be a little more certain that the culprit is _get_interfaces_and_samples() per se, would you please decorate it with memoized, and add a log line in the end

@utils.memoized # add this line def _get_interfaces_and_samples(): ... logging.debug('LINKS %s', links_and_samples) ## and this line return links_and_samples

I'd like to see what happens when the function is run only once, and returns a non-empty reasonable dictionary of links and samples.

Looks similar, I modified my second server for this test: 12:25, still growing from yesterday: 544512 restarted with mods for logging and memoize: stabilized @ 12:32: 114284 1:23: 115300 Thread-12::DEBUG::2015-03-25 12:28:08,080::sampling::243::root::(_get_interfaces_and_samples) LINKS {'vnet18': <virt.sampling.InterfaceSample instance at 0x7f38c03e85f0>, 'vnet19': <virt.sampling.InterfaceSample instance at 0x7f38b42cbcf8>, 'bond0': <virt.sampling.InterfaceSample instance at 0x7f38b429afc8>, 'vnet13': <virt.sampling.InterfaceSample instance at 0x7f38b42c8680>, 'vnet16': <virt.sampling.InterfaceSample instance at 0x7f38b42cb368>, 'private': <virt.sampling.InterfaceSample instance at 0x7f38b42b8bd8>, 'bond0.100': <virt.sampling.InterfaceSample instance at 0x7f38b42bdd88>, 'vnet0': <virt.sampling.InterfaceSample instance at 0x7f38b42c1f80>, 'enp3s0': <virt.sampling.InterfaceSample instance at 0x7f38b429cef0>, 'vnet2': <virt.sampling.InterfaceSample instance at 0x7f38b42bbbd8>, 'vnet3': <virt.sampling.InterfaceSample instance at 0x7f38b42c37e8>, 'vnet4': <virt.sampling.InterfaceSample instance at 0x7f38b42c5518>, 'vnet5': <virt.sampling.InterfaceSample instance at 0x7f38b42c6ab8>, 'vnet6': <virt.sampling.InterfaceSample instance at 0x7f38b42c7248>, 'vnet7': <virt.sampling.InterfaceSample instance at 0x7f38c03e7a28>, 'vnet8': <virt.sampling.InterfaceSample instance at 0x7f38b42c7c20>, 'bond0.1100': <virt.sampling.InterfaceSample instance at 0x7f38b42be710>, 'bond0.1103': <virt.sampling.InterfaceSample instance at 0x7f38b429dc68>, 'ovirtmgmt': <virt.sampling.InterfaceSample instance at 0x7f38b42b16c8>, 'lo': <virt.sampling.InterfaceSample instance at 0x7f38b429a8c0>, 'vnet22': <virt.sampling.InterfaceSample instance at 0x7f38c03e7128>, 'vnet21': <virt.sampling.InterfaceSample instance at 0x7f38b42cd368>, 'vnet20': <virt.sampling.InterfaceSample instance at 0x7f38b42cc7a0>, 'internet': <virt.sampling.InterfaceSample instance at 0x7f38b42aa098>, 'bond0.1203': <virt.sampling.InterfaceSample instance at 0x7f38b42aa8c0>, 'bond0.1223': <virt.sampling.InterfaceSample instance at 0x7f38b42bb128>, ‘XXXXXXXXXXX': <virt.sampling.InterfaceSample instance at 0x7f38b42bee60>, ‘XXXXXXX': <virt.sampling.InterfaceSample instance at 0x7f38b42beef0>, ';vdsmdummy;': <virt.sampling.InterfaceSample instance at 0x7f38b42bdc20>, 'vnet14': <virt.sampling.InterfaceSample instance at 0x7f38b42ca050>, 'mgmt': <virt.sampling.InterfaceSample instance at 0x7f38b42be248>, 'vnet15': <virt.sampling.InterfaceSample instance at 0x7f38b42cab00>, 'enp2s0': <virt.sampling.InterfaceSample instance at 0x7f38b429c200>, 'bond0.1110': <virt.sampling.InterfaceSample instance at 0x7f38b42bed40>, 'vnet1': <virt.sampling.InterfaceSample instance at 0x7f38b42c27e8>, 'bond0.1233': <virt.sampling.InterfaceSample instance at 0x7f38b42bedd0>, 'bond0.1213': <virt.sampling.InterfaceSample instance at 0x7f38b42b2128>} Didn’t see the significant CPU use difference on this one, so thinking it was all ksmd on yesterdays tests. Yesterdays test is still going, and still hovering around 135016 or so.

Dan Kenigsberg

26 Mar 26 Mar

12:42 p.m.

On Wed, Mar 25, 2015 at 01:29:25PM -0500, Darrell Budic wrote:

...

...
On Mar 25, 2015, at 5:34 AM, Dan Kenigsberg <danken@redhat.com> wrote:

On Tue, Mar 24, 2015 at 02:01:40PM -0500, Darrell Budic wrote:

...
...
On Mar 24, 2015, at 4:33 AM, Dan Kenigsberg <danken@redhat.com> wrote:

On Mon, Mar 23, 2015 at 04:00:14PM -0400, John Taylor wrote:

...
Chris Adams <cma@cmadams.net> writes:

...
Once upon a time, Sven Kieske <s.kieske@mittwald.de> said: > On 13/03/15 12:29, Kapetanakis Giannis wrote: >> We also face this problem since 3.5 in two different installations... >> Hope it's fixed soon > > Nothing will get fixed if no one bothers to > open BZs and send relevants log files to help > track down the problems.

There's already an open BZ:

https://bugzilla.redhat.com/show_bug.cgi?id=1158108

I'm not sure if that is exactly the same problem I'm seeing or not; my vdsm process seems to be growing faster (RSS grew 952K in a 5 minute period just now; VSZ didn't change).

For those following this I've added a comment on the bz [1], although in my case the memory leak is, like Chris Adams, a lot more than the 300KiB/h in the original bug report by Daniel Helgenberger .

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1158108

That's interesting (and worrying). Could you check your suggestion by editing sampling.py so that _get_interfaces_and_samples() returns the empty dict immediately? Would this make the leak disappear?

Looks like you’ve got something there. Just a quick test for now, watching RSS in top. I’ll let it go this way for a while and see what it looks in a few hours.

System 1: 13 VMs w/ 24 interfaces between them

11:47 killed a vdsm @ 9.116G RSS (after maybe a week and a half running)

11:47: 97xxx 11:57 135544 and climbing 12:00 136400

restarted with sampling.py modified to just return empty set:

def _get_interfaces_and_samples(): links_and_samples = {} return links_and_samples

Thanks for the input. Just to be a little more certain that the culprit is _get_interfaces_and_samples() per se, would you please decorate it with memoized, and add a log line in the end

@utils.memoized # add this line def _get_interfaces_and_samples(): ... logging.debug('LINKS %s', links_and_samples) ## and this line return links_and_samples

I'd like to see what happens when the function is run only once, and returns a non-empty reasonable dictionary of links and samples.

Looks similar, I modified my second server for this test:

Thanks again. Would you be kind to search further? Does the following script leak anything on your host, when placed in your /usr/share/vdsm: #!/usr/bin/python from time import sleep from virt.sampling import _get_interfaces_and_samples while True: _get_interfaces_and_samples() sleep(0.2) Something that can be a bit harder would be to: # service vdsmd stop # su - vdsm -s /bin/bash # cd /usr/share/vdsm # valgrind --leak-check=full --log-file=/tmp/your.log vdsm as suggested by Thomas on https://bugzilla.redhat.com/show_bug.cgi?id=1158108#c6 Regards, Dan.

Darrell Budic

5:12 p.m.

...

On Mar 26, 2015, at 6:42 AM, Dan Kenigsberg <danken@redhat.com> wrote:

On Wed, Mar 25, 2015 at 01:29:25PM -0500, Darrell Budic wrote:

...
...
On Mar 25, 2015, at 5:34 AM, Dan Kenigsberg <danken@redhat.com> wrote:

On Tue, Mar 24, 2015 at 02:01:40PM -0500, Darrell Budic wrote:

...
...
On Mar 24, 2015, at 4:33 AM, Dan Kenigsberg <danken@redhat.com> wrote:

On Mon, Mar 23, 2015 at 04:00:14PM -0400, John Taylor wrote:

...
Chris Adams <cma@cmadams.net> writes:

> Once upon a time, Sven Kieske <s.kieske@mittwald.de> said: >> On 13/03/15 12:29, Kapetanakis Giannis wrote: >>> We also face this problem since 3.5 in two different installations... >>> Hope it's fixed soon >> >> Nothing will get fixed if no one bothers to >> open BZs and send relevants log files to help >> track down the problems. > > There's already an open BZ: > > https://bugzilla.redhat.com/show_bug.cgi?id=1158108 > > I'm not sure if that is exactly the same problem I'm seeing or not; my > vdsm process seems to be growing faster (RSS grew 952K in a 5 minute > period just now; VSZ didn't change).

For those following this I've added a comment on the bz [1], although in my case the memory leak is, like Chris Adams, a lot more than the 300KiB/h in the original bug report by Daniel Helgenberger .

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1158108

That's interesting (and worrying). Could you check your suggestion by editing sampling.py so that _get_interfaces_and_samples() returns the empty dict immediately? Would this make the leak disappear?

Looks like you’ve got something there. Just a quick test for now, watching RSS in top. I’ll let it go this way for a while and see what it looks in a few hours.

System 1: 13 VMs w/ 24 interfaces between them

11:47 killed a vdsm @ 9.116G RSS (after maybe a week and a half running)

11:47: 97xxx 11:57 135544 and climbing 12:00 136400

restarted with sampling.py modified to just return empty set:

def _get_interfaces_and_samples(): links_and_samples = {} return links_and_samples

Thanks for the input. Just to be a little more certain that the culprit is _get_interfaces_and_samples() per se, would you please decorate it with memoized, and add a log line in the end

@utils.memoized # add this line def _get_interfaces_and_samples(): ... logging.debug('LINKS %s', links_and_samples) ## and this line return links_and_samples

I'd like to see what happens when the function is run only once, and returns a non-empty reasonable dictionary of links and samples.

Looks similar, I modified my second server for this test:

Thanks again. Would you be kind to search further? Does the following script leak anything on your host, when placed in your /usr/share/vdsm:

#!/usr/bin/python

from time import sleep from virt.sampling import _get_interfaces_and_samples

while True: _get_interfaces_and_samples() sleep(0.2)

Something that can be a bit harder would be to: # service vdsmd stop # su - vdsm -s /bin/bash # cd /usr/share/vdsm # valgrind --leak-check=full --log-file=/tmp/your.log vdsm

as suggested by Thomas on https://bugzilla.redhat.com/show_bug.cgi?id=1158108#c6

Yes, this script leaks quickly. Started out at a RSS of 21000ish, already at 26744 a minute in, about 5 minutes later it’s at 39384 and climbing. Been abusing a production server for those simple tests, but didn’t want to run valgrind against it right this minute. Did run it against the test.py script above though, got this (fpaste.org didn’t like, too long maybe?): http://tower.onholyground.com/valgrind-test.log To comment on some other posts in this thread, I also see leaks on my test system which is running Centos 6.6, but it only has 3 VMs across 2 servers and 3 configured networks and it leaks MUCH slower. I suspect people don’t notice this on test systems because they don’t have a lot of VMs/interfaces running, and don’t leave them up for weeks at a time. That’s why I was running these tests on my production box, to have more VMs up.

Kapetanakis Giannis

30 Mar 30 Mar

3:40 p.m.

On 26/03/15 18:12, Darrell Budic wrote:

...

Yes, this script leaks quickly. Started out at a RSS of 21000ish, already at 26744 a minute in, about 5 minutes later it’s at 39384 and climbing.

Been abusing a production server for those simple tests, but didn’t want to run valgrind against it right this minute. Did run it against the test.py script above though, got this (fpaste.org didn’t like, too long maybe?): http://tower.onholyground.com/valgrind-test.log

To comment on some other posts in this thread, I also see leaks on my test system which is running Centos 6.6, but it only has 3 VMs across 2 servers and 3 configured networks and it leaks MUCH slower. I suspect people don’t notice this on test systems because they don’t have a lot of VMs/interfaces running, and don’t leave them up for weeks at a time. That’s why I was running these tests on my production box, to have more VMs up.

I don't think it's related directly to the number of VMs running. Maybe indirectly if it's related to the number of network interfaces (so vm interfaces add to the leak). We've seen the leak on nodes under maintenance... G

Nathanaël Blanchet

4:18 p.m.

Just to precise that I'm also concerned whatever is the host (el7 or el6) and I have many vms running on a single host (up to 15) and many networks ( up to 10) It is always the same : when vdsmd finished to take the totality of memory, the host becomes unreacheable and vms begin to migrate. The only way to stop this is to restart vdsmd. Le 30/03/2015 15:40, Kapetanakis Giannis a écrit :

...

On 26/03/15 18:12, Darrell Budic wrote:

...
Yes, this script leaks quickly. Started out at a RSS of 21000ish, already at 26744 a minute in, about 5 minutes later it’s at 39384 and climbing.

Been abusing a production server for those simple tests, but didn’t want to run valgrind against it right this minute. Did run it against the test.py script above though, got this (fpaste.org didn’t like, too long maybe?): http://tower.onholyground.com/valgrind-test.log

To comment on some other posts in this thread, I also see leaks on my test system which is running Centos 6.6, but it only has 3 VMs across 2 servers and 3 configured networks and it leaks MUCH slower. I suspect people don’t notice this on test systems because they don’t have a lot of VMs/interfaces running, and don’t leave them up for weeks at a time. That’s why I was running these tests on my production box, to have more VMs up.

I don't think it's related directly to the number of VMs running. Maybe indirectly if it's related to the number of network interfaces (so vm interfaces add to the leak).

We've seen the leak on nodes under maintenance...

G _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users

3893

Age (days ago)

3919

Last active (days ago)

List overview

Download

31 comments

11 participants

participants (11)

Chris Adams
Dan Kenigsberg
Daniel Helgenberger
Darrell Budic
Federico Alberto Sayd
John Taylor
Kapetanakis Giannis
Matt .
Nathanaël Blanchet
Sven Kieske
ybronhei