On Tue, May 8, 2018 at 9:16 AM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, May 7, 2018 at 6:48 PM Chris Adams <cma@cmadams.net> wrote:
I have a problem with a memory leak in vdsm.  I have a dev cluster that
right now is:

- two nodes
- CentOS 7.4 (up to date)
- oVirt 4.2.2 (installed as 3.5 and upgraded version by version)
- hosted engine (no other running VM at the moment)
- iSCSI storage

I have a script that writes the vdsm RSS to a file every five minutes,
and on the node holding the hosted engine, vdsm RSS grows around
300-1500KB every snapshot.

I maintain several oVirt clusters for others, and they all seem to have
this problem.  The production clusters are all still on oVirt 4.1, but
they all have this problem too, so I guess it is something about how I
set them up?  On a couple I just checked, the vdsm RSS is over 1G.

Any tips on instrumenting vdsm to track this down?  I am unfortunately
only passingly familiar with python (I can make small changes, but not
knowledgeable enough to figure this out).

To debug these issues, you should enable the health monitor by creating
a drop-in configuration file:

$ cat /etc/vdsm/vdsm.conf.d/health.conf
[devel]
health_monitor_enable = true

And restart vdsm to start the health monitor.

The health logs are using DEBUG level so you need to enable
DEBUG level for the "health" logger. You can do this with:

    $ vdsm-client Host setLogLevel level=DEBUG name=health

Or by adding new logger configuration to /etc/vdsm/logger.conf:

1. add health logger to [loggers]

[loggers]
keys=root,vds,storage,virt,ovirt_hosted_engine_ha,ovirt_hosted_engine_ha_config,IOProcess,devel,health

2. add [logger_health] section

[logger_health]
level=DEBUG
handlers=logthread
qualname=health
propagate=0
 
Finally post here the [health] logs from vdsm.log.                                                                                 

Here are example logs on a host running 4.2.3 without hosted engine:

# grep '(health)' /var/log/vdsm/vdsm.log
2018-05-08 10:54:29,424+0300 DEBUG (health) [health] Checking health (health:90)
2018-05-08 10:54:29,451+0300 DEBUG (health) [health] Collected 4710 objects (health:97)
2018-05-08 10:54:29,451+0300 DEBUG (health) [health] user=2.22%, sys=0.92%, rss=84192 kB (+23436), threads=70 (health:122)
2018-05-08 10:55:29,451+0300 DEBUG (health) [health] Checking health (health:90)
2018-05-08 10:55:29,472+0300 DEBUG (health) [health] Collected 43 objects (health:97)
2018-05-08 10:55:29,474+0300 DEBUG (health) [health] user=0.60%, sys=0.38%, rss=84236 kB (+44), threads=62 (health:122)
2018-05-08 10:56:29,475+0300 DEBUG (health) [health] Checking health (health:90)
2018-05-08 10:56:29,496+0300 DEBUG (health) [health] Collected 48 objects (health:97)
2018-05-08 10:56:29,496+0300 DEBUG (health) [health] user=0.58%, sys=0.42%, rss=84444 kB (+208), threads=62 (health:122)
2018-05-08 10:57:29,497+0300 DEBUG (health) [health] Checking health (health:90)
2018-05-08 10:57:29,521+0300 DEBUG (health) [health] Collected 43 objects (health:97)
2018-05-08 10:57:29,521+0300 DEBUG (health) [health] user=0.58%, sys=0.42%, rss=84452 kB (+8), threads=62 (health:122)
2018-05-08 10:58:29,522+0300 DEBUG (health) [health] Checking health (health:90)
2018-05-08 10:58:29,545+0300 DEBUG (health) [health] Collected 43 objects (health:97)
2018-05-08 10:58:29,545+0300 DEBUG (health) [health] user=0.60%, sys=0.42%, rss=84500 kB (+48), threads=62 (health:122)
2018-05-08 10:59:29,546+0300 DEBUG (health) [health] Checking health (health:90)
2018-05-08 10:59:29,569+0300 DEBUG (health) [health] Collected 137 objects (health:97)
2018-05-08 10:59:29,570+0300 DEBUG (health) [health] user=0.70%, sys=0.48%, rss=84876 kB (+376), threads=69 (health:122)
2018-05-08 11:00:29,570+0300 DEBUG (health) [health] Checking health (health:90)
2018-05-08 11:00:29,593+0300 DEBUG (health) [health] Collected 43 objects (health:97)
2018-05-08 11:00:29,594+0300 DEBUG (health) [health] user=0.60%, sys=0.43%, rss=84744 kB (-132), threads=62 (health:122)
2018-05-08 11:01:29,594+0300 DEBUG (health) [health] Checking health (health:90)
2018-05-08 11:01:29,617+0300 DEBUG (health) [health] Collected 43 objects (health:97)
2018-05-08 11:01:29,617+0300 DEBUG (health) [health] user=0.65%, sys=0.40%, rss=84748 kB (+4), threads=62 (health:122)
2018-05-08 11:02:29,618+0300 DEBUG (health) [health] Checking health (health:90)
2018-05-08 11:02:29,641+0300 DEBUG (health) [health] Collected 43 objects (health:97)
2018-05-08 11:02:29,641+0300 DEBUG (health) [health] user=0.57%, sys=0.40%, rss=84748 kB (+0), threads=62 (health:122)
2018-05-08 11:03:29,642+0300 DEBUG (health) [health] Checking health (health:90)
2018-05-08 11:03:29,665+0300 DEBUG (health) [health] Collected 43 objects (health:97)
2018-05-08 11:03:29,665+0300 DEBUG (health) [health] user=0.60%, sys=0.43%, rss=84812 kB (+64), threads=62 (health:122)
2018-05-08 11:04:29,666+0300 DEBUG (health) [health] Checking health (health:90)
2018-05-08 11:04:29,690+0300 DEBUG (health) [health] Collected 137 objects (health:97)
2018-05-08 11:04:29,690+0300 DEBUG (health) [health] user=0.73%, sys=0.47%, rss=85008 kB (+196), threads=69 (health:122)

Health stats are also reported using the configured metrics collector, 
(see [metrics] section in vdsm.conf) using these names:

hosts.vdsm.gc.uncollectable
hosts.vdsm.cpu.user_pct
hosts.vdsm.cpu.sys_pct
hosts.vdsm.cpu.memory.rss
hosts.vdsm.threads_count

Nir