[ovirt-devel] [VDSM] health: Introduce Vdsm health monitoring

Nir Soffer nsoffer at redhat.com
Fri Jan 15 15:09:25 UTC 2016


On Fri, Jan 15, 2016 at 11:22 AM, Francesco Romani <fromani at redhat.com> wrote:
> ----- Original Message -----
>> From: "Nir Soffer" <nsoffer at redhat.com>
>> To: "devel" <devel at ovirt.org>, "Francesco Romani" <fromani at redhat.com>
>> Sent: Friday, January 15, 2016 9:52:13 AM
>> Subject: Re: [VDSM] health: Introduce Vdsm health monitoring
>>
>> I tested with https://gerrit.ovirt.org/#/c/51630
>>
>> libvirt.virDomain object does not leak any more, but pthreading object
>> are still leaking,
>> and the number of uncollectible objects is still very high (12,915).
>>
>> Each time we start and stop a vm, we leak about 6000 objects which are part
>> of a reference cycle, and each time we create and leak 27 pthreading
>> locks/conditions
>> which are probably the reason for this leak (because they implement
>> __del__ or reference object
>> implementing __del__.
>
> *Surely* this deserve more investigation, *most likley* there is more to be fixed,
> and *surely* I'll invest more energy in research, but I think we should also apply
> some grains of salt to the raw results gc gives us.
> I've been running a test overnight like
>
> - start 20 vms
> - sleep 2m
> - stop 20 vms
> - sleep 30s
>
> repeated 100 times, monitoring VmRSS (/proc/$VDSM_PID/status) and I didn't observed
> any _significant_ leak.
> The VmRSS growth was pretty low, almost flat. So something doesn't click here.
>
> I mean, 6000 objects (at least) per vm each time *should* be noticeable after all!
>
> Once 51630 is in good shape I'll broad the net and go hunt for more virt leaks.

After removing the gc.DEBUG_SAVEALL, we see now only the libvirt.virDomain leak.

Turns out that the pthreading objects leak was caused by saving all
the unreachable
objects when using gc.DEBUG_SAVEALL.

Here is a run with https://gerrit.ovirt.org/51708, but without
https://gerrit.ovirt.org/51630

# tail -f /var/log/vdsm/vdsm.log | grep health
MainThread::INFO::2016-01-15 16:14:14,789::health::58::health::(start)
Starting health monitor (interval=60)
Thread-15::DEBUG::2016-01-15 16:14:14,793::health::70::health::(_run)
Health monitor started

Thread-15::DEBUG::2016-01-15
16:15:14,794::health::84::health::(_check) Checking health
Thread-15::DEBUG::2016-01-15
16:15:14,815::health::86::health::(_check) Collected 54 objects

activating host...

Thread-15::DEBUG::2016-01-15
16:16:14,815::health::84::health::(_check) Checking health
Thread-15::DEBUG::2016-01-15
16:16:14,837::health::86::health::(_check) Collected 299 objects

Thread-15::DEBUG::2016-01-15
16:17:14,838::health::84::health::(_check) Checking health
Thread-15::DEBUG::2016-01-15
16:17:14,858::health::86::health::(_check) Collected 128 objects

starting vm...

Thread-15::DEBUG::2016-01-15
16:18:14,858::health::84::health::(_check) Checking health
Thread-15::DEBUG::2016-01-15
16:18:14,883::health::86::health::(_check) Collected 1771 objects

Thread-15::DEBUG::2016-01-15
16:19:14,884::health::84::health::(_check) Checking health
Thread-15::DEBUG::2016-01-15
16:19:14,909::health::86::health::(_check) Collected 167 objects

stopping vm...

Thread-15::DEBUG::2016-01-15
16:20:14,909::health::84::health::(_check) Checking health
Thread-15::DEBUG::2016-01-15
16:20:14,934::health::86::health::(_check) Collected 179 objects

vm is down...

Thread-15::DEBUG::2016-01-15
16:21:14,934::health::84::health::(_check) Checking health
Thread-15::DEBUG::2016-01-15
16:21:14,960::health::86::health::(_check) Collected 2284 objects
Thread-15::WARNING::2016-01-15
16:21:14,960::health::94::health::(_check) Found 1 uncollectable
objects: ['<libvirt.virDomain object at 0x1ef6550>']

starting vm...

Thread-15::DEBUG::2016-01-15
16:22:14,961::health::84::health::(_check) Checking health
Thread-15::DEBUG::2016-01-15
16:22:14,987::health::86::health::(_check) Collected 1721 objects
Thread-15::WARNING::2016-01-15
16:22:14,987::health::94::health::(_check) Found 1 uncollectable
objects: ['<libvirt.virDomain object at 0x1ef6550>']

stopping vm...

Thread-15::DEBUG::2016-01-15
16:23:14,988::health::84::health::(_check) Checking health
Thread-15::DEBUG::2016-01-15
16:23:15,014::health::86::health::(_check) Collected 189 objects
Thread-15::WARNING::2016-01-15
16:23:15,014::health::94::health::(_check) Found 1 uncollectable
objects: ['<libvirt.virDomain object at 0x1ef6550>']

Thread-15::DEBUG::2016-01-15
16:24:15,014::health::84::health::(_check) Checking health
Thread-15::DEBUG::2016-01-15
16:24:15,040::health::86::health::(_check) Collected 2240 objects
Thread-15::WARNING::2016-01-15
16:24:15,041::health::94::health::(_check) Found 2 uncollectable
objects: ['<libvirt.virDomain object at 0x1ef6550>',
'<libvirt.virDomain object at 0x7f79ec10cf90>']

vm is down here (probbaly was done before the log)

Thread-15::DEBUG::2016-01-15
16:25:15,041::health::84::health::(_check) Checking health
Thread-15::DEBUG::2016-01-15
16:25:15,065::health::86::health::(_check) Collected 128 objects
Thread-15::WARNING::2016-01-15
16:25:15,065::health::94::health::(_check) Found 2 uncollectable
objects: ['<libvirt.virDomain object at 0x1ef6550>',
'<libvirt.virDomain object at 0x7f79ec10cf90>']

starting vm...

Thread-15::DEBUG::2016-01-15
16:26:15,067::health::84::health::(_check) Checking health
Thread-15::DEBUG::2016-01-15
16:26:15,094::health::86::health::(_check) Collected 1749 objects
Thread-15::WARNING::2016-01-15
16:26:15,094::health::94::health::(_check) Found 2 uncollectable
objects: ['<libvirt.virDomain object at 0x1ef6550>',
'<libvirt.virDomain object at 0x7f79ec10cf90>']

stopping vm...

Thread-15::DEBUG::2016-01-15
16:27:15,096::health::84::health::(_check) Checking health
Thread-15::DEBUG::2016-01-15
16:27:15,119::health::86::health::(_check) Collected 167 objects
Thread-15::WARNING::2016-01-15
16:27:15,120::health::94::health::(_check) Found 2 uncollectable
objects: ['<libvirt.virDomain object at 0x1ef6550>',
'<libvirt.virDomain object at 0x7f79ec10cf90>']

vm is shutting down...

Thread-15::DEBUG::2016-01-15
16:28:15,120::health::84::health::(_check) Checking health
Thread-15::DEBUG::2016-01-15
16:28:15,144::health::86::health::(_check) Collected 2262 objects
Thread-15::WARNING::2016-01-15
16:28:15,144::health::94::health::(_check) Found 3 uncollectable
objects: ['<libvirt.virDomain object at 0x1ef6550>',
'<libvirt.virDomain object at 0x7f79ec10cf90>', '<libvirt.virDomain
object at 0x1f0a2d0>']

vm is down...

Thread-15::DEBUG::2016-01-15
16:29:15,145::health::84::health::(_check) Checking health
Thread-15::DEBUG::2016-01-15
16:29:15,167::health::86::health::(_check) Collected 128 objects
Thread-15::WARNING::2016-01-15
16:29:15,167::health::94::health::(_check) Found 3 uncollectable
objects: ['<libvirt.virDomain object at 0x1ef6550>',
'<libvirt.virDomain object at 0x7f79ec10cf90>', '<libvirt.virDomain
object at 0x1f0a2d0>']

When running with both https://gerrit.ovirt.org/51708 and
https://gerrit.ovirt.org/51630,
no object leak because of reference cycles and __del__.

I'll abandon https://gerrit.ovirt.org/51868 since it is not needed.

Nir



More information about the Devel mailing list