Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls

9 Jul 2014

      ----- Original Message -----
...
From: "Francesco Romani" <fromani@redhat.com>
To: "Nir Soffer" <nsoffer@redhat.com>
Cc: devel@ovirt.org, "Federico Simoncelli" <fsimonce@redhat.com>, "Michal Skrivanek" <mskrivan@redhat.com>, "Adam
Litke" <alitke@redhat.com>
Sent: Wednesday, July 9, 2014 2:22:05 PM
Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of	stuck calls
----- Original Message -----
...
From: "Nir Soffer" <nsoffer@redhat.com>
To: "Francesco Romani" <fromani@redhat.com>
Cc: devel@ovirt.org, "Federico Simoncelli" <fsimonce@redhat.com>, "Michal
Skrivanek" <mskrivan@redhat.com>, "Adam
Litke" <alitke@redhat.com>
Sent: Monday, July 7, 2014 4:53:29 PM
Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling
of	stuck calls
...
...
If one vm may stoop responding, causing all libvirt calls for this vm
to
block, then a thread pool with one connection per worker thread can
lead
to a failure when all connection happen to run a request that blocks
the thread. If this is the case, then each task related to one vm must
depend on other tasks and should not be skipped until the previous task
returned, simulating the current threading model without creating 100's
of threads.
Agreed, we should introduce this concept and this is lacking in my
threadpool
proposal.
So basically the current threading model is the behavior we want?
If some call get stuck, stop sampling this vm. Continue when the
call returns.
Michal? Federico?
Yep - but with less threads, and surely with a constant number of them.
Your schedule library (review in my queue at very high priority) is indeed
a nice step in this direcation.
Waiting for Federico's ack.
That looks good. Now I would like to summarize few things.

We know that when a request gets stuck on a vm also the subsequent ones will
get stuck (at least until their timeout is up, except for the first one that
could stay there forever).

We want a limited number of threads polling the statistics (trying to match
the number of threads that libvirt has).

Given those two assumptions we want a thread pool of workers that are picking
up jobs *per-vm*. The jobs should be smart enough to:

- understand what samples they have to take in that cycle (cpu? network? etc.)
- resubmit themselves in the queue

Now this will ensure that in the queue there's only one job per-vm and if it
gets stuck it is not re-submitted (no other worker will get stuck).

Additionally I think someone mentioned re-connection to libvirt in case of
stuck threads. I actually want to discourage (or really minimize) this
behavior because I can't think of a case where it would improve the situation
(it may just end up generating a large number of zombie threads on the libvirt
side).

-- 
Federico