[ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls

Fri Jul 4 20:30:24 UTC 2014

----- Original Message -----
> From: "Francesco Romani" <fromani at redhat.com>
> To: devel at ovirt.org
> Sent: Friday, July 4, 2014 5:48:59 PM
> Subject: [ovirt-devel] [VDSM][sampling] thread pool status and handling of	stuck calls
> 
> Hi,
> 
> Nir has begun reviewing my draft patches about the thread pool and sampling
> refactoring (thanks!),
> and already suggested quite some improvements which I'd like to summarize
> 
> Quick links to the ongoing discussion:
> http://gerrit.ovirt.org/#/c/29191/8/lib/threadpool/worker.py,cm
> http://gerrit.ovirt.org/#/c/29190/4/lib/threadpool/README.rst,cm
> 
> Quick summary of the discussion on gerrit so far:
> 1. extract the scheduling logic from the thread pool. Either add a separate
> scheduler class
>    or let the sampling task reschedule themselves after a succesfull
>    completion.
>    In any way the concept of  'periodic task', and the added complexity,
>    isn't needed.

This will also allow task to change the sampling policy depending on the results.
If some calls always fails, maybe we can run it less often.

> 
> 2. drop all the *queue classes I've added, thus making the package simpler.
>    They are no longer needed since we remove the concept of periodic task.
> 
> 3. have per-task timeout, move the stuck task detection elsewhere, like in
> the worker thread, ot
>    maybe better in the aforementioned scheduler.

The scheduler should not care about task status or results, it should be 
responsible for starting a task when it should run.

>    If the scheduler finds that any task started in the former pass (or even
>    before!)
>    has not yet completed, there is no point in keeping this task alive and it
>    should be cancelled.

It should not do this. If we schedule task when it was finished, we cannot get
into this state, that a new task started while the previous task is still running.

> 
> 4. the sampling task (or maybe the scheduler) can be smarter and halting the
> sample in presence of
>    not responding calls for a given VM, granted the VM reports its
>    'health'/responsiveness.

The scheduler should not do this. The object that ask scheduled the tasks should
cancel them if the vm is not responding.

The current model get this right because all calls related to specific vm
are run on the same vm thread, so stuck call will prevent all other calls
from running.

Seems that this is an important requirement, if we have this issue - that 
all call realted specific vm may start block forever. Did we know that this
happens?

If one vm may stoop responding, causing all libvirt calls for this vm to
block, then a thread pool with one connection per worker thread can lead
to a failure when all connection happen to run a request that blocks
the thread. If this is the case, then each task related to one vm must
depend on other tasks and should not be skipped until the previous task
returned, simulating the current threading model without creating 100's
of threads.

> 
> (Hopefully I haven't forgot anything big)
> 
> In the draft currently published, I reluctantly added the *queue classes and
> I agree the periodic
> task implementation is messy, so I'll be very happy to drop them.

Just to make it more clear - I think we should make simple components that do
one thing well, not magic components that do everything, like a threadpool that
does scheduling and other fancy stuff.

> 
> However, a core question still holds: what to do in presence of the stuck
> task?
> 
> I think it is worth to discuss this topic on a medium friendlier than gerrit,
> as it is the single
> most important decision to make in the sampling refactoring.
> 
> It all boils down to:
> Should we just keep somewhere stuck threads and wait? Should we cancel stuck
> tasks?
> 
> A. Let's cancel the stuck tasks.
> If we move toward a libvirt connection pool, and we give each worker thread
> in the sampling pool
> a separate libvirt connection, hopefully read-only, 

Why read only?

> then we should be able to
> cancel stuck task by
> killing the worker's libvirt connection. We'll still need a (probably much
> simpler) watchman/supervisor,
> but no big deal here.
> Libvirt allows to close a connection from a different thread.
> I haven't actually tried to unstuck a blocked thread this way, but I have no
> reason to believe it
> will not work.

Lets try?

You can block access to storage using iptables, which may cause the block
related calls to stuck, and try to close a connection after few seconds from
another thread.

> 
> B. Let's keep around blocked threads
> The code as it is just leaves a blocked libvirt call and the worker thread
> that carried it frozen.
> The stuck worker thread can be replaced up to a cap of frozen threads.
> In this worst case scenario, we end up with one (blocked!) thread per VM, as
> it is today, and with
> no sampling data.

In the worst case, each vm will cause all threads to be stuck on call related
to this vm, since calls can run on any thread in the pool.

> 
> I believe that #A has some drawbacks which we risk to overlook, and on the
> same time #B has some merits.
> 
> Let me explain:
> The hardest case is a call blocked in the kernel in D state. Libvirt has no
> more room than VDSM
> to unblock it; and libvirt itself *has* a pool of resources (threads in this
> case) which can be depleted
> by stuck calls. Actually, retrying to do a failed task may deplete their pool
> even faster[1].
> 
> I'm not happy to just push this problem down the stack, as it looks to me
> that we gain
> very little by doing so. VDSM itself surely stays cleaner, but the
> VDS/hypervisor hosts on the whole
> improves just a bit: libvirt scales better, and that gives us some more room.
> 
> On the other hand, by avoiding to reissue dangerous calls, 

Which are the dangerous calls?

If they are related to storage domains, we already have a thread per each
domain, so maybe they should run on the domain monitoring thread, no on libvirt
threads.

We don't have any issue if a domain monitoring thread get stuck - this will
simply make the domain unavailable after couple of minutes.

> I believe we make
> better use of
> the host resources in general. Actually, the point of keeping blocked thread
> around is a side effect
> of not reattempting blocked calls. Moreover, to keep the blocked thread
> around has a significant
> benefit: we can discover at the earliest moment when it is safe again to do
> the blocked call,
> because the blocked call itself returns and we can track this event! (and of
> course drop the
> now stale result). Otherwise, if we drop the connection, we'll lose this
> event and we have no
> more option that trying again and hoping for the best[2]

This is a good point.

> 
> I know the #B approach is not the cleanest, but I think it has slightly more
> appeal, especially
> on the libvirt depletion front.
> 
> Thoughts and comments very welcome!

We don't know yet why libvirt calls may stuck, right?

libvirt is probably not accessing storage (which may get you in D state) but
query qemu which does access storage. So libvirt should be able to abort such
calls.

Lets make a list of calls that can stuck, and check with the libvirt developers
what is the best way to handle such calls.

If they tell us that closing a stuck connection will break libvirt we obsiously
cannot do this and will have to wait until such calls return, and replace
the stuck thread with another thread.

But I think that we are trying to solve the problem too quickly before we
understand it (at least I don't yet understand the libvirt side).

First, can you make a list of libvirt calls that we make, and how much time
each call takes on average? Lets have a patch that add this info - how much
we waited for the response - before doing any design.

How many libvirt connections are we using today? Do we use a same connection
from many threads?

Looking in the libvirt api, it looks like libvirt supports accessing a connection
from many threads. It seems that responses can return in different order then
the requests.
http://libvirt.org/internals/rpc.html#apiclientdispatchex1

When one of the dangerous calls get stuck, does it stop all other requests
on the connection on we simply have a thread that make a call and will never
return, while other threads are happily sending requests and receiving responses
on this connection?

Nir