[ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls

Wed Jul 9 14:57:53 UTC 2014

----- Original Message -----
> From: "Francesco Romani" <fromani at redhat.com>
> To: "Nir Soffer" <nsoffer at redhat.com>
> Cc: devel at ovirt.org, "Federico Simoncelli" <fsimonce at redhat.com>, "Michal Skrivanek" <mskrivan at redhat.com>, "Adam
> Litke" <alitke at redhat.com>
> Sent: Wednesday, July 9, 2014 2:22:05 PM
> Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of	stuck calls
> 
> ----- Original Message -----
> > From: "Nir Soffer" <nsoffer at redhat.com>
> > To: "Francesco Romani" <fromani at redhat.com>
> > Cc: devel at ovirt.org, "Federico Simoncelli" <fsimonce at redhat.com>, "Michal
> > Skrivanek" <mskrivan at redhat.com>, "Adam
> > Litke" <alitke at redhat.com>
> > Sent: Monday, July 7, 2014 4:53:29 PM
> > Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling
> > of	stuck calls
> > > > If one vm may stoop responding, causing all libvirt calls for this vm
> > > > to
> > > > block, then a thread pool with one connection per worker thread can
> > > > lead
> > > > to a failure when all connection happen to run a request that blocks
> > > > the thread. If this is the case, then each task related to one vm must
> > > > depend on other tasks and should not be skipped until the previous task
> > > > returned, simulating the current threading model without creating 100's
> > > > of threads.
> > > 
> > > Agreed, we should introduce this concept and this is lacking in my
> > > threadpool
> > > proposal.
> > 
> > So basically the current threading model is the behavior we want?
> > 
> > If some call get stuck, stop sampling this vm. Continue when the
> > call returns.
> > 
> > Michal? Federico?
> 
> Yep - but with less threads, and surely with a constant number of them.
> Your schedule library (review in my queue at very high priority) is indeed
> a nice step in this direcation.
> 
> Waiting for Federico's ack.

That looks good. Now I would like to summarize few things.

We know that when a request gets stuck on a vm also the subsequent ones will
get stuck (at least until their timeout is up, except for the first one that
could stay there forever).

We want a limited number of threads polling the statistics (trying to match
the number of threads that libvirt has).

Given those two assumptions we want a thread pool of workers that are picking
up jobs *per-vm*. The jobs should be smart enough to:

- understand what samples they have to take in that cycle (cpu? network? etc.)
- resubmit themselves in the queue

Now this will ensure that in the queue there's only one job per-vm and if it
gets stuck it is not re-submitted (no other worker will get stuck).

Additionally I think someone mentioned re-connection to libvirt in case of
stuck threads. I actually want to discourage (or really minimize) this
behavior because I can't think of a case where it would improve the situation
(it may just end up generating a large number of zombie threads on the libvirt
side).

-- 
Federico