----- Original Message -----
From: "Francesco Romani" <fromani(a)redhat.com>
To: "Nir Soffer" <nsoffer(a)redhat.com>
Cc: devel(a)ovirt.org, "Federico Simoncelli" <fsimonce(a)redhat.com>,
"Michal Skrivanek" <mskrivan(a)redhat.com>, "Adam
Litke" <alitke(a)redhat.com>
Sent: Wednesday, July 9, 2014 2:22:05 PM
Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck
calls
----- Original Message -----
> From: "Nir Soffer" <nsoffer(a)redhat.com>
> To: "Francesco Romani" <fromani(a)redhat.com>
> Cc: devel(a)ovirt.org, "Federico Simoncelli" <fsimonce(a)redhat.com>,
"Michal
> Skrivanek" <mskrivan(a)redhat.com>, "Adam
> Litke" <alitke(a)redhat.com>
> Sent: Monday, July 7, 2014 4:53:29 PM
> Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling
> of stuck calls
> > > If one vm may stoop responding, causing all libvirt calls for this vm
> > > to
> > > block, then a thread pool with one connection per worker thread can
> > > lead
> > > to a failure when all connection happen to run a request that blocks
> > > the thread. If this is the case, then each task related to one vm must
> > > depend on other tasks and should not be skipped until the previous task
> > > returned, simulating the current threading model without creating
100's
> > > of threads.
> >
> > Agreed, we should introduce this concept and this is lacking in my
> > threadpool
> > proposal.
>
> So basically the current threading model is the behavior we want?
>
> If some call get stuck, stop sampling this vm. Continue when the
> call returns.
>
> Michal? Federico?
Yep - but with less threads, and surely with a constant number of them.
Your schedule library (review in my queue at very high priority) is indeed
a nice step in this direcation.
Waiting for Federico's ack.
That looks good. Now I would like to summarize few things.
We know that when a request gets stuck on a vm also the subsequent ones will
get stuck (at least until their timeout is up, except for the first one that
could stay there forever).
We want a limited number of threads polling the statistics (trying to match
the number of threads that libvirt has).
Given those two assumptions we want a thread pool of workers that are picking
up jobs *per-vm*. The jobs should be smart enough to:
- understand what samples they have to take in that cycle (cpu? network? etc.)
- resubmit themselves in the queue
Now this will ensure that in the queue there's only one job per-vm and if it
gets stuck it is not re-submitted (no other worker will get stuck).
Additionally I think someone mentioned re-connection to libvirt in case of
stuck threads. I actually want to discourage (or really minimize) this
behavior because I can't think of a case where it would improve the situation
(it may just end up generating a large number of zombie threads on the libvirt
side).
--
Federico