From: "Francesco Romani" <fromani(a)redhat.com>
To: devel(a)ovirt.org
Cc: "Nir Soffer" <nsoffer(a)redhat.com>, "Michal Skrivanek"
<mskrivan(a)redhat.com>, "Federico Simoncelli"
<fsimonce(a)redhat.com>, "Saggi Mizrahi" <smizrahi(a)redhat.com>,
"Dan Kenigsberg" <danken(a)redhat.com>
Sent: Saturday, July 12, 2014 1:59:22 PM
Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck
calls
(I'm continuing from here but this probably deserves a new thread. However.)
----- Original Message -----
> From: "Federico Simoncelli" <fsimonce(a)redhat.com>
> To: devel(a)ovirt.org
> Cc: "Nir Soffer" <nsoffer(a)redhat.com>, "Michal Skrivanek"
> <mskrivan(a)redhat.com>, "Adam Litke" <alitke(a)redhat.com>,
> "Francesco Romani" <fromani(a)redhat.com>
> Sent: Wednesday, July 9, 2014 4:57:53 PM
> Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling
> of stuck calls
> > > So basically the current threading model is the behavior we want?
> > >
> > > If some call get stuck, stop sampling this vm. Continue when the
> > > call returns.
> > >
> > > Michal? Federico?
> >
> > Yep - but with less threads, and surely with a constant number of them.
> > Your schedule library (review in my queue at very high priority) is
> > indeed
> > a nice step in this direcation.
> >
> > Waiting for Federico's ack.
>
> That looks good. Now I would like to summarize few things.
>
> We know that when a request gets stuck on a vm also the subsequent ones
> will
> get stuck (at least until their timeout is up, except for the first one
> that
> could stay there forever).
>
> We want a limited number of threads polling the statistics (trying to match
> the number of threads that libvirt has).
>
> Given those two assumptions we want a thread pool of workers that are
> picking
> up jobs *per-vm*. The jobs should be smart enough to:
>
> - understand what samples they have to take in that cycle (cpu? network?
> etc.)
> - resubmit themselves in the queue
>
> Now this will ensure that in the queue there's only one job per-vm and if
> it
> gets stuck it is not re-submitted (no other worker will get stuck).
In the last few days I was thinking really hard and long about our last
discussions,
feedback and proposals and how to properly fit all the pieces together.
Michal and me also had a chat about this topics on Friday, and eventually
I come up with this new draft
http://gerrit.ovirt.org/#/c/29977
(yes, that's it, just this) which builds on Nir's Schedule, the existing
Threadpool
hidden inside vdsm/storage, and which I believe provides a much, much better
ground
for further development or discussion
.
Driving forces behind this new draft:
- minimize bloat.
- minimize changes.
- separate nicely concerns (Scheduler schedules, threadpool executes,
Sampling
cares about the actual sampling only).
- leverage as much as possible existing infrastracture; avoid to introduce
new
fancy stuff unless absolutely needed.
And here it is. Almost all the concepts and requirements we discussed are
there.
The thing which is lacking here is strong isolation about VMs/samplings.
This new concept does nothing to recover stuck worker threads: if the pool
is exausted, everything eventually stops, after a few sampling intervals.
Stuck jobs are detected and the corresponding VMs are marked unresponsive
(leveraging existing infrastructure).
When (if?) stuck jobs eventually restart working, everything else restarts as
well.
The changes are minimal, and there is still room for refactoring and cleanup,
but I believe the design is nicer and cleaner.
Further steps:
* replace existing thread pool with a fancier one which can replace
stuck threads, or dinamically resize himself, to achieve better isolation
among
VMs or jobs?
* Split the new VmStatsCollector class in smaller components?
* Stale data detection. Planned but not yet there, I just need to get how to
properly fit it into the AdvancedStatsFunction windowing sample. Should
nt be a big deal, however.
I also have already quite few cleanup patches for the existing threadpool and
for
the sampling code in the queue, some are on gerrit, some are not.
I think most of them can wait once we agree on the overall design.
Nir also provided further suggestions (thanks for that!) and possible design
alternatives which I'm now evaluating carefully.
I agree with Federico and you - I think this is the way we should explore.
But I don't understand the way you are implementing this using the scheduler
in
, and it seems that this does not ensure that
every vm has only one sampling task running at the same time.
I started to work on a prototype during the last week and I think that this
is the right way to implement. Please check this patch:
This use the current storage thread pool, but I don't think it is good enough.
I think we should continue with
so we can handle
stuck worker thread without decreasing the work force of the thread pool.
Nir