[ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls

Mon Jul 7 14:04:02 UTC 2014

----- Original Message -----
> From: "Nir Soffer" <nsoffer at redhat.com>
> To: "Francesco Romani" <fromani at redhat.com>
> Cc: devel at ovirt.org
> Sent: Friday, July 4, 2014 10:30:24 PM
> Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of	stuck calls

> > Quick summary of the discussion on gerrit so far:
> > 1. extract the scheduling logic from the thread pool. Either add a separate
> > scheduler class
> >    or let the sampling task reschedule themselves after a succesfull
> >    completion.
> >    In any way the concept of  'periodic task', and the added complexity,
> >    isn't needed.
> 
> This will also allow task to change the sampling policy depending on the
> results.
> If some calls always fails, maybe we can run it less often.

This looks a feature worth having

> >    If the scheduler finds that any task started in the former pass (or even
> >    before!)
> >    has not yet completed, there is no point in keeping this task alive and
> >    it
> >    should be cancelled.
> 
> It should not do this. If we schedule task when it was finished, we cannot
> get
> into this state, that a new task started while the previous task is still
> running.

Good point and elegant solution indeed.

> > 
> > 4. the sampling task (or maybe the scheduler) can be smarter and halting
> > the
> > sample in presence of
> >    not responding calls for a given VM, granted the VM reports its
> >    'health'/responsiveness.
> 
> The scheduler should not do this. The object that ask scheduled the tasks
> should cancel them if the vm is not responding.

Yes, then we need a separate supervisor thread - or otherwise I don't yet see
how we can achieve this

> The current model get this right because all calls related to specific vm
> are run on the same vm thread, so stuck call will prevent all other calls
> from running.
> 
> Seems that this is an important requirement, if we have this issue - that
> all call realted specific vm may start block forever. Did we know that this
> happens?

Yes. In all the reported cases it was directly or indirectly storage being unavailable.
Directly: libvirt does some access (stat()) for example and gets stuck.
NFS soft mount doesn't always solve this.
Indirectly: qemu stuck in D state because it was doing I/O and the storage disapperead
underneath its foots.

> If one vm may stoop responding, causing all libvirt calls for this vm to
> block, then a thread pool with one connection per worker thread can lead
> to a failure when all connection happen to run a request that blocks
> the thread. If this is the case, then each task related to one vm must
> depend on other tasks and should not be skipped until the previous task
> returned, simulating the current threading model without creating 100's
> of threads.

Agreed, we should introduce this concept and this is lacking in my threadpool proposal.

> > In the draft currently published, I reluctantly added the *queue classes
> > and
> > I agree the periodic
> > task implementation is messy, so I'll be very happy to drop them.
> 
> Just to make it more clear - I think we should make simple components that do
> one thing well, not magic components that do everything, like a threadpool
> that
> does scheduling and other fancy stuff.

Sure! believe or not, the proposal I submitted was the simplest I could came of with :)

> > A. Let's cancel the stuck tasks.
> > If we move toward a libvirt connection pool, and we give each worker thread
> > in the sampling pool
> > a separate libvirt connection, hopefully read-only,
> 
> Why read only?

Application of the rule of least privilege.
Sampling doesn't need to change any VM/domain property, just to read stuff.

> > then we should be able to
> > cancel stuck task by
> > killing the worker's libvirt connection. We'll still need a (probably much
> > simpler) watchman/supervisor,
> > but no big deal here.
> > Libvirt allows to close a connection from a different thread.
> > I haven't actually tried to unstuck a blocked thread this way, but I have
> > no
> > reason to believe it
> > will not work.
> 
> Lets try?
> 
> You can block access to storage using iptables, which may cause the block
> related calls to stuck, and try to close a connection after few seconds from
> another thread.

Sure, is near to the top of my TODO list.

> > 
> > B. Let's keep around blocked threads
> > The code as it is just leaves a blocked libvirt call and the worker thread
> > that carried it frozen.
> > The stuck worker thread can be replaced up to a cap of frozen threads.
> > In this worst case scenario, we end up with one (blocked!) thread per VM,
> > as
> > it is today, and with
> > no sampling data.
> 
> In the worst case, each vm will cause all threads to be stuck on call related
> to this vm, since calls can run on any thread in the pool.

True, and this should be avoided any way.

> > On the other hand, by avoiding to reissue dangerous calls,
> 
> Which are the dangerous calls?
> 
> If they are related to storage domains, we already have a thread per each
> domain, so maybe they should run on the domain monitoring thread, no on
> libvirt
> threads.
> 
> We don't have any issue if a domain monitoring thread get stuck - this will
> simply make the domain unavailable after couple of minutes.

This is an interesting idea, see below for a description of the affected calls.

> > Thoughts and comments very welcome!
> 
> We don't know yet why libvirt calls may stuck, right?
> 
> libvirt is probably not accessing storage (which may get you in D state) but
> query qemu which does access storage. So libvirt should be able to abort such
> calls.
> 
> Lets make a list of calls that can stuck, and check with the libvirt
> developers
> what is the best way to handle such calls.

And here we go:

sampler                 period (runs every X seconds)
self._highWrite         2

self._sampleVmJobs      15

self._sampleBalloon     15

self._sampleCpu         15
self._sampleNet         15
self._sampleVcpuPinning 15
self._sampleCpuTune     15

self._updateVolumes     60
self._sampleDisk        60
self._sampleDiskLatency 60

The potentially blocking samplers are
_highWrite, _sampleBalloon, _sampleVmJobs (but see below), _sampleDisk

* _highWrite uses virDomainGetBlockInfo which, depending on the storage configuration,
could enter the QEMU monitor.
libvirt needs to do some direct access (stat) anyway, the, depending on the device:
- file-based devices: libvirt does the access directly
- block-based devices: libvirt enters the QEMU monitor to ask for the last block extent

* _sampleCpu uses virDomainGetCPUStats, which uses cgroups. Should be safe.

* _sampleNet uses virDomainInterfaceStats, which uses /proc/net/dev. Should be safe.

* _sampleBalloon uses virDomainGetInfo, which uses /proc, but also needs to enter
  the domain monitor.

* _sampleVmJobs uses virDomainBLockJobInfo, which needs to enter the QEMU monitor.
  However, this needs to run only if there are active block jobs.
  I don't have numbers, but I expect this sampler to be idle most of time.

* _sampleVcpuPinning uses virDomainGetVcpus which uses /proc and syscall (for affinity stuff),
  should be safe

* _sampleCpuTune uses virDomainSchedulerParameters, which uses cgroups, virDomainGetVcpusFlags
  and virDomainGetMetadata, which both access the domain definition (XML inside libvirt),
  should be safe

* _updateVolume uses irs.getVolumeSize which I guess it is safe

* _sampleDisk uses virDomainGetBlockStats, which needs to enter the QEMU monitor.

* _sampleDiskLatency uses virDomainGetBlockStatsFlags, which needs to enter the QEMU monitor.
  This call looks like a generalized version of the above, so I can maybe optimize the two
  somehow by issuing a single call to libvirt and gathering all the stats in one go.

_highWrite is on a class of its own. Could be a candidate to be moved,
at least partially (storage health monitoring) in the storage. The rest of its functionalty
may stay in virt, or may be moved to storage as well. Let's see.

In general, we can distinguish the calls in

* storage/VM health (prevent VM misbehaving on storage exausted): _highWrite
* storage status: _updateVolume, _sampleDisk, _sampleDiskLatency.
We can optimize the later two, looks wasteful to ask for data, discard some,
then ask again for the missing data. Need to check carefully if the API
allows to do both in one go, but looks feasible.
* balloon stats: _sampleBalloon: needs to enter the QEMU monitor
* everything else: does not touch disk nor qemu, so no special treatment here.

> If they tell us that closing a stuck connection will break libvirt we
> obsiously
> cannot do this and will have to wait until such calls return, and replace
> the stuck thread with another thread.

Will verify ASAP, and check with libvirt devs as well.

> First, can you make a list of libvirt calls that we make, and how much time
> each call takes on average? Lets have a patch that add this info - how much
> we waited for the response - before doing any design.

sure, I'll prepare a patch for this

> How many libvirt connections are we using today? Do we use a same connection
> from many threads?

This should be the case. Will have more precise information ASAP.

> Looking in the libvirt api, it looks like libvirt supports accessing a
> connection
> from many threads. It seems that responses can return in different order then
> the requests.
> http://libvirt.org/internals/rpc.html#apiclientdispatchex1
> 
> When one of the dangerous calls get stuck, does it stop all other requests
> on the connection on we simply have a thread that make a call and will never
> return, while other threads are happily sending requests and receiving
> responses on this connection?

Need to check this as well

-- 
Francesco Romani
RedHat Engineering Virtualization R & D
Phone: 8261328
IRC: fromani