Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls

7 Jul 2014

      ----- Original Message -----
...
From: "Francesco Romani" <fromani@redhat.com>
To: devel@ovirt.org
Cc: "Nir Soffer" <nsoffer@redhat.com>
Sent: Monday, July 7, 2014 5:04:02 PM
Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of	stuck calls
----- Original Message -----
...
From: "Nir Soffer" <nsoffer@redhat.com>
To: "Francesco Romani" <fromani@redhat.com>
Cc: devel@ovirt.org
Sent: Friday, July 4, 2014 10:30:24 PM
Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling
of	stuck calls
...
...
Quick summary of the discussion on gerrit so far:
1. extract the scheduling logic from the thread pool. Either add a
separate
scheduler class
   or let the sampling task reschedule themselves after a succesfull
   completion.
   In any way the concept of  'periodic task', and the added complexity,
   isn't needed.
This will also allow task to change the sampling policy depending on the
results.
If some calls always fails, maybe we can run it less often.
This looks a feature worth having
But lets take a note and forget about this - we should focus on simple solution
...
...
...
4. the sampling task (or maybe the scheduler) can be smarter and halting
the
sample in presence of
   not responding calls for a given VM, granted the VM reports its
   'health'/responsiveness.
The scheduler should not do this. The object that ask scheduled the tasks
should cancel them if the vm is not responding.
Yes, then we need a separate supervisor thread - or otherwise I don't yet see
how we can achieve this
...
The current model get this right because all calls related to specific vm
are run on the same vm thread, so stuck call will prevent all other calls
from running.
Seems that this is an important requirement, if we have this issue - that
all call realted specific vm may start block forever. Did we know that this
happens?
Yes. In all the reported cases it was directly or indirectly storage being
unavailable.
Directly: libvirt does some access (stat()) for example and gets stuck.
NFS soft mount doesn't always solve this.
Indirectly: qemu stuck in D state because it was doing I/O and the storage
disapperead
underneath its foots.
When this happens - does all calls related to specific vm can block,
or it is always the dangerous calls you listed bellow?
...
...
If one vm may stoop responding, causing all libvirt calls for this vm to
block, then a thread pool with one connection per worker thread can lead
to a failure when all connection happen to run a request that blocks
the thread. If this is the case, then each task related to one vm must
depend on other tasks and should not be skipped until the previous task
returned, simulating the current threading model without creating 100's
of threads.
Agreed, we should introduce this concept and this is lacking in my threadpool
proposal.
So basically the current threading model is the behavior we want?

If some call get stuck, stop sampling this vm. Continue when the 
call returns.

Michal? Federico?
...
...
...
A. Let's cancel the stuck tasks.
If we move toward a libvirt connection pool, and we give each worker
thread
in the sampling pool
a separate libvirt connection, hopefully read-only,
Why read only?
Application of the rule of least privilege.
Sampling doesn't need to change any VM/domain property, just to read stuff.
Good idea.
...
...
...
then we should be able to
cancel stuck task by
killing the worker's libvirt connection. We'll still need a (probably
much
simpler) watchman/supervisor,
but no big deal here.
Libvirt allows to close a connection from a different thread.
I haven't actually tried to unstuck a blocked thread this way, but I have
no
reason to believe it
will not work.
Lets try?
You can block access to storage using iptables, which may cause the block
related calls to stuck, and try to close a connection after few seconds
from
another thread.
Sure, is near to the top of my TODO list.
Lets wait with this. If we can continue to send other request on the same
connection while one call never return, I don't see a need to close it.
...
...
...
B. Let's keep around blocked threads
The code as it is just leaves a blocked libvirt call and the worker
thread
that carried it frozen.
The stuck worker thread can be replaced up to a cap of frozen threads.
In this worst case scenario, we end up with one (blocked!) thread per VM,
as
it is today, and with
no sampling data.
In the worst case, each vm will cause all threads to be stuck on call
related
to this vm, since calls can run on any thread in the pool.
True, and this should be avoided any way.
...
...
On the other hand, by avoiding to reissue dangerous calls,
Which are the dangerous calls?
If they are related to storage domains, we already have a thread per each
domain, so maybe they should run on the domain monitoring thread, no on
libvirt
threads.
We don't have any issue if a domain monitoring thread get stuck - this will
simply make the domain unavailable after couple of minutes.
This is an interesting idea, see below for a description of the affected
calls.
...
...
Thoughts and comments very welcome!
We don't know yet why libvirt calls may stuck, right?
libvirt is probably not accessing storage (which may get you in D state)
but
query qemu which does access storage. So libvirt should be able to abort
such
calls.
Lets make a list of calls that can stuck, and check with the libvirt
developers
what is the best way to handle such calls.
And here we go:
sampler                 period (runs every X seconds)
self._highWrite         2
self._sampleVmJobs      15
self._sampleBalloon     15
self._sampleCpu         15
self._sampleNet         15
self._sampleVcpuPinning 15
self._sampleCpuTune     15
self._updateVolumes     60
self._sampleDisk        60
self._sampleDiskLatency 60
The potentially blocking samplers are
_highWrite, _sampleBalloon, _sampleVmJobs (but see below), _sampleDisk
* _highWrite uses virDomainGetBlockInfo which, depending on the storage
configuration,
could enter the QEMU monitor.
libvirt needs to do some direct access (stat) anyway, the, depending on the
device:
- file-based devices: libvirt does the access directly
We call virDomainGetBlockInfo only for block storage domain and qcow format:

# vm.py:
2470     def _getExtendCandidates(self):                                                                                                                                                                           
2471         ret = []
2472 
2473         mergeCandidates = self._getLiveMergeExtendCandidates()
2474         for drive in self._devices[DISK_DEVICES]:
2475             if not drive.blockDev or drive.format != 'cow':
2476                 continue
2477 
2478             capacity, alloc, physical = self._dom.blockInfo(drive.path, 0)
...
- block-based devices: libvirt enters the QEMU monitor to ask for the last
block extent
This is our only usage.
...
* _sampleCpu uses virDomainGetCPUStats, which uses cgroups. Should be safe.
Are should be safe really safe? can you check with libvirt developers
about that?
...
* _sampleNet uses virDomainInterfaceStats, which uses /proc/net/dev. Should
be safe.
* _sampleBalloon uses virDomainGetInfo, which uses /proc, but also needs to
enter
  the domain monitor.
And the domain monitor may be stuck when storage is not available?
...
* _sampleVmJobs uses virDomainBLockJobInfo, which needs to enter the QEMU
monitor.
  However, this needs to run only if there are active block jobs.
  I don't have numbers, but I expect this sampler to be idle most of time.
Adam: This is related to live merge right?
...
* _sampleVcpuPinning uses virDomainGetVcpus which uses /proc and syscall (for
affinity stuff),
  should be safe
* _sampleCpuTune uses virDomainSchedulerParameters, which uses cgroups,
virDomainGetVcpusFlags
  and virDomainGetMetadata, which both access the domain definition (XML
  inside libvirt),
  should be safe
* _updateVolume uses irs.getVolumeSize which I guess it is safe
* _sampleDisk uses virDomainGetBlockStats, which needs to enter the QEMU
monitor.
* _sampleDiskLatency uses virDomainGetBlockStatsFlags, which needs to enter
the QEMU monitor.
  This call looks like a generalized version of the above, so I can maybe
  optimize the two
  somehow by issuing a single call to libvirt and gathering all the stats in
  one go.
_highWrite is on a class of its own. Could be a candidate to be moved,
at least partially (storage health monitoring) in the storage. The rest of
its functionalty
may stay in virt, or may be moved to storage as well. Let's see.
In general, we can distinguish the calls in
* storage/VM health (prevent VM misbehaving on storage exausted): _highWrite
* storage status: _updateVolume, _sampleDisk, _sampleDiskLatency.
We can optimize the later two, looks wasteful to ask for data, discard some,
then ask again for the missing data. Need to check carefully if the API
allows to do both in one go, but looks feasible.
* balloon stats: _sampleBalloon: needs to enter the QEMU monitor
* everything else: does not touch disk nor qemu, so no special treatment
here.
If we separate storage calls so vm sampling continue even if storage
cause qemu to block - do we gain anything? I guess this vm is 
not useful if its storage is not available, and will soon will
pause anyway.

Looks like the simplest solution is the best and fancy separation
of sampling calls is needed.

Nir