[ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls

Mon Jul 7 15:04:19 UTC 2014

On Jul 7, 2014, at 16:53 , Nir Soffer <nsoffer at redhat.com> wrote:

> ----- Original Message -----
>> From: "Francesco Romani" <fromani at redhat.com>
>> To: devel at ovirt.org
>> Cc: "Nir Soffer" <nsoffer at redhat.com>
>> Sent: Monday, July 7, 2014 5:04:02 PM
>> Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of	stuck calls
>> 
>> ----- Original Message -----
>>> From: "Nir Soffer" <nsoffer at redhat.com>
>>> To: "Francesco Romani" <fromani at redhat.com>
>>> Cc: devel at ovirt.org
>>> Sent: Friday, July 4, 2014 10:30:24 PM
>>> Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling
>>> of	stuck calls
>> 
>>>> Quick summary of the discussion on gerrit so far:
>>>> 1. extract the scheduling logic from the thread pool. Either add a
>>>> separate
>>>> scheduler class
>>>>   or let the sampling task reschedule themselves after a succesfull
>>>>   completion.
>>>>   In any way the concept of  'periodic task', and the added complexity,
>>>>   isn't needed.
>>> 
>>> This will also allow task to change the sampling policy depending on the
>>> results.
>>> If some calls always fails, maybe we can run it less often.
>> 
>> This looks a feature worth having
> 
> But lets take a note and forget about this - we should focus on simple solution
> 
>>>> 4. the sampling task (or maybe the scheduler) can be smarter and halting
>>>> the
>>>> sample in presence of
>>>>   not responding calls for a given VM, granted the VM reports its
>>>>   'health'/responsiveness.
>>> 
>>> The scheduler should not do this. The object that ask scheduled the tasks
>>> should cancel them if the vm is not responding.
>> 
>> Yes, then we need a separate supervisor thread - or otherwise I don't yet see
>> how we can achieve this
>> 
>>> The current model get this right because all calls related to specific vm
>>> are run on the same vm thread, so stuck call will prevent all other calls
>>> from running.
>>> 
>>> Seems that this is an important requirement, if we have this issue - that
>>> all call realted specific vm may start block forever. Did we know that this
>>> happens?
>> 
>> Yes. In all the reported cases it was directly or indirectly storage being
>> unavailable.
>> Directly: libvirt does some access (stat()) for example and gets stuck.
>> NFS soft mount doesn't always solve this.
>> Indirectly: qemu stuck in D state because it was doing I/O and the storage
>> disapperead
>> underneath its foots.
> 
> When this happens - does all calls related to specific vm can block,
> or it is always the dangerous calls you listed bellow?

the storage stats are "dangerous", but ultimately anything can happen...

> 
>> 
>>> If one vm may stoop responding, causing all libvirt calls for this vm to
>>> block, then a thread pool with one connection per worker thread can lead
>>> to a failure when all connection happen to run a request that blocks
>>> the thread. If this is the case, then each task related to one vm must
>>> depend on other tasks and should not be skipped until the previous task
>>> returned, simulating the current threading model without creating 100's
>>> of threads.
>> 
>> Agreed, we should introduce this concept and this is lacking in my threadpool
>> proposal.
> 
> So basically the current threading model is the behavior we want?
> 
> If some call get stuck, stop sampling this vm. Continue when the 
> call returns.

for stats - yes

> 
> Michal? Federico?
> 
>>>> A. Let's cancel the stuck tasks.
>>>> If we move toward a libvirt connection pool, and we give each worker
>>>> thread
>>>> in the sampling pool
>>>> a separate libvirt connection, hopefully read-only,
>>> 
>>> Why read only?
>> 
>> Application of the rule of least privilege.
>> Sampling doesn't need to change any VM/domain property, just to read stuff.
> 
> Good idea.
> 
>> 
>>>> then we should be able to
>>>> cancel stuck task by
>>>> killing the worker's libvirt connection. We'll still need a (probably
>>>> much
>>>> simpler) watchman/supervisor,
>>>> but no big deal here.
>>>> Libvirt allows to close a connection from a different thread.
>>>> I haven't actually tried to unstuck a blocked thread this way, but I have
>>>> no
>>>> reason to believe it
>>>> will not work.
>>> 
>>> Lets try?
>>> 
>>> You can block access to storage using iptables, which may cause the block
>>> related calls to stuck, and try to close a connection after few seconds
>>> from
>>> another thread.
>> 
>> Sure, is near to the top of my TODO list.
> 
> Lets wait with this. If we can continue to send other request on the same
> connection while one call never return, I don't see a need to close it.
> 
>> 
>>>> 
>>>> B. Let's keep around blocked threads
>>>> The code as it is just leaves a blocked libvirt call and the worker
>>>> thread
>>>> that carried it frozen.
>>>> The stuck worker thread can be replaced up to a cap of frozen threads.
>>>> In this worst case scenario, we end up with one (blocked!) thread per VM,
>>>> as
>>>> it is today, and with
>>>> no sampling data.
>>> 
>>> In the worst case, each vm will cause all threads to be stuck on call
>>> related
>>> to this vm, since calls can run on any thread in the pool.
>> 
>> True, and this should be avoided any way.
>> 
>>>> On the other hand, by avoiding to reissue dangerous calls,
>>> 
>>> Which are the dangerous calls?
>>> 
>>> If they are related to storage domains, we already have a thread per each
>>> domain, so maybe they should run on the domain monitoring thread, no on
>>> libvirt
>>> threads.
>>> 
>>> We don't have any issue if a domain monitoring thread get stuck - this will
>>> simply make the domain unavailable after couple of minutes.
>> 
>> This is an interesting idea, see below for a description of the affected
>> calls.
>> 
>>>> Thoughts and comments very welcome!
>>> 
>>> We don't know yet why libvirt calls may stuck, right?
>>> 
>>> libvirt is probably not accessing storage (which may get you in D state)
>>> but
>>> query qemu which does access storage. So libvirt should be able to abort
>>> such
>>> calls.
>>> 
>>> Lets make a list of calls that can stuck, and check with the libvirt
>>> developers
>>> what is the best way to handle such calls.
>> 
>> And here we go:
>> 
>> sampler                 period (runs every X seconds)
>> self._highWrite         2
>> 
>> self._sampleVmJobs      15
>> 
>> self._sampleBalloon     15
>> 
>> self._sampleCpu         15
>> self._sampleNet         15
>> self._sampleVcpuPinning 15
>> self._sampleCpuTune     15
>> 
>> self._updateVolumes     60
>> self._sampleDisk        60
>> self._sampleDiskLatency 60
>> 
>> The potentially blocking samplers are
>> _highWrite, _sampleBalloon, _sampleVmJobs (but see below), _sampleDisk
>> 
>> * _highWrite uses virDomainGetBlockInfo which, depending on the storage
>> configuration,
>> could enter the QEMU monitor.
>> libvirt needs to do some direct access (stat) anyway, the, depending on the
>> device:
>> - file-based devices: libvirt does the access directly
> 
> We call virDomainGetBlockInfo only for block storage domain and qcow format:
> 
> # vm.py:
> 2470     def _getExtendCandidates(self):                                                                                                                                                                           
> 2471         ret = []
> 2472 
> 2473         mergeCandidates = self._getLiveMergeExtendCandidates()
> 2474         for drive in self._devices[DISK_DEVICES]:
> 2475             if not drive.blockDev or drive.format != 'cow':
> 2476                 continue
> 2477 
> 2478             capacity, alloc, physical = self._dom.blockInfo(drive.path, 0)
> 
>> - block-based devices: libvirt enters the QEMU monitor to ask for the last
>> block extent
> 
> This is our only usage.
> 
>> 
>> * _sampleCpu uses virDomainGetCPUStats, which uses cgroups. Should be safe.
> 
> Are should be safe really safe? can you check with libvirt developers
> about that?
> 
>> 
>> * _sampleNet uses virDomainInterfaceStats, which uses /proc/net/dev. Should
>> be safe.
>> 
>> * _sampleBalloon uses virDomainGetInfo, which uses /proc, but also needs to
>> enter
>>  the domain monitor.
> 
> And the domain monitor may be stuck when storage is not available?
> 
>> 
>> * _sampleVmJobs uses virDomainBLockJobInfo, which needs to enter the QEMU
>> monitor.
>>  However, this needs to run only if there are active block jobs.
>>  I don't have numbers, but I expect this sampler to be idle most of time.
> 
> Adam: This is related to live merge right?
> 
>> 
>> * _sampleVcpuPinning uses virDomainGetVcpus which uses /proc and syscall (for
>> affinity stuff),
>>  should be safe
>> 
>> * _sampleCpuTune uses virDomainSchedulerParameters, which uses cgroups,
>> virDomainGetVcpusFlags
>>  and virDomainGetMetadata, which both access the domain definition (XML
>>  inside libvirt),
>>  should be safe
>> 
>> * _updateVolume uses irs.getVolumeSize which I guess it is safe
>> 
>> * _sampleDisk uses virDomainGetBlockStats, which needs to enter the QEMU
>> monitor.
>> 
>> * _sampleDiskLatency uses virDomainGetBlockStatsFlags, which needs to enter
>> the QEMU monitor.
>>  This call looks like a generalized version of the above, so I can maybe
>>  optimize the two
>>  somehow by issuing a single call to libvirt and gathering all the stats in
>>  one go.
>> 
>> _highWrite is on a class of its own. Could be a candidate to be moved,
>> at least partially (storage health monitoring) in the storage. The rest of
>> its functionalty
>> may stay in virt, or may be moved to storage as well. Let's see.
>> 
>> In general, we can distinguish the calls in
>> 
>> * storage/VM health (prevent VM misbehaving on storage exausted): _highWrite
>> * storage status: _updateVolume, _sampleDisk, _sampleDiskLatency.
>> We can optimize the later two, looks wasteful to ask for data, discard some,
>> then ask again for the missing data. Need to check carefully if the API
>> allows to do both in one go, but looks feasible.
>> * balloon stats: _sampleBalloon: needs to enter the QEMU monitor
>> * everything else: does not touch disk nor qemu, so no special treatment
>> here.
> 
> If we separate storage calls so vm sampling continue even if storage
> cause qemu to block - do we gain anything? I guess this vm is 
> not useful if its storage is not available, and will soon will
> pause anyway.
> 
> Looks like the simplest solution is the best and fancy separation
> of sampling calls is needed.

you meant "not needed", no?

> 
> Nir