[ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
Michal Skrivanek
mskrivan at redhat.com
Mon Jul 7 15:04:19 UTC 2014
On Jul 7, 2014, at 16:53 , Nir Soffer <nsoffer at redhat.com> wrote:
> ----- Original Message -----
>> From: "Francesco Romani" <fromani at redhat.com>
>> To: devel at ovirt.org
>> Cc: "Nir Soffer" <nsoffer at redhat.com>
>> Sent: Monday, July 7, 2014 5:04:02 PM
>> Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
>>
>> ----- Original Message -----
>>> From: "Nir Soffer" <nsoffer at redhat.com>
>>> To: "Francesco Romani" <fromani at redhat.com>
>>> Cc: devel at ovirt.org
>>> Sent: Friday, July 4, 2014 10:30:24 PM
>>> Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling
>>> of stuck calls
>>
>>>> Quick summary of the discussion on gerrit so far:
>>>> 1. extract the scheduling logic from the thread pool. Either add a
>>>> separate
>>>> scheduler class
>>>> or let the sampling task reschedule themselves after a succesfull
>>>> completion.
>>>> In any way the concept of 'periodic task', and the added complexity,
>>>> isn't needed.
>>>
>>> This will also allow task to change the sampling policy depending on the
>>> results.
>>> If some calls always fails, maybe we can run it less often.
>>
>> This looks a feature worth having
>
> But lets take a note and forget about this - we should focus on simple solution
>
>>>> 4. the sampling task (or maybe the scheduler) can be smarter and halting
>>>> the
>>>> sample in presence of
>>>> not responding calls for a given VM, granted the VM reports its
>>>> 'health'/responsiveness.
>>>
>>> The scheduler should not do this. The object that ask scheduled the tasks
>>> should cancel them if the vm is not responding.
>>
>> Yes, then we need a separate supervisor thread - or otherwise I don't yet see
>> how we can achieve this
>>
>>> The current model get this right because all calls related to specific vm
>>> are run on the same vm thread, so stuck call will prevent all other calls
>>> from running.
>>>
>>> Seems that this is an important requirement, if we have this issue - that
>>> all call realted specific vm may start block forever. Did we know that this
>>> happens?
>>
>> Yes. In all the reported cases it was directly or indirectly storage being
>> unavailable.
>> Directly: libvirt does some access (stat()) for example and gets stuck.
>> NFS soft mount doesn't always solve this.
>> Indirectly: qemu stuck in D state because it was doing I/O and the storage
>> disapperead
>> underneath its foots.
>
> When this happens - does all calls related to specific vm can block,
> or it is always the dangerous calls you listed bellow?
the storage stats are "dangerous", but ultimately anything can happen...
>
>>
>>> If one vm may stoop responding, causing all libvirt calls for this vm to
>>> block, then a thread pool with one connection per worker thread can lead
>>> to a failure when all connection happen to run a request that blocks
>>> the thread. If this is the case, then each task related to one vm must
>>> depend on other tasks and should not be skipped until the previous task
>>> returned, simulating the current threading model without creating 100's
>>> of threads.
>>
>> Agreed, we should introduce this concept and this is lacking in my threadpool
>> proposal.
>
> So basically the current threading model is the behavior we want?
>
> If some call get stuck, stop sampling this vm. Continue when the
> call returns.
for stats - yes
>
> Michal? Federico?
>
>>>> A. Let's cancel the stuck tasks.
>>>> If we move toward a libvirt connection pool, and we give each worker
>>>> thread
>>>> in the sampling pool
>>>> a separate libvirt connection, hopefully read-only,
>>>
>>> Why read only?
>>
>> Application of the rule of least privilege.
>> Sampling doesn't need to change any VM/domain property, just to read stuff.
>
> Good idea.
>
>>
>>>> then we should be able to
>>>> cancel stuck task by
>>>> killing the worker's libvirt connection. We'll still need a (probably
>>>> much
>>>> simpler) watchman/supervisor,
>>>> but no big deal here.
>>>> Libvirt allows to close a connection from a different thread.
>>>> I haven't actually tried to unstuck a blocked thread this way, but I have
>>>> no
>>>> reason to believe it
>>>> will not work.
>>>
>>> Lets try?
>>>
>>> You can block access to storage using iptables, which may cause the block
>>> related calls to stuck, and try to close a connection after few seconds
>>> from
>>> another thread.
>>
>> Sure, is near to the top of my TODO list.
>
> Lets wait with this. If we can continue to send other request on the same
> connection while one call never return, I don't see a need to close it.
>
>>
>>>>
>>>> B. Let's keep around blocked threads
>>>> The code as it is just leaves a blocked libvirt call and the worker
>>>> thread
>>>> that carried it frozen.
>>>> The stuck worker thread can be replaced up to a cap of frozen threads.
>>>> In this worst case scenario, we end up with one (blocked!) thread per VM,
>>>> as
>>>> it is today, and with
>>>> no sampling data.
>>>
>>> In the worst case, each vm will cause all threads to be stuck on call
>>> related
>>> to this vm, since calls can run on any thread in the pool.
>>
>> True, and this should be avoided any way.
>>
>>>> On the other hand, by avoiding to reissue dangerous calls,
>>>
>>> Which are the dangerous calls?
>>>
>>> If they are related to storage domains, we already have a thread per each
>>> domain, so maybe they should run on the domain monitoring thread, no on
>>> libvirt
>>> threads.
>>>
>>> We don't have any issue if a domain monitoring thread get stuck - this will
>>> simply make the domain unavailable after couple of minutes.
>>
>> This is an interesting idea, see below for a description of the affected
>> calls.
>>
>>>> Thoughts and comments very welcome!
>>>
>>> We don't know yet why libvirt calls may stuck, right?
>>>
>>> libvirt is probably not accessing storage (which may get you in D state)
>>> but
>>> query qemu which does access storage. So libvirt should be able to abort
>>> such
>>> calls.
>>>
>>> Lets make a list of calls that can stuck, and check with the libvirt
>>> developers
>>> what is the best way to handle such calls.
>>
>> And here we go:
>>
>> sampler period (runs every X seconds)
>> self._highWrite 2
>>
>> self._sampleVmJobs 15
>>
>> self._sampleBalloon 15
>>
>> self._sampleCpu 15
>> self._sampleNet 15
>> self._sampleVcpuPinning 15
>> self._sampleCpuTune 15
>>
>> self._updateVolumes 60
>> self._sampleDisk 60
>> self._sampleDiskLatency 60
>>
>> The potentially blocking samplers are
>> _highWrite, _sampleBalloon, _sampleVmJobs (but see below), _sampleDisk
>>
>> * _highWrite uses virDomainGetBlockInfo which, depending on the storage
>> configuration,
>> could enter the QEMU monitor.
>> libvirt needs to do some direct access (stat) anyway, the, depending on the
>> device:
>> - file-based devices: libvirt does the access directly
>
> We call virDomainGetBlockInfo only for block storage domain and qcow format:
>
> # vm.py:
> 2470 def _getExtendCandidates(self):
> 2471 ret = []
> 2472
> 2473 mergeCandidates = self._getLiveMergeExtendCandidates()
> 2474 for drive in self._devices[DISK_DEVICES]:
> 2475 if not drive.blockDev or drive.format != 'cow':
> 2476 continue
> 2477
> 2478 capacity, alloc, physical = self._dom.blockInfo(drive.path, 0)
>
>> - block-based devices: libvirt enters the QEMU monitor to ask for the last
>> block extent
>
> This is our only usage.
>
>>
>> * _sampleCpu uses virDomainGetCPUStats, which uses cgroups. Should be safe.
>
> Are should be safe really safe? can you check with libvirt developers
> about that?
>
>>
>> * _sampleNet uses virDomainInterfaceStats, which uses /proc/net/dev. Should
>> be safe.
>>
>> * _sampleBalloon uses virDomainGetInfo, which uses /proc, but also needs to
>> enter
>> the domain monitor.
>
> And the domain monitor may be stuck when storage is not available?
>
>>
>> * _sampleVmJobs uses virDomainBLockJobInfo, which needs to enter the QEMU
>> monitor.
>> However, this needs to run only if there are active block jobs.
>> I don't have numbers, but I expect this sampler to be idle most of time.
>
> Adam: This is related to live merge right?
>
>>
>> * _sampleVcpuPinning uses virDomainGetVcpus which uses /proc and syscall (for
>> affinity stuff),
>> should be safe
>>
>> * _sampleCpuTune uses virDomainSchedulerParameters, which uses cgroups,
>> virDomainGetVcpusFlags
>> and virDomainGetMetadata, which both access the domain definition (XML
>> inside libvirt),
>> should be safe
>>
>> * _updateVolume uses irs.getVolumeSize which I guess it is safe
>>
>> * _sampleDisk uses virDomainGetBlockStats, which needs to enter the QEMU
>> monitor.
>>
>> * _sampleDiskLatency uses virDomainGetBlockStatsFlags, which needs to enter
>> the QEMU monitor.
>> This call looks like a generalized version of the above, so I can maybe
>> optimize the two
>> somehow by issuing a single call to libvirt and gathering all the stats in
>> one go.
>>
>> _highWrite is on a class of its own. Could be a candidate to be moved,
>> at least partially (storage health monitoring) in the storage. The rest of
>> its functionalty
>> may stay in virt, or may be moved to storage as well. Let's see.
>>
>> In general, we can distinguish the calls in
>>
>> * storage/VM health (prevent VM misbehaving on storage exausted): _highWrite
>> * storage status: _updateVolume, _sampleDisk, _sampleDiskLatency.
>> We can optimize the later two, looks wasteful to ask for data, discard some,
>> then ask again for the missing data. Need to check carefully if the API
>> allows to do both in one go, but looks feasible.
>> * balloon stats: _sampleBalloon: needs to enter the QEMU monitor
>> * everything else: does not touch disk nor qemu, so no special treatment
>> here.
>
> If we separate storage calls so vm sampling continue even if storage
> cause qemu to block - do we gain anything? I guess this vm is
> not useful if its storage is not available, and will soon will
> pause anyway.
>
> Looks like the simplest solution is the best and fancy separation
> of sampling calls is needed.
you meant "not needed", no?
>
> Nir
More information about the Devel
mailing list