[VDSM][sampling] thread pool status and handling of stuck calls

Hi, Nir has begun reviewing my draft patches about the thread pool and sampling refactoring (thanks!), and already suggested quite some improvements which I'd like to summarize Quick links to the ongoing discussion: http://gerrit.ovirt.org/#/c/29191/8/lib/threadpool/worker.py,cm http://gerrit.ovirt.org/#/c/29190/4/lib/threadpool/README.rst,cm Quick summary of the discussion on gerrit so far: 1. extract the scheduling logic from the thread pool. Either add a separate scheduler class or let the sampling task reschedule themselves after a succesfull completion. In any way the concept of 'periodic task', and the added complexity, isn't needed. 2. drop all the *queue classes I've added, thus making the package simpler. They are no longer needed since we remove the concept of periodic task. 3. have per-task timeout, move the stuck task detection elsewhere, like in the worker thread, ot maybe better in the aforementioned scheduler. If the scheduler finds that any task started in the former pass (or even before!) has not yet completed, there is no point in keeping this task alive and it should be cancelled. 4. the sampling task (or maybe the scheduler) can be smarter and halting the sample in presence of not responding calls for a given VM, granted the VM reports its 'health'/responsiveness. (Hopefully I haven't forgot anything big) In the draft currently published, I reluctantly added the *queue classes and I agree the periodic task implementation is messy, so I'll be very happy to drop them. However, a core question still holds: what to do in presence of the stuck task? I think it is worth to discuss this topic on a medium friendlier than gerrit, as it is the single most important decision to make in the sampling refactoring. It all boils down to: Should we just keep somewhere stuck threads and wait? Should we cancel stuck tasks? A. Let's cancel the stuck tasks. If we move toward a libvirt connection pool, and we give each worker thread in the sampling pool a separate libvirt connection, hopefully read-only, then we should be able to cancel stuck task by killing the worker's libvirt connection. We'll still need a (probably much simpler) watchman/supervisor, but no big deal here. Libvirt allows to close a connection from a different thread. I haven't actually tried to unstuck a blocked thread this way, but I have no reason to believe it will not work. B. Let's keep around blocked threads The code as it is just leaves a blocked libvirt call and the worker thread that carried it frozen. The stuck worker thread can be replaced up to a cap of frozen threads. In this worst case scenario, we end up with one (blocked!) thread per VM, as it is today, and with no sampling data. I believe that #A has some drawbacks which we risk to overlook, and on the same time #B has some merits. Let me explain: The hardest case is a call blocked in the kernel in D state. Libvirt has no more room than VDSM to unblock it; and libvirt itself *has* a pool of resources (threads in this case) which can be depleted by stuck calls. Actually, retrying to do a failed task may deplete their pool even faster[1]. I'm not happy to just push this problem down the stack, as it looks to me that we gain very little by doing so. VDSM itself surely stays cleaner, but the VDS/hypervisor hosts on the whole improves just a bit: libvirt scales better, and that gives us some more room. On the other hand, by avoiding to reissue dangerous calls, I believe we make better use of the host resources in general. Actually, the point of keeping blocked thread around is a side effect of not reattempting blocked calls. Moreover, to keep the blocked thread around has a significant benefit: we can discover at the earliest moment when it is safe again to do the blocked call, because the blocked call itself returns and we can track this event! (and of course drop the now stale result). Otherwise, if we drop the connection, we'll lose this event and we have no more option that trying again and hoping for the best[2] I know the #B approach is not the cleanest, but I think it has slightly more appeal, especially on the libvirt depletion front. Thoughts and comments very welcome! +++ [1] They have extensions to management API to dinamically adjust their thread pool and/or to cancel tasks, but it is in the RHEL7.2 timeframe. [2] A crazy idea would be to do something like http://en.wikipedia.org/wiki/Exponential_backoff which I'm not sure would be beneficial Bests and thanks, -- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani

----- Original Message -----
From: "Francesco Romani" <fromani@redhat.com> To: devel@ovirt.org Sent: Friday, July 4, 2014 5:48:59 PM Subject: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
Hi,
Nir has begun reviewing my draft patches about the thread pool and sampling refactoring (thanks!), and already suggested quite some improvements which I'd like to summarize
Quick links to the ongoing discussion: http://gerrit.ovirt.org/#/c/29191/8/lib/threadpool/worker.py,cm http://gerrit.ovirt.org/#/c/29190/4/lib/threadpool/README.rst,cm
Quick summary of the discussion on gerrit so far: 1. extract the scheduling logic from the thread pool. Either add a separate scheduler class or let the sampling task reschedule themselves after a succesfull completion. In any way the concept of 'periodic task', and the added complexity, isn't needed.
This will also allow task to change the sampling policy depending on the results. If some calls always fails, maybe we can run it less often.
2. drop all the *queue classes I've added, thus making the package simpler. They are no longer needed since we remove the concept of periodic task.
3. have per-task timeout, move the stuck task detection elsewhere, like in the worker thread, ot maybe better in the aforementioned scheduler.
The scheduler should not care about task status or results, it should be responsible for starting a task when it should run.
If the scheduler finds that any task started in the former pass (or even before!) has not yet completed, there is no point in keeping this task alive and it should be cancelled.
It should not do this. If we schedule task when it was finished, we cannot get into this state, that a new task started while the previous task is still running.
4. the sampling task (or maybe the scheduler) can be smarter and halting the sample in presence of not responding calls for a given VM, granted the VM reports its 'health'/responsiveness.
The scheduler should not do this. The object that ask scheduled the tasks should cancel them if the vm is not responding. The current model get this right because all calls related to specific vm are run on the same vm thread, so stuck call will prevent all other calls from running. Seems that this is an important requirement, if we have this issue - that all call realted specific vm may start block forever. Did we know that this happens? If one vm may stoop responding, causing all libvirt calls for this vm to block, then a thread pool with one connection per worker thread can lead to a failure when all connection happen to run a request that blocks the thread. If this is the case, then each task related to one vm must depend on other tasks and should not be skipped until the previous task returned, simulating the current threading model without creating 100's of threads.
(Hopefully I haven't forgot anything big)
In the draft currently published, I reluctantly added the *queue classes and I agree the periodic task implementation is messy, so I'll be very happy to drop them.
Just to make it more clear - I think we should make simple components that do one thing well, not magic components that do everything, like a threadpool that does scheduling and other fancy stuff.
However, a core question still holds: what to do in presence of the stuck task?
I think it is worth to discuss this topic on a medium friendlier than gerrit, as it is the single most important decision to make in the sampling refactoring.
It all boils down to: Should we just keep somewhere stuck threads and wait? Should we cancel stuck tasks?
A. Let's cancel the stuck tasks. If we move toward a libvirt connection pool, and we give each worker thread in the sampling pool a separate libvirt connection, hopefully read-only,
Why read only?
then we should be able to cancel stuck task by killing the worker's libvirt connection. We'll still need a (probably much simpler) watchman/supervisor, but no big deal here. Libvirt allows to close a connection from a different thread. I haven't actually tried to unstuck a blocked thread this way, but I have no reason to believe it will not work.
Lets try? You can block access to storage using iptables, which may cause the block related calls to stuck, and try to close a connection after few seconds from another thread.
B. Let's keep around blocked threads The code as it is just leaves a blocked libvirt call and the worker thread that carried it frozen. The stuck worker thread can be replaced up to a cap of frozen threads. In this worst case scenario, we end up with one (blocked!) thread per VM, as it is today, and with no sampling data.
In the worst case, each vm will cause all threads to be stuck on call related to this vm, since calls can run on any thread in the pool.
I believe that #A has some drawbacks which we risk to overlook, and on the same time #B has some merits.
Let me explain: The hardest case is a call blocked in the kernel in D state. Libvirt has no more room than VDSM to unblock it; and libvirt itself *has* a pool of resources (threads in this case) which can be depleted by stuck calls. Actually, retrying to do a failed task may deplete their pool even faster[1].
I'm not happy to just push this problem down the stack, as it looks to me that we gain very little by doing so. VDSM itself surely stays cleaner, but the VDS/hypervisor hosts on the whole improves just a bit: libvirt scales better, and that gives us some more room.
On the other hand, by avoiding to reissue dangerous calls,
Which are the dangerous calls? If they are related to storage domains, we already have a thread per each domain, so maybe they should run on the domain monitoring thread, no on libvirt threads. We don't have any issue if a domain monitoring thread get stuck - this will simply make the domain unavailable after couple of minutes.
I believe we make better use of the host resources in general. Actually, the point of keeping blocked thread around is a side effect of not reattempting blocked calls. Moreover, to keep the blocked thread around has a significant benefit: we can discover at the earliest moment when it is safe again to do the blocked call, because the blocked call itself returns and we can track this event! (and of course drop the now stale result). Otherwise, if we drop the connection, we'll lose this event and we have no more option that trying again and hoping for the best[2]
This is a good point.
I know the #B approach is not the cleanest, but I think it has slightly more appeal, especially on the libvirt depletion front.
Thoughts and comments very welcome!
We don't know yet why libvirt calls may stuck, right? libvirt is probably not accessing storage (which may get you in D state) but query qemu which does access storage. So libvirt should be able to abort such calls. Lets make a list of calls that can stuck, and check with the libvirt developers what is the best way to handle such calls. If they tell us that closing a stuck connection will break libvirt we obsiously cannot do this and will have to wait until such calls return, and replace the stuck thread with another thread. But I think that we are trying to solve the problem too quickly before we understand it (at least I don't yet understand the libvirt side). First, can you make a list of libvirt calls that we make, and how much time each call takes on average? Lets have a patch that add this info - how much we waited for the response - before doing any design. How many libvirt connections are we using today? Do we use a same connection from many threads? Looking in the libvirt api, it looks like libvirt supports accessing a connection from many threads. It seems that responses can return in different order then the requests. http://libvirt.org/internals/rpc.html#apiclientdispatchex1 When one of the dangerous calls get stuck, does it stop all other requests on the connection on we simply have a thread that make a call and will never return, while other threads are happily sending requests and receiving responses on this connection? Nir

----- Original Message -----
From: "Nir Soffer" <nsoffer@redhat.com> To: "Francesco Romani" <fromani@redhat.com> Cc: devel@ovirt.org Sent: Friday, July 4, 2014 10:30:24 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
Quick summary of the discussion on gerrit so far: 1. extract the scheduling logic from the thread pool. Either add a separate scheduler class or let the sampling task reschedule themselves after a succesfull completion. In any way the concept of 'periodic task', and the added complexity, isn't needed.
This will also allow task to change the sampling policy depending on the results. If some calls always fails, maybe we can run it less often.
This looks a feature worth having
If the scheduler finds that any task started in the former pass (or even before!) has not yet completed, there is no point in keeping this task alive and it should be cancelled.
It should not do this. If we schedule task when it was finished, we cannot get into this state, that a new task started while the previous task is still running.
Good point and elegant solution indeed.
4. the sampling task (or maybe the scheduler) can be smarter and halting the sample in presence of not responding calls for a given VM, granted the VM reports its 'health'/responsiveness.
The scheduler should not do this. The object that ask scheduled the tasks should cancel them if the vm is not responding.
Yes, then we need a separate supervisor thread - or otherwise I don't yet see how we can achieve this
The current model get this right because all calls related to specific vm are run on the same vm thread, so stuck call will prevent all other calls from running.
Seems that this is an important requirement, if we have this issue - that all call realted specific vm may start block forever. Did we know that this happens?
Yes. In all the reported cases it was directly or indirectly storage being unavailable. Directly: libvirt does some access (stat()) for example and gets stuck. NFS soft mount doesn't always solve this. Indirectly: qemu stuck in D state because it was doing I/O and the storage disapperead underneath its foots.
If one vm may stoop responding, causing all libvirt calls for this vm to block, then a thread pool with one connection per worker thread can lead to a failure when all connection happen to run a request that blocks the thread. If this is the case, then each task related to one vm must depend on other tasks and should not be skipped until the previous task returned, simulating the current threading model without creating 100's of threads.
Agreed, we should introduce this concept and this is lacking in my threadpool proposal.
In the draft currently published, I reluctantly added the *queue classes and I agree the periodic task implementation is messy, so I'll be very happy to drop them.
Just to make it more clear - I think we should make simple components that do one thing well, not magic components that do everything, like a threadpool that does scheduling and other fancy stuff.
Sure! believe or not, the proposal I submitted was the simplest I could came of with :)
A. Let's cancel the stuck tasks. If we move toward a libvirt connection pool, and we give each worker thread in the sampling pool a separate libvirt connection, hopefully read-only,
Why read only?
Application of the rule of least privilege. Sampling doesn't need to change any VM/domain property, just to read stuff.
then we should be able to cancel stuck task by killing the worker's libvirt connection. We'll still need a (probably much simpler) watchman/supervisor, but no big deal here. Libvirt allows to close a connection from a different thread. I haven't actually tried to unstuck a blocked thread this way, but I have no reason to believe it will not work.
Lets try?
You can block access to storage using iptables, which may cause the block related calls to stuck, and try to close a connection after few seconds from another thread.
Sure, is near to the top of my TODO list.
B. Let's keep around blocked threads The code as it is just leaves a blocked libvirt call and the worker thread that carried it frozen. The stuck worker thread can be replaced up to a cap of frozen threads. In this worst case scenario, we end up with one (blocked!) thread per VM, as it is today, and with no sampling data.
In the worst case, each vm will cause all threads to be stuck on call related to this vm, since calls can run on any thread in the pool.
True, and this should be avoided any way.
On the other hand, by avoiding to reissue dangerous calls,
Which are the dangerous calls?
If they are related to storage domains, we already have a thread per each domain, so maybe they should run on the domain monitoring thread, no on libvirt threads.
We don't have any issue if a domain monitoring thread get stuck - this will simply make the domain unavailable after couple of minutes.
This is an interesting idea, see below for a description of the affected calls.
Thoughts and comments very welcome!
We don't know yet why libvirt calls may stuck, right?
libvirt is probably not accessing storage (which may get you in D state) but query qemu which does access storage. So libvirt should be able to abort such calls.
Lets make a list of calls that can stuck, and check with the libvirt developers what is the best way to handle such calls.
And here we go: sampler period (runs every X seconds) self._highWrite 2 self._sampleVmJobs 15 self._sampleBalloon 15 self._sampleCpu 15 self._sampleNet 15 self._sampleVcpuPinning 15 self._sampleCpuTune 15 self._updateVolumes 60 self._sampleDisk 60 self._sampleDiskLatency 60 The potentially blocking samplers are _highWrite, _sampleBalloon, _sampleVmJobs (but see below), _sampleDisk * _highWrite uses virDomainGetBlockInfo which, depending on the storage configuration, could enter the QEMU monitor. libvirt needs to do some direct access (stat) anyway, the, depending on the device: - file-based devices: libvirt does the access directly - block-based devices: libvirt enters the QEMU monitor to ask for the last block extent * _sampleCpu uses virDomainGetCPUStats, which uses cgroups. Should be safe. * _sampleNet uses virDomainInterfaceStats, which uses /proc/net/dev. Should be safe. * _sampleBalloon uses virDomainGetInfo, which uses /proc, but also needs to enter the domain monitor. * _sampleVmJobs uses virDomainBLockJobInfo, which needs to enter the QEMU monitor. However, this needs to run only if there are active block jobs. I don't have numbers, but I expect this sampler to be idle most of time. * _sampleVcpuPinning uses virDomainGetVcpus which uses /proc and syscall (for affinity stuff), should be safe * _sampleCpuTune uses virDomainSchedulerParameters, which uses cgroups, virDomainGetVcpusFlags and virDomainGetMetadata, which both access the domain definition (XML inside libvirt), should be safe * _updateVolume uses irs.getVolumeSize which I guess it is safe * _sampleDisk uses virDomainGetBlockStats, which needs to enter the QEMU monitor. * _sampleDiskLatency uses virDomainGetBlockStatsFlags, which needs to enter the QEMU monitor. This call looks like a generalized version of the above, so I can maybe optimize the two somehow by issuing a single call to libvirt and gathering all the stats in one go. _highWrite is on a class of its own. Could be a candidate to be moved, at least partially (storage health monitoring) in the storage. The rest of its functionalty may stay in virt, or may be moved to storage as well. Let's see. In general, we can distinguish the calls in * storage/VM health (prevent VM misbehaving on storage exausted): _highWrite * storage status: _updateVolume, _sampleDisk, _sampleDiskLatency. We can optimize the later two, looks wasteful to ask for data, discard some, then ask again for the missing data. Need to check carefully if the API allows to do both in one go, but looks feasible. * balloon stats: _sampleBalloon: needs to enter the QEMU monitor * everything else: does not touch disk nor qemu, so no special treatment here.
If they tell us that closing a stuck connection will break libvirt we obsiously cannot do this and will have to wait until such calls return, and replace the stuck thread with another thread.
Will verify ASAP, and check with libvirt devs as well.
First, can you make a list of libvirt calls that we make, and how much time each call takes on average? Lets have a patch that add this info - how much we waited for the response - before doing any design.
sure, I'll prepare a patch for this
How many libvirt connections are we using today? Do we use a same connection from many threads?
This should be the case. Will have more precise information ASAP.
Looking in the libvirt api, it looks like libvirt supports accessing a connection from many threads. It seems that responses can return in different order then the requests. http://libvirt.org/internals/rpc.html#apiclientdispatchex1
When one of the dangerous calls get stuck, does it stop all other requests on the connection on we simply have a thread that make a call and will never return, while other threads are happily sending requests and receiving responses on this connection?
Need to check this as well -- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani

----- Original Message -----
From: "Francesco Romani" <fromani@redhat.com> To: devel@ovirt.org Cc: "Nir Soffer" <nsoffer@redhat.com> Sent: Monday, July 7, 2014 5:04:02 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
----- Original Message -----
From: "Nir Soffer" <nsoffer@redhat.com> To: "Francesco Romani" <fromani@redhat.com> Cc: devel@ovirt.org Sent: Friday, July 4, 2014 10:30:24 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
Quick summary of the discussion on gerrit so far: 1. extract the scheduling logic from the thread pool. Either add a separate scheduler class or let the sampling task reschedule themselves after a succesfull completion. In any way the concept of 'periodic task', and the added complexity, isn't needed.
This will also allow task to change the sampling policy depending on the results. If some calls always fails, maybe we can run it less often.
This looks a feature worth having
But lets take a note and forget about this - we should focus on simple solution
4. the sampling task (or maybe the scheduler) can be smarter and halting the sample in presence of not responding calls for a given VM, granted the VM reports its 'health'/responsiveness.
The scheduler should not do this. The object that ask scheduled the tasks should cancel them if the vm is not responding.
Yes, then we need a separate supervisor thread - or otherwise I don't yet see how we can achieve this
The current model get this right because all calls related to specific vm are run on the same vm thread, so stuck call will prevent all other calls from running.
Seems that this is an important requirement, if we have this issue - that all call realted specific vm may start block forever. Did we know that this happens?
Yes. In all the reported cases it was directly or indirectly storage being unavailable. Directly: libvirt does some access (stat()) for example and gets stuck. NFS soft mount doesn't always solve this. Indirectly: qemu stuck in D state because it was doing I/O and the storage disapperead underneath its foots.
When this happens - does all calls related to specific vm can block, or it is always the dangerous calls you listed bellow?
If one vm may stoop responding, causing all libvirt calls for this vm to block, then a thread pool with one connection per worker thread can lead to a failure when all connection happen to run a request that blocks the thread. If this is the case, then each task related to one vm must depend on other tasks and should not be skipped until the previous task returned, simulating the current threading model without creating 100's of threads.
Agreed, we should introduce this concept and this is lacking in my threadpool proposal.
So basically the current threading model is the behavior we want? If some call get stuck, stop sampling this vm. Continue when the call returns. Michal? Federico?
A. Let's cancel the stuck tasks. If we move toward a libvirt connection pool, and we give each worker thread in the sampling pool a separate libvirt connection, hopefully read-only,
Why read only?
Application of the rule of least privilege. Sampling doesn't need to change any VM/domain property, just to read stuff.
Good idea.
then we should be able to cancel stuck task by killing the worker's libvirt connection. We'll still need a (probably much simpler) watchman/supervisor, but no big deal here. Libvirt allows to close a connection from a different thread. I haven't actually tried to unstuck a blocked thread this way, but I have no reason to believe it will not work.
Lets try?
You can block access to storage using iptables, which may cause the block related calls to stuck, and try to close a connection after few seconds from another thread.
Sure, is near to the top of my TODO list.
Lets wait with this. If we can continue to send other request on the same connection while one call never return, I don't see a need to close it.
B. Let's keep around blocked threads The code as it is just leaves a blocked libvirt call and the worker thread that carried it frozen. The stuck worker thread can be replaced up to a cap of frozen threads. In this worst case scenario, we end up with one (blocked!) thread per VM, as it is today, and with no sampling data.
In the worst case, each vm will cause all threads to be stuck on call related to this vm, since calls can run on any thread in the pool.
True, and this should be avoided any way.
On the other hand, by avoiding to reissue dangerous calls,
Which are the dangerous calls?
If they are related to storage domains, we already have a thread per each domain, so maybe they should run on the domain monitoring thread, no on libvirt threads.
We don't have any issue if a domain monitoring thread get stuck - this will simply make the domain unavailable after couple of minutes.
This is an interesting idea, see below for a description of the affected calls.
Thoughts and comments very welcome!
We don't know yet why libvirt calls may stuck, right?
libvirt is probably not accessing storage (which may get you in D state) but query qemu which does access storage. So libvirt should be able to abort such calls.
Lets make a list of calls that can stuck, and check with the libvirt developers what is the best way to handle such calls.
And here we go:
sampler period (runs every X seconds) self._highWrite 2
self._sampleVmJobs 15
self._sampleBalloon 15
self._sampleCpu 15 self._sampleNet 15 self._sampleVcpuPinning 15 self._sampleCpuTune 15
self._updateVolumes 60 self._sampleDisk 60 self._sampleDiskLatency 60
The potentially blocking samplers are _highWrite, _sampleBalloon, _sampleVmJobs (but see below), _sampleDisk
* _highWrite uses virDomainGetBlockInfo which, depending on the storage configuration, could enter the QEMU monitor. libvirt needs to do some direct access (stat) anyway, the, depending on the device: - file-based devices: libvirt does the access directly
We call virDomainGetBlockInfo only for block storage domain and qcow format: # vm.py: 2470 def _getExtendCandidates(self): 2471 ret = [] 2472 2473 mergeCandidates = self._getLiveMergeExtendCandidates() 2474 for drive in self._devices[DISK_DEVICES]: 2475 if not drive.blockDev or drive.format != 'cow': 2476 continue 2477 2478 capacity, alloc, physical = self._dom.blockInfo(drive.path, 0)
- block-based devices: libvirt enters the QEMU monitor to ask for the last block extent
This is our only usage.
* _sampleCpu uses virDomainGetCPUStats, which uses cgroups. Should be safe.
Are should be safe really safe? can you check with libvirt developers about that?
* _sampleNet uses virDomainInterfaceStats, which uses /proc/net/dev. Should be safe.
* _sampleBalloon uses virDomainGetInfo, which uses /proc, but also needs to enter the domain monitor.
And the domain monitor may be stuck when storage is not available?
* _sampleVmJobs uses virDomainBLockJobInfo, which needs to enter the QEMU monitor. However, this needs to run only if there are active block jobs. I don't have numbers, but I expect this sampler to be idle most of time.
Adam: This is related to live merge right?
* _sampleVcpuPinning uses virDomainGetVcpus which uses /proc and syscall (for affinity stuff), should be safe
* _sampleCpuTune uses virDomainSchedulerParameters, which uses cgroups, virDomainGetVcpusFlags and virDomainGetMetadata, which both access the domain definition (XML inside libvirt), should be safe
* _updateVolume uses irs.getVolumeSize which I guess it is safe
* _sampleDisk uses virDomainGetBlockStats, which needs to enter the QEMU monitor.
* _sampleDiskLatency uses virDomainGetBlockStatsFlags, which needs to enter the QEMU monitor. This call looks like a generalized version of the above, so I can maybe optimize the two somehow by issuing a single call to libvirt and gathering all the stats in one go.
_highWrite is on a class of its own. Could be a candidate to be moved, at least partially (storage health monitoring) in the storage. The rest of its functionalty may stay in virt, or may be moved to storage as well. Let's see.
In general, we can distinguish the calls in
* storage/VM health (prevent VM misbehaving on storage exausted): _highWrite * storage status: _updateVolume, _sampleDisk, _sampleDiskLatency. We can optimize the later two, looks wasteful to ask for data, discard some, then ask again for the missing data. Need to check carefully if the API allows to do both in one go, but looks feasible. * balloon stats: _sampleBalloon: needs to enter the QEMU monitor * everything else: does not touch disk nor qemu, so no special treatment here.
If we separate storage calls so vm sampling continue even if storage cause qemu to block - do we gain anything? I guess this vm is not useful if its storage is not available, and will soon will pause anyway. Looks like the simplest solution is the best and fancy separation of sampling calls is needed. Nir

On Jul 7, 2014, at 16:53 , Nir Soffer <nsoffer@redhat.com> wrote:
----- Original Message -----
From: "Francesco Romani" <fromani@redhat.com> To: devel@ovirt.org Cc: "Nir Soffer" <nsoffer@redhat.com> Sent: Monday, July 7, 2014 5:04:02 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
----- Original Message -----
From: "Nir Soffer" <nsoffer@redhat.com> To: "Francesco Romani" <fromani@redhat.com> Cc: devel@ovirt.org Sent: Friday, July 4, 2014 10:30:24 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
Quick summary of the discussion on gerrit so far: 1. extract the scheduling logic from the thread pool. Either add a separate scheduler class or let the sampling task reschedule themselves after a succesfull completion. In any way the concept of 'periodic task', and the added complexity, isn't needed.
This will also allow task to change the sampling policy depending on the results. If some calls always fails, maybe we can run it less often.
This looks a feature worth having
But lets take a note and forget about this - we should focus on simple solution
4. the sampling task (or maybe the scheduler) can be smarter and halting the sample in presence of not responding calls for a given VM, granted the VM reports its 'health'/responsiveness.
The scheduler should not do this. The object that ask scheduled the tasks should cancel them if the vm is not responding.
Yes, then we need a separate supervisor thread - or otherwise I don't yet see how we can achieve this
The current model get this right because all calls related to specific vm are run on the same vm thread, so stuck call will prevent all other calls from running.
Seems that this is an important requirement, if we have this issue - that all call realted specific vm may start block forever. Did we know that this happens?
Yes. In all the reported cases it was directly or indirectly storage being unavailable. Directly: libvirt does some access (stat()) for example and gets stuck. NFS soft mount doesn't always solve this. Indirectly: qemu stuck in D state because it was doing I/O and the storage disapperead underneath its foots.
When this happens - does all calls related to specific vm can block, or it is always the dangerous calls you listed bellow?
the storage stats are "dangerous", but ultimately anything can happen...
If one vm may stoop responding, causing all libvirt calls for this vm to block, then a thread pool with one connection per worker thread can lead to a failure when all connection happen to run a request that blocks the thread. If this is the case, then each task related to one vm must depend on other tasks and should not be skipped until the previous task returned, simulating the current threading model without creating 100's of threads.
Agreed, we should introduce this concept and this is lacking in my threadpool proposal.
So basically the current threading model is the behavior we want?
If some call get stuck, stop sampling this vm. Continue when the call returns.
for stats - yes
Michal? Federico?
A. Let's cancel the stuck tasks. If we move toward a libvirt connection pool, and we give each worker thread in the sampling pool a separate libvirt connection, hopefully read-only,
Why read only?
Application of the rule of least privilege. Sampling doesn't need to change any VM/domain property, just to read stuff.
Good idea.
then we should be able to cancel stuck task by killing the worker's libvirt connection. We'll still need a (probably much simpler) watchman/supervisor, but no big deal here. Libvirt allows to close a connection from a different thread. I haven't actually tried to unstuck a blocked thread this way, but I have no reason to believe it will not work.
Lets try?
You can block access to storage using iptables, which may cause the block related calls to stuck, and try to close a connection after few seconds from another thread.
Sure, is near to the top of my TODO list.
Lets wait with this. If we can continue to send other request on the same connection while one call never return, I don't see a need to close it.
B. Let's keep around blocked threads The code as it is just leaves a blocked libvirt call and the worker thread that carried it frozen. The stuck worker thread can be replaced up to a cap of frozen threads. In this worst case scenario, we end up with one (blocked!) thread per VM, as it is today, and with no sampling data.
In the worst case, each vm will cause all threads to be stuck on call related to this vm, since calls can run on any thread in the pool.
True, and this should be avoided any way.
On the other hand, by avoiding to reissue dangerous calls,
Which are the dangerous calls?
If they are related to storage domains, we already have a thread per each domain, so maybe they should run on the domain monitoring thread, no on libvirt threads.
We don't have any issue if a domain monitoring thread get stuck - this will simply make the domain unavailable after couple of minutes.
This is an interesting idea, see below for a description of the affected calls.
Thoughts and comments very welcome!
We don't know yet why libvirt calls may stuck, right?
libvirt is probably not accessing storage (which may get you in D state) but query qemu which does access storage. So libvirt should be able to abort such calls.
Lets make a list of calls that can stuck, and check with the libvirt developers what is the best way to handle such calls.
And here we go:
sampler period (runs every X seconds) self._highWrite 2
self._sampleVmJobs 15
self._sampleBalloon 15
self._sampleCpu 15 self._sampleNet 15 self._sampleVcpuPinning 15 self._sampleCpuTune 15
self._updateVolumes 60 self._sampleDisk 60 self._sampleDiskLatency 60
The potentially blocking samplers are _highWrite, _sampleBalloon, _sampleVmJobs (but see below), _sampleDisk
* _highWrite uses virDomainGetBlockInfo which, depending on the storage configuration, could enter the QEMU monitor. libvirt needs to do some direct access (stat) anyway, the, depending on the device: - file-based devices: libvirt does the access directly
We call virDomainGetBlockInfo only for block storage domain and qcow format:
# vm.py: 2470 def _getExtendCandidates(self): 2471 ret = [] 2472 2473 mergeCandidates = self._getLiveMergeExtendCandidates() 2474 for drive in self._devices[DISK_DEVICES]: 2475 if not drive.blockDev or drive.format != 'cow': 2476 continue 2477 2478 capacity, alloc, physical = self._dom.blockInfo(drive.path, 0)
- block-based devices: libvirt enters the QEMU monitor to ask for the last block extent
This is our only usage.
* _sampleCpu uses virDomainGetCPUStats, which uses cgroups. Should be safe.
Are should be safe really safe? can you check with libvirt developers about that?
* _sampleNet uses virDomainInterfaceStats, which uses /proc/net/dev. Should be safe.
* _sampleBalloon uses virDomainGetInfo, which uses /proc, but also needs to enter the domain monitor.
And the domain monitor may be stuck when storage is not available?
* _sampleVmJobs uses virDomainBLockJobInfo, which needs to enter the QEMU monitor. However, this needs to run only if there are active block jobs. I don't have numbers, but I expect this sampler to be idle most of time.
Adam: This is related to live merge right?
* _sampleVcpuPinning uses virDomainGetVcpus which uses /proc and syscall (for affinity stuff), should be safe
* _sampleCpuTune uses virDomainSchedulerParameters, which uses cgroups, virDomainGetVcpusFlags and virDomainGetMetadata, which both access the domain definition (XML inside libvirt), should be safe
* _updateVolume uses irs.getVolumeSize which I guess it is safe
* _sampleDisk uses virDomainGetBlockStats, which needs to enter the QEMU monitor.
* _sampleDiskLatency uses virDomainGetBlockStatsFlags, which needs to enter the QEMU monitor. This call looks like a generalized version of the above, so I can maybe optimize the two somehow by issuing a single call to libvirt and gathering all the stats in one go.
_highWrite is on a class of its own. Could be a candidate to be moved, at least partially (storage health monitoring) in the storage. The rest of its functionalty may stay in virt, or may be moved to storage as well. Let's see.
In general, we can distinguish the calls in
* storage/VM health (prevent VM misbehaving on storage exausted): _highWrite * storage status: _updateVolume, _sampleDisk, _sampleDiskLatency. We can optimize the later two, looks wasteful to ask for data, discard some, then ask again for the missing data. Need to check carefully if the API allows to do both in one go, but looks feasible. * balloon stats: _sampleBalloon: needs to enter the QEMU monitor * everything else: does not touch disk nor qemu, so no special treatment here.
If we separate storage calls so vm sampling continue even if storage cause qemu to block - do we gain anything? I guess this vm is not useful if its storage is not available, and will soon will pause anyway.
Looks like the simplest solution is the best and fancy separation of sampling calls is needed.
you meant "not needed", no?
Nir

----- Original Message -----
From: "Michal Skrivanek" <mskrivan@redhat.com> To: "Nir Soffer" <nsoffer@redhat.com> Cc: "Francesco Romani" <fromani@redhat.com>, devel@ovirt.org, "Federico Simoncelli" <fsimonce@redhat.com>, "Adam Litke" <alitke@redhat.com> Sent: Monday, July 7, 2014 6:04:19 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
On Jul 7, 2014, at 16:53 , Nir Soffer <nsoffer@redhat.com> wrote:
If we separate storage calls so vm sampling continue even if storage cause qemu to block - do we gain anything? I guess this vm is not useful if its storage is not available, and will soon will pause anyway.
Looks like the simplest solution is the best and fancy separation of sampling calls is needed.
you meant "not needed", no?
Yes, not needed.

On 07/07/14 10:53 -0400, Nir Soffer wrote:
* _sampleVmJobs uses virDomainBLockJobInfo, which needs to enter the QEMU monitor. However, this needs to run only if there are active block jobs. I don't have numbers, but I expect this sampler to be idle most of time.
Adam: This is related to live merge right?
Yes, it is currently used only for live merge. It only calls libvirt when it expects a job to be running so indeed it is pretty much a noop most of the time. -- Adam Litke

----- Original Message -----
From: "Nir Soffer" <nsoffer@redhat.com> To: "Francesco Romani" <fromani@redhat.com> Cc: devel@ovirt.org, "Federico Simoncelli" <fsimonce@redhat.com>, "Michal Skrivanek" <mskrivan@redhat.com>, "Adam Litke" <alitke@redhat.com> Sent: Monday, July 7, 2014 4:53:29 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
This will also allow task to change the sampling policy depending on the results. If some calls always fails, maybe we can run it less often.
This looks a feature worth having
But lets take a note and forget about this - we should focus on simple solution
Agreed. Stuff for future work.
Yes. In all the reported cases it was directly or indirectly storage being unavailable. Directly: libvirt does some access (stat()) for example and gets stuck. NFS soft mount doesn't always solve this. Indirectly: qemu stuck in D state because it was doing I/O and the storage disapperead underneath its foots.
When this happens - does all calls related to specific vm can block, or it is always the dangerous calls you listed bellow?
As Michal said, the danger come mostly form the listed calls and in general to storage related calls. To makes things even more complex: https://www.redhat.com/archives/libvir-list/2014-July/msg00478.html Relevant quote for the lazy people (:))
If there is a prior call to libvirt that involves that guest domain which has blocked on storage, then this can prevent subsequent calls from completely since the prior call may hold a lock.
However this is less or more what we look for - not spam libvirt with related call if one get stuck, so it seems is not that bad for us, I believe.
If one vm may stoop responding, causing all libvirt calls for this vm to block, then a thread pool with one connection per worker thread can lead to a failure when all connection happen to run a request that blocks the thread. If this is the case, then each task related to one vm must depend on other tasks and should not be skipped until the previous task returned, simulating the current threading model without creating 100's of threads.
Agreed, we should introduce this concept and this is lacking in my threadpool proposal.
So basically the current threading model is the behavior we want?
If some call get stuck, stop sampling this vm. Continue when the call returns.
Michal? Federico?
Yep - but with less threads, and surely with a constant number of them. Your schedule library (review in my queue at very high priority) is indeed a nice step in this direcation. Waiting for Federico's ack.
You can block access to storage using iptables, which may cause the block related calls to stuck, and try to close a connection after few seconds from another thread.
Sure, is near to the top of my TODO list.
Lets wait with this. If we can continue to send other request on the same connection while one call never return, I don't see a need to close it.
Work in progress...
- block-based devices: libvirt enters the QEMU monitor to ask for the last block extent
This is our only usage.
Yep, but a very important one indeed
* _sampleCpu uses virDomainGetCPUStats, which uses cgroups. Should be safe.
Are should be safe really safe? can you check with libvirt developers about that?
Done: https://www.redhat.com/archives/libvir-list/2014-July/msg00477.html Waiting for (more) answers.
* _sampleNet uses virDomainInterfaceStats, which uses /proc/net/dev. Should be safe.
* _sampleBalloon uses virDomainGetInfo, which uses /proc, but also needs to enter the domain monitor.
And the domain monitor may be stuck when storage is not available?
Just to be clear, here I mean the QEMU domain (aka VM) monitor. This is a very good question. I think yes, it can get stuck, but I don't have hard evidence here, yet.
* storage/VM health (prevent VM misbehaving on storage exausted): _highWrite * storage status: _updateVolume, _sampleDisk, _sampleDiskLatency. We can optimize the later two, looks wasteful to ask for data, discard some, then ask again for the missing data. Need to check carefully if the API allows to do both in one go, but looks feasible. * balloon stats: _sampleBalloon: needs to enter the QEMU monitor * everything else: does not touch disk nor qemu, so no special treatment here.
If we separate storage calls so vm sampling continue even if storage cause qemu to block - do we gain anything? I guess this vm is not useful if its storage is not available, and will soon will pause anyway.
Looks like the simplest solution is the best and fancy separation of sampling calls is needed.
You are right about a VM being useless without storage. Probably separation is not the solution here, however we must consider some kind of isolation. If a VM gets stuck on storage (D-state etc. etc.) for whatever reason, it's fine to block all the other samplings of that VM, however the other VM should continue to sample until they block individually. We shouldn't assume if the storage is not-responsive for a VM, than it will be for all the other VMs on an host. Please note that I'm *not* implying we are assuming the above, it is just to highlight a requirement for sampling in general and to preserve the current (good) behaviour specifically. Bests, -- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani

----- Original Message -----
From: "Francesco Romani" <fromani@redhat.com> To: "Nir Soffer" <nsoffer@redhat.com> Cc: devel@ovirt.org, "Federico Simoncelli" <fsimonce@redhat.com>, "Michal Skrivanek" <mskrivan@redhat.com>, "Adam Litke" <alitke@redhat.com> Sent: Wednesday, July 9, 2014 2:22:05 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
----- Original Message -----
From: "Nir Soffer" <nsoffer@redhat.com> To: "Francesco Romani" <fromani@redhat.com> Cc: devel@ovirt.org, "Federico Simoncelli" <fsimonce@redhat.com>, "Michal Skrivanek" <mskrivan@redhat.com>, "Adam Litke" <alitke@redhat.com> Sent: Monday, July 7, 2014 4:53:29 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
If one vm may stoop responding, causing all libvirt calls for this vm to block, then a thread pool with one connection per worker thread can lead to a failure when all connection happen to run a request that blocks the thread. If this is the case, then each task related to one vm must depend on other tasks and should not be skipped until the previous task returned, simulating the current threading model without creating 100's of threads.
Agreed, we should introduce this concept and this is lacking in my threadpool proposal.
So basically the current threading model is the behavior we want?
If some call get stuck, stop sampling this vm. Continue when the call returns.
Michal? Federico?
Yep - but with less threads, and surely with a constant number of them. Your schedule library (review in my queue at very high priority) is indeed a nice step in this direcation.
Waiting for Federico's ack.
That looks good. Now I would like to summarize few things. We know that when a request gets stuck on a vm also the subsequent ones will get stuck (at least until their timeout is up, except for the first one that could stay there forever). We want a limited number of threads polling the statistics (trying to match the number of threads that libvirt has). Given those two assumptions we want a thread pool of workers that are picking up jobs *per-vm*. The jobs should be smart enough to: - understand what samples they have to take in that cycle (cpu? network? etc.) - resubmit themselves in the queue Now this will ensure that in the queue there's only one job per-vm and if it gets stuck it is not re-submitted (no other worker will get stuck). Additionally I think someone mentioned re-connection to libvirt in case of stuck threads. I actually want to discourage (or really minimize) this behavior because I can't think of a case where it would improve the situation (it may just end up generating a large number of zombie threads on the libvirt side). -- Federico

(I'm continuing from here but this probably deserves a new thread. However.) ----- Original Message -----
From: "Federico Simoncelli" <fsimonce@redhat.com> To: devel@ovirt.org Cc: "Nir Soffer" <nsoffer@redhat.com>, "Michal Skrivanek" <mskrivan@redhat.com>, "Adam Litke" <alitke@redhat.com>, "Francesco Romani" <fromani@redhat.com> Sent: Wednesday, July 9, 2014 4:57:53 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
So basically the current threading model is the behavior we want?
If some call get stuck, stop sampling this vm. Continue when the call returns.
Michal? Federico?
Yep - but with less threads, and surely with a constant number of them. Your schedule library (review in my queue at very high priority) is indeed a nice step in this direcation.
Waiting for Federico's ack.
That looks good. Now I would like to summarize few things.
We know that when a request gets stuck on a vm also the subsequent ones will get stuck (at least until their timeout is up, except for the first one that could stay there forever).
We want a limited number of threads polling the statistics (trying to match the number of threads that libvirt has).
Given those two assumptions we want a thread pool of workers that are picking up jobs *per-vm*. The jobs should be smart enough to:
- understand what samples they have to take in that cycle (cpu? network? etc.) - resubmit themselves in the queue
Now this will ensure that in the queue there's only one job per-vm and if it gets stuck it is not re-submitted (no other worker will get stuck).
In the last few days I was thinking really hard and long about our last discussions, feedback and proposals and how to properly fit all the pieces together. Michal and me also had a chat about this topics on Friday, and eventually I come up with this new draft http://gerrit.ovirt.org/#/c/29977 (yes, that's it, just this) which builds on Nir's Schedule, the existing Threadpool hidden inside vdsm/storage, and which I believe provides a much, much better ground for further development or discussion . Driving forces behind this new draft: - minimize bloat. - minimize changes. - separate nicely concerns (Scheduler schedules, threadpool executes, Sampling cares about the actual sampling only). - leverage as much as possible existing infrastracture; avoid to introduce new fancy stuff unless absolutely needed. And here it is. Almost all the concepts and requirements we discussed are there. The thing which is lacking here is strong isolation about VMs/samplings. This new concept does nothing to recover stuck worker threads: if the pool is exausted, everything eventually stops, after a few sampling intervals. Stuck jobs are detected and the corresponding VMs are marked unresponsive (leveraging existing infrastructure). When (if?) stuck jobs eventually restart working, everything else restarts as well. The changes are minimal, and there is still room for refactoring and cleanup, but I believe the design is nicer and cleaner. Further steps: * replace existing thread pool with a fancier one which can replace stuck threads, or dinamically resize himself, to achieve better isolation among VMs or jobs? * Split the new VmStatsCollector class in smaller components? * Stale data detection. Planned but not yet there, I just need to get how to properly fit it into the AdvancedStatsFunction windowing sample. Should nt be a big deal, however. I also have already quite few cleanup patches for the existing threadpool and for the sampling code in the queue, some are on gerrit, some are not. I think most of them can wait once we agree on the overall design. Nir also provided further suggestions (thanks for that!) and possible design alternatives which I'm now evaluating carefully. Thanks and bests, -- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani

----- Original Message -----
From: "Francesco Romani" <fromani@redhat.com> To: devel@ovirt.org Cc: "Nir Soffer" <nsoffer@redhat.com>, "Michal Skrivanek" <mskrivan@redhat.com>, "Federico Simoncelli" <fsimonce@redhat.com>, "Saggi Mizrahi" <smizrahi@redhat.com>, "Dan Kenigsberg" <danken@redhat.com> Sent: Saturday, July 12, 2014 1:59:22 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
(I'm continuing from here but this probably deserves a new thread. However.)
----- Original Message -----
From: "Federico Simoncelli" <fsimonce@redhat.com> To: devel@ovirt.org Cc: "Nir Soffer" <nsoffer@redhat.com>, "Michal Skrivanek" <mskrivan@redhat.com>, "Adam Litke" <alitke@redhat.com>, "Francesco Romani" <fromani@redhat.com> Sent: Wednesday, July 9, 2014 4:57:53 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
So basically the current threading model is the behavior we want?
If some call get stuck, stop sampling this vm. Continue when the call returns.
Michal? Federico?
Yep - but with less threads, and surely with a constant number of them. Your schedule library (review in my queue at very high priority) is indeed a nice step in this direcation.
Waiting for Federico's ack.
That looks good. Now I would like to summarize few things.
We know that when a request gets stuck on a vm also the subsequent ones will get stuck (at least until their timeout is up, except for the first one that could stay there forever).
We want a limited number of threads polling the statistics (trying to match the number of threads that libvirt has).
Given those two assumptions we want a thread pool of workers that are picking up jobs *per-vm*. The jobs should be smart enough to:
- understand what samples they have to take in that cycle (cpu? network? etc.) - resubmit themselves in the queue
Now this will ensure that in the queue there's only one job per-vm and if it gets stuck it is not re-submitted (no other worker will get stuck).
In the last few days I was thinking really hard and long about our last discussions, feedback and proposals and how to properly fit all the pieces together. Michal and me also had a chat about this topics on Friday, and eventually I come up with this new draft
http://gerrit.ovirt.org/#/c/29977
(yes, that's it, just this) which builds on Nir's Schedule, the existing Threadpool hidden inside vdsm/storage, and which I believe provides a much, much better ground for further development or discussion . Driving forces behind this new draft: - minimize bloat. - minimize changes. - separate nicely concerns (Scheduler schedules, threadpool executes, Sampling cares about the actual sampling only). - leverage as much as possible existing infrastracture; avoid to introduce new fancy stuff unless absolutely needed.
And here it is. Almost all the concepts and requirements we discussed are there. The thing which is lacking here is strong isolation about VMs/samplings.
This new concept does nothing to recover stuck worker threads: if the pool is exausted, everything eventually stops, after a few sampling intervals. Stuck jobs are detected and the corresponding VMs are marked unresponsive (leveraging existing infrastructure). When (if?) stuck jobs eventually restart working, everything else restarts as well.
The changes are minimal, and there is still room for refactoring and cleanup, but I believe the design is nicer and cleaner.
Further steps: * replace existing thread pool with a fancier one which can replace stuck threads, or dinamically resize himself, to achieve better isolation among VMs or jobs? * Split the new VmStatsCollector class in smaller components? * Stale data detection. Planned but not yet there, I just need to get how to properly fit it into the AdvancedStatsFunction windowing sample. Should nt be a big deal, however.
I also have already quite few cleanup patches for the existing threadpool and for the sampling code in the queue, some are on gerrit, some are not.
I think most of them can wait once we agree on the overall design.
Nir also provided further suggestions (thanks for that!) and possible design alternatives which I'm now evaluating carefully.
I agree with Federico and you - I think this is the way we should explore. But I don't understand the way you are implementing this using the scheduler in http://gerrit.ovirt.org/29977, and it seems that this does not ensure that every vm has only one sampling task running at the same time. I started to work on a prototype during the last week and I think that this is the right way to implement. Please check this patch: http://gerrit.ovirt.org/29980 This use the current storage thread pool, but I don't think it is good enough. I think we should continue with http://gerrit.ovirt.org/29191 so we can handle stuck worker thread without decreasing the work force of the thread pool. Nir

Hi, I will find time to go in depth in both patches but I still think that this is solving the problem in the wrong place. You want to have the libvirt connection management code to be separate from the sampling. You want to have a libvirt connection pull that you can Queue requests and track their process. The connection pool would have, internally, a thread pool with the same amount of threads as libvirt has (read libvirt conf?). The amount of connections is unimportant if I understand libvirt's design correctly. By giving to the rest of VDSM an Async view of libvirt operations we gain few things: A) All call to libvirt enjoy these features even if they are not sampling. B) We simplify the sampling code. So lets assume we have an global libvirtConnectionPool The api is something like: lcp.<nameOfOperation>(<args>) # returns a promise for example: promise = lcp.listAllDomains() if promise.wait(timeout): try: promise.cancel() except InProgress: # handle except AlreadyFninished: pass Sampling would be done with the scheduler and a separate thread pool would be used to handle the response or lack there of. def sample() p = lcp.doSampling() # setCallback() calls the callback immediately # if p is already done. p.setCallaback(onDoneSampling, args=(p,)) sched.schedule(timeout, onSamplingTimeout, args=(p,)) def onSamplingTimeout(p) status = p.getStatus() if status == InProgress: # In flight, so it's probably stuck # maybe one day we would be able to do something # other than log and weep log.warning("Sampling of vm X is stuck :(") sched.schedule(timeout, onDoneSampling=(p,)) elif status Finished: # Race, finished just in time, will be handled # in the callback shortly pass else: log.warning("libvirt under heavy load :(") def onDoneSampling(p) threadpool.queue(doParsing(p.response()) That way the threadpool, sampling, and libvirt connection management are all separate. Also, we can hide how we optimize libvirt connections\threads and have it orthogonal to how the sampling works. We could have more operations free the threadpool thread they are using if they are waiting for libvirt ops. If we ever get async APIs we could just have the libvirtConnectionManagement layer handle that for us and keep on using the same interface internally. ----- Original Message -----
From: "Nir Soffer" <nsoffer@redhat.com> To: "Francesco Romani" <fromani@redhat.com> Cc: devel@ovirt.org, "Michal Skrivanek" <mskrivan@redhat.com>, "Federico Simoncelli" <fsimonce@redhat.com>, "Saggi Mizrahi" <smizrahi@redhat.com>, "Dan Kenigsberg" <danken@redhat.com> Sent: Saturday, July 12, 2014 8:07:27 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
----- Original Message -----
From: "Francesco Romani" <fromani@redhat.com> To: devel@ovirt.org Cc: "Nir Soffer" <nsoffer@redhat.com>, "Michal Skrivanek" <mskrivan@redhat.com>, "Federico Simoncelli" <fsimonce@redhat.com>, "Saggi Mizrahi" <smizrahi@redhat.com>, "Dan Kenigsberg" <danken@redhat.com> Sent: Saturday, July 12, 2014 1:59:22 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
(I'm continuing from here but this probably deserves a new thread. However.)
----- Original Message -----
From: "Federico Simoncelli" <fsimonce@redhat.com> To: devel@ovirt.org Cc: "Nir Soffer" <nsoffer@redhat.com>, "Michal Skrivanek" <mskrivan@redhat.com>, "Adam Litke" <alitke@redhat.com>, "Francesco Romani" <fromani@redhat.com> Sent: Wednesday, July 9, 2014 4:57:53 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
So basically the current threading model is the behavior we want?
If some call get stuck, stop sampling this vm. Continue when the call returns.
Michal? Federico?
Yep - but with less threads, and surely with a constant number of them. Your schedule library (review in my queue at very high priority) is indeed a nice step in this direcation.
Waiting for Federico's ack.
That looks good. Now I would like to summarize few things.
We know that when a request gets stuck on a vm also the subsequent ones will get stuck (at least until their timeout is up, except for the first one that could stay there forever).
We want a limited number of threads polling the statistics (trying to match the number of threads that libvirt has).
Given those two assumptions we want a thread pool of workers that are picking up jobs *per-vm*. The jobs should be smart enough to:
- understand what samples they have to take in that cycle (cpu? network? etc.) - resubmit themselves in the queue
Now this will ensure that in the queue there's only one job per-vm and if it gets stuck it is not re-submitted (no other worker will get stuck).
In the last few days I was thinking really hard and long about our last discussions, feedback and proposals and how to properly fit all the pieces together. Michal and me also had a chat about this topics on Friday, and eventually I come up with this new draft
http://gerrit.ovirt.org/#/c/29977
(yes, that's it, just this) which builds on Nir's Schedule, the existing Threadpool hidden inside vdsm/storage, and which I believe provides a much, much better ground for further development or discussion . Driving forces behind this new draft: - minimize bloat. - minimize changes. - separate nicely concerns (Scheduler schedules, threadpool executes, Sampling cares about the actual sampling only). - leverage as much as possible existing infrastracture; avoid to introduce new fancy stuff unless absolutely needed.
And here it is. Almost all the concepts and requirements we discussed are there. The thing which is lacking here is strong isolation about VMs/samplings.
This new concept does nothing to recover stuck worker threads: if the pool is exausted, everything eventually stops, after a few sampling intervals. Stuck jobs are detected and the corresponding VMs are marked unresponsive (leveraging existing infrastructure). When (if?) stuck jobs eventually restart working, everything else restarts as well.
The changes are minimal, and there is still room for refactoring and cleanup, but I believe the design is nicer and cleaner.
Further steps: * replace existing thread pool with a fancier one which can replace stuck threads, or dinamically resize himself, to achieve better isolation among VMs or jobs? * Split the new VmStatsCollector class in smaller components? * Stale data detection. Planned but not yet there, I just need to get how to properly fit it into the AdvancedStatsFunction windowing sample. Should nt be a big deal, however.
I also have already quite few cleanup patches for the existing threadpool and for the sampling code in the queue, some are on gerrit, some are not.
I think most of them can wait once we agree on the overall design.
Nir also provided further suggestions (thanks for that!) and possible design alternatives which I'm now evaluating carefully.
I agree with Federico and you - I think this is the way we should explore.
But I don't understand the way you are implementing this using the scheduler in http://gerrit.ovirt.org/29977, and it seems that this does not ensure that every vm has only one sampling task running at the same time.
I started to work on a prototype during the last week and I think that this is the right way to implement. Please check this patch: http://gerrit.ovirt.org/29980
This use the current storage thread pool, but I don't think it is good enough. I think we should continue with http://gerrit.ovirt.org/29191 so we can handle stuck worker thread without decreasing the work force of the thread pool.
Nir

----- Original Message -----
From: "Saggi Mizrahi" <smizrahi@redhat.com> To: "Nir Soffer" <nsoffer@redhat.com>, "Francesco Romani" <fromani@redhat.com> Cc: devel@ovirt.org, "Michal Skrivanek" <mskrivan@redhat.com>, "Federico Simoncelli" <fsimonce@redhat.com>, "Dan Kenigsberg" <danken@redhat.com> Sent: Sunday, July 13, 2014 1:03:20 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
Hi, I will find time to go in depth in both patches but I still think that this is solving the problem in the wrong place.
You want to have the libvirt connection management code to be separate from the sampling.
It is - we did not suggest changing libvrit connections.
You want to have a libvirt connection pull that you can Queue requests and track their process. The connection pool would have, internally, a thread pool with the same amount of threads as libvirt has (read libvirt conf?). The amount of connections is unimportant if I understand libvirt's design correctly.
If we don't need multiple connections, then what the connection pool does? I guess that you don't mean *connection* pool.
By giving to the rest of VDSM an Async view of libvirt operations we gain few things:
A) All call to libvirt enjoy these features even if they are not sampling. B) We simplify the sampling code.
So lets assume we have an global libvirtConnectionPool
The api is something like: lcp.<nameOfOperation>(<args>) # returns a promise
for example: promise = lcp.listAllDomains()
if promise.wait(timeout): try: promise.cancel() except InProgress: # handle except AlreadyFninished: pass
So what you suggest is actually adding a timeout for libvirt calls, without heaving such timeout on the libvirt side. If we had this on the libvirt, side, we would not have to add anything, right? The current code that blocks on libvirt call will block on a condition variable, and the real call will block on libvirt in a worker thread. If the worker thread finish before the timeout, the calling code will get the result, and if it is stuck or finish after the timeout, the calling code will get a timeout. This is nice, but we are left with the worker thread stuck on libvirt, or if it finished after the timeout, we lost the result of the call. This design imply 2 threads - the calling thread, and the worker threads, so it does not scale and cannot be used for sampling, where we like to have only one thread triggering sampling calls, and the worker threads executing them. For the api calls, where we have a connection/request thread running libvirt code, this adds a timeout using small additional cost (few other worker threads), and a condition variable and lock for each worker thread. Michal, Francesco: do we want timeout on libvrit calls, leaving threads stuck on libvirt, or do we like to wait for libvirt call to return?
Sampling would be done with the scheduler and a separate thread pool would be used to handle the response or lack there of.
For sampling, we came to the conclusion that we don't want to abort a call after a timeout, but we want to stop sampling for this vm until the call return. So having a timeout is a non-goal.
def sample() p = lcp.doSampling() # setCallback() calls the callback immediately # if p is already done. p.setCallaback(onDoneSampling, args=(p,)) sched.schedule(timeout, onSamplingTimeout, args=(p,))
def onSamplingTimeout(p) status = p.getStatus() if status == InProgress: # In flight, so it's probably stuck # maybe one day we would be able to do something # other than log and weep log.warning("Sampling of vm X is stuck :(") sched.schedule(timeout, onDoneSampling=(p,)) elif status Finished: # Race, finished just in time, will be handled # in the callback shortly pass else: log.warning("libvirt under heavy load :(")
def onDoneSampling(p) threadpool.queue(doParsing(p.response())
That way the threadpool, sampling, and libvirt connection management are all separate. Also, we can hide how we optimize libvirt connections\threads and have it orthogonal to how the sampling works.
The current patches do not change libvirt connection management this is orthogonal issue. They are only about changing the way we do sampling.
We could have more operations free the threadpool thread they are using if they are waiting for libvirt ops.
- You have a worker thread stuck on libvirt api call. On the libvirt side, libvirt worker thread is blocked on qemu monitor, which is blocked in the kernel in uninterruptible wait. - You run code on another thread detecting that the first thread is stuck This is what Frencesco is suggesting - have a supervisor thread that handle stuck threads. Now what are you going to do about it?
If we ever get async APIs we could just have the libvirtConnectionManagement layer handle that for us and keep on using the same interface internally.
I think that what you suggest is orthogonal to the sampling problem that Francesco and me are trying to tackle.
----- Original Message -----
From: "Nir Soffer" <nsoffer@redhat.com> To: "Francesco Romani" <fromani@redhat.com> Cc: devel@ovirt.org, "Michal Skrivanek" <mskrivan@redhat.com>, "Federico Simoncelli" <fsimonce@redhat.com>, "Saggi Mizrahi" <smizrahi@redhat.com>, "Dan Kenigsberg" <danken@redhat.com> Sent: Saturday, July 12, 2014 8:07:27 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
----- Original Message -----
From: "Francesco Romani" <fromani@redhat.com> To: devel@ovirt.org Cc: "Nir Soffer" <nsoffer@redhat.com>, "Michal Skrivanek" <mskrivan@redhat.com>, "Federico Simoncelli" <fsimonce@redhat.com>, "Saggi Mizrahi" <smizrahi@redhat.com>, "Dan Kenigsberg" <danken@redhat.com> Sent: Saturday, July 12, 2014 1:59:22 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
(I'm continuing from here but this probably deserves a new thread. However.)
----- Original Message -----
From: "Federico Simoncelli" <fsimonce@redhat.com> To: devel@ovirt.org Cc: "Nir Soffer" <nsoffer@redhat.com>, "Michal Skrivanek" <mskrivan@redhat.com>, "Adam Litke" <alitke@redhat.com>, "Francesco Romani" <fromani@redhat.com> Sent: Wednesday, July 9, 2014 4:57:53 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
So basically the current threading model is the behavior we want?
If some call get stuck, stop sampling this vm. Continue when the call returns.
Michal? Federico?
Yep - but with less threads, and surely with a constant number of them. Your schedule library (review in my queue at very high priority) is indeed a nice step in this direcation.
Waiting for Federico's ack.
That looks good. Now I would like to summarize few things.
We know that when a request gets stuck on a vm also the subsequent ones will get stuck (at least until their timeout is up, except for the first one that could stay there forever).
We want a limited number of threads polling the statistics (trying to match the number of threads that libvirt has).
Given those two assumptions we want a thread pool of workers that are picking up jobs *per-vm*. The jobs should be smart enough to:
- understand what samples they have to take in that cycle (cpu? network? etc.) - resubmit themselves in the queue
Now this will ensure that in the queue there's only one job per-vm and if it gets stuck it is not re-submitted (no other worker will get stuck).
In the last few days I was thinking really hard and long about our last discussions, feedback and proposals and how to properly fit all the pieces together. Michal and me also had a chat about this topics on Friday, and eventually I come up with this new draft
http://gerrit.ovirt.org/#/c/29977
(yes, that's it, just this) which builds on Nir's Schedule, the existing Threadpool hidden inside vdsm/storage, and which I believe provides a much, much better ground for further development or discussion . Driving forces behind this new draft: - minimize bloat. - minimize changes. - separate nicely concerns (Scheduler schedules, threadpool executes, Sampling cares about the actual sampling only). - leverage as much as possible existing infrastracture; avoid to introduce new fancy stuff unless absolutely needed.
And here it is. Almost all the concepts and requirements we discussed are there. The thing which is lacking here is strong isolation about VMs/samplings.
This new concept does nothing to recover stuck worker threads: if the pool is exausted, everything eventually stops, after a few sampling intervals. Stuck jobs are detected and the corresponding VMs are marked unresponsive (leveraging existing infrastructure). When (if?) stuck jobs eventually restart working, everything else restarts as well.
The changes are minimal, and there is still room for refactoring and cleanup, but I believe the design is nicer and cleaner.
Further steps: * replace existing thread pool with a fancier one which can replace stuck threads, or dinamically resize himself, to achieve better isolation among VMs or jobs? * Split the new VmStatsCollector class in smaller components? * Stale data detection. Planned but not yet there, I just need to get how to properly fit it into the AdvancedStatsFunction windowing sample. Should nt be a big deal, however.
I also have already quite few cleanup patches for the existing threadpool and for the sampling code in the queue, some are on gerrit, some are not.
I think most of them can wait once we agree on the overall design.
Nir also provided further suggestions (thanks for that!) and possible design alternatives which I'm now evaluating carefully.
I agree with Federico and you - I think this is the way we should explore.
But I don't understand the way you are implementing this using the scheduler in http://gerrit.ovirt.org/29977, and it seems that this does not ensure that every vm has only one sampling task running at the same time.
I started to work on a prototype during the last week and I think that this is the right way to implement. Please check this patch: http://gerrit.ovirt.org/29980
This use the current storage thread pool, but I don't think it is good enough. I think we should continue with http://gerrit.ovirt.org/29191 so we can handle stuck worker thread without decreasing the work force of the thread pool.
Nir

From: "Nir Soffer" <nsoffer@redhat.com> To: "Saggi Mizrahi" <smizrahi@redhat.com> Cc: "Francesco Romani" <fromani@redhat.com>, devel@ovirt.org, "Michal Skrivanek" <mskrivan@redhat.com>, "Federico Simoncelli" <fsimonce@redhat.com>, "Dan Kenigsberg" <danken@redhat.com> Sent: Sunday, July 13, 2014 2:34:50 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
----- Original Message -----
From: "Saggi Mizrahi" <smizrahi@redhat.com> To: "Nir Soffer" <nsoffer@redhat.com>, "Francesco Romani" <fromani@redhat.com> Cc: devel@ovirt.org, "Michal Skrivanek" <mskrivan@redhat.com>, "Federico Simoncelli" <fsimonce@redhat.com>, "Dan Kenigsberg" <danken@redhat.com> Sent: Sunday, July 13, 2014 1:03:20 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
Hi, I will find time to go in depth in both patches but I still think that this is solving the problem in the wrong place.
You want to have the libvirt connection management code to be separate from the sampling.
It is - we did not suggest changing libvrit connections.
You want to have a libvirt connection pull that you can Queue requests and track their process. The connection pool would have, internally, a thread pool with the same amount of threads as libvirt has (read libvirt conf?). The amount of connections is unimportant if I understand libvirt's design correctly.
If we don't need multiple connections, then what the connection pool does? I guess that you don't mean *connection* pool.
By giving to the rest of VDSM an Async view of libvirt operations we gain few things:
A) All call to libvirt enjoy these features even if they are not sampling. B) We simplify the sampling code.
So lets assume we have an global libvirtConnectionPool
The api is something like: lcp.<nameOfOperation>(<args>) # returns a promise
for example: promise = lcp.listAllDomains()
if promise.wait(timeout): try: promise.cancel() except InProgress: # handle except AlreadyFninished: pass
So what you suggest is actually adding a timeout for libvirt calls, without heaving such timeout on the libvirt side. If we had this on the libvirt, side, we would not have to add anything, right? yes
The current code that blocks on libvirt call will block on a condition variable, and the real call will block on libvirt in a worker thread. This is what the callback is for. You would not wait on anything and the once the call is done the callback will be called.
If the worker thread finish before the timeout, the calling code will get the result, and if it is stuck or finish after the timeout, the calling code will get a timeout. timeout is optional, in the example I gave I just queue a sched event to put something in the log.
This is nice, but we are left with the worker thread stuck on libvirt, or if it finished after the timeout, we lost the result of the call.
This design imply 2 threads - the calling thread, and the worker threads, so it does not scale and cannot be used for sampling, where we like to have only one thread triggering sampling calls, and the worker threads executing them.
For the api calls, where we have a connection/request thread running libvirt code, this adds a timeout using small additional cost (few other worker threads), and a condition variable and lock for each worker thread. No the point is having a thread in VDSM per available thread in libvirt. So if for cases where all libvirt threads are stuck, all of our threads are stuck but the rest of the system can handle it. Sampling will just log every few minutes just for records sake but other operations might return an error since it took too long for the operation to get to libvirt.
Michal, Francesco: do we want timeout on libvrit calls, leaving threads stuck on libvirt, or do we like to wait for libvirt call to return?
Sampling would be done with the scheduler and a separate thread pool would be used to handle the response or lack there of.
For sampling, we came to the conclusion that we don't want to abort a call after a timeout, but we want to stop sampling for this vm until the call return. So having a timeout is a non-goal.
def sample() p = lcp.doSampling() # setCallback() calls the callback immediately # if p is already done. p.setCallaback(onDoneSampling, args=(p,)) sched.schedule(timeout, onSamplingTimeout, args=(p,))
def onSamplingTimeout(p) status = p.getStatus() if status == InProgress: # In flight, so it's probably stuck # maybe one day we would be able to do something # other than log and weep log.warning("Sampling of vm X is stuck :(") sched.schedule(timeout, onDoneSampling=(p,)) elif status Finished: # Race, finished just in time, will be handled # in the callback shortly pass else: log.warning("libvirt under heavy load :(")
def onDoneSampling(p) threadpool.queue(doParsing(p.response())
That way the threadpool, sampling, and libvirt connection management are all separate. Also, we can hide how we optimize libvirt connections\threads and have it orthogonal to how the sampling works.
The current patches do not change libvirt connection management this is orthogonal issue. They are only about changing the way we do sampling. As I've been saying, I think the problem is in actually in the
----- Original Message ----- libvirt connection management and not the stats operations.
We could have more operations free the threadpool thread they are using if they are waiting for libvirt ops.
- You have a worker thread stuck on libvirt api call. On the libvirt side, libvirt worker thread is blocked on qemu monitor, which is blocked in the kernel in uninterruptible wait. - You run code on another thread detecting that the first thread is stuck This is what Frencesco is suggesting - have a supervisor thread that handle stuck threads.
Now what are you going to do about it?
If we ever get async APIs we could just have the libvirtConnectionManagement layer handle that for us and keep on using the same interface internally.
I think that what you suggest is orthogonal to the sampling problem that Francesco and me are trying to tackle.
----- Original Message -----
From: "Nir Soffer" <nsoffer@redhat.com> To: "Francesco Romani" <fromani@redhat.com> Cc: devel@ovirt.org, "Michal Skrivanek" <mskrivan@redhat.com>, "Federico Simoncelli" <fsimonce@redhat.com>, "Saggi Mizrahi" <smizrahi@redhat.com>, "Dan Kenigsberg" <danken@redhat.com> Sent: Saturday, July 12, 2014 8:07:27 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
----- Original Message -----
From: "Francesco Romani" <fromani@redhat.com> To: devel@ovirt.org Cc: "Nir Soffer" <nsoffer@redhat.com>, "Michal Skrivanek" <mskrivan@redhat.com>, "Federico Simoncelli" <fsimonce@redhat.com>, "Saggi Mizrahi" <smizrahi@redhat.com>, "Dan Kenigsberg" <danken@redhat.com> Sent: Saturday, July 12, 2014 1:59:22 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
(I'm continuing from here but this probably deserves a new thread. However.)
----- Original Message -----
From: "Federico Simoncelli" <fsimonce@redhat.com> To: devel@ovirt.org Cc: "Nir Soffer" <nsoffer@redhat.com>, "Michal Skrivanek" <mskrivan@redhat.com>, "Adam Litke" <alitke@redhat.com>, "Francesco Romani" <fromani@redhat.com> Sent: Wednesday, July 9, 2014 4:57:53 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
> So basically the current threading model is the behavior we want? > > If some call get stuck, stop sampling this vm. Continue when the > call returns. > > Michal? Federico?
Yep - but with less threads, and surely with a constant number of them. Your schedule library (review in my queue at very high priority) is indeed a nice step in this direcation.
Waiting for Federico's ack.
That looks good. Now I would like to summarize few things.
We know that when a request gets stuck on a vm also the subsequent ones will get stuck (at least until their timeout is up, except for the first one that could stay there forever).
We want a limited number of threads polling the statistics (trying to match the number of threads that libvirt has).
Given those two assumptions we want a thread pool of workers that are picking up jobs *per-vm*. The jobs should be smart enough to:
- understand what samples they have to take in that cycle (cpu? network? etc.) - resubmit themselves in the queue
Now this will ensure that in the queue there's only one job per-vm and if it gets stuck it is not re-submitted (no other worker will get stuck).
In the last few days I was thinking really hard and long about our last discussions, feedback and proposals and how to properly fit all the pieces together. Michal and me also had a chat about this topics on Friday, and eventually I come up with this new draft
http://gerrit.ovirt.org/#/c/29977
(yes, that's it, just this) which builds on Nir's Schedule, the existing Threadpool hidden inside vdsm/storage, and which I believe provides a much, much better ground for further development or discussion . Driving forces behind this new draft: - minimize bloat. - minimize changes. - separate nicely concerns (Scheduler schedules, threadpool executes, Sampling cares about the actual sampling only). - leverage as much as possible existing infrastracture; avoid to introduce new fancy stuff unless absolutely needed.
And here it is. Almost all the concepts and requirements we discussed are there. The thing which is lacking here is strong isolation about VMs/samplings.
This new concept does nothing to recover stuck worker threads: if the pool is exausted, everything eventually stops, after a few sampling intervals. Stuck jobs are detected and the corresponding VMs are marked unresponsive (leveraging existing infrastructure). When (if?) stuck jobs eventually restart working, everything else restarts as well.
The changes are minimal, and there is still room for refactoring and cleanup, but I believe the design is nicer and cleaner.
Further steps: * replace existing thread pool with a fancier one which can replace stuck threads, or dinamically resize himself, to achieve better isolation among VMs or jobs? * Split the new VmStatsCollector class in smaller components? * Stale data detection. Planned but not yet there, I just need to get how to properly fit it into the AdvancedStatsFunction windowing sample. Should nt be a big deal, however.
I also have already quite few cleanup patches for the existing threadpool and for the sampling code in the queue, some are on gerrit, some are not.
I think most of them can wait once we agree on the overall design.
Nir also provided further suggestions (thanks for that!) and possible design alternatives which I'm now evaluating carefully.
I agree with Federico and you - I think this is the way we should explore.
But I don't understand the way you are implementing this using the scheduler in http://gerrit.ovirt.org/29977, and it seems that this does not ensure that every vm has only one sampling task running at the same time.
I started to work on a prototype during the last week and I think that this is the right way to implement. Please check this patch: http://gerrit.ovirt.org/29980
This use the current storage thread pool, but I don't think it is good enough. I think we should continue with http://gerrit.ovirt.org/29191 so we can handle stuck worker thread without decreasing the work force of the thread pool.
Nir

----- Original Message -----
From: "Saggi Mizrahi" <smizrahi@redhat.com> To: "Nir Soffer" <nsoffer@redhat.com> Cc: "Francesco Romani" <fromani@redhat.com>, devel@ovirt.org, "Michal Skrivanek" <mskrivan@redhat.com>, "Federico Simoncelli" <fsimonce@redhat.com>, "Dan Kenigsberg" <danken@redhat.com> Sent: Sunday, July 13, 2014 5:43:28 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
[...]
The current patches do not change libvirt connection management this is orthogonal issue. They are only about changing the way we do sampling. As I've been saying, I think the problem is in actually in the libvirt connection management and not the stats operations.
Well yes, I think to have a better libvirt connection management is another way to reach the go, granted it could detect and signal to the upper layer a stuck call. With that in place, the sampling code is simpler, and no need for fancy thread pool. Even though we may need something like this in the connection handling code internals. *Maybe* a supervisor-like approach like my very first proposal could work, but a very good point made by Nir is how to tell when something is 'stuck', since only tasks really know their timeout. For sampling it is easy, is the sampling interval, but how to convey this timeout in a generic manner to the connection layer? Bests, -- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani

----- Original Message -----
From: "Francesco Romani" <fromani@redhat.com> To: devel@ovirt.org Cc: "Nir Soffer" <nsoffer@redhat.com>, "Michal Skrivanek" <mskrivan@redhat.com>, "Federico Simoncelli" <fsimonce@redhat.com>, "Dan Kenigsberg" <danken@redhat.com>, "Saggi Mizrahi" <smizrahi@redhat.com> Sent: Tuesday, July 15, 2014 6:50:45 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
----- Original Message -----
From: "Saggi Mizrahi" <smizrahi@redhat.com> To: "Nir Soffer" <nsoffer@redhat.com> Cc: "Francesco Romani" <fromani@redhat.com>, devel@ovirt.org, "Michal Skrivanek" <mskrivan@redhat.com>, "Federico Simoncelli" <fsimonce@redhat.com>, "Dan Kenigsberg" <danken@redhat.com> Sent: Sunday, July 13, 2014 5:43:28 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
[...]
The current patches do not change libvirt connection management this is orthogonal issue. They are only about changing the way we do sampling. As I've been saying, I think the problem is in actually in the libvirt connection management and not the stats operations.
Well yes, I think to have a better libvirt connection management is another way to reach the go, granted it could detect and signal to the upper layer a stuck call.
With that in place, the sampling code is simpler, and no need for fancy thread pool. Even though we may need something like this in the connection handling code internals.
Can you explain how do you solve the sampling issue with better connection management? Since libvirt does not have async api (yet), it seems that this would just move the thread pool to the connection layer.
*Maybe* a supervisor-like approach like my very first proposal could work, but a very good point made by Nir is how to tell when something is 'stuck', since only tasks really know their timeout.
For sampling it is easy, is the sampling interval, but how to convey this timeout in a generic manner to the connection layer?

----- Original Message -----
From: "Nir Soffer" <nsoffer@redhat.com> To: "Francesco Romani" <fromani@redhat.com> Cc: devel@ovirt.org, "Michal Skrivanek" <mskrivan@redhat.com>, "Federico Simoncelli" <fsimonce@redhat.com>, "Dan Kenigsberg" <danken@redhat.com>, "Saggi Mizrahi" <smizrahi@redhat.com> Sent: Tuesday, July 15, 2014 7:28:51 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
The current patches do not change libvirt connection management this is orthogonal issue. They are only about changing the way we do sampling. As I've been saying, I think the problem is in actually in the libvirt connection management and not the stats operations.
Well yes, I think to have a better libvirt connection management is another way to reach the go, granted it could detect and signal to the upper layer a stuck call.
With that in place, the sampling code is simpler, and no need for fancy thread pool. Even though we may need something like this in the connection handling code internals.
Can you explain how do you solve the sampling issue with better connection management?
Since libvirt does not have async api (yet), it seems that this would just move the thread pool to the connection layer.
Exactly like that. I'm not very happy with this approach. More precisely, I can't see a satisfying way to implement this approach, thus I prefer to go ahead with the route we already took. Bests, -- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani

----- Original Message -----
From: "Nir Soffer" <nsoffer@redhat.com> To: "Saggi Mizrahi" <smizrahi@redhat.com> Cc: "Francesco Romani" <fromani@redhat.com>, devel@ovirt.org, "Michal Skrivanek" <mskrivan@redhat.com>, "Federico Simoncelli" <fsimonce@redhat.com>, "Dan Kenigsberg" <danken@redhat.com> Sent: Sunday, July 13, 2014 1:34:50 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
This use the current storage thread pool, but I don't think it is good enough. I think we should continue with http://gerrit.ovirt.org/29191 so we can handle stuck worker thread without decreasing the work force of the thread pool.
Hi, just a quick note until I get up to speed with last emails Since my first threadpool draft was bloated and overengineered, I restarted from scratch: https://github.com/mojaves/pyexecutor/blob/master/executor.py Now is down to 109 lines including license header and blank lines. It's on github just and only to not spam gerrit, so if anyone wish can track the evolution of the code. I'll update my gerrit patchset as soon as I have docs and tests. Bests, -- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani

I share most of Nir's answers, let me just add some information here and there ----- Original Message -----
From: "Nir Soffer" <nsoffer@redhat.com> To: "Saggi Mizrahi" <smizrahi@redhat.com> Cc: "Francesco Romani" <fromani@redhat.com>, devel@ovirt.org, "Michal Skrivanek" <mskrivan@redhat.com>, "Federico Simoncelli" <fsimonce@redhat.com>, "Dan Kenigsberg" <danken@redhat.com> Sent: Sunday, July 13, 2014 1:34:50 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
----- Original Message -----
From: "Saggi Mizrahi" <smizrahi@redhat.com> To: "Nir Soffer" <nsoffer@redhat.com>, "Francesco Romani" <fromani@redhat.com> Cc: devel@ovirt.org, "Michal Skrivanek" <mskrivan@redhat.com>, "Federico Simoncelli" <fsimonce@redhat.com>, "Dan Kenigsberg" <danken@redhat.com> Sent: Sunday, July 13, 2014 1:03:20 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
Hi, I will find time to go in depth in both patches but I still think that this is solving the problem in the wrong place.
You want to have the libvirt connection management code to be separate from the sampling.
It is - we did not suggest changing libvrit connections.
Indeed. My biggest reason for that actually boils down to what Federico suggested on his last email. By acknowledging stuck calls in the upper layer I believe we can manage them better on overall. I think is more akin to a flow error, cannot be recovered on the lower layers, nor the connection layer can provide a reliable error handling, so let's just handle this on application level.
if promise.wait(timeout): try: promise.cancel() except InProgress: # handle except AlreadyFninished: pass
So what you suggest is actually adding a timeout for libvirt calls, without heaving such timeout on the libvirt side. If we had this on the libvirt, side, we would not have to add anything, right?
The current code that blocks on libvirt call will block on a condition variable, and the real call will block on libvirt in a worker thread.
If the worker thread finish before the timeout, the calling code will get the result, and if it is stuck or finish after the timeout, the calling code will get a timeout.
This is nice, but we are left with the worker thread stuck on libvirt, or if it finished after the timeout, we lost the result of the call.
This design imply 2 threads - the calling thread, and the worker threads, so it does not scale and cannot be used for sampling, where we like to have only one thread triggering sampling calls, and the worker threads executing them.
For the api calls, where we have a connection/request thread running libvirt code, this adds a timeout using small additional cost (few other worker threads), and a condition variable and lock for each worker thread.
Michal, Francesco: do we want timeout on libvrit calls, leaving threads stuck on libvirt, or do we like to wait for libvirt call to return?
Well I see little benefit for additional timeout. And surely (sorry for the repetition, but IMO worth stressing out) we need to be nice with libvirt, thus wait is doing the less harm, because we want to avoid to clog libvirt by issuing more call that we will know in advance they would fail (since one is still stuck holding the QEMU monitor). If we patiently wait for libvirt call to get unstuck we also know precisely when is safe to restart sampling/calling libvirt - right after the call returned. Otherwise we'll have to implement some polling, or just call and hope (not good). Having said that, let me mention the timeout machinery we already have in place: vdsm/virt/vm.py :: NotifyingVirDomain Please note that this code is not sampling or stuck-handling specific and was introduced long ago. This code doesn't change anything we said or did because the timeout will be raised when 1. a call is already stuck inside the qemu monitor 2. another call tries to enter on the same monitor The above behaviour (and maybe a timeout) does'nt change the picture, or the 'be nice with libvirt' requirement (well, is probably more like 'do not blindly drop the ball to libvirt, because this will just make things worse'). It could just help us to detect earlier that a VM is stuck. Saggi is right when he says the async-ness is one of the core issues here, but the another, maybe also more important, is the stuck-ness of calls. And unfortunately, 1. we will know a call is stuck only *after* it happened (timeout, supervisor, anything else) 2. there is no reliable way to *cancel* a stuck call/task, even dropping the connection won't help, as Federico reminded. In the design we are evolving, we shall 1. for the VM *already* blocked, and somehow detected as that, suspend sampling until libvirt unlocks. This state has to be reported to Engine, but that's it. 2. for the *others* VM, replace the stuck worker threads with fresh ones (fancy threadpool approach) otherwise everything stops when the pool is exausted, even if some VM continues to behave correctly. - http://gerrit.ovirt.org/#/c/29191 Lastly, about to asyncness of the libvirt API, maybe worth (re)mentioning https://www.redhat.com/archives/libvir-list/2014-July/msg00139.html relevant snippet: "[...] The libvirt-gobject binding to libvirt provides async APIs to libvirt APIs. It does this by using threads internally. [...]"
Sampling would be done with the scheduler and a separate thread pool would be used to handle the response or lack there of.
For sampling, we came to the conclusion that we don't want to abort a call after a timeout, but we want to stop sampling for this vm until the call return. So having a timeout is a non-goal.
Indeed. More precisely, until we know the QEMU monitor for a VM is safe again, there is no point in try to use it again.
That way the threadpool, sampling, and libvirt connection management are all separate. Also, we can hide how we optimize libvirt connections\threads and have it orthogonal to how the sampling works.
The current patches do not change libvirt connection management this is orthogonal issue. They are only about changing the way we do sampling.
Agreed
If we ever get async APIs we could just have the libvirtConnectionManagement layer handle that for us and keep on using the same interface internally.
I think that what you suggest is orthogonal to the sampling problem that Francesco and me are trying to tackle.
Agreed again. Bests, -- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani
participants (6)
-
Adam Litke
-
Federico Simoncelli
-
Francesco Romani
-
Michal Skrivanek
-
Nir Soffer
-
Saggi Mizrahi