
----- Original Message -----
From: "Francesco Romani" <fromani@redhat.com> To: "Nir Soffer" <nsoffer@redhat.com> Cc: devel@ovirt.org, "Federico Simoncelli" <fsimonce@redhat.com>, "Michal Skrivanek" <mskrivan@redhat.com>, "Adam Litke" <alitke@redhat.com> Sent: Wednesday, July 9, 2014 2:22:05 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
----- Original Message -----
From: "Nir Soffer" <nsoffer@redhat.com> To: "Francesco Romani" <fromani@redhat.com> Cc: devel@ovirt.org, "Federico Simoncelli" <fsimonce@redhat.com>, "Michal Skrivanek" <mskrivan@redhat.com>, "Adam Litke" <alitke@redhat.com> Sent: Monday, July 7, 2014 4:53:29 PM Subject: Re: [ovirt-devel] [VDSM][sampling] thread pool status and handling of stuck calls
If one vm may stoop responding, causing all libvirt calls for this vm to block, then a thread pool with one connection per worker thread can lead to a failure when all connection happen to run a request that blocks the thread. If this is the case, then each task related to one vm must depend on other tasks and should not be skipped until the previous task returned, simulating the current threading model without creating 100's of threads.
Agreed, we should introduce this concept and this is lacking in my threadpool proposal.
So basically the current threading model is the behavior we want?
If some call get stuck, stop sampling this vm. Continue when the call returns.
Michal? Federico?
Yep - but with less threads, and surely with a constant number of them. Your schedule library (review in my queue at very high priority) is indeed a nice step in this direcation.
Waiting for Federico's ack.
That looks good. Now I would like to summarize few things. We know that when a request gets stuck on a vm also the subsequent ones will get stuck (at least until their timeout is up, except for the first one that could stay there forever). We want a limited number of threads polling the statistics (trying to match the number of threads that libvirt has). Given those two assumptions we want a thread pool of workers that are picking up jobs *per-vm*. The jobs should be smart enough to: - understand what samples they have to take in that cycle (cpu? network? etc.) - resubmit themselves in the queue Now this will ensure that in the queue there's only one job per-vm and if it gets stuck it is not re-submitted (no other worker will get stuck). Additionally I think someone mentioned re-connection to libvirt in case of stuck threads. I actually want to discourage (or really minimize) this behavior because I can't think of a case where it would improve the situation (it may just end up generating a large number of zombie threads on the libvirt side). -- Federico