
On Mon, Feb 8, 2021 at 1:22 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Mon, Feb 8, 2021 at 9:05 AM Yedidyah Bar David <didi@redhat.com> wrote:
Hi all,
I ran a loop of [1] (from [2]). The loop succeeded for ~ 380 iterations, then failed with 'Too many open files'. First failure was:
2021-02-08 02:21:15,702+0100 ERROR (jsonrpc/4) [storage.HSM] Could not connect to storageServer (hsm:2446) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 2443, in connectStorageServer conObj.connect() File "/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py", line 449, in connect return self._mountCon.connect() File "/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py", line 171, in connect self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP) File "/usr/lib/python3.6/site-packages/vdsm/storage/mount.py", line 210, in mount cgroup=cgroup) File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", line 56, in __call__ return callMethod() File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", line 54, in <lambda> **kwargs) File "<string>", line 2, in mount File "/usr/lib64/python3.6/multiprocessing/managers.py", line 772, in _callmethod raise convert_to_error(kind, result) OSError: [Errno 24] Too many open files
Maybe we have a fd leak in supervdsmd? We know that there a small memory leak in multiprocessing, but not about any fd leak.
But obviously, once it did, it continued failing for this reason on many later operations.
Smells like fd leak.
Is this considered a bug?
Generally yes, but the question is if this happens during real world scenarios.
Do we actively try to prevent such cases?
No, we don't have any code monitoring number of open fds in runtime, or tests checking this in system tests. We do have health monitor in vdsm: https://github.com/oVirt/vdsm/blob/master/lib/vdsm/health.py It can be useful to log monitor also the number of fds (.e.g ls -lh /proc/pid/fd). We don't have any monitor in supervdsm, it can be useful to add one. supervdsm is relatively simple, but the problem is it runs possibly complex code from vdsm, so "safe" changes in vdsm can cause trouble when the code is run by supervdsm.
So should I open one and attach logs? Or it can be considered a "corner
case"?
Yes, please open a bug, and include the info you have. Please include output of "ls -lh /proc/pid/fd" for both vdsm and supervdsm when you reproduce the issue, or during the long test if you cannot reproduce.
Using vdsm-4.40.50.3-37.git7883b3b43.el8.x86_64 from ost-images-el8-he-installed-1-202102021144.x86_64 .
I can also let access to the machine(s) if needed, for now.
Sorry, now cleaned this env. Can try to reproduce if there is interest.
It will help you can reproduce. Nir