
On Tue, Feb 9, 2021 at 12:30 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Feb 9, 2021 at 10:39 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Tue, Feb 9, 2021 at 8:55 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Mon, Feb 8, 2021 at 7:05 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Feb 8, 2021 at 6:47 PM Asaf Rachmani <arachman@redhat.com> wrote:
I didn't check it yet, but maybe this customer bug is related to this discussion: https://bugzilla.redhat.com/show_bug.cgi?id=1666553
This bug is about the broker opening too many connections to vdsm: https://bugzilla.redhat.com/show_bug.cgi?id=1666553#c1
If the broker keeps 1000's of connections open it it likey that vdsm will run out of fds.
From vdsm.log: 2019-01-15 13:41:11,162+0000 INFO (periodic/2) [vdsm.api] FINISH multipath_health return={} from=internal, task_id=97c359aa-002e-46d8-9fc5-2477db0909b4 (api:52) 2019-01-15 13:41:12,210+0000 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {}, 'jsonrpc': '2.0', 'method': u'Host.getCapabilities', 'id': u'74b9dc62-22b2-4698-9d84-6a71c4f29763'} at 0x7f71dc31b0d0> timeout=60, duration=60 at 0x7f71dc31b110> task#=33 at 0x7f722003c890>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__ self._handler(self._ctx, self._req) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest response = self._handle_request(req, ctx) File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request res = method(**params) File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod result = fn(*methodArgs) File: "<string>", line 2, in getCapabilities File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1337, in getCapabilities c = caps.get() File: "/usr/lib/python2.7/site-packages/vdsm/host/caps.py", line 168, in get net_caps = supervdsm.getProxy().network_caps() File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__ return callMethod() File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda> **kwargs) File: "<string>", line 2, in network_caps File: "/usr/lib64/python2.7/multiprocessing/b", line 759, in _callmethod kind, result = conn.recv() (executor:363)
On Mon, Feb 8, 2021 at 5:52 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Feb 8, 2021 at 1:22 PM Yedidyah Bar David <didi@redhat.com> wrote: > > On Mon, Feb 8, 2021 at 9:05 AM Yedidyah Bar David <didi@redhat.com> wrote: > > > > Hi all, > > > > I ran a loop of [1] (from [2]). The loop succeeded for ~ 380 > > iterations, then failed with 'Too many open files'. First failure was: > > > > 2021-02-08 02:21:15,702+0100 ERROR (jsonrpc/4) [storage.HSM] Could not > > connect to storageServer (hsm:2446) > > Traceback (most recent call last): > > File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line > > 2443, in connectStorageServer > > conObj.connect() > > File "/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py", > > line 449, in connect > > return self._mountCon.connect() > > File "/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py", > > line 171, in connect > > self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP) > > File "/usr/lib/python3.6/site-packages/vdsm/storage/mount.py", line > > 210, in mount > > cgroup=cgroup) > > File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", > > line 56, in __call__ > > return callMethod() > > File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", > > line 54, in <lambda> > > **kwargs) > > File "<string>", line 2, in mount > > File "/usr/lib64/python3.6/multiprocessing/managers.py", line 772, > > in _callmethod > > raise convert_to_error(kind, result) > > OSError: [Errno 24] Too many open files
Maybe we have a fd leak in supervdsmd?
We know that there a small memory leak in multiprocessing, but not about any fd leak.
> > But obviously, once it did, it continued failing for this reason on > > many later operations.
Smells like fd leak.
> > Is this considered a bug?
Generally yes, but the question is if this happens during real world scenarios.
> Do we actively try to prevent such cases?
No, we don't have any code monitoring number of open fds in runtime, or tests checking this in system tests.
We do have health monitor in vdsm: https://github.com/oVirt/vdsm/blob/master/lib/vdsm/health.py
It can be useful to log monitor also the number of fds (.e.g ls -lh /proc/pid/fd).
We don't have any monitor in supervdsm, it can be useful to add one. supervdsm is relatively simple, but the problem is it runs possibly complex code from vdsm, so "safe" changes in vdsm can cause trouble when the code is run by supervdsm.
> So should I open one and attach logs? Or it can be considered a "corner > > case"?
Yes, please open a bug, and include the info you have.
Done, https://bugzilla.redhat.com/show_bug.cgi?id=1926589 .
Please include output of "ls -lh /proc/pid/fd" for both vdsm and supervdsm when you reproduce the issue, or during the long test if you cannot reproduce.
> > Using vdsm-4.40.50.3-37.git7883b3b43.el8.x86_64 from > > ost-images-el8-he-installed-1-202102021144.x86_64 . > > > > I can also let access to the machine(s) if needed, for now. > > Sorry, now cleaned this env. Can try to reproduce if there is interest.
It will help you can reproduce.
Trying now.
It didn't fail yet (finished 45 iterations), but it already seems like supervdsm is leaking. When I started, it had 30+ open files (32 on host-0, 35 on host-1), now it has 160+ (169, 166). Not sure how to see what's leaking. Almost all of them are pipes. I'll run strace on it for some time, hopefully this can help.
Looked a bit some more and added an attachment and some guesses to the bug.
Thanks for the research! I posted quick minimal fix: https://gerrit.ovirt.org/c/vdsm/+/113395 And prepared a build: https://jenkins.ovirt.org/job/vdsm_standard-check-patch/26538/ Does it work for you? Nir