I didn't check it yet, but maybe this customer bug is related to this discussion:
https://bugzilla.redhat.com/show_bug.cgi?id=1666553

From vdsm.log:
2019-01-15 13:41:11,162+0000 INFO  (periodic/2) [vdsm.api] FINISH multipath_health return={} from=internal, task_id=97c359aa-002e-46d8-9fc5-2477db0909b4 (api:52)
2019-01-15 13:41:12,210+0000 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {}, 'jsonrpc': '2.0', 'method': u'Host.getCapabilities', 'id': u'74b9dc62-22b2-4698-9d84-6a71c4f29763'} at 0x7f71dc31b0d0> timeout=60, duration=60 at 0x7f71dc31b110> task#=33 at 0x7f722003c890>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__
  self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest
  response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request
  res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in _dynamicMethod
  result = fn(*methodArgs)
File: "<string>", line 2, in getCapabilities
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1337, in getCapabilities
  c = caps.get()
File: "/usr/lib/python2.7/site-packages/vdsm/host/caps.py", line 168, in get
  net_caps = supervdsm.getProxy().network_caps()
File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__
  return callMethod()
File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda>
  **kwargs)
File: "<string>", line 2, in network_caps
File: "/usr/lib64/python2.7/multiprocessing/b", line 759, in _callmethod
  kind, result = conn.recv() (executor:363)




On Mon, Feb 8, 2021 at 5:52 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Feb 8, 2021 at 1:22 PM Yedidyah Bar David <didi@redhat.com> wrote:
>
> On Mon, Feb 8, 2021 at 9:05 AM Yedidyah Bar David <didi@redhat.com> wrote:
> >
> > Hi all,
> >
> > I ran a loop of [1] (from [2]). The loop succeeded for ~ 380
> > iterations, then failed with 'Too many open files'. First failure was:
> >
> > 2021-02-08 02:21:15,702+0100 ERROR (jsonrpc/4) [storage.HSM] Could not
> > connect to storageServer (hsm:2446)
> > Traceback (most recent call last):
> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line
> > 2443, in connectStorageServer
> >     conObj.connect()
> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py",
> > line 449, in connect
> >     return self._mountCon.connect()
> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py",
> > line 171, in connect
> >     self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP)
> >   File "/usr/lib/python3.6/site-packages/vdsm/storage/mount.py", line
> > 210, in mount
> >     cgroup=cgroup)
> >   File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py",
> > line 56, in __call__
> >     return callMethod()
> >   File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py",
> > line 54, in <lambda>
> >     **kwargs)
> >   File "<string>", line 2, in mount
> >   File "/usr/lib64/python3.6/multiprocessing/managers.py", line 772,
> > in _callmethod
> >     raise convert_to_error(kind, result)
> > OSError: [Errno 24] Too many open files

Maybe we have a fd leak in supervdsmd?

We know that there a small memory leak in multiprocessing, but not
about any fd leak.

> > But obviously, once it did, it continued failing for this reason on
> > many later operations.

Smells like fd leak.

> > Is this considered a bug?

Generally yes, but the question is if this happens during
real world scenarios.

> Do we actively try to prevent such cases?

No, we don't have any code monitoring number of open fds
in runtime, or tests checking this in system tests.

We do have health monitor in vdsm:
https://github.com/oVirt/vdsm/blob/master/lib/vdsm/health.py

It can be useful to log monitor also the number of fds (.e.g ls -lh
/proc/pid/fd).

We don't have any monitor in supervdsm, it can be useful to add
one. supervdsm is relatively simple, but the problem is it runs
possibly complex code from vdsm, so "safe" changes in vdsm can
cause trouble when the code is run by supervdsm.

> So should I open one and attach logs? Or it can be considered a "corner
> > case"?

Yes, please open a bug, and include the info you have.

Please include output of "ls -lh /proc/pid/fd" for both vdsm
and supervdsm when you reproduce the issue, or during the
long test if you cannot reproduce.

> > Using vdsm-4.40.50.3-37.git7883b3b43.el8.x86_64 from
> > ost-images-el8-he-installed-1-202102021144.x86_64 .
> >
> > I can also let access to the machine(s) if needed, for now.
>
> Sorry, now cleaned this env. Can try to reproduce if there is interest.

It will help you can reproduce.

Nir
_______________________________________________
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/W7YKK25CHMNQAB4R2BMFL7PBOIPOMMBY/