I didn't check it yet, but maybe this customer bug is related to this
discussion:
https://bugzilla.redhat.com/show_bug.cgi?id=1666553
From vdsm.log:
2019-01-15 13:41:11,162+0000 INFO (periodic/2) [vdsm.api] FINISH
multipath_health return={} from=internal,
task_id=97c359aa-002e-46d8-9fc5-2477db0909b4 (api:52)
2019-01-15 13:41:12,210+0000 WARN (vdsm.Scheduler) [Executor] Worker
blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params':
{},
'jsonrpc': '2.0', 'method': u'Host.getCapabilities',
'id':
u'74b9dc62-22b2-4698-9d84-6a71c4f29763'} at 0x7f71dc31b0d0> timeout=60,
duration=60 at 0x7f71dc31b110> task#=33 at 0x7f722003c890>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line
194, in run
ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in
_execute_task
task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in
__call__
self._callable()
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523,
in __call__
self._handler(self._ctx, self._req)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566,
in _serveRequest
response = self._handle_request(req, ctx)
File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606,
in _handle_request
res = method(**params)
File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in
_dynamicMethod
result = fn(*methodArgs)
File: "<string>", line 2, in getCapabilities
File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in
method
ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1337, in
getCapabilities
c = caps.get()
File: "/usr/lib/python2.7/site-packages/vdsm/host/caps.py", line 168, in get
net_caps = supervdsm.getProxy().network_caps()
File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55,
in __call__
return callMethod()
File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53,
in <lambda>
**kwargs)
File: "<string>", line 2, in network_caps
File: "/usr/lib64/python2.7/multiprocessing/b", line 759, in _callmethod
kind, result = conn.recv() (executor:363)
On Mon, Feb 8, 2021 at 5:52 PM Nir Soffer <nsoffer(a)redhat.com> wrote:
On Mon, Feb 8, 2021 at 1:22 PM Yedidyah Bar David
<didi(a)redhat.com> wrote:
>
> On Mon, Feb 8, 2021 at 9:05 AM Yedidyah Bar David <didi(a)redhat.com>
wrote:
> >
> > Hi all,
> >
> > I ran a loop of [1] (from [2]). The loop succeeded for ~ 380
> > iterations, then failed with 'Too many open files'. First failure was:
> >
> > 2021-02-08 02:21:15,702+0100 ERROR (jsonrpc/4) [storage.HSM] Could not
> > connect to storageServer (hsm:2446)
> > Traceback (most recent call last):
> > File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line
> > 2443, in connectStorageServer
> > conObj.connect()
> > File
"/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py",
> > line 449, in connect
> > return self._mountCon.connect()
> > File
"/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py",
> > line 171, in connect
> > self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP)
> > File "/usr/lib/python3.6/site-packages/vdsm/storage/mount.py",
line
> > 210, in mount
> > cgroup=cgroup)
> > File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py",
> > line 56, in __call__
> > return callMethod()
> > File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py",
> > line 54, in <lambda>
> > **kwargs)
> > File "<string>", line 2, in mount
> > File "/usr/lib64/python3.6/multiprocessing/managers.py", line 772,
> > in _callmethod
> > raise convert_to_error(kind, result)
> > OSError: [Errno 24] Too many open files
Maybe we have a fd leak in supervdsmd?
We know that there a small memory leak in multiprocessing, but not
about any fd leak.
> > But obviously, once it did, it continued failing for this reason on
> > many later operations.
Smells like fd leak.
> > Is this considered a bug?
Generally yes, but the question is if this happens during
real world scenarios.
> Do we actively try to prevent such cases?
No, we don't have any code monitoring number of open fds
in runtime, or tests checking this in system tests.
We do have health monitor in vdsm:
https://github.com/oVirt/vdsm/blob/master/lib/vdsm/health.py
It can be useful to log monitor also the number of fds (.e.g ls -lh
/proc/pid/fd).
We don't have any monitor in supervdsm, it can be useful to add
one. supervdsm is relatively simple, but the problem is it runs
possibly complex code from vdsm, so "safe" changes in vdsm can
cause trouble when the code is run by supervdsm.
> So should I open one and attach logs? Or it can be considered a "corner
> > case"?
Yes, please open a bug, and include the info you have.
Please include output of "ls -lh /proc/pid/fd" for both vdsm
and supervdsm when you reproduce the issue, or during the
long test if you cannot reproduce.
> > Using vdsm-4.40.50.3-37.git7883b3b43.el8.x86_64 from
> > ost-images-el8-he-installed-1-202102021144.x86_64 .
> >
> > I can also let access to the machine(s) if needed, for now.
>
> Sorry, now cleaned this env. Can try to reproduce if there is interest.
It will help you can reproduce.
Nir
_______________________________________________
Devel mailing list -- devel(a)ovirt.org
To unsubscribe send an email to devel-leave(a)ovirt.org
Privacy Statement:
https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/W7YKK25CHMN...