On Mon, Feb 8, 2021 at 6:47 PM Asaf Rachmani <arachman(a)redhat.com> wrote:
>
>
> I didn't check it yet, but maybe this customer bug is related to this
discussion:
>
https://bugzilla.redhat.com/show_bug.cgi?id=1666553
This bug is about the broker opening too many connections
to vdsm:
https://bugzilla.redhat.com/show_bug.cgi?id=1666553#c1
If the broker keeps 1000's of connections open it it likey that
vdsm will run out of fds.
> From vdsm.log:
> 2019-01-15 13:41:11,162+0000 INFO (periodic/2) [vdsm.api] FINISH multipath_health
return={} from=internal, task_id=97c359aa-002e-46d8-9fc5-2477db0909b4 (api:52)
> 2019-01-15 13:41:12,210+0000 WARN (vdsm.Scheduler) [Executor] Worker blocked:
<Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': {},
'jsonrpc': '2.0', 'method': u'Host.getCapabilities',
'id': u'74b9dc62-22b2-4698-9d84-6a71c4f29763'} at 0x7f71dc31b0d0>
timeout=60, duration=60 at 0x7f71dc31b110> task#=33 at 0x7f722003c890>, traceback:
> File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
> self.__bootstrap_inner()
> File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
> self.run()
> File: "/usr/lib64/python2.7/threading.py", line 765, in run
> self.__target(*self.__args, **self.__kwargs)
> File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line
194, in run
> ret = func(*args, **kwargs)
> File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in
_run
> self._execute_task()
> File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in
_execute_task
> task()
> File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in
__call__
> self._callable()
> File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523,
in __call__
> self._handler(self._ctx, self._req)
> File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566,
in _serveRequest
> response = self._handle_request(req, ctx)
> File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606,
in _handle_request
> res = method(**params)
> File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in
_dynamicMethod
> result = fn(*methodArgs)
> File: "<string>", line 2, in getCapabilities
> File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in
method
> ret = func(*args, **kwargs)
> File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1337, in
getCapabilities
> c = caps.get()
> File: "/usr/lib/python2.7/site-packages/vdsm/host/caps.py", line 168, in
get
> net_caps = supervdsm.getProxy().network_caps()
> File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line
55, in __call__
> return callMethod()
> File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line
53, in <lambda>
> **kwargs)
> File: "<string>", line 2, in network_caps
> File: "/usr/lib64/python2.7/multiprocessing/b", line 759, in _callmethod
> kind, result = conn.recv() (executor:363)
>
>
>
>
> On Mon, Feb 8, 2021 at 5:52 PM Nir Soffer <nsoffer(a)redhat.com> wrote:
>>
>> On Mon, Feb 8, 2021 at 1:22 PM Yedidyah Bar David <didi(a)redhat.com>
wrote:
>> >
>> > On Mon, Feb 8, 2021 at 9:05 AM Yedidyah Bar David <didi(a)redhat.com>
wrote:
>> > >
>> > > Hi all,
>> > >
>> > > I ran a loop of [1] (from [2]). The loop succeeded for ~ 380
>> > > iterations, then failed with 'Too many open files'. First
failure was:
>> > >
>> > > 2021-02-08 02:21:15,702+0100 ERROR (jsonrpc/4) [storage.HSM] Could
not
>> > > connect to storageServer (hsm:2446)
>> > > Traceback (most recent call last):
>> > > File
"/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line
>> > > 2443, in connectStorageServer
>> > > conObj.connect()
>> > > File
"/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py",
>> > > line 449, in connect
>> > > return self._mountCon.connect()
>> > > File
"/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py",
>> > > line 171, in connect
>> > > self._mount.mount(self.options, self._vfsType,
cgroup=self.CGROUP)
>> > > File
"/usr/lib/python3.6/site-packages/vdsm/storage/mount.py", line
>> > > 210, in mount
>> > > cgroup=cgroup)
>> > > File
"/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py",
>> > > line 56, in __call__
>> > > return callMethod()
>> > > File
"/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py",
>> > > line 54, in <lambda>
>> > > **kwargs)
>> > > File "<string>", line 2, in mount
>> > > File "/usr/lib64/python3.6/multiprocessing/managers.py",
line 772,
>> > > in _callmethod
>> > > raise convert_to_error(kind, result)
>> > > OSError: [Errno 24] Too many open files
>>
>> Maybe we have a fd leak in supervdsmd?
>>
>> We know that there a small memory leak in multiprocessing, but not
>> about any fd leak.
>>
>> > > But obviously, once it did, it continued failing for this reason on
>> > > many later operations.
>>
>> Smells like fd leak.
>>
>> > > Is this considered a bug?
>>
>> Generally yes, but the question is if this happens during
>> real world scenarios.
>>
>> > Do we actively try to prevent such cases?
>>
>> No, we don't have any code monitoring number of open fds
>> in runtime, or tests checking this in system tests.
>>
>> We do have health monitor in vdsm:
>>
https://github.com/oVirt/vdsm/blob/master/lib/vdsm/health.py
>>
>> It can be useful to log monitor also the number of fds (.e.g ls -lh
>> /proc/pid/fd).
>>
>> We don't have any monitor in supervdsm, it can be useful to add
>> one. supervdsm is relatively simple, but the problem is it runs
>> possibly complex code from vdsm, so "safe" changes in vdsm can
>> cause trouble when the code is run by supervdsm.
>>
>> > So should I open one and attach logs? Or it can be considered a
"corner
>> > > case"?
>>
>> Yes, please open a bug, and include the info you have.
>>
>> Please include output of "ls -lh /proc/pid/fd" for both vdsm
>> and supervdsm when you reproduce the issue, or during the
>> long test if you cannot reproduce.
>>
>> > > Using vdsm-4.40.50.3-37.git7883b3b43.el8.x86_64 from
>> > > ost-images-el8-he-installed-1-202102021144.x86_64 .
>> > >
>> > > I can also let access to the machine(s) if needed, for now.
>> >
>> > Sorry, now cleaned this env. Can try to reproduce if there is interest.
>>
>> It will help you can reproduce.