
On Sun, Jul 29, 2018 at 8:06 PM Nir Soffer <nsoffer@redhat.com> wrote:
I'm seeing now new issue with latest vdsm 4.2.5:
On engine side, host is not responding. On the host, vdsm is stuck waiting for stuck lshw child processes
root 29580 0.0 0.0 776184 27468 ? S<sl 19:44 0:00 /usr/bin/python2 /usr/share/vdsm/supervdsmd --sockfile /var/run/vdsm/svdsm.sock root 29627 0.0 0.0 20548 1452 ? D< 19:44 0:00 \_ lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable root 29808 0.0 0.0 20548 1456 ? D< 19:44 0:00 \_ lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable root 30013 0.0 0.0 20548 1452 ? D< 19:46 0:00 \_ lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable root 30064 0.0 0.0 20548 1456 ? D< 19:49 0:00 \_ lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable
In the log we see many of these new errors:
2018-07-29 19:52:39,001+0300 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=periodic/1 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7f8c14152ad 0> at 0x7f8c14152b10> timeout=15, duration=345 at 0x7f8c140f9190> task#=0 at 0x7f8c1418d950>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 220, in __call__ self._func() File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 576, in __call__ sample = HostSample(self._pid) File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 240, in __init__ self.interfaces = _get_interfaces_and_samples() File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 200, in _get_interfaces_and_samples for link in ipwrapper.getLinks(): File: "/usr/lib/python2.7/site-packages/vdsm/network/ipwrapper.py", line 267, in getLinks in six.viewitems(dpdk.get_dpdk_devices())) File: "/usr/lib/python2.7/site-packages/vdsm/network/link/dpdk.py", line 44, in get_dpdk_devices dpdk_devices = _get_dpdk_devices() File: "/usr/lib/python2.7/site-packages/vdsm/common/cache.py", line 41, in __call__ value = self.func(*args) File: "/usr/lib/python2.7/site-packages/vdsm/network/link/dpdk.py", line 111, in _get_dpdk_devices devices = _lshw_command() File: "/usr/lib/python2.7/site-packages/vdsm/network/link/dpdk.py", line 123, in _lshw_command rc, out, err = cmd.exec_sync(['lshw', '-json'] + filterout_cmd) File: "/usr/lib/python2.7/site-packages/vdsm/network/cmd.py", line 38, in exec_sync retcode, out, err = exec_sync_bytes(cmds) File: "/usr/lib/python2.7/site-packages/vdsm/common/cmdutils.py", line 156, in exec_cmd out, err = p.communicate() File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 924, in communicate stdout, stderr = self._communicate(input, endtime, timeout) File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 924, in communicate stdout, stderr = self._communicate(input, endtime, timeout) File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1706, in _communicate orig_timeout) File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1779, in _communicate_with_poll ready = poller.poll(self._remaining_time(endtime)) (executor:363)
I'm not sure it is reproducible, happened after I - stop vdsmd supervdsmd sanlock wdmd - install new sanlock version with unrelated fix - start vdsmd
Running same command from the shell is successful:
# time lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable cpuinfo > /dev/null
real 0m0.744s user 0m0.701s sys 0m0.040s
And create fairly large json, but this should not be an issue: # lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable cpuinfo | wc -c 143468
Do we have a flag to disable dpkd until this issue is fixed?
This is a quick hack to work around this issue: diff --git a/lib/vdsm/network/link/dpdk.py b/lib/vdsm/network/link/dpdk.py index baa876427..132ea0096 100644 --- a/lib/vdsm/network/link/dpdk.py +++ b/lib/vdsm/network/link/dpdk.py @@ -117,6 +116,8 @@ def _get_dpdk_devices(): def _lshw_command(): + return [] + filter_out_hw = ['usb', 'pcmcia', 'isapnp', 'ide', 'scsi', 'dmi', 'memory', 'cpuinfo'] filterout_cmd = list(itertools.chain.from_iterable(('-disable', x) Don't complain about missing dpdk devices :-)