
I'm seeing now new issue with latest vdsm 4.2.5: On engine side, host is not responding. On the host, vdsm is stuck waiting for stuck lshw child processes root 29580 0.0 0.0 776184 27468 ? S<sl 19:44 0:00 /usr/bin/python2 /usr/share/vdsm/supervdsmd --sockfile /var/run/vdsm/svdsm.sock root 29627 0.0 0.0 20548 1452 ? D< 19:44 0:00 \_ lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable root 29808 0.0 0.0 20548 1456 ? D< 19:44 0:00 \_ lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable root 30013 0.0 0.0 20548 1452 ? D< 19:46 0:00 \_ lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable root 30064 0.0 0.0 20548 1456 ? D< 19:49 0:00 \_ lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable In the log we see many of these new errors: 2018-07-29 19:52:39,001+0300 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=periodic/1 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7f8c14152ad 0> at 0x7f8c14152b10> timeout=15, duration=345 at 0x7f8c140f9190> task#=0 at 0x7f8c1418d950>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 220, in __call__ self._func() File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 576, in __call__ sample = HostSample(self._pid) File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 240, in __init__ self.interfaces = _get_interfaces_and_samples() File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 200, in _get_interfaces_and_samples for link in ipwrapper.getLinks(): File: "/usr/lib/python2.7/site-packages/vdsm/network/ipwrapper.py", line 267, in getLinks in six.viewitems(dpdk.get_dpdk_devices())) File: "/usr/lib/python2.7/site-packages/vdsm/network/link/dpdk.py", line 44, in get_dpdk_devices dpdk_devices = _get_dpdk_devices() File: "/usr/lib/python2.7/site-packages/vdsm/common/cache.py", line 41, in __call__ value = self.func(*args) File: "/usr/lib/python2.7/site-packages/vdsm/network/link/dpdk.py", line 111, in _get_dpdk_devices devices = _lshw_command() File: "/usr/lib/python2.7/site-packages/vdsm/network/link/dpdk.py", line 123, in _lshw_command rc, out, err = cmd.exec_sync(['lshw', '-json'] + filterout_cmd) File: "/usr/lib/python2.7/site-packages/vdsm/network/cmd.py", line 38, in exec_sync retcode, out, err = exec_sync_bytes(cmds) File: "/usr/lib/python2.7/site-packages/vdsm/common/cmdutils.py", line 156, in exec_cmd out, err = p.communicate() File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 924, in communicate stdout, stderr = self._communicate(input, endtime, timeout) File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 924, in communicate stdout, stderr = self._communicate(input, endtime, timeout) File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1706, in _communicate orig_timeout) File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1779, in _communicate_with_poll ready = poller.poll(self._remaining_time(endtime)) (executor:363) I'm not sure it is reproducible, happened after I - stop vdsmd supervdsmd sanlock wdmd - install new sanlock version with unrelated fix - start vdsmd Running same command from the shell is successful: # time lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable cpuinfo > /dev/null real 0m0.744s user 0m0.701s sys 0m0.040s And create fairly large json, but this should not be an issue: # lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable cpuinfo | wc -c 143468 Do we have a flag to disable dpkd until this issue is fixed? Nir