On Sun, Jul 29, 2018 at 8:06 PM Nir Soffer <nsoffer@redhat.com> wrote:
I'm seeing now new issue with latest vdsm 4.2.5:

On engine side, host is not responding.
On the host, vdsm is stuck waiting for stuck lshw child processes

root     29580  0.0  0.0 776184 27468 ?        S<sl 19:44   0:00 /usr/bin/python2 /usr/share/vdsm/supervdsmd --sockfile /var/run/vdsm/svdsm.sock
root     29627  0.0  0.0  20548  1452 ?        D<   19:44   0:00  \_ lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable 
root     29808  0.0  0.0  20548  1456 ?        D<   19:44   0:00  \_ lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable 
root     30013  0.0  0.0  20548  1452 ?        D<   19:46   0:00  \_ lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable 
root     30064  0.0  0.0  20548  1456 ?        D<   19:49   0:00  \_ lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable 

In the log we see many of these new errors:

2018-07-29 19:52:39,001+0300 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=periodic/1 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7f8c14152ad
0> at 0x7f8c14152b10> timeout=15, duration=345 at 0x7f8c140f9190> task#=0 at 0x7f8c1418d950>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
  self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
  self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
  self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
  ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
  self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task
  task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__
  self._callable()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 220, in __call__
  self._func()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 576, in __call__
  sample = HostSample(self._pid)
File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 240, in __init__
  self.interfaces = _get_interfaces_and_samples()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 200, in _get_interfaces_and_samples
  for link in ipwrapper.getLinks():
File: "/usr/lib/python2.7/site-packages/vdsm/network/ipwrapper.py", line 267, in getLinks
  in six.viewitems(dpdk.get_dpdk_devices()))
File: "/usr/lib/python2.7/site-packages/vdsm/network/link/dpdk.py", line 44, in get_dpdk_devices
  dpdk_devices = _get_dpdk_devices()
File: "/usr/lib/python2.7/site-packages/vdsm/common/cache.py", line 41, in __call__
  value = self.func(*args)
File: "/usr/lib/python2.7/site-packages/vdsm/network/link/dpdk.py", line 111, in _get_dpdk_devices
  devices = _lshw_command()
File: "/usr/lib/python2.7/site-packages/vdsm/network/link/dpdk.py", line 123, in _lshw_command
  rc, out, err = cmd.exec_sync(['lshw', '-json'] + filterout_cmd)
File: "/usr/lib/python2.7/site-packages/vdsm/network/cmd.py", line 38, in exec_sync
  retcode, out, err = exec_sync_bytes(cmds)
File: "/usr/lib/python2.7/site-packages/vdsm/common/cmdutils.py", line 156, in exec_cmd
  out, err = p.communicate()
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 924, in communicate
  stdout, stderr = self._communicate(input, endtime, timeout)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 924, in communicate
  stdout, stderr = self._communicate(input, endtime, timeout)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1706, in _communicate
  orig_timeout)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1779, in _communicate_with_poll
  ready = poller.poll(self._remaining_time(endtime)) (executor:363)

I'm not sure it is reproducible, happened after I
- stop vdsmd supervdsmd sanlock wdmd
- install new sanlock version with unrelated fix
- start vdsmd

Running same command from the shell is successful:

# time lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable cpuinfo > /dev/null
                            
real 0m0.744s
user 0m0.701s
sys 0m0.040s

And create fairly large json, but this should not be an issue:
# lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable cpuinfo | wc -c
143468                      

Do we have a flag to disable dpkd until this issue is fixed?

This is a quick hack to work around this issue:

diff --git a/lib/vdsm/network/link/dpdk.py b/lib/vdsm/network/link/dpdk.py
index baa876427..132ea0096 100644
--- a/lib/vdsm/network/link/dpdk.py
+++ b/lib/vdsm/network/link/dpdk.py
@@ -117,6 +116,8 @@ def _get_dpdk_devices():
 
 
 def _lshw_command():
+    return []
+
     filter_out_hw = ['usb', 'pcmcia', 'isapnp', 'ide', 'scsi', 'dmi', 'memory',
                      'cpuinfo']
     filterout_cmd = list(itertools.chain.from_iterable(('-disable', x)

 Don't complain about missing dpdk devices :-)