I'm seeing now new issue with latest vdsm 4.2.5:
On engine side, host is not responding.
On the host, vdsm is stuck waiting for stuck lshw child processes
root 29580 0.0 0.0 776184 27468 ? S<sl 19:44 0:00
/usr/bin/python2 /usr/share/vdsm/supervdsmd --sockfile
/var/run/vdsm/svdsm.sock
root 29627 0.0 0.0 20548 1452 ? D< 19:44 0:00 \_ lshw
-json -disable usb -disable pcmcia -disable isapnp -disable ide -disable
scsi -disable dmi -disable memory -disable
root 29808 0.0 0.0 20548 1456 ? D< 19:44 0:00 \_ lshw
-json -disable usb -disable pcmcia -disable isapnp -disable ide -disable
scsi -disable dmi -disable memory -disable
root 30013 0.0 0.0 20548 1452 ? D< 19:46 0:00 \_ lshw
-json -disable usb -disable pcmcia -disable isapnp -disable ide -disable
scsi -disable dmi -disable memory -disable
root 30064 0.0 0.0 20548 1456 ? D< 19:49 0:00 \_ lshw
-json -disable usb -disable pcmcia -disable isapnp -disable ide -disable
scsi -disable dmi -disable memory -disable
In the log we see many of these new errors:
2018-07-29 19:52:39,001+0300 WARN (vdsm.Scheduler) [Executor] Worker
blocked: <Worker name=periodic/1 running <Task <Operation
action=<vdsm.virt.sampling.HostMonitor object at 0x7f8c14152ad
0> at 0x7f8c14152b10> timeout=15, duration=345 at 0x7f8c140f9190> task#=0
at 0x7f8c1418d950>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line
194, in run
ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in
_execute_task
task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in
__call__
self._callable()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 220,
in __call__
self._func()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 576,
in __call__
sample = HostSample(self._pid)
File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 240,
in __init__
self.interfaces = _get_interfaces_and_samples()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 200,
in _get_interfaces_and_samples
for link in ipwrapper.getLinks():
File: "/usr/lib/python2.7/site-packages/vdsm/network/ipwrapper.py", line
267, in getLinks
in six.viewitems(dpdk.get_dpdk_devices()))
File: "/usr/lib/python2.7/site-packages/vdsm/network/link/dpdk.py", line
44, in get_dpdk_devices
dpdk_devices = _get_dpdk_devices()
File: "/usr/lib/python2.7/site-packages/vdsm/common/cache.py", line 41, in
__call__
value = self.func(*args)
File: "/usr/lib/python2.7/site-packages/vdsm/network/link/dpdk.py", line
111, in _get_dpdk_devices
devices = _lshw_command()
File: "/usr/lib/python2.7/site-packages/vdsm/network/link/dpdk.py", line
123, in _lshw_command
rc, out, err = cmd.exec_sync(['lshw', '-json'] + filterout_cmd)
File: "/usr/lib/python2.7/site-packages/vdsm/network/cmd.py", line 38, in
exec_sync
retcode, out, err = exec_sync_bytes(cmds)
File: "/usr/lib/python2.7/site-packages/vdsm/common/cmdutils.py", line 156,
in exec_cmd
out, err = p.communicate()
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 924, in
communicate
stdout, stderr = self._communicate(input, endtime, timeout)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 924, in
communicate
stdout, stderr = self._communicate(input, endtime, timeout)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1706, in
_communicate
orig_timeout)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1779, in
_communicate_with_poll
ready = poller.poll(self._remaining_time(endtime)) (executor:363)
I'm not sure it is reproducible, happened after I
- stop vdsmd supervdsmd sanlock wdmd
- install new sanlock version with unrelated fix
- start vdsmd
Running same command from the shell is successful:
# time lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide
-disable scsi -disable dmi -disable memory -disable cpuinfo > /dev/null
real 0m0.744s
user 0m0.701s
sys 0m0.040s
And create fairly large json, but this should not be an issue:
# lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide
-disable scsi -disable dmi -disable memory -disable cpuinfo | wc -c
143468
Do we have a flag to disable dpkd until this issue is fixed?
Nir