Vdsm stuck in get_dpdk_devices
by Nir Soffer
I'm seeing now new issue with latest vdsm 4.2.5:
On engine side, host is not responding.
On the host, vdsm is stuck waiting for stuck lshw child processes
root 29580 0.0 0.0 776184 27468 ? S<sl 19:44 0:00
/usr/bin/python2 /usr/share/vdsm/supervdsmd --sockfile
/var/run/vdsm/svdsm.sock
root 29627 0.0 0.0 20548 1452 ? D< 19:44 0:00 \_ lshw
-json -disable usb -disable pcmcia -disable isapnp -disable ide -disable
scsi -disable dmi -disable memory -disable
root 29808 0.0 0.0 20548 1456 ? D< 19:44 0:00 \_ lshw
-json -disable usb -disable pcmcia -disable isapnp -disable ide -disable
scsi -disable dmi -disable memory -disable
root 30013 0.0 0.0 20548 1452 ? D< 19:46 0:00 \_ lshw
-json -disable usb -disable pcmcia -disable isapnp -disable ide -disable
scsi -disable dmi -disable memory -disable
root 30064 0.0 0.0 20548 1456 ? D< 19:49 0:00 \_ lshw
-json -disable usb -disable pcmcia -disable isapnp -disable ide -disable
scsi -disable dmi -disable memory -disable
In the log we see many of these new errors:
2018-07-29 19:52:39,001+0300 WARN (vdsm.Scheduler) [Executor] Worker
blocked: <Worker name=periodic/1 running <Task <Operation
action=<vdsm.virt.sampling.HostMonitor object at 0x7f8c14152ad
0> at 0x7f8c14152b10> timeout=15, duration=345 at 0x7f8c140f9190> task#=0
at 0x7f8c1418d950>, traceback:
File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
self.__bootstrap_inner()
File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
self.run()
File: "/usr/lib64/python2.7/threading.py", line 765, in run
self.__target(*self.__args, **self.__kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line
194, in run
ret = func(*args, **kwargs)
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run
self._execute_task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in
_execute_task
task()
File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in
__call__
self._callable()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 220,
in __call__
self._func()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 576,
in __call__
sample = HostSample(self._pid)
File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 240,
in __init__
self.interfaces = _get_interfaces_and_samples()
File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 200,
in _get_interfaces_and_samples
for link in ipwrapper.getLinks():
File: "/usr/lib/python2.7/site-packages/vdsm/network/ipwrapper.py", line
267, in getLinks
in six.viewitems(dpdk.get_dpdk_devices()))
File: "/usr/lib/python2.7/site-packages/vdsm/network/link/dpdk.py", line
44, in get_dpdk_devices
dpdk_devices = _get_dpdk_devices()
File: "/usr/lib/python2.7/site-packages/vdsm/common/cache.py", line 41, in
__call__
value = self.func(*args)
File: "/usr/lib/python2.7/site-packages/vdsm/network/link/dpdk.py", line
111, in _get_dpdk_devices
devices = _lshw_command()
File: "/usr/lib/python2.7/site-packages/vdsm/network/link/dpdk.py", line
123, in _lshw_command
rc, out, err = cmd.exec_sync(['lshw', '-json'] + filterout_cmd)
File: "/usr/lib/python2.7/site-packages/vdsm/network/cmd.py", line 38, in
exec_sync
retcode, out, err = exec_sync_bytes(cmds)
File: "/usr/lib/python2.7/site-packages/vdsm/common/cmdutils.py", line 156,
in exec_cmd
out, err = p.communicate()
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 924, in
communicate
stdout, stderr = self._communicate(input, endtime, timeout)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 924, in
communicate
stdout, stderr = self._communicate(input, endtime, timeout)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1706, in
_communicate
orig_timeout)
File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1779, in
_communicate_with_poll
ready = poller.poll(self._remaining_time(endtime)) (executor:363)
I'm not sure it is reproducible, happened after I
- stop vdsmd supervdsmd sanlock wdmd
- install new sanlock version with unrelated fix
- start vdsmd
Running same command from the shell is successful:
# time lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide
-disable scsi -disable dmi -disable memory -disable cpuinfo > /dev/null
real 0m0.744s
user 0m0.701s
sys 0m0.040s
And create fairly large json, but this should not be an issue:
# lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide
-disable scsi -disable dmi -disable memory -disable cpuinfo | wc -c
143468
Do we have a flag to disable dpkd until this issue is fixed?
Nir
6 years, 3 months
[ACTION REQUIRED] Dropping oVirt 4.1 in CI
by Ehud Yonasi
Hey everyone,
Since oVirt 4.1 has gone EOL, it is time to drop all related jobs in CI
which includes:
- All 4.1 jobs related
- All 4.1 change queue jobs
- The 'tested' repository under resources
- The 4.1 nightly snapshot repository jobs
Maintainers, if your projects have any dependencies on the resources
above, please be sure to update them to more up to date or stable
resources. In particular, to depend on any 4.1 packages, please use the 4.1
released repo at:
Thanks in advance for your cooperation,
Ehud.
6 years, 3 months
[ATT] OST 'master' is now broken because of a change in CentOS - DEVELOPMENT BLOCKED
by Barak Korren
The openstack-java-glance-* packages in CentOS have been updated in a way
that is now incompatible with how engine had been using Glance.
This in turn causes any OST run to break ATM, which means no patches are
currently making it past OST and CQ and into the 'tested' and nightly
snapshot repositories.
So far we've only seen this affect 'master' but since the change was made
in CentOS, there is no reason to believe it will not break other version as
well.
A fix to engine to make it compatible with the new package had been posted
here:
https://gerrit.ovirt.org/c/93352/
Additionally, an issue in OST had made this harder to diagnose then it
should have been, and was fixed here:
https://gerrit.ovirt.org/c/93350/
Actions required:
1. Please avoid merging any unrelated patches until the issue is fixed
2. If you've merged any patches since Friday morning, please not that
there were probably removed from the change queue asfailed changes, ad
willl need to be resubmitted by either merging a newer patch to the
relevant project, commenting "ci re-merge please" on the latest merged
patch in Gerrit or resenting the Webhook event from GitHub.
3. If you can please help with reviewing, merging and back-porting the
patches above to speed up resolution of this issue.
Here is a list of project for which we've seen patches get dropped over the
weekend
- ovirt-provider-ovn
- ovirt-engine
- ovirt-ansible-vm-infra
- vdsm
Tracker ticket for this issue:
https://ovirt-jira.atlassian.net/browse/OVIRT-2375
--
Barak Korren
RHV DevOps team , RHCE, RHCi
Red Hat EMEA
redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
6 years, 3 months
[ OST Failure Report ] [ oVirt Master (ALL) ] [ 27-07-2018 ] [ 002_bootstrap.list_glance_images ]
by Dafna Ron
Hi,
We seem to have a problem in the OST tests that is causing failure in
multiple projects
The issue is that setting __name__ on local function is not working
correctly and we are exiting instead of skipping when Glance image is not
available.
We had the same issue on a different test where Milan fixed it:
https://gerrit.ovirt.org/#/c/93167/
but now it seems that we have the same issue on list_glace_images as well.
ERROR:
Traceback (most recent call last):
File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
testMethod()
File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
129, in wrapped_test
test()
File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
59, in wrapper
return func(get_test_prefix(), *args, **kwargs)
File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
78, in wrapper
prefix.virt_env.engine_vm().get_api(api_ver=4), *args, **kwargs
File "/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/002_bootstrap.py",
line 633, in list_glance_images
raise SkipTest('%s: GLANCE connectivity test failed' %
list_glance_images_4.__name__ )
NameError: global name 'list_glance_images_4' is not defined
6 years, 3 months
storage domain deactivating
by Hetz Ben Hamo
Hi,
As part of my "torture testing" to oVirt (4.2.4) I'm doing some testing
with 1 node where I shut down the HE and the machine (powering off, not
yanking the power cable) and rebooting to see it's survival status.
One of the weird thing that happens is that when the machine boots and it
starts the HE, it mounts the storage domains and everything works. However,
after few moments, 3 of my 4 storage domains (ISO, export, and another
storage domain, but not the hosted_engine storage domain) is being
automatically deactivted, with the following errors:
VDSM command GetFileStatsVDS failed: Storage domain does not exist:
(u'f241db01-2282-4204-8fe0-e27e36b3a909',)
Refresh image list failed for domain(s): ISO (ISO file type). Please check
domain activity.
Storage Domain ISO (Data Center HetzLabs) was deactivated by system because
it's not visible by any of the hosts.
Storage Domain data-NAS3 (Data Center HetzLabs) was deactivated by system
because it's not visible by any of the hosts.
Storage Domain export (Data Center HetzLabs) was deactivated by system
because it's not visible by any of the hosts.
However, when I see those message and I'm manually re-activating those
storage domains, all of them getting the status "UP" and there are no
errors and I can see disks, images, etc...
Should I open a bug in Bugzilla about this issue?
Thanks
6 years, 3 months
OST: Cluster compatibility testing
by Milan Zamazal
Hi, CI runs now OST basic-suite-master periodically with data center and
cluster versions different from the default one on master. That tests
changes in master for breakages when run in older version compatibility
modes. Contingent failures are reported to infra.
If you make an OST test known not to run on older cluster versions
(e.g. < 4.1), you should mark the test with a decorator such as
@versioning.require_version(4, 1)
You can also distinguish cluster version dependent code inside tests by
calling
versioning.cluster_version_ok(4, 1)
See examples in basic-suite-master.
You can run basic-suite-master with a different data center and cluster
version manually on your computer by setting OST_DC_VERSION environment
variable, e.g.:
export OST_DC_VERSION=4.1
./run_suite.sh basic-suite-master
Barak, it's currently possible to request OST run on Gerrit patches.
I was asked whether it is also possible to request an OST run with a
non-default cluster version(s). Is it or not?
Thanks,
Milan
6 years, 3 months
May snapshot status be `OK' before snapshot creation finishes?
by Milan Zamazal
Hi, a failure of previewing a snapshot with memory has been experienced
in OST master suite and I'm not sure whether REST API responses about
snapshot status are correct or not.
When I'm creating a snapshot with memory, <snapshot_status> reported
from Engine REST API in /api/vms/…/snapshots is initially `locked' and
later switches to `ok'. The problem is that `ok' starts being reported
before snapshot creation completes, resulting in errors if I try to stop
the VM or to preview the snapshot at the moment.
Does <snapshot_status>ok</snapshot_status> guarantee that the snapshot
is completed or not? I can see the following example in
ovirt-engine-sdk (for snapshot without memory):
# 'Waiting for Snapshot creation to finish'
snapshot_service = snapshots_service.snapshot_service(snapshot.id)
while True:
time.sleep(5)
snapshot = snapshot_service.get()
if snapshot.snapshot_status == types.SnapshotStatus.OK:
break
So I suppose snapshot status shouldn't switch to OK before snapshot
creation finishes. But it's not true in Engine master. Is it a bug or
a feature?
Thanks,
Milan
6 years, 3 months
Re: Failed to Run VM on host (self hosted engine)
by Deekshith
Dear Team
I have reinstalled the OS and tried issue is same unable to launch the VM, I don’t have any SAN or any storages it’s all in Local .
Please help me to resolve the issue
My server details
Lenovo x3650 M5 ,
Deekshith
6 years, 3 months
[ OST Failure Report ] [ oVirt Master (ovirt-engine) ] [ 23-07-2018 ] [ 008_basic_ui_sanity.start_grid ]
by Dafna Ron
Hi,
the issue seems to be that host-1 stopped responding and I can see some
fluetd errors which we should look at.
Jira opened to track this issue:
https://ovirt-jira.atlassian.net/browse/OVIRT-2363
Martin, I also added you to the Jira - can you please have a look?
error from node-1 messages log:
Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:14
-0400 [warn]: detached forwarding server 'lago-basic-suite-master-engine'
host="lago-basic-suite-master-engine" port=24224 phi=16.275347714068506
Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd:
["lago-basic-suite-master-engine", "lago-basic-suite-master-engine",
"lago-basic-suite-master-engine", "lago-basic-suite-master-engine",
"lago-basic-suite-master-engine", "lago-basic-suite-master-engine"]
Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:14
-0400 fluent.warn:
{"host":"lago-basic-suite-master-engine","port":24224,"phi":16.275347714068506,"message":"detached
forwarding server 'lago-basic-suite-master-engine'
host=\"lago-basic-suite-master-engine\" port=24224 phi=16.275347714068506"}
Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:15
-0400 [warn]: detached forwarding server 'lago-basic-suite-master-engine'
host="lago-basic-suite-master-engine" port=24224 phi=16.70444149784817
Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd:
["lago-basic-suite-master-engine", "lago-basic-suite-master-engine",
"lago-basic-suite-master-engine", "lago-basic-suite-master-engine",
"lago-basic-suite-master-engine", "lago-basic-suite-master-engine"]
Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:15
-0400 fluent.warn:
{"host":"lago-basic-suite-master-engine","port":24224,"phi":16.70444149784817,"message":"detached
forwarding server 'lago-basic-suite-master-engine'
host=\"lago-basic-suite-master-engine\" port=24224 phi=16.70444149784817"}
Jul 23 05:09:23 lago-basic-suite-master-host-1 python: ansible-command
Invoked with warn=False executable=None _uses_shell=False
_raw_params=systemctl is-active 'collectd' removes=None argv=None
creates=None chdir=None stdin=None
Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: New session
29 of user root.
Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Started Session 29
of user root.
Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Starting Session 29
of user root.
Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: Removed
session 29.
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27
-0400 [warn]: failed to flush the buffer. error_class="RuntimeError"
error="no nodes are available" plugin_id="object:151a620"
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27
-0400 [warn]: retry count exceededs limit.
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27
-0400 [warn]:
/usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/out_forward.rb:222:in
`write_objects'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27
-0400 [warn]:
/usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:490:in `write'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27
-0400 [warn]:
/usr/share/gems/gems/fluentd-0.12.42/lib/fluent/buffer.rb:354:in
`write_chunk'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27
-0400 [warn]:
/usr/share/gems/gems/fluentd-0.12.42/lib/fluent/buffer.rb:333:in `pop'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27
-0400 [warn]:
/usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:342:in `try_flush'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27
-0400 [warn]:
/usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:149:in `run'
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27
-0400 [error]: throwing away old logs.
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27
-0400 fluent.warn: {"error_class":"RuntimeError","error":"no nodes are
available","plugin_id":"object:151a620","message":"failed to flush the
buffer. error_class=\"RuntimeError\" error=\"no nodes are available\"
plugin_id=\"object:151a620\""}
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27
-0400 fluent.warn: {"message":"retry count exceededs limit."}
Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27
-0400 fluent.error: {"message":"throwing away old logs."}
Thanks.
Dafna
On Mon, Jul 23, 2018 at 10:31 AM, oVirt Jenkins <jenkins(a)ovirt.org> wrote:
> Change 92882,9 (ovirt-engine) is probably the reason behind recent system
> test
> failures in the "ovirt-master" change queue and needs to be fixed.
>
> This change had been removed from the testing queue. Artifacts build from
> this
> change will not be released until it is fixed.
>
> For further details about the change see:
> https://gerrit.ovirt.org/#/c/92882/9
>
> For failed test results see:
> http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/8764/
> _______________________________________________
> Infra mailing list -- infra(a)ovirt.org
> To unsubscribe send an email to infra-leave(a)ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-
> guidelines/
> List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/
> message/6LYYXSGM4LQSRVSYY3IJEIE64LW27TJM/
>
6 years, 3 months