July 2018 - Devel - oVirt List Archives

Vdsm stuck in get_dpdk_devices
by Nir Soffer 29 Jul '18

29 Jul '18

I'm seeing now new issue with latest vdsm 4.2.5: On engine side, host is not responding. On the host, vdsm is stuck waiting for stuck lshw child processes root 29580 0.0 0.0 776184 27468 ? S<sl 19:44 0:00 /usr/bin/python2 /usr/share/vdsm/supervdsmd --sockfile /var/run/vdsm/svdsm.sock root 29627 0.0 0.0 20548 1452 ? D< 19:44 0:00 \_ lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable root 29808 0.0 0.0 20548 1456 ? D< 19:44 0:00 \_ lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable root 30013 0.0 0.0 20548 1452 ? D< 19:46 0:00 \_ lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable root 30064 0.0 0.0 20548 1456 ? D< 19:49 0:00 \_ lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable In the log we see many of these new errors: 2018-07-29 19:52:39,001+0300 WARN (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=periodic/1 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x7f8c14152ad 0> at 0x7f8c14152b10> timeout=15, duration=345 at 0x7f8c140f9190> task#=0 at 0x7f8c1418d950>, traceback: File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap self.__bootstrap_inner() File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner self.run() File: "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run ret = func(*args, **kwargs) File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run self._execute_task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task task() File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__ self._callable() File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 220, in __call__ self._func() File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 576, in __call__ sample = HostSample(self._pid) File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 240, in __init__ self.interfaces = _get_interfaces_and_samples() File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 200, in _get_interfaces_and_samples for link in ipwrapper.getLinks(): File: "/usr/lib/python2.7/site-packages/vdsm/network/ipwrapper.py", line 267, in getLinks in six.viewitems(dpdk.get_dpdk_devices())) File: "/usr/lib/python2.7/site-packages/vdsm/network/link/dpdk.py", line 44, in get_dpdk_devices dpdk_devices = _get_dpdk_devices() File: "/usr/lib/python2.7/site-packages/vdsm/common/cache.py", line 41, in __call__ value = self.func(*args) File: "/usr/lib/python2.7/site-packages/vdsm/network/link/dpdk.py", line 111, in _get_dpdk_devices devices = _lshw_command() File: "/usr/lib/python2.7/site-packages/vdsm/network/link/dpdk.py", line 123, in _lshw_command rc, out, err = cmd.exec_sync(['lshw', '-json'] + filterout_cmd) File: "/usr/lib/python2.7/site-packages/vdsm/network/cmd.py", line 38, in exec_sync retcode, out, err = exec_sync_bytes(cmds) File: "/usr/lib/python2.7/site-packages/vdsm/common/cmdutils.py", line 156, in exec_cmd out, err = p.communicate() File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 924, in communicate stdout, stderr = self._communicate(input, endtime, timeout) File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 924, in communicate stdout, stderr = self._communicate(input, endtime, timeout) File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1706, in _communicate orig_timeout) File: "/usr/lib64/python2.7/site-packages/subprocess32.py", line 1779, in _communicate_with_poll ready = poller.poll(self._remaining_time(endtime)) (executor:363) I'm not sure it is reproducible, happened after I - stop vdsmd supervdsmd sanlock wdmd - install new sanlock version with unrelated fix - start vdsmd Running same command from the shell is successful: # time lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable cpuinfo > /dev/null real 0m0.744s user 0m0.701s sys 0m0.040s And create fairly large json, but this should not be an issue: # lshw -json -disable usb -disable pcmcia -disable isapnp -disable ide -disable scsi -disable dmi -disable memory -disable cpuinfo | wc -c 143468 Do we have a flag to disable dpkd until this issue is fixed? Nir

1 1

[ACTION REQUIRED] Dropping oVirt 4.1 in CI
by Ehud Yonasi 29 Jul '18

29 Jul '18

Hey everyone, Since oVirt 4.1 has gone EOL, it is time to drop all related jobs in CI which includes: - All 4.1 jobs related - All 4.1 change queue jobs - The 'tested' repository under resources - The 4.1 nightly snapshot repository jobs Maintainers, if your projects have any dependencies on the resources above, please be sure to update them to more up to date or stable resources. In particular, to depend on any 4.1 packages, please use the 4.1 released repo at: Thanks in advance for your cooperation, Ehud.

1 1

[ATT] OST 'master' is now broken because of a change in CentOS - DEVELOPMENT BLOCKED
by Barak Korren 29 Jul '18

29 Jul '18

The openstack-java-glance-* packages in CentOS have been updated in a way that is now incompatible with how engine had been using Glance. This in turn causes any OST run to break ATM, which means no patches are currently making it past OST and CQ and into the 'tested' and nightly snapshot repositories. So far we've only seen this affect 'master' but since the change was made in CentOS, there is no reason to believe it will not break other version as well. A fix to engine to make it compatible with the new package had been posted here: https://gerrit.ovirt.org/c/93352/ Additionally, an issue in OST had made this harder to diagnose then it should have been, and was fixed here: https://gerrit.ovirt.org/c/93350/ Actions required: 1. Please avoid merging any unrelated patches until the issue is fixed 2. If you've merged any patches since Friday morning, please not that there were probably removed from the change queue asfailed changes, ad willl need to be resubmitted by either merging a newer patch to the relevant project, commenting "ci re-merge please" on the latest merged patch in Gerrit or resenting the Webhook event from GitHub. 3. If you can please help with reviewing, merging and back-porting the patches above to speed up resolution of this issue. Here is a list of project for which we've seen patches get dropped over the weekend - ovirt-provider-ovn - ovirt-engine - ovirt-ansible-vm-infra - vdsm Tracker ticket for this issue: https://ovirt-jira.atlassian.net/browse/OVIRT-2375 -- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

2 2

[ OST Failure Report ] [ oVirt Master (ALL) ] [ 27-07-2018 ] [ 002_bootstrap.list_glance_images ]
by Dafna Ron 27 Jul '18

27 Jul '18

Hi, We seem to have a problem in the OST tests that is causing failure in multiple projects The issue is that setting __name__ on local function is not working correctly and we are exiting instead of skipping when Glance image is not available. We had the same issue on a different test where Milan fixed it: https://gerrit.ovirt.org/#/c/93167/ but now it seems that we have the same issue on list_glace_images as well. ERROR: Traceback (most recent call last): File "/usr/lib64/python2.7/unittest/case.py", line 369, in run testMethod() File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 129, in wrapped_test test() File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 59, in wrapper return func(get_test_prefix(), *args, **kwargs) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 78, in wrapper prefix.virt_env.engine_vm().get_api(api_ver=4), *args, **kwargs File "/home/jenkins/workspace/ovirt-master_change-queue-tester/ovirt-system-tests/basic-suite-master/test-scenarios/002_bootstrap.py", line 633, in list_glance_images raise SkipTest('%s: GLANCE connectivity test failed' % list_glance_images_4.__name__ ) NameError: global name 'list_glance_images_4' is not defined

2 3

Fwd: [Guidelines change] Changes to the packaging guidelines
by Sandro Bonazzola 27 Jul '18

27 Jul '18

FYI ---------- Forwarded message ---------- From: Jason L Tibbitts III <tibbs(a)math.uh.edu> Date: 2018-07-25 3:10 GMT+02:00 Subject: [Guidelines change] Changes to the packaging guidelines To: devel-announce(a)lists.fedoraproject.org Here are the recent changes to the packaging guidelines. ----- The packaging guidelines for enabling services by default were significantly revised to emphasize that services starting by default should fail only in exceptional conditions, and to provide additional guidance for services related to hardware enablement. * https://fedoraproject.org/wiki/Packaging:DefaultServices * https://pagure.io/packaging-committee/issue/777 ----- The Python guidelines were modified to mention the %pypi_source macro (available in all Fedora and EPEL releases) which conveniently expands to the proper source URL for the package at PyPi. * https://fedoraproject.org/wiki/Packaging:Python#Source_Files_from_PyPI * https://pagure.io/packaging-committee/issue/759 ----- The Python guidelines were modified to indicate that packages must not own the top-level __pycache__ directory. * https://fedoraproject.org/wiki/Packaging:Python#Byte_compiling * https://pagure.io/packaging-committee/issue/782 ----- A small change was made to the Java packaging guidelines to specify a dependency on javapackages-filesystem instead of javapackages-tools. * https://fedoraproject.org/wiki/Packaging:Java#BuildRequires_and_Requires * https://pagure.io/packaging-committee/issue/781 _______________________________________________ devel-announce mailing list -- devel-announce(a)lists.fedoraproject.org To unsubscribe send an email to devel-announce-leave(a)lists.fedoraproject.org Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel- announce@lists.fedoraproject.org/message/HP4LRC2RJNMOMXITZZYJ6FLF2RITY6H6/ -- SANDRO BONAZZOLA MANAGER, SOFTWARE ENGINEERING, EMEA R&D RHV Red Hat EMEA <https://www.redhat.com/> sbonazzo(a)redhat.com <https://red.ht/sig>

1 0

storage domain deactivating
by Hetz Ben Hamo 26 Jul '18

26 Jul '18

Hi, As part of my "torture testing" to oVirt (4.2.4) I'm doing some testing with 1 node where I shut down the HE and the machine (powering off, not yanking the power cable) and rebooting to see it's survival status. One of the weird thing that happens is that when the machine boots and it starts the HE, it mounts the storage domains and everything works. However, after few moments, 3 of my 4 storage domains (ISO, export, and another storage domain, but not the hosted_engine storage domain) is being automatically deactivted, with the following errors: VDSM command GetFileStatsVDS failed: Storage domain does not exist: (u'f241db01-2282-4204-8fe0-e27e36b3a909',) Refresh image list failed for domain(s): ISO (ISO file type). Please check domain activity. Storage Domain ISO (Data Center HetzLabs) was deactivated by system because it's not visible by any of the hosts. Storage Domain data-NAS3 (Data Center HetzLabs) was deactivated by system because it's not visible by any of the hosts. Storage Domain export (Data Center HetzLabs) was deactivated by system because it's not visible by any of the hosts. However, when I see those message and I'm manually re-activating those storage domains, all of them getting the status "UP" and there are no errors and I can see disks, images, etc... Should I open a bug in Bugzilla about this issue? Thanks

3 4

OST: Cluster compatibility testing
by Milan Zamazal 26 Jul '18

26 Jul '18

Hi, CI runs now OST basic-suite-master periodically with data center and cluster versions different from the default one on master. That tests changes in master for breakages when run in older version compatibility modes. Contingent failures are reported to infra. If you make an OST test known not to run on older cluster versions (e.g. < 4.1), you should mark the test with a decorator such as @versioning.require_version(4, 1) You can also distinguish cluster version dependent code inside tests by calling versioning.cluster_version_ok(4, 1) See examples in basic-suite-master. You can run basic-suite-master with a different data center and cluster version manually on your computer by setting OST_DC_VERSION environment variable, e.g.: export OST_DC_VERSION=4.1 ./run_suite.sh basic-suite-master Barak, it's currently possible to request OST run on Gerrit patches. I was asked whether it is also possible to request an OST run with a non-default cluster version(s). Is it or not? Thanks, Milan

2 2

May snapshot status be `OK' before snapshot creation finishes?
by Milan Zamazal 26 Jul '18

26 Jul '18

Hi, a failure of previewing a snapshot with memory has been experienced in OST master suite and I'm not sure whether REST API responses about snapshot status are correct or not. When I'm creating a snapshot with memory, <snapshot_status> reported from Engine REST API in /api/vms/…/snapshots is initially `locked' and later switches to `ok'. The problem is that `ok' starts being reported before snapshot creation completes, resulting in errors if I try to stop the VM or to preview the snapshot at the moment. Does <snapshot_status>ok</snapshot_status> guarantee that the snapshot is completed or not? I can see the following example in ovirt-engine-sdk (for snapshot without memory): # 'Waiting for Snapshot creation to finish' snapshot_service = snapshots_service.snapshot_service(snapshot.id) while True: time.sleep(5) snapshot = snapshot_service.get() if snapshot.snapshot_status == types.SnapshotStatus.OK: break So I suppose snapshot status shouldn't switch to OK before snapshot creation finishes. But it's not true in Engine master. Is it a bug or a feature? Thanks, Milan

2 3

Re: Failed to Run VM on host (self hosted engine)
by Deekshith 26 Jul '18

26 Jul '18

Dear Team I have reinstalled the OS and tried issue is same unable to launch the VM, I don’t have any SAN or any storages it’s all in Local . Please help me to resolve the issue My server details Lenovo x3650 M5 , Deekshith

1 0

[ OST Failure Report ] [ oVirt Master (ovirt-engine) ] [ 23-07-2018 ] [ 008_basic_ui_sanity.start_grid ]
by Dafna Ron 25 Jul '18

25 Jul '18

Hi, the issue seems to be that host-1 stopped responding and I can see some fluetd errors which we should look at. Jira opened to track this issue: https://ovirt-jira.atlassian.net/browse/OVIRT-2363 Martin, I also added you to the Jira - can you please have a look? error from node-1 messages log: Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:14 -0400 [warn]: detached forwarding server 'lago-basic-suite-master-engine' host="lago-basic-suite-master-engine" port=24224 phi=16.275347714068506 Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: ["lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine"] Jul 23 05:09:14 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:14 -0400 fluent.warn: {"host":"lago-basic-suite-master-engine","port":24224,"phi":16.275347714068506,"message":"detached forwarding server 'lago-basic-suite-master-engine' host=\"lago-basic-suite-master-engine\" port=24224 phi=16.275347714068506"} Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:15 -0400 [warn]: detached forwarding server 'lago-basic-suite-master-engine' host="lago-basic-suite-master-engine" port=24224 phi=16.70444149784817 Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: ["lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine", "lago-basic-suite-master-engine"] Jul 23 05:09:15 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:15 -0400 fluent.warn: {"host":"lago-basic-suite-master-engine","port":24224,"phi":16.70444149784817,"message":"detached forwarding server 'lago-basic-suite-master-engine' host=\"lago-basic-suite-master-engine\" port=24224 phi=16.70444149784817"} Jul 23 05:09:23 lago-basic-suite-master-host-1 python: ansible-command Invoked with warn=False executable=None _uses_shell=False _raw_params=systemctl is-active 'collectd' removes=None argv=None creates=None chdir=None stdin=None Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: New session 29 of user root. Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Started Session 29 of user root. Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd: Starting Session 29 of user root. Jul 23 05:09:25 lago-basic-suite-master-host-1 systemd-logind: Removed session 29. Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: failed to flush the buffer. error_class="RuntimeError" error="no nodes are available" plugin_id="object:151a620" Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: retry count exceededs limit. Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/plugin/out_forward.rb:222:in `write_objects' Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:490:in `write' Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/buffer.rb:354:in `write_chunk' Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/buffer.rb:333:in `pop' Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:342:in `try_flush' Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [warn]: /usr/share/gems/gems/fluentd-0.12.42/lib/fluent/output.rb:149:in `run' Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 [error]: throwing away old logs. Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 fluent.warn: {"error_class":"RuntimeError","error":"no nodes are available","plugin_id":"object:151a620","message":"failed to flush the buffer. error_class=\"RuntimeError\" error=\"no nodes are available\" plugin_id=\"object:151a620\""} Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 fluent.warn: {"message":"retry count exceededs limit."} Jul 23 05:09:27 lago-basic-suite-master-host-1 fluentd: 2018-07-23 05:09:27 -0400 fluent.error: {"message":"throwing away old logs."} Thanks. Dafna On Mon, Jul 23, 2018 at 10:31 AM, oVirt Jenkins <jenkins(a)ovirt.org> wrote: > Change 92882,9 (ovirt-engine) is probably the reason behind recent system > test > failures in the "ovirt-master" change queue and needs to be fixed. > > This change had been removed from the testing queue. Artifacts build from > this > change will not be released until it is fixed. > > For further details about the change see: > https://gerrit.ovirt.org/#/c/92882/9 > > For failed test results see: > http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/8764/ > _______________________________________________ > Infra mailing list -- infra(a)ovirt.org > To unsubscribe send an email to infra-leave(a)ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: https://www.ovirt.org/community/about/community- > guidelines/ > List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/ > message/6LYYXSGM4LQSRVSYY3IJEIE64LW27TJM/ >

5 4