January 2020 - Infra - oVirt List Archives

[JENKINS] Failed to setup proejct vdsm_standard-check-patch
by jenkins＠jenkins.phx.ovirt.org 21 Jan '20

21 Jan '20

Failed to run project_setup.sh for: #17519 vdsm [check-patch]. It probably means that docker_cleanup.py failed. This step doesn't fail the job, but we do collect data about such failures to find the root cause. Infra owner, ensure that we're not running out of disk space on vm0139.workers-phx.ovirt.org.

1 0

[CQ]: 106363, 3 (java-client-kubevirt) failed "ovirt-master" system tests, but isn't the failure root cause
by oVirt Jenkins 21 Jan '20

21 Jan '20

A system test invoked by the "ovirt-master" change queue including change 106363,3 (java-client-kubevirt) failed. However, this change seems not to be the root cause for this failure. Change 106362,1 (java-client-kubevirt) that this change depends on or is based on, was detected as the cause of the testing failures. This change had been removed from the testing queue. Artifacts built from this change will not be released until either change 106362,1 (java-client-kubevirt) is fixed and this change is updated to refer to or rebased on the fixed version, or this change is modified to no longer depend on it. For further details about the change see: https://gerrit.ovirt.org/#/c/106363/3 For further details about the change that seems to be the root cause behind the testing failures see: https://gerrit.ovirt.org/#/c/106362/1 For failed test results see: https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/18199/

1 0

[CQ]: 106362, 1 (java-client-kubevirt) failed "ovirt-master" system tests
by oVirt Jenkins 21 Jan '20

21 Jan '20

Change 106362,1 (java-client-kubevirt) is probably the reason behind recent system test failures in the "ovirt-master" change queue and needs to be fixed. This change had been removed from the testing queue. Artifacts build from this change will not be released until it is fixed. For further details about the change see: https://gerrit.ovirt.org/#/c/106362/1 For failed test results see: https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/18194/

1 0

oVirt infra daily report - unstable production jobs - 1076
by jenkins＠jenkins.phx.ovirt.org 21 Jan '20

21 Jan '20

Good morning! Attached is the HTML page with the jenkins status report. You can see it also here: - https://jenkins.ovirt.org/job/system_jenkins-report/1076//artifact/exported… Cheers, Jenkins

1 0

[CQ]: 105942,1 (ovirt-host) failed "ovirt-master" system tests
by oVirt Jenkins 20 Jan '20

20 Jan '20

Change 105942,1 (ovirt-host) is probably the reason behind recent system test failures in the "ovirt-master" change queue and needs to be fixed. This change had been removed from the testing queue. Artifacts build from this change will not be released until it is fixed. For further details about the change see: https://gerrit.ovirt.org/#/c/105942/1 For failed test results see: https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/18184/

1 0

Build failed in Jenkins: system-sync_mirrors-centos-sclo-rh-release-7.6-el7-x86_64 #267
by jenkins＠jenkins.phx.ovirt.org 20 Jan '20

20 Jan '20

See <https://jenkins.ovirt.org/job/system-sync_mirrors-centos-sclo-rh-release-7.…> Changes: ------------------------------------------ Started by timer Running as SYSTEM [EnvInject] - Loading node environment variables. Building remotely on mirrors.phx.ovirt.org (mirrors) in workspace <https://jenkins.ovirt.org/job/system-sync_mirrors-centos-sclo-rh-release-7.…> No credentials specified > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url http://gerrit.ovirt.org/jenkins.git # timeout=10 Cleaning workspace > git rev-parse --verify HEAD # timeout=10 Resetting working tree > git reset --hard # timeout=10 > git clean -fdx # timeout=10 Pruning obsolete local branches Fetching upstream changes from http://gerrit.ovirt.org/jenkins.git > git --version # timeout=10 > git fetch --tags --progress --prune http://gerrit.ovirt.org/jenkins.git +refs/heads/*:refs/remotes/origin/* > git rev-parse origin/master^{commit} # timeout=10 Checking out Revision 7ea39d9c23b94da5498c23a574c14b8403ef8ee5 (origin/master) > git config core.sparsecheckout # timeout=10 > git checkout -f 7ea39d9c23b94da5498c23a574c14b8403ef8ee5 Commit message: "gate: Fix suit cloning in container" > git rev-list --no-walk 7ea39d9c23b94da5498c23a574c14b8403ef8ee5 # timeout=10 [system-sync_mirrors-centos-sclo-rh-release-7.6-el7-x86_64] $ /bin/bash -xe /tmp/jenkins2256290304568313237.sh + jenkins/scripts/mirror_mgr.sh resync_yum_mirror centos-sclo-rh-release-7.6-el7 x86_64 jenkins/data/mirrors-reposync.conf + MIRRORS_MP_BASE=/var/www/html/repos + MIRRORS_HTTP_BASE=http://mirrors.phx.ovirt.org/repos + MIRRORS_CACHE=/home/jenkins/mirrors_cache + MAX_LOCK_ATTEMPTS=120 + LOCK_WAIT_INTERVAL=5 + LOCK_BASE=/home/jenkins + OLD_MD_TO_KEEP=100 + HTTP_SELINUX_TYPE=httpd_sys_content_t + HTTP_FILE_MODE=644 + main resync_yum_mirror centos-sclo-rh-release-7.6-el7 x86_64 jenkins/data/mirrors-reposync.conf + local command=resync_yum_mirror + command_args=("${@:2}") + local command_args + cmd_resync_yum_mirror centos-sclo-rh-release-7.6-el7 x86_64 jenkins/data/mirrors-reposync.conf + local repo_name=centos-sclo-rh-release-7.6-el7 + local repo_archs=x86_64 + local reposync_conf=jenkins/data/mirrors-reposync.conf + local sync_needed + mkdir -p /home/jenkins/mirrors_cache + verify_repo_fs centos-sclo-rh-release-7.6-el7 yum + local repo_name=centos-sclo-rh-release-7.6-el7 + local repo_type=yum + sudo install -o jenkins -d /var/www/html/repos/yum /var/www/html/repos/yum/centos-sclo-rh-release-7.6-el7 /var/www/html/repos/yum/centos-sclo-rh-release-7.6-el7/base + check_yum_sync_needed centos-sclo-rh-release-7.6-el7 x86_64 jenkins/data/mirrors-reposync.conf sync_needed + local repo_name=centos-sclo-rh-release-7.6-el7 + local repo_archs=x86_64 + local reposync_conf=jenkins/data/mirrors-reposync.conf + local p_sync_needed=sync_needed + local reposync_out + echo 'Checking if mirror needs a resync' Checking if mirror needs a resync + rm -rf /home/jenkins/mirrors_cache/centos-sclo-rh-release-7.6-el7 ++ IFS=, ++ echo x86_64 + for arch in '$(IFS=,; echo $repo_archs)' ++ run_reposync centos-sclo-rh-release-7.6-el7 x86_64 jenkins/data/mirrors-reposync.conf --urls --quiet ++ local repo_name=centos-sclo-rh-release-7.6-el7 ++ local repo_arch=x86_64 ++ local reposync_conf=jenkins/data/mirrors-reposync.conf ++ extra_args=("${@:4}") ++ local extra_args ++ reposync --config=jenkins/data/mirrors-reposync.conf --repoid=centos-sclo-rh-release-7.6-el7 --arch=x86_64 --cachedir=/home/jenkins/mirrors_cache --download_path=/var/www/html/repos/yum/centos-sclo-rh-release-7.6-el7/base --norepopath --newest-only --urls --quiet Error setting up repositories: failure: repodata/be198d32d786c933beb06e141f7f49d7b9671df70ecb09f05c39370d498a5f57-primary.sqlite.bz2 from centos-sclo-rh-release-7.6-el7: [Errno 256] No more mirrors to try. http://vault.centos.org/7.6.1810/sclo/x86_64/rh/repodata/be198d32d786c933be…: [Errno 16] error setting timestamp on file /home/jenkins/mirrors_cache/centos-sclo-rh-release-7.6-el7/be198d32d786c933beb06e141f7f49d7b9671df70ecb09f05c39370d498a5f57-primary.sqlite.bz2 from http://vault.centos.org/7.6.1810/sclo/x86_64/rh/repodata/be198d32d786c933be…, OSError: [Errno 2] No such file or directory: '/home/jenkins/mirrors_cache/centos-sclo-rh-release-7.6-el7/be198d32d786c933beb06e141f7f49d7b9671df70ecb09f05c39370d498a5f57-primary.sqlite.bz2' + reposync_out= Build step 'Execute shell' marked build as failure

1 1

OST CI: Time issue?
by Yedidyah Bar David 20 Jan '20

20 Jan '20

Hi all, I looked at failure of [1]. It timed out while waiting for the engine to be up, after restarting the engine VM. lago.log [2] has: 2020-01-20 04:16:06,026::008_restart_he_vm.py::_start_he_vm::173::root::INFO::Starting VM... ... 2020-01-20 04:16:09,823::008_restart_he_vm.py::_start_he_vm::178::root::INFO::Waiting for VM to be UP... ... 2020-01-20 04:16:25,113::008_restart_he_vm.py::_start_he_vm::186::root::INFO::VM is UP. 2020-01-20 04:16:25,113::008_restart_he_vm.py::_wait_for_engine_health::190::root::INFO::Waiting for engine to start... ... Then there is a loop of running 'hosted-engine --vm-status --json' and parsing the status, specifically checking the health. The health status does not change until the timeout (after 10 minutes), but after 'vm', there is 'detail', which does change. First one is 'Powering up': 2020-01-20 04:16:25,113::ssh.py::ssh::89::lago.ssh::DEBUG::Command 9ef791c0 on lago-he-basic-role-remote-suite-4-3-host-0 output: {"1": {"conf_on_shared_storage": true, "live-data": true, "extra": "metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=5199 (Sun Jan 19 23:16:24 2020)\nhost-id=1\nscore=2400\nvm_conf_refresh_time=5199 (Sun Jan 19 23:16:24 2020)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n", "hostname": "lago-he-basic-role-remote-suite-4-3-host-0.lago.local", "host-id": 1, "engine-status": {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "Powering up"}, "score": 2400, "stopped": false, "maintenance": false, "crc32": "e7f42f2a", "local_conf_timestamp": 5199, "host-ts": 5199}, "2": {"conf_on_shared_storage": true, "live-data": false, "extra": "metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=5190 (Sun Jan 19 23:16:15 2020)\nhost-id=2\nscore=3400\nvm_conf_refresh_time=5190 (Sun Jan 19 23:16:15 2020)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n", "hostname": "lago-he-basic-role-remote-suite-4-3-host-1", "host-id": 2, "engine-status": {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}, "score": 3400, "stopped": false, "maintenance": false, "crc32": "c3657e8b", "local_conf_timestamp": 5190, "host-ts": 5190}, "global_maintenance": true} ... Then a few similar ones, last one is from '2020-01-20 04:17:22,980', and then it changes to 'Up': 2020-01-20 04:17:27,517::ssh.py::ssh::89::lago.ssh::DEBUG::Command c426f53a on lago-he-basic-role-remote-suite-4-3-host-0 output: {"1": {"conf_on_shared_storage": true, "live-data": true, "extra": "metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=5259 (Sun Jan 19 23:17:24 2020)\nhost-id=1\nscore=2928\nvm_conf_refresh_time=5259 (Sun Jan 19 23:17:24 2020)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n", "hostname": "lago-he-basic-role-remote-suite-4-3-host-0.lago.local", "host-id": 1, "engine-status": {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up"}, "score": 2928, "stopped": false, "maintenance": false, "crc32": "08d4bfe1", "local_conf_timestamp": 5259, "host-ts": 5259}, "2": {"conf_on_shared_storage": true, "live-data": true, "extra": "metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=5250 (Sun Jan 19 23:17:16 2020)\nhost-id=2\nscore=3400\nvm_conf_refresh_time=5251 (Sun Jan 19 23:17:16 2020)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n", "hostname": "lago-he-basic-role-remote-suite-4-3-host-1", "host-id": 2, "engine-status": {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}, "score": 3400, "stopped": false, "maintenance": false, "crc32": "ab277249", "local_conf_timestamp": 5251, "host-ts": 5250}, "global_maintenance": true} /var/log/messages of the engine vm [3] has: Jan 19 23:17:57 lago-he-basic-role-remote-suite-4-3-engine kernel: Initializing cgroup subsys cpuset Meaning, the first line after the reboot has a timestamp that's almost 2 minutes after starting the VM. That's a long time. That's half a minute after detail changed to 'Up', above (no idea where we take this 'detail' from, though). Then, first line that is not from the kernel: Jan 19 23:17:58 lago-he-basic-role-remote-suite-4-3-engine systemd[1]: systemd 219 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN) then it starts lots of services, and then: Jan 19 23:20:31 lago-he-basic-role-remote-suite-4-3-engine systemd: Started oVirt Engine websockets proxy. Jan 19 23:20:33 lago-he-basic-role-remote-suite-4-3-engine chronyd[872]: Selected source 45.33.103.94 Jan 19 23:20:37 lago-he-basic-role-remote-suite-4-3-engine chronyd[872]: System clock wrong by 3.919225 seconds, adjustment started Jan 19 23:20:37 lago-he-basic-role-remote-suite-4-3-engine systemd: Time has been changed Jan 19 23:20:37 lago-he-basic-role-remote-suite-4-3-engine chronyd[872]: System clock was stepped by 3.919225 seconds Jan 19 23:20:38 lago-he-basic-role-remote-suite-4-3-engine chronyd[872]: Selected source 129.250.35.250 Jan 19 23:20:43 lago-he-basic-role-remote-suite-4-3-engine cloud-init: Cloud-init v. 18.5 running 'modules:config' at Mon, 20 Jan 2020 04:20:39 +0000. Up 182.82 seconds. and later: Jan 19 23:21:32 lago-he-basic-role-remote-suite-4-3-engine systemd: Started Update UTMP about System Runlevel Changes. Jan 19 23:21:32 lago-he-basic-role-remote-suite-4-3-engine systemd: Startup finished in 22.344s (kernel) + 32.237s (initrd) + 3min 815ms (userspace) = 3min 55.397s. Jan 19 23:21:43 lago-he-basic-role-remote-suite-4-3-engine chronyd[872]: Selected source 206.55.191.142 Jan 19 23:23:52 lago-he-basic-role-remote-suite-4-3-engine chronyd[872]: Selected source 129.250.35.250 Jan 19 23:26:32 lago-he-basic-role-remote-suite-4-3-engine systemd: Created slice User Slice of root. engine.log's [4] first line after the reboot is: 2020-01-19 23:26:18,251-05 INFO [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread Pool -- 103) [] Loaded file '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'. which is a mere 40 seconds before the timeout, and server.log's [5] first line is: 2020-01-19 23:23:13,392-05 INFO [org.jboss.as.server.deployment] (MSC service thread 1-2) WFLYSRV0027: Starting deployment of "apidoc.war" (runtime-name: "apidoc.war") which is a 3 minutes before that. I do not know much about what the engine does during its startup, and how the logs should look, and how long it should take between startup and a working health page. If this all looks normal, then perhaps we should simply give it more than 10 minutes. Otherwise, there is either heavy load on the infra, or perhaps some NTP problems. Anyone has a clue? Anyway, now pushed a patch [6] to allow up to 20 minutes, at least to reduce the noise. Thanks and best regards, [1] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-role-remote-suite… [2] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-role-remote-suite… [3] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-role-remote-suite… [4] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-role-remote-suite… [5] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-role-remote-suite… [6] https://gerrit.ovirt.org/106411 -- Didi

1 0

[CQ]: 106329, 3 (vdsm) failed "ovirt-master" system tests, but isn't the failure root cause
by oVirt Jenkins 20 Jan '20

20 Jan '20

A system test invoked by the "ovirt-master" change queue including change 106329,3 (vdsm) failed. However, this change seems not to be the root cause for this failure. Change 106280,6 (vdsm) that this change depends on or is based on, was detected as the cause of the testing failures. This change had been removed from the testing queue. Artifacts built from this change will not be released until either change 106280,6 (vdsm) is fixed and this change is updated to refer to or rebased on the fixed version, or this change is modified to no longer depend on it. For further details about the change see: https://gerrit.ovirt.org/#/c/106329/3 For further details about the change that seems to be the root cause behind the testing failures see: https://gerrit.ovirt.org/#/c/106280/6 For failed test results see: https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/18181/

1 0

oVirt infra daily report - unstable production jobs - 1075
by jenkins＠jenkins.phx.ovirt.org 20 Jan '20

20 Jan '20

Good morning! Attached is the HTML page with the jenkins status report. You can see it also here: - https://jenkins.ovirt.org/job/system_jenkins-report/1075//artifact/exported… Cheers, Jenkins

1 0

[CQ]: 106392, 2 (ovirt-engine) failed "ovirt-master" system tests, but isn't the failure root cause
by oVirt Jenkins 19 Jan '20

19 Jan '20

A system test invoked by the "ovirt-master" change queue including change 106392,2 (ovirt-engine) failed. However, this change seems not to be the root cause for this failure. Change 104391,14 (ovirt-engine) that this change depends on or is based on, was detected as the cause of the testing failures. This change had been removed from the testing queue. Artifacts built from this change will not be released until either change 104391,14 (ovirt-engine) is fixed and this change is updated to refer to or rebased on the fixed version, or this change is modified to no longer depend on it. For further details about the change see: https://gerrit.ovirt.org/#/c/106392/2 For further details about the change that seems to be the root cause behind the testing failures see: https://gerrit.ovirt.org/#/c/104391/14 For failed test results see: https://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/18178/

1 0