Build failed in Jenkins:
system-sync_mirrors-centos-sclo-rh-release-7.6-el7-x86_64 #267
by jenkins@jenkins.phx.ovirt.org
See <https://jenkins.ovirt.org/job/system-sync_mirrors-centos-sclo-rh-release-...>
Changes:
------------------------------------------
Started by timer
Running as SYSTEM
[EnvInject] - Loading node environment variables.
Building remotely on mirrors.phx.ovirt.org (mirrors) in workspace <https://jenkins.ovirt.org/job/system-sync_mirrors-centos-sclo-rh-release-...>
No credentials specified
> git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
> git config remote.origin.url http://gerrit.ovirt.org/jenkins.git # timeout=10
Cleaning workspace
> git rev-parse --verify HEAD # timeout=10
Resetting working tree
> git reset --hard # timeout=10
> git clean -fdx # timeout=10
Pruning obsolete local branches
Fetching upstream changes from http://gerrit.ovirt.org/jenkins.git
> git --version # timeout=10
> git fetch --tags --progress --prune http://gerrit.ovirt.org/jenkins.git +refs/heads/*:refs/remotes/origin/*
> git rev-parse origin/master^{commit} # timeout=10
Checking out Revision 7ea39d9c23b94da5498c23a574c14b8403ef8ee5 (origin/master)
> git config core.sparsecheckout # timeout=10
> git checkout -f 7ea39d9c23b94da5498c23a574c14b8403ef8ee5
Commit message: "gate: Fix suit cloning in container"
> git rev-list --no-walk 7ea39d9c23b94da5498c23a574c14b8403ef8ee5 # timeout=10
[system-sync_mirrors-centos-sclo-rh-release-7.6-el7-x86_64] $ /bin/bash -xe /tmp/jenkins2256290304568313237.sh
+ jenkins/scripts/mirror_mgr.sh resync_yum_mirror centos-sclo-rh-release-7.6-el7 x86_64 jenkins/data/mirrors-reposync.conf
+ MIRRORS_MP_BASE=/var/www/html/repos
+ MIRRORS_HTTP_BASE=http://mirrors.phx.ovirt.org/repos
+ MIRRORS_CACHE=/home/jenkins/mirrors_cache
+ MAX_LOCK_ATTEMPTS=120
+ LOCK_WAIT_INTERVAL=5
+ LOCK_BASE=/home/jenkins
+ OLD_MD_TO_KEEP=100
+ HTTP_SELINUX_TYPE=httpd_sys_content_t
+ HTTP_FILE_MODE=644
+ main resync_yum_mirror centos-sclo-rh-release-7.6-el7 x86_64 jenkins/data/mirrors-reposync.conf
+ local command=resync_yum_mirror
+ command_args=("${@:2}")
+ local command_args
+ cmd_resync_yum_mirror centos-sclo-rh-release-7.6-el7 x86_64 jenkins/data/mirrors-reposync.conf
+ local repo_name=centos-sclo-rh-release-7.6-el7
+ local repo_archs=x86_64
+ local reposync_conf=jenkins/data/mirrors-reposync.conf
+ local sync_needed
+ mkdir -p /home/jenkins/mirrors_cache
+ verify_repo_fs centos-sclo-rh-release-7.6-el7 yum
+ local repo_name=centos-sclo-rh-release-7.6-el7
+ local repo_type=yum
+ sudo install -o jenkins -d /var/www/html/repos/yum /var/www/html/repos/yum/centos-sclo-rh-release-7.6-el7 /var/www/html/repos/yum/centos-sclo-rh-release-7.6-el7/base
+ check_yum_sync_needed centos-sclo-rh-release-7.6-el7 x86_64 jenkins/data/mirrors-reposync.conf sync_needed
+ local repo_name=centos-sclo-rh-release-7.6-el7
+ local repo_archs=x86_64
+ local reposync_conf=jenkins/data/mirrors-reposync.conf
+ local p_sync_needed=sync_needed
+ local reposync_out
+ echo 'Checking if mirror needs a resync'
Checking if mirror needs a resync
+ rm -rf /home/jenkins/mirrors_cache/centos-sclo-rh-release-7.6-el7
++ IFS=,
++ echo x86_64
+ for arch in '$(IFS=,; echo $repo_archs)'
++ run_reposync centos-sclo-rh-release-7.6-el7 x86_64 jenkins/data/mirrors-reposync.conf --urls --quiet
++ local repo_name=centos-sclo-rh-release-7.6-el7
++ local repo_arch=x86_64
++ local reposync_conf=jenkins/data/mirrors-reposync.conf
++ extra_args=("${@:4}")
++ local extra_args
++ reposync --config=jenkins/data/mirrors-reposync.conf --repoid=centos-sclo-rh-release-7.6-el7 --arch=x86_64 --cachedir=/home/jenkins/mirrors_cache --download_path=/var/www/html/repos/yum/centos-sclo-rh-release-7.6-el7/base --norepopath --newest-only --urls --quiet
Error setting up repositories: failure: repodata/be198d32d786c933beb06e141f7f49d7b9671df70ecb09f05c39370d498a5f57-primary.sqlite.bz2 from centos-sclo-rh-release-7.6-el7: [Errno 256] No more mirrors to try.
http://vault.centos.org/7.6.1810/sclo/x86_64/rh/repodata/be198d32d786c933...: [Errno 16] error setting timestamp on file /home/jenkins/mirrors_cache/centos-sclo-rh-release-7.6-el7/be198d32d786c933beb06e141f7f49d7b9671df70ecb09f05c39370d498a5f57-primary.sqlite.bz2 from http://vault.centos.org/7.6.1810/sclo/x86_64/rh/repodata/be198d32d786c933..., OSError: [Errno 2] No such file or directory: '/home/jenkins/mirrors_cache/centos-sclo-rh-release-7.6-el7/be198d32d786c933beb06e141f7f49d7b9671df70ecb09f05c39370d498a5f57-primary.sqlite.bz2'
+ reposync_out=
Build step 'Execute shell' marked build as failure
4 years, 10 months
OST CI: Time issue?
by Yedidyah Bar David
Hi all,
I looked at failure of [1].
It timed out while waiting for the engine to be up, after restarting
the engine VM.
lago.log [2] has:
2020-01-20 04:16:06,026::008_restart_he_vm.py::_start_he_vm::173::root::INFO::Starting
VM...
...
2020-01-20 04:16:09,823::008_restart_he_vm.py::_start_he_vm::178::root::INFO::Waiting
for VM to be UP...
...
2020-01-20 04:16:25,113::008_restart_he_vm.py::_start_he_vm::186::root::INFO::VM
is UP.
2020-01-20 04:16:25,113::008_restart_he_vm.py::_wait_for_engine_health::190::root::INFO::Waiting
for engine to start...
...
Then there is a loop of running 'hosted-engine --vm-status --json' and
parsing the status, specifically checking the health. The health
status does not change until the timeout (after 10 minutes), but after
'vm', there is 'detail', which does change. First one is 'Powering
up':
2020-01-20 04:16:25,113::ssh.py::ssh::89::lago.ssh::DEBUG::Command
9ef791c0 on lago-he-basic-role-remote-suite-4-3-host-0 output:
{"1": {"conf_on_shared_storage": true, "live-data": true, "extra":
"metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=5199
(Sun Jan 19 23:16:24
2020)\nhost-id=1\nscore=2400\nvm_conf_refresh_time=5199 (Sun Jan 19
23:16:24 2020)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n",
"hostname": "lago-he-basic-role-remote-suite-4-3-host-0.lago.local",
"host-id": 1, "engine-status": {"reason": "bad vm status", "health":
"bad", "vm": "up", "detail": "Powering up"}, "score": 2400, "stopped":
false, "maintenance": false, "crc32": "e7f42f2a",
"local_conf_timestamp": 5199, "host-ts": 5199}, "2":
{"conf_on_shared_storage": true, "live-data": false, "extra":
"metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=5190
(Sun Jan 19 23:16:15
2020)\nhost-id=2\nscore=3400\nvm_conf_refresh_time=5190 (Sun Jan 19
23:16:15 2020)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n",
"hostname": "lago-he-basic-role-remote-suite-4-3-host-1", "host-id":
2, "engine-status": {"reason": "vm not running on this host",
"health": "bad", "vm": "down", "detail": "unknown"}, "score": 3400,
"stopped": false, "maintenance": false, "crc32": "c3657e8b",
"local_conf_timestamp": 5190, "host-ts": 5190}, "global_maintenance":
true}
...
Then a few similar ones, last one is from '2020-01-20 04:17:22,980',
and then it changes to 'Up':
2020-01-20 04:17:27,517::ssh.py::ssh::89::lago.ssh::DEBUG::Command
c426f53a on lago-he-basic-role-remote-suite-4-3-host-0 output:
{"1": {"conf_on_shared_storage": true, "live-data": true, "extra":
"metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=5259
(Sun Jan 19 23:17:24
2020)\nhost-id=1\nscore=2928\nvm_conf_refresh_time=5259 (Sun Jan 19
23:17:24 2020)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n",
"hostname": "lago-he-basic-role-remote-suite-4-3-host-0.lago.local",
"host-id": 1, "engine-status": {"reason": "failed liveliness check",
"health": "bad", "vm": "up", "detail": "Up"}, "score": 2928,
"stopped": false, "maintenance": false, "crc32": "08d4bfe1",
"local_conf_timestamp": 5259, "host-ts": 5259}, "2":
{"conf_on_shared_storage": true, "live-data": true, "extra":
"metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=5250
(Sun Jan 19 23:17:16
2020)\nhost-id=2\nscore=3400\nvm_conf_refresh_time=5251 (Sun Jan 19
23:17:16 2020)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n",
"hostname": "lago-he-basic-role-remote-suite-4-3-host-1", "host-id":
2, "engine-status": {"reason": "vm not running on this host",
"health": "bad", "vm": "down", "detail": "unknown"}, "score": 3400,
"stopped": false, "maintenance": false, "crc32": "ab277249",
"local_conf_timestamp": 5251, "host-ts": 5250}, "global_maintenance":
true}
/var/log/messages of the engine vm [3] has:
Jan 19 23:17:57 lago-he-basic-role-remote-suite-4-3-engine kernel:
Initializing cgroup subsys cpuset
Meaning, the first line after the reboot has a timestamp that's almost
2 minutes after starting the VM. That's a long time. That's half a
minute after detail changed to 'Up', above (no idea where we take this
'detail' from, though). Then, first line that is not from the kernel:
Jan 19 23:17:58 lago-he-basic-role-remote-suite-4-3-engine systemd[1]:
systemd 219 running in system mode. (+PAM +AUDIT +SELINUX +IMA
-APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL
+XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN)
then it starts lots of services, and then:
Jan 19 23:20:31 lago-he-basic-role-remote-suite-4-3-engine systemd:
Started oVirt Engine websockets proxy.
Jan 19 23:20:33 lago-he-basic-role-remote-suite-4-3-engine
chronyd[872]: Selected source 45.33.103.94
Jan 19 23:20:37 lago-he-basic-role-remote-suite-4-3-engine
chronyd[872]: System clock wrong by 3.919225 seconds, adjustment
started
Jan 19 23:20:37 lago-he-basic-role-remote-suite-4-3-engine systemd:
Time has been changed
Jan 19 23:20:37 lago-he-basic-role-remote-suite-4-3-engine
chronyd[872]: System clock was stepped by 3.919225 seconds
Jan 19 23:20:38 lago-he-basic-role-remote-suite-4-3-engine
chronyd[872]: Selected source 129.250.35.250
Jan 19 23:20:43 lago-he-basic-role-remote-suite-4-3-engine cloud-init:
Cloud-init v. 18.5 running 'modules:config' at Mon, 20 Jan 2020
04:20:39 +0000. Up 182.82 seconds.
and later:
Jan 19 23:21:32 lago-he-basic-role-remote-suite-4-3-engine systemd:
Started Update UTMP about System Runlevel Changes.
Jan 19 23:21:32 lago-he-basic-role-remote-suite-4-3-engine systemd:
Startup finished in 22.344s (kernel) + 32.237s (initrd) + 3min 815ms
(userspace) = 3min 55.397s.
Jan 19 23:21:43 lago-he-basic-role-remote-suite-4-3-engine
chronyd[872]: Selected source 206.55.191.142
Jan 19 23:23:52 lago-he-basic-role-remote-suite-4-3-engine
chronyd[872]: Selected source 129.250.35.250
Jan 19 23:26:32 lago-he-basic-role-remote-suite-4-3-engine systemd:
Created slice User Slice of root.
engine.log's [4] first line after the reboot is:
2020-01-19 23:26:18,251-05 INFO
[org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService
Thread Pool -- 103) [] Loaded file
'/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'.
which is a mere 40 seconds before the timeout, and server.log's [5]
first line is:
2020-01-19 23:23:13,392-05 INFO [org.jboss.as.server.deployment] (MSC
service thread 1-2) WFLYSRV0027: Starting deployment of "apidoc.war"
(runtime-name: "apidoc.war")
which is a 3 minutes before that.
I do not know much about what the engine does during its startup, and
how the logs should look, and how long it should take between startup
and a working health page. If this all looks normal, then perhaps we
should simply give it more than 10 minutes. Otherwise, there is either
heavy load on the infra, or perhaps some NTP problems.
Anyone has a clue?
Anyway, now pushed a patch [6] to allow up to 20 minutes, at least to
reduce the noise.
Thanks and best regards,
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-role-remote-sui...
[2] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-role-remote-sui...
[3] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-role-remote-sui...
[4] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-role-remote-sui...
[5] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-role-remote-sui...
[6] https://gerrit.ovirt.org/106411
--
Didi
4 years, 10 months