OST CI: Time issue?

20 Jan 2020

      Hi all,

I looked at failure of [1].

It timed out while waiting for the engine to be up, after restarting
the engine VM.

lago.log [2] has:

2020-01-20 04:16:06,026::008_restart_he_vm.py::_start_he_vm::173::root::INFO::Starting
VM...
...
2020-01-20 04:16:09,823::008_restart_he_vm.py::_start_he_vm::178::root::INFO::Waiting
for VM to be UP...
...
2020-01-20 04:16:25,113::008_restart_he_vm.py::_start_he_vm::186::root::INFO::VM
is UP.
2020-01-20 04:16:25,113::008_restart_he_vm.py::_wait_for_engine_health::190::root::INFO::Waiting
for engine to start...
...
Then there is a loop of running 'hosted-engine --vm-status --json' and
parsing the status, specifically checking the health. The health
status does not change until the timeout (after 10 minutes), but after
'vm', there is 'detail', which does change. First one is 'Powering
up':

2020-01-20 04:16:25,113::ssh.py::ssh::89::lago.ssh::DEBUG::Command
9ef791c0 on lago-he-basic-role-remote-suite-4-3-host-0 output:
 {"1": {"conf_on_shared_storage": true, "live-data": true, "extra":
"metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=5199
(Sun Jan 19 23:16:24
2020)\nhost-id=1\nscore=2400\nvm_conf_refresh_time=5199 (Sun Jan 19
23:16:24 2020)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n",
"hostname": "lago-he-basic-role-remote-suite-4-3-host-0.lago.local",
"host-id": 1, "engine-status": {"reason": "bad vm status", "health":
"bad", "vm": "up", "detail": "Powering up"}, "score": 2400, "stopped":
false, "maintenance": false, "crc32": "e7f42f2a",
"local_conf_timestamp": 5199, "host-ts": 5199}, "2":
{"conf_on_shared_storage": true, "live-data": false, "extra":
"metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=5190
(Sun Jan 19 23:16:15
2020)\nhost-id=2\nscore=3400\nvm_conf_refresh_time=5190 (Sun Jan 19
23:16:15 2020)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n",
"hostname": "lago-he-basic-role-remote-suite-4-3-host-1", "host-id":
2, "engine-status": {"reason": "vm not running on this host",
"health": "bad", "vm": "down", "detail": "unknown"}, "score": 3400,
"stopped": false, "maintenance": false, "crc32": "c3657e8b",
"local_conf_timestamp": 5190, "host-ts": 5190}, "global_maintenance":
true}
...
Then a few similar ones, last one is from '2020-01-20 04:17:22,980',
and then it changes to 'Up':

2020-01-20 04:17:27,517::ssh.py::ssh::89::lago.ssh::DEBUG::Command
c426f53a on lago-he-basic-role-remote-suite-4-3-host-0 output:
 {"1": {"conf_on_shared_storage": true, "live-data": true, "extra":
"metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=5259
(Sun Jan 19 23:17:24
2020)\nhost-id=1\nscore=2928\nvm_conf_refresh_time=5259 (Sun Jan 19
23:17:24 2020)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n",
"hostname": "lago-he-basic-role-remote-suite-4-3-host-0.lago.local",
"host-id": 1, "engine-status": {"reason": "failed liveliness check",
"health": "bad", "vm": "up", "detail": "Up"}, "score": 2928,
"stopped": false, "maintenance": false, "crc32": "08d4bfe1",
"local_conf_timestamp": 5259, "host-ts": 5259}, "2":
{"conf_on_shared_storage": true, "live-data": true, "extra":
"metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=5250
(Sun Jan 19 23:17:16
2020)\nhost-id=2\nscore=3400\nvm_conf_refresh_time=5251 (Sun Jan 19
23:17:16 2020)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n",
"hostname": "lago-he-basic-role-remote-suite-4-3-host-1", "host-id":
2, "engine-status": {"reason": "vm not running on this host",
"health": "bad", "vm": "down", "detail": "unknown"}, "score": 3400,
"stopped": false, "maintenance": false, "crc32": "ab277249",
"local_conf_timestamp": 5251, "host-ts": 5250}, "global_maintenance":
true}

/var/log/messages of the engine vm [3] has:

Jan 19 23:17:57 lago-he-basic-role-remote-suite-4-3-engine kernel:
Initializing cgroup subsys cpuset

Meaning, the first line after the reboot has a timestamp that's almost
2 minutes after starting the VM. That's a long time. That's half a
minute after detail changed to 'Up', above (no idea where we take this
'detail' from, though). Then, first line that is not from the kernel:

Jan 19 23:17:58 lago-he-basic-role-remote-suite-4-3-engine systemd[1]:
systemd 219 running in system mode. (+PAM +AUDIT +SELINUX +IMA
-APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL
+XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN)

then it starts lots of services, and then:

Jan 19 23:20:31 lago-he-basic-role-remote-suite-4-3-engine systemd:
Started oVirt Engine websockets proxy.
Jan 19 23:20:33 lago-he-basic-role-remote-suite-4-3-engine
chronyd[872]: Selected source 45.33.103.94
Jan 19 23:20:37 lago-he-basic-role-remote-suite-4-3-engine
chronyd[872]: System clock wrong by 3.919225 seconds, adjustment
started
Jan 19 23:20:37 lago-he-basic-role-remote-suite-4-3-engine systemd:
Time has been changed
Jan 19 23:20:37 lago-he-basic-role-remote-suite-4-3-engine
chronyd[872]: System clock was stepped by 3.919225 seconds
Jan 19 23:20:38 lago-he-basic-role-remote-suite-4-3-engine
chronyd[872]: Selected source 129.250.35.250
Jan 19 23:20:43 lago-he-basic-role-remote-suite-4-3-engine cloud-init:
Cloud-init v. 18.5 running 'modules:config' at Mon, 20 Jan 2020
04:20:39 +0000. Up 182.82 seconds.

and later:

Jan 19 23:21:32 lago-he-basic-role-remote-suite-4-3-engine systemd:
Started Update UTMP about System Runlevel Changes.
Jan 19 23:21:32 lago-he-basic-role-remote-suite-4-3-engine systemd:
Startup finished in 22.344s (kernel) + 32.237s (initrd) + 3min 815ms
(userspace) = 3min 55.397s.
Jan 19 23:21:43 lago-he-basic-role-remote-suite-4-3-engine
chronyd[872]: Selected source 206.55.191.142
Jan 19 23:23:52 lago-he-basic-role-remote-suite-4-3-engine
chronyd[872]: Selected source 129.250.35.250
Jan 19 23:26:32 lago-he-basic-role-remote-suite-4-3-engine systemd:
Created slice User Slice of root.

engine.log's [4] first line after the reboot is:

2020-01-19 23:26:18,251-05 INFO
[org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService
Thread Pool -- 103) [] Loaded file
'/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'.

which is a mere 40 seconds before the timeout, and server.log's [5]
first line is:

2020-01-19 23:23:13,392-05 INFO  [org.jboss.as.server.deployment] (MSC
service thread 1-2) WFLYSRV0027: Starting deployment of "apidoc.war"
(runtime-name: "apidoc.war")

which is a 3 minutes before that.

I do not know much about what the engine does during its startup, and
how the logs should look, and how long it should take between startup
and a working health page. If this all looks normal, then perhaps we
should simply give it more than 10 minutes. Otherwise, there is either
heavy load on the infra, or perhaps some NTP problems.

Anyone has a clue?

Anyway, now pushed a patch [6] to allow up to 20 minutes, at least to
reduce the noise.

Thanks and best regards,

[1] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-role-remote-suite-...
[2] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-role-remote-suite-...
[3] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-role-remote-suite-...
[4] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-role-remote-suite-...
[5] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-role-remote-suite-...
[6] https://gerrit.ovirt.org/106411
--
Didi

OST CI: Time issue?

Yedidyah Bar David