
Hi all, I looked at failure of [1]. It timed out while waiting for the engine to be up, after restarting the engine VM. lago.log [2] has: 2020-01-20 04:16:06,026::008_restart_he_vm.py::_start_he_vm::173::root::INFO::Starting VM... ... 2020-01-20 04:16:09,823::008_restart_he_vm.py::_start_he_vm::178::root::INFO::Waiting for VM to be UP... ... 2020-01-20 04:16:25,113::008_restart_he_vm.py::_start_he_vm::186::root::INFO::VM is UP. 2020-01-20 04:16:25,113::008_restart_he_vm.py::_wait_for_engine_health::190::root::INFO::Waiting for engine to start... ... Then there is a loop of running 'hosted-engine --vm-status --json' and parsing the status, specifically checking the health. The health status does not change until the timeout (after 10 minutes), but after 'vm', there is 'detail', which does change. First one is 'Powering up': 2020-01-20 04:16:25,113::ssh.py::ssh::89::lago.ssh::DEBUG::Command 9ef791c0 on lago-he-basic-role-remote-suite-4-3-host-0 output: {"1": {"conf_on_shared_storage": true, "live-data": true, "extra": "metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=5199 (Sun Jan 19 23:16:24 2020)\nhost-id=1\nscore=2400\nvm_conf_refresh_time=5199 (Sun Jan 19 23:16:24 2020)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n", "hostname": "lago-he-basic-role-remote-suite-4-3-host-0.lago.local", "host-id": 1, "engine-status": {"reason": "bad vm status", "health": "bad", "vm": "up", "detail": "Powering up"}, "score": 2400, "stopped": false, "maintenance": false, "crc32": "e7f42f2a", "local_conf_timestamp": 5199, "host-ts": 5199}, "2": {"conf_on_shared_storage": true, "live-data": false, "extra": "metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=5190 (Sun Jan 19 23:16:15 2020)\nhost-id=2\nscore=3400\nvm_conf_refresh_time=5190 (Sun Jan 19 23:16:15 2020)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n", "hostname": "lago-he-basic-role-remote-suite-4-3-host-1", "host-id": 2, "engine-status": {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}, "score": 3400, "stopped": false, "maintenance": false, "crc32": "c3657e8b", "local_conf_timestamp": 5190, "host-ts": 5190}, "global_maintenance": true} ... Then a few similar ones, last one is from '2020-01-20 04:17:22,980', and then it changes to 'Up': 2020-01-20 04:17:27,517::ssh.py::ssh::89::lago.ssh::DEBUG::Command c426f53a on lago-he-basic-role-remote-suite-4-3-host-0 output: {"1": {"conf_on_shared_storage": true, "live-data": true, "extra": "metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=5259 (Sun Jan 19 23:17:24 2020)\nhost-id=1\nscore=2928\nvm_conf_refresh_time=5259 (Sun Jan 19 23:17:24 2020)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n", "hostname": "lago-he-basic-role-remote-suite-4-3-host-0.lago.local", "host-id": 1, "engine-status": {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up"}, "score": 2928, "stopped": false, "maintenance": false, "crc32": "08d4bfe1", "local_conf_timestamp": 5259, "host-ts": 5259}, "2": {"conf_on_shared_storage": true, "live-data": true, "extra": "metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=5250 (Sun Jan 19 23:17:16 2020)\nhost-id=2\nscore=3400\nvm_conf_refresh_time=5251 (Sun Jan 19 23:17:16 2020)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n", "hostname": "lago-he-basic-role-remote-suite-4-3-host-1", "host-id": 2, "engine-status": {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}, "score": 3400, "stopped": false, "maintenance": false, "crc32": "ab277249", "local_conf_timestamp": 5251, "host-ts": 5250}, "global_maintenance": true} /var/log/messages of the engine vm [3] has: Jan 19 23:17:57 lago-he-basic-role-remote-suite-4-3-engine kernel: Initializing cgroup subsys cpuset Meaning, the first line after the reboot has a timestamp that's almost 2 minutes after starting the VM. That's a long time. That's half a minute after detail changed to 'Up', above (no idea where we take this 'detail' from, though). Then, first line that is not from the kernel: Jan 19 23:17:58 lago-he-basic-role-remote-suite-4-3-engine systemd[1]: systemd 219 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN) then it starts lots of services, and then: Jan 19 23:20:31 lago-he-basic-role-remote-suite-4-3-engine systemd: Started oVirt Engine websockets proxy. Jan 19 23:20:33 lago-he-basic-role-remote-suite-4-3-engine chronyd[872]: Selected source 45.33.103.94 Jan 19 23:20:37 lago-he-basic-role-remote-suite-4-3-engine chronyd[872]: System clock wrong by 3.919225 seconds, adjustment started Jan 19 23:20:37 lago-he-basic-role-remote-suite-4-3-engine systemd: Time has been changed Jan 19 23:20:37 lago-he-basic-role-remote-suite-4-3-engine chronyd[872]: System clock was stepped by 3.919225 seconds Jan 19 23:20:38 lago-he-basic-role-remote-suite-4-3-engine chronyd[872]: Selected source 129.250.35.250 Jan 19 23:20:43 lago-he-basic-role-remote-suite-4-3-engine cloud-init: Cloud-init v. 18.5 running 'modules:config' at Mon, 20 Jan 2020 04:20:39 +0000. Up 182.82 seconds. and later: Jan 19 23:21:32 lago-he-basic-role-remote-suite-4-3-engine systemd: Started Update UTMP about System Runlevel Changes. Jan 19 23:21:32 lago-he-basic-role-remote-suite-4-3-engine systemd: Startup finished in 22.344s (kernel) + 32.237s (initrd) + 3min 815ms (userspace) = 3min 55.397s. Jan 19 23:21:43 lago-he-basic-role-remote-suite-4-3-engine chronyd[872]: Selected source 206.55.191.142 Jan 19 23:23:52 lago-he-basic-role-remote-suite-4-3-engine chronyd[872]: Selected source 129.250.35.250 Jan 19 23:26:32 lago-he-basic-role-remote-suite-4-3-engine systemd: Created slice User Slice of root. engine.log's [4] first line after the reboot is: 2020-01-19 23:26:18,251-05 INFO [org.ovirt.engine.core.uutils.config.ShellLikeConfd] (ServerService Thread Pool -- 103) [] Loaded file '/usr/share/ovirt-engine/services/ovirt-engine/ovirt-engine.conf'. which is a mere 40 seconds before the timeout, and server.log's [5] first line is: 2020-01-19 23:23:13,392-05 INFO [org.jboss.as.server.deployment] (MSC service thread 1-2) WFLYSRV0027: Starting deployment of "apidoc.war" (runtime-name: "apidoc.war") which is a 3 minutes before that. I do not know much about what the engine does during its startup, and how the logs should look, and how long it should take between startup and a working health page. If this all looks normal, then perhaps we should simply give it more than 10 minutes. Otherwise, there is either heavy load on the infra, or perhaps some NTP problems. Anyone has a clue? Anyway, now pushed a patch [6] to allow up to 20 minutes, at least to reduce the noise. Thanks and best regards, [1] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-role-remote-suite-... [2] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-role-remote-suite-... [3] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-role-remote-suite-... [4] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-role-remote-suite-... [5] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-role-remote-suite-... [6] https://gerrit.ovirt.org/106411 -- Didi