
Hi all, Galit has been working on [1] for some time now. Willing to help with that (and getting back to a stable state with hosted-engine suites), I looked at it a few days ago, after manual deploy on centos8 with nightly master snapshot worked for me. I rebased her patches, and CI failed [2]. Out of 5 suites that it tested, 3 passed (!), and 2 failed :-( . I checked the two that failed, and see: 1. One of them failed [3] due to failing to parse JSON output from 'hosted-engine --vm-status --json'. The output seems to have been cut in the middle, after 1024 bytes: 2019-12-22 11:13:47,846::ssh.py::ssh::58::lago.ssh::DEBUG::Running 200606d0 on lago-he-basic-role-remote-suite-4-3-host-0: hosted-engine --vm-status --json 2019-12-22 11:13:48,441::ssh.py::ssh::81::lago.ssh::DEBUG::Command 200606d0 on lago-he-basic-role-remote-suite-4-3-host-0 returned with 0 2019-12-22 11:13:48,441::ssh.py::ssh::89::lago.ssh::DEBUG::Command 200606d0 on lago-he-basic-role-remote-suite-4-3-host-0 output: {"1": {"conf_on_shared_storage": true, "live-data": true, "extra": "metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=4534 (Sun Dec 22 06:13:39 2019)\nhost-id=1\nscore=3000\nvm_conf_refresh_time=4535 (Sun Dec 22 06:13:40 2019)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n", "hostname": "lago-he-basic-role-remote-suite-4-3-host-0.lago.local", "host-id": 1, "engine-status": {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Powering down"}, "score": 3000, "stopped": false, "maintenance": false, "crc32": "76bfd2ac", "local_conf_timestamp": 4535, "host-ts": 4534}, "2": {"conf_on_shared_storage": true, "live-data": true, "extra": "metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=4532 (Sun Dec 22 06:13:38 2019)\nhost-id=2\nscore=3400\nvm_conf_refresh_time=4532 (Sun Dec 22 06:13:38 2019)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n", "hostname": "lago-he-basic-role-remote-s 2019-12-22 11:13:48,442::testlib.py::assert_equals_within::242::ovirtlago.testlib::ERROR:: * Unhandled exception in <function <lambda> at 0x7fcc58f38668> Traceback (most recent call last): If you check previous runs of this command, in the same lago log, you see that they are complete (and somewhat longer, but not much), e.g.: 2019-12-22 11:13:41,804::ssh.py::ssh::58::lago.ssh::DEBUG::Running 1c6c1226 on lago-he-basic-role-remote-suite-4-3-host-0: hosted-engine --vm-status --json 2019-12-22 11:13:43,105::ssh.py::ssh::81::lago.ssh::DEBUG::Command 1c6c1226 on lago-he-basic-role-remote-suite-4-3-host-0 returned with 0 2019-12-22 11:13:43,106::ssh.py::ssh::89::lago.ssh::DEBUG::Command 1c6c1226 on lago-he-basic-role-remote-suite-4-3-host-0 output: {"1": {"conf_on_shared_storage": true, "live-data": true, "extra": "metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=4534 (Sun Dec 22 06:13:39 2019)\nhost-id=1\nscore=3000\nvm_conf_refresh_time=4535 (Sun Dec 22 06:13:40 2019)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n", "hostname": "lago-he-basic-role-remote-suite-4-3-host-0.lago.local", "host-id": 1, "engine-status": {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Powering down"}, "score": 3000, "stopped": false, "maintenance": false, "crc32": "76bfd2ac", "local_conf_timestamp": 4535, "host-ts": 4534}, "2": {"conf_on_shared_storage": true, "live-data": true, "extra": "metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=4532 (Sun Dec 22 06:13:38 2019)\nhost-id=2\nscore=3400\nvm_conf_refresh_time=4532 (Sun Dec 22 06:13:38 2019)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n", "hostname": "lago-he-basic-role-remote-suite-4-3-host-1", "host-id": 2, "engine-status": {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}, "score": 3400, "stopped": false, "maintenance": false, "crc32": "2fc395a1", "local_conf_timestamp": 4532, "host-ts": 4532}, "global_maintenance": true} So I suspect an infra issue - on the network level, ssh, etc. 2. The other seems to have failed [4] due to timeout: 2019-12-22 12:57:08,595::cmd.py::exit_handler::921::cli::DEBUG::signal 15 was caught The reason it took too long seems to be: 2019-12-22 11:08:13,233::log_utils.py::__enter__::600::lago.ssh::DEBUG::start task:57a44b4e-859e-4ff2-b897-7c8b3ade4557:Get ssh client for lago-he-node-ng-suite-4-3-host-0: 2019-12-22 11:08:13,637::log_utils.py::__exit__::611::lago.ssh::DEBUG::end task:57a44b4e-859e-4ff2-b897-7c8b3ade4557:Get ssh client for lago-he-node-ng-suite-4-3-host-0: 2019-12-22 11:08:13,942::ssh.py::ssh::58::lago.ssh::DEBUG::Running 59005d88 on lago-he-node-ng-suite-4-3-host-0: true 2019-12-22 11:08:15,605::ssh.py::ssh::81::lago.ssh::DEBUG::Command 59005d88 on lago-he-node-ng-suite-4-3-host-0 returned with 0 2019-12-22 11:08:15,606::ssh.py::wait_for_ssh::153::lago.ssh::DEBUG::Wait succeeded for ssh to lago-he-node-ng-suite-4-3-host-0 2019-12-22 11:08:15,608::log_utils.py::__enter__::600::lago.ssh::DEBUG::start task:38e81fd0-50f0-497c-8fb0-38eedcfb3f55:Get ssh client for lago-he-node-ng-suite-4-3-host-0: 2019-12-22 11:08:16,129::log_utils.py::__exit__::611::lago.ssh::DEBUG::end task:38e81fd0-50f0-497c-8fb0-38eedcfb3f55:Get ssh client for lago-he-node-ng-suite-4-3-host-0: 2019-12-22 12:20:39,054::log_utils.py::__enter__::600::lago.ssh::DEBUG::start task:4007e002-f8e2-4427-884f-af57f723aca0:Get ssh client for lago-he-node-ng-suite-4-3-storage: 2019-12-22 12:20:39,639::log_utils.py::__exit__::611::lago.ssh::DEBUG::end task:4007e002-f8e2-4427-884f-af57f723aca0:Get ssh client for lago-he-node-ng-suite-4-3-storage: 2019-12-22 12:20:40,408::ssh.py::ssh::58::lago.ssh::DEBUG::Running 77b26f28 on lago-he-node-ng-suite-4-3-storage: true 2019-12-22 12:20:40,475::ssh.py::ssh::81::lago.ssh::DEBUG::Command 77b26f28 on lago-he-node-ng-suite-4-3-storage returned with 0 2019-12-22 12:20:40,476::ssh.py::wait_for_ssh::153::lago.ssh::DEBUG::Wait succeeded for ssh to lago-he-node-ng-suite-4-3-storage Meaning, if I got it right, that it sshed to host-0, succeeded, then tried to ssh to storage, and also succeeded - but connecting took more than an hour. Again, I suspect some infra issue, perhaps same as (1.). Any clue? I am now trying it again, anyway [5]. Thanks and best regards, [1] https://gerrit.ovirt.org/104797 [2] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/7269/ [3] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/7269/a... [4] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/7269/a... [5] http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/7273/ -- Didi
participants (1)
-
Yedidyah Bar David