Hi all,
Galit has been working on [1] for some time now. Willing to help with
that (and getting back to a stable state with hosted-engine suites), I
looked at it a few days ago, after manual deploy on centos8 with
nightly master snapshot worked for me. I rebased her patches, and CI
failed [2]. Out of 5 suites that it tested, 3 passed (!), and 2 failed
:-( . I checked the two that failed, and see:
1. One of them failed [3] due to failing to parse JSON output from
'hosted-engine --vm-status --json'. The output seems to have been cut
in the middle, after 1024 bytes:
2019-12-22 11:13:47,846::ssh.py::ssh::58::lago.ssh::DEBUG::Running
200606d0 on lago-he-basic-role-remote-suite-4-3-host-0: hosted-engine
--vm-status --json
2019-12-22 11:13:48,441::ssh.py::ssh::81::lago.ssh::DEBUG::Command
200606d0 on lago-he-basic-role-remote-suite-4-3-host-0 returned with 0
2019-12-22 11:13:48,441::ssh.py::ssh::89::lago.ssh::DEBUG::Command
200606d0 on lago-he-basic-role-remote-suite-4-3-host-0 output:
{"1": {"conf_on_shared_storage": true, "live-data": true,
"extra":
"metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=4534
(Sun Dec 22 06:13:39
2019)\nhost-id=1\nscore=3000\nvm_conf_refresh_time=4535 (Sun Dec 22
06:13:40
2019)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n",
"hostname": "lago-he-basic-role-remote-suite-4-3-host-0.lago.local",
"host-id": 1, "engine-status": {"reason": "failed
liveliness check",
"health": "bad", "vm": "up", "detail":
"Powering down"}, "score":
3000, "stopped": false, "maintenance": false, "crc32":
"76bfd2ac",
"local_conf_timestamp": 4535, "host-ts": 4534}, "2":
{"conf_on_shared_storage": true, "live-data": true,
"extra":
"metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=4532
(Sun Dec 22 06:13:38
2019)\nhost-id=2\nscore=3400\nvm_conf_refresh_time=4532 (Sun Dec 22
06:13:38
2019)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n",
"hostname": "lago-he-basic-role-remote-s
2019-12-22
11:13:48,442::testlib.py::assert_equals_within::242::ovirtlago.testlib::ERROR::
* Unhandled exception in <function <lambda> at 0x7fcc58f38668>
Traceback (most recent call last):
If you check previous runs of this command, in the same lago log, you
see that they are complete (and somewhat longer, but not much), e.g.:
2019-12-22 11:13:41,804::ssh.py::ssh::58::lago.ssh::DEBUG::Running
1c6c1226 on lago-he-basic-role-remote-suite-4-3-host-0: hosted-engine
--vm-status --json
2019-12-22 11:13:43,105::ssh.py::ssh::81::lago.ssh::DEBUG::Command
1c6c1226 on lago-he-basic-role-remote-suite-4-3-host-0 returned with 0
2019-12-22 11:13:43,106::ssh.py::ssh::89::lago.ssh::DEBUG::Command
1c6c1226 on lago-he-basic-role-remote-suite-4-3-host-0 output:
{"1": {"conf_on_shared_storage": true, "live-data": true,
"extra":
"metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=4534
(Sun Dec 22 06:13:39
2019)\nhost-id=1\nscore=3000\nvm_conf_refresh_time=4535 (Sun Dec 22
06:13:40
2019)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n",
"hostname": "lago-he-basic-role-remote-suite-4-3-host-0.lago.local",
"host-id": 1, "engine-status": {"reason": "failed
liveliness check",
"health": "bad", "vm": "up", "detail":
"Powering down"}, "score":
3000, "stopped": false, "maintenance": false, "crc32":
"76bfd2ac",
"local_conf_timestamp": 4535, "host-ts": 4534}, "2":
{"conf_on_shared_storage": true, "live-data": true,
"extra":
"metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=4532
(Sun Dec 22 06:13:38
2019)\nhost-id=2\nscore=3400\nvm_conf_refresh_time=4532 (Sun Dec 22
06:13:38
2019)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n",
"hostname": "lago-he-basic-role-remote-suite-4-3-host-1",
"host-id":
2, "engine-status": {"reason": "vm not running on this
host",
"health": "bad", "vm": "down", "detail":
"unknown"}, "score": 3400,
"stopped": false, "maintenance": false, "crc32":
"2fc395a1",
"local_conf_timestamp": 4532, "host-ts": 4532},
"global_maintenance":
true}
So I suspect an infra issue - on the network level, ssh, etc.
2. The other seems to have failed [4] due to timeout:
2019-12-22 12:57:08,595::cmd.py::exit_handler::921::cli::DEBUG::signal
15 was caught
The reason it took too long seems to be:
2019-12-22 11:08:13,233::log_utils.py::__enter__::600::lago.ssh::DEBUG::start
task:57a44b4e-859e-4ff2-b897-7c8b3ade4557:Get ssh client for
lago-he-node-ng-suite-4-3-host-0:
2019-12-22 11:08:13,637::log_utils.py::__exit__::611::lago.ssh::DEBUG::end
task:57a44b4e-859e-4ff2-b897-7c8b3ade4557:Get ssh client for
lago-he-node-ng-suite-4-3-host-0:
2019-12-22 11:08:13,942::ssh.py::ssh::58::lago.ssh::DEBUG::Running
59005d88 on lago-he-node-ng-suite-4-3-host-0: true
2019-12-22 11:08:15,605::ssh.py::ssh::81::lago.ssh::DEBUG::Command
59005d88 on lago-he-node-ng-suite-4-3-host-0 returned with 0
2019-12-22 11:08:15,606::ssh.py::wait_for_ssh::153::lago.ssh::DEBUG::Wait
succeeded for ssh to lago-he-node-ng-suite-4-3-host-0
2019-12-22 11:08:15,608::log_utils.py::__enter__::600::lago.ssh::DEBUG::start
task:38e81fd0-50f0-497c-8fb0-38eedcfb3f55:Get ssh client for
lago-he-node-ng-suite-4-3-host-0:
2019-12-22 11:08:16,129::log_utils.py::__exit__::611::lago.ssh::DEBUG::end
task:38e81fd0-50f0-497c-8fb0-38eedcfb3f55:Get ssh client for
lago-he-node-ng-suite-4-3-host-0:
2019-12-22 12:20:39,054::log_utils.py::__enter__::600::lago.ssh::DEBUG::start
task:4007e002-f8e2-4427-884f-af57f723aca0:Get ssh client for
lago-he-node-ng-suite-4-3-storage:
2019-12-22 12:20:39,639::log_utils.py::__exit__::611::lago.ssh::DEBUG::end
task:4007e002-f8e2-4427-884f-af57f723aca0:Get ssh client for
lago-he-node-ng-suite-4-3-storage:
2019-12-22 12:20:40,408::ssh.py::ssh::58::lago.ssh::DEBUG::Running
77b26f28 on lago-he-node-ng-suite-4-3-storage: true
2019-12-22 12:20:40,475::ssh.py::ssh::81::lago.ssh::DEBUG::Command
77b26f28 on lago-he-node-ng-suite-4-3-storage returned with 0
2019-12-22 12:20:40,476::ssh.py::wait_for_ssh::153::lago.ssh::DEBUG::Wait
succeeded for ssh to lago-he-node-ng-suite-4-3-storage
Meaning, if I got it right, that it sshed to host-0, succeeded, then
tried to ssh to storage, and also succeeded - but connecting took more
than an hour. Again, I suspect some infra issue, perhaps same as (1.).
Any clue?
I am now trying it again, anyway [5].
Thanks and best regards,
[1]
https://gerrit.ovirt.org/104797
[2]
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/7269/
[3]
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/726...
[4]
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/726...
[5]
http://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/7273/
--
Didi