On Wed, Feb 19, 2020 at 4:51 AM <jenkins(a)jenkins.phx.ovirt.org> wrote:
Project:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-role-remote-sui...
Build:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-role-remote-sui...
Build Number: 283
Build Status: Failure
Triggered By: Started by timer
-------------------------------------
Changes Since Last Success:
-------------------------------------
Changes for Build #283
[Anton Marchukov] Added "sar" system resources collection on VMs.
[Yedidyah Bar David] Move ovirt-engine-extension-aaa-ldap master to stdci v2
[Gal Ben Haim] Fix the return value of update_upstream_sources
-----------------
Failed Tests:
-----------------
1 tests failed.
FAILED: 008_restart_he_vm.restart_he_vm
Error Message:
1 != 0
-------------------- >> begin captured logging << --------------------
lago.ssh: DEBUG: start task:02e64977-e3bd-4e7c-9084-61ddeaebb791:Get ssh client for
lago-he-basic-role-remote-suite-4-3-host-1:
lago.ssh: DEBUG: end task:02e64977-e3bd-4e7c-9084-61ddeaebb791:Get ssh client for
lago-he-basic-role-remote-suite-4-3-host-1:
lago.ssh: DEBUG: Running 56796ff6 on lago-he-basic-role-remote-suite-4-3-host-1:
hosted-engine --vm-status --json
lago.ssh: DEBUG: Command 56796ff6 on lago-he-basic-role-remote-suite-4-3-host-1 returned
with 0
lago.ssh: DEBUG: Command 56796ff6 on lago-he-basic-role-remote-suite-4-3-host-1 output:
{"1": {"conf_on_shared_storage": true, "live-data": true,
"extra":
"metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=4583 (Tue Feb 18
21:48:18 2020)\nhost-id=1\nscore=3400\nvm_conf_refresh_time=4584 (Tue Feb 18 21:48:19
2020)\nconf_on_shared_storage=True\nmaintenance=False\nstate=EngineUp\nstopped=False\n",
"hostname": "lago-he-basic-role-remote-suite-4-3-host-0.lago.local",
"host-id": 1, "engine-status": {"health": "good",
"vm": "up", "detail": "Up"}, "score":
3400, "stopped": false, "maintenance": false, "crc32":
"eda7a0ea", "local_conf_timestamp": 4584, "host-ts": 4583},
"2": {"conf_on_shared_storage": true, "live-data": true,
"extra":
"metadata_parse_version=1\nmetadata_feature_version=1\ntimestamp=4604 (Tue Feb 18
21:48:40 2020)\nhost-id=2\nscore=3400\nvm_conf_refresh_time=4605 (Tue Feb 18 21:48:40
2020)\nconf_on_shared_storage=True\nmaintenance=False\nstate=GlobalMaintenance\nstopped=False\n",
"hostname": "lago-he-basic-role-remote-suite-4-3-host-1",
"host-id": 2, "engine-status": {"reason": "vm not
running on this host", "health": "bad", "vm":
"down", "detail": "unknown"}, "score": 3400,
"stopped": false, "maintenance": false, "crc32":
"23d07ed7", "local_conf_timestamp": 4605, "host-ts": 4604},
"global_maintenance": true}
root: INFO: Engine VM is on host lago-he-basic-role-remote-suite-4-3-host-0, restarting
the VM
root: INFO: Shutting down HE VM on host: lago-he-basic-role-remote-suite-4-3-host-0
lago.ssh: DEBUG: start task:90671f4b-c54e-4efe-b0c1-cfa47015a9db:Get ssh client for
lago-he-basic-role-remote-suite-4-3-host-0:
lago.ssh: DEBUG: end task:90671f4b-c54e-4efe-b0c1-cfa47015a9db:Get ssh client for
lago-he-basic-role-remote-suite-4-3-host-0:
lago.ssh: DEBUG: Running 57a55eee on lago-he-basic-role-remote-suite-4-3-host-0:
hosted-engine --vm-shutdown
lago.ssh: DEBUG: Command 57a55eee on lago-he-basic-role-remote-suite-4-3-host-0 returned
with 0
root: INFO: Command succeeded
root: INFO: Waiting for VM to be down...
lago.ssh: DEBUG: start task:4b1d8dce-3c5b-4d53-b81f-1de8bb71eed7:Get ssh client for
lago-he-basic-role-remote-suite-4-3-host-0:
lago.ssh: DEBUG: end task:4b1d8dce-3c5b-4d53-b81f-1de8bb71eed7:Get ssh client for
lago-he-basic-role-remote-suite-4-3-host-0:
lago.ssh: DEBUG: Running 5901a70c on lago-he-basic-role-remote-suite-4-3-host-0:
hosted-engine --vm-status --json
lago.ssh: DEBUG: Command 5901a70c on lago-he-basic-role-remote-suite-4-3-host-0 returned
with 1
lago.ssh: DEBUG: Command 5901a70c on lago-he-basic-role-remote-suite-4-3-host-0 output:
The hosted engine configuration has not been retrieved from shared storage. Please
ensure that ovirt-ha-agent is running and the storage server is reachable.
agent.log has:
StatusStorageThread::ERROR::2020-02-18
21:48:30,263::status_broker::90::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run)
Failed to update state.
Traceback (most recent call last):
File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py",
line 82, in run
if (self._status_broker._inquire_whiteboard_lock() or
File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py",
line 195, in _inquire_whiteboard_lock
self.host_id, self._lease_file)
SanlockException: (104, 'Sanlock lockspace inquire failure',
'Connection reset by peer')
StatusStorageThread::ERROR::2020-02-18
21:48:30,300::status_broker::70::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(trigger_restart)
Trying to restart the broker
Another "Connection reset by peer", looks to me similar to the one
reported a few days ago (with subject "[oVirt Jenkins]
ovirt-system-tests_he-basic-scsi-suite-4.3 - Build # 350 - Failure!").
Are we ok with this? Stable 4.3 jobs failing with no clear and
acceptable reason and no further handling?
It seems like a communication issue, to me. Is anyone looking at it?
Thanks,
--
Didi