On Mon, Mar 2, 2020 at 12:52 AM Nir Soffer <nsoffer(a)redhat.com> wrote:
On Sun, Mar 1, 2020 at 10:10 AM Yedidyah Bar David <didi(a)redhat.com> wrote:
>
> Hi all,
>
> On Sun, Mar 1, 2020 at 6:06 AM <jenkins(a)jenkins.phx.ovirt.org> wrote:
> >
> > Project:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/
> > Build:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/366/
>
> I think the root cause is:
>
>
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/366/a...
>
> StatusStorageThread::ERROR::2020-02-29
>
23:03:04,671::status_broker::90::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run)
> Failed to update state.
> Traceback (most recent call last):
> File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py",
> line 82, in run
> if (self._status_broker._inquire_whiteboard_lock() or
> File
"/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py",
> line 195, in _inquire_whiteboard_lock
> self.host_id, self._lease_file)
> SanlockException: (104, 'Sanlock lockspace inquire failure',
> 'Connection reset by peer')
Can you point us to the source using the sanlock API?
I think it's right where the above error message says it is:
https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_...
The messages looks like client error accessing sanlock server socket
(maybe someone restarted sanlock at that point?)
Maybe, I failed to find evidence :-)
but it may also be some error code reused for sanlock internal error
for
accessing the storage.
Usually you can find more info about the error in /var/sanlock.log
Couldn't find anything:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/366/a...
> This caused the broker to restart itself,
Restarting because sanlock failed does sound like useful error handling
for broker clients.
> and while it was doing that,
> OST did 'hosted-engine --vm-status --json', which failed, thus failing
> the build.
If the broker may restart itself on errors, clients need to use a retry
mechanism to deal with the restarts, so the test should probably have
a retry mechanism before it fails.
I am not it's the test that should have that retry mechanism, or the
command 'hosted-engine --vm-status'. Opinions? If latter, we probably
need to add user controls for this (time between retries, max number).
> This seems to me like another case of a communication problem in CI.
> Not sure what else could have caused it to fail to inquire the status
> of the lock. This (communication) issue was mentioned several times in
> the past already. Are we doing anything re this?
I still didn't see any concrete reply to this point, but perhaps the reply
should be: If our CI is not completely perfect, and sometimes has communication
issues, that's simply normal life - also the networks of our users are like
that. We should simply expect that, and do what's needed (above)...
Thanks,
>
> Thanks and best regards,
>
> > Build Number: 366
> > Build Status: Failure
> > Triggered By: Started by timer
> >
> > -------------------------------------
> > Changes Since Last Success:
> > -------------------------------------
> > Changes for Build #366
> > [Marcin Sobczyk] el8: Don't try to collect whole '/etc/httpd' dir
> >
> >
> >
> >
> > -----------------
> > Failed Tests:
> > -----------------
> > 1 tests failed.
> > FAILED: 008_restart_he_vm.clear_global_maintenance
> >
> > Error Message:
> > 1 != 0
> > -------------------- >> begin captured logging <<
--------------------
> > root: INFO: Waiting For System Stability...
> > lago.ssh: DEBUG: start task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client
for lago-he-basic-suite-4-3-host-0:
> > lago.ssh: DEBUG: end task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client
for lago-he-basic-suite-4-3-host-0:
> > lago.ssh: DEBUG: Running 9a90ca60 on lago-he-basic-suite-4-3-host-0:
hosted-engine --set-maintenance --mode=none
> > lago.ssh: DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 returned
with 1
> > lago.ssh: DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 errors:
> > Cannot connect to the HA daemon, please check the logs.
> >
> > ovirtlago.testlib: ERROR: * Unhandled exception in <function
<lambda> at 0x7f52673872a8>
> > Traceback (most recent call last):
> > File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
234, in assert_equals_within
> > res = func()
> > File
"/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py",
line 87, in <lambda>
> > lambda: _set_and_test_maintenance_mode(host, False)
> > File
"/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py",
line 108, in _set_and_test_maintenance_mode
> > nt.assert_equals(ret.code, 0)
> > File "/usr/lib64/python2.7/unittest/case.py", line 553, in
assertEqual
> > assertion_func(first, second, msg=msg)
> > File "/usr/lib64/python2.7/unittest/case.py", line 546, in
_baseAssertEqual
> > raise self.failureException(msg)
> > AssertionError: 1 != 0
> > --------------------- >> end captured logging <<
---------------------
> >
> > Stack Trace:
> > File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
> > testMethod()
> > File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in
runTest
> > self.test(*self.arg)
> > File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
142, in wrapped_test
> > test()
> > File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
60, in wrapper
> > return func(get_test_prefix(), *args, **kwargs)
> > File
"/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py",
line 87, in clear_global_maintenance
> > lambda: _set_and_test_maintenance_mode(host, False)
> > File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
282, in assert_true_within_short
> > assert_equals_within_short(func, True, allowed_exceptions)
> > File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
266, in assert_equals_within_short
> > func, value, SHORT_TIMEOUT, allowed_exceptions=allowed_exceptions
> > File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line
234, in assert_equals_within
> > res = func()
> > File
"/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py",
line 87, in <lambda>
> > lambda: _set_and_test_maintenance_mode(host, False)
> > File
"/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py",
line 108, in _set_and_test_maintenance_mode
> > nt.assert_equals(ret.code, 0)
> > File "/usr/lib64/python2.7/unittest/case.py", line 553, in
assertEqual
> > assertion_func(first, second, msg=msg)
> > File "/usr/lib64/python2.7/unittest/case.py", line 546, in
_baseAssertEqual
> > raise self.failureException(msg)
> > '1 != 0\n-------------------- >> begin captured logging <<
--------------------\nroot: INFO: Waiting For System Stability...\nlago.ssh: DEBUG: start
task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client for
lago-he-basic-suite-4-3-host-0:\nlago.ssh: DEBUG: end
task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client for
lago-he-basic-suite-4-3-host-0:\nlago.ssh: DEBUG: Running 9a90ca60 on
lago-he-basic-suite-4-3-host-0: hosted-engine --set-maintenance --mode=none\nlago.ssh:
DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 returned with 1\nlago.ssh:
DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 errors:\n Cannot connect to the
HA daemon, please check the logs.\n\novirtlago.testlib: ERROR: * Unhandled exception
in <function <lambda> at 0x7f52673872a8>\nTraceback (most recent call last):\n
File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 234, in
assert_equals_within\n res = func()\n File
"/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py",
line 87, in <lambda>\n lambda: _set_and_test_maintenance_mode(host, False)\n
File
"/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py",
line 108, in _set_and_test_maintenance_mode\n nt.assert_equals(ret.code, 0)\n File
"/usr/lib64/python2.7/unittest/case.py", line 553, in assertEqual\n
assertion_func(first, second, msg=msg)\n File
"/usr/lib64/python2.7/unittest/case.py", line 546, in _baseAssertEqual\n
raise self.failureException(msg)\nAssertionError: 1 != 0\n--------------------- >>
end captured logging << ---------------------'
>
>
>
> --
> Didi
> _______________________________________________
> Infra mailing list -- infra(a)ovirt.org
> To unsubscribe send an email to infra-leave(a)ovirt.org
> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
https://lists.ovirt.org/archives/list/infra@ovirt.org/message/QGRYTQWRPEF...
--
Didi