Sanlock lockspace inquire failure (was: [oVirt Jenkins] ovirt-system-tests_he-basic-suite-4.3 - Build # 366 - Failure!)

Hi all, On Sun, Mar 1, 2020 at 6:06 AM <jenkins@jenkins.phx.ovirt.org> wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/366/
I think the root cause is: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/366/arti... StatusStorageThread::ERROR::2020-02-29 23:03:04,671::status_broker::90::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to update state. Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 82, in run if (self._status_broker._inquire_whiteboard_lock() or File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 195, in _inquire_whiteboard_lock self.host_id, self._lease_file) SanlockException: (104, 'Sanlock lockspace inquire failure', 'Connection reset by peer') This caused the broker to restart itself, and while it was doing that, OST did 'hosted-engine --vm-status --json', which failed, thus failing the build. This seems to me like another case of a communication problem in CI. Not sure what else could have caused it to fail to inquire the status of the lock. This (communication) issue was mentioned several times in the past already. Are we doing anything re this? Thanks and best regards,
Build Number: 366 Build Status: Failure Triggered By: Started by timer
------------------------------------- Changes Since Last Success: ------------------------------------- Changes for Build #366 [Marcin Sobczyk] el8: Don't try to collect whole '/etc/httpd' dir
----------------- Failed Tests: ----------------- 1 tests failed. FAILED: 008_restart_he_vm.clear_global_maintenance
Error Message: 1 != 0 -------------------- >> begin captured logging << -------------------- root: INFO: Waiting For System Stability... lago.ssh: DEBUG: start task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client for lago-he-basic-suite-4-3-host-0: lago.ssh: DEBUG: end task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client for lago-he-basic-suite-4-3-host-0: lago.ssh: DEBUG: Running 9a90ca60 on lago-he-basic-suite-4-3-host-0: hosted-engine --set-maintenance --mode=none lago.ssh: DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 returned with 1 lago.ssh: DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 errors: Cannot connect to the HA daemon, please check the logs.
ovirtlago.testlib: ERROR: * Unhandled exception in <function <lambda> at 0x7f52673872a8> Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 234, in assert_equals_within res = func() File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 87, in <lambda> lambda: _set_and_test_maintenance_mode(host, False) File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 108, in _set_and_test_maintenance_mode nt.assert_equals(ret.code, 0) File "/usr/lib64/python2.7/unittest/case.py", line 553, in assertEqual assertion_func(first, second, msg=msg) File "/usr/lib64/python2.7/unittest/case.py", line 546, in _baseAssertEqual raise self.failureException(msg) AssertionError: 1 != 0 --------------------- >> end captured logging << ---------------------
Stack Trace: File "/usr/lib64/python2.7/unittest/case.py", line 369, in run testMethod() File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 142, in wrapped_test test() File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 60, in wrapper return func(get_test_prefix(), *args, **kwargs) File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 87, in clear_global_maintenance lambda: _set_and_test_maintenance_mode(host, False) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 282, in assert_true_within_short assert_equals_within_short(func, True, allowed_exceptions) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 266, in assert_equals_within_short func, value, SHORT_TIMEOUT, allowed_exceptions=allowed_exceptions File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 234, in assert_equals_within res = func() File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 87, in <lambda> lambda: _set_and_test_maintenance_mode(host, False) File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 108, in _set_and_test_maintenance_mode nt.assert_equals(ret.code, 0) File "/usr/lib64/python2.7/unittest/case.py", line 553, in assertEqual assertion_func(first, second, msg=msg) File "/usr/lib64/python2.7/unittest/case.py", line 546, in _baseAssertEqual raise self.failureException(msg) '1 != 0\n-------------------- >> begin captured logging << --------------------\nroot: INFO: Waiting For System Stability...\nlago.ssh: DEBUG: start task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client for lago-he-basic-suite-4-3-host-0:\nlago.ssh: DEBUG: end task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client for lago-he-basic-suite-4-3-host-0:\nlago.ssh: DEBUG: Running 9a90ca60 on lago-he-basic-suite-4-3-host-0: hosted-engine --set-maintenance --mode=none\nlago.ssh: DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 returned with 1\nlago.ssh: DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 errors:\n Cannot connect to the HA daemon, please check the logs.\n\novirtlago.testlib: ERROR: * Unhandled exception in <function <lambda> at 0x7f52673872a8>\nTraceback (most recent call last):\n File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 234, in assert_equals_within\n res = func()\n File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 87, in <lambda>\n lambda: _set_and_test_maintenance_mode(host, False)\n File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 108, in _set_and_test_maintenance_mode\n nt.assert_equals(ret.code, 0)\n File "/usr/lib64/python2.7/unittest/case.py", line 553, in assertEqual\n assertion_func(first, second, msg=msg)\n File "/usr/lib64/python2.7/unittest/case.py", line 546, in _baseAssertEqual\n raise self.failureException(msg)\nAssertionError: 1 != 0\n--------------------- >> end captured logging << ---------------------'
-- Didi

On Sun, Mar 1, 2020 at 10:10 AM Yedidyah Bar David <didi@redhat.com> wrote:
Hi all,
On Sun, Mar 1, 2020 at 6:06 AM <jenkins@jenkins.phx.ovirt.org> wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/366/
I think the root cause is:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/366/arti...
StatusStorageThread::ERROR::2020-02-29 23:03:04,671::status_broker::90::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to update state. Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 82, in run if (self._status_broker._inquire_whiteboard_lock() or File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 195, in _inquire_whiteboard_lock self.host_id, self._lease_file) SanlockException: (104, 'Sanlock lockspace inquire failure', 'Connection reset by peer')
Can you point us to the source using the sanlock API? The messages looks like client error accessing sanlock server socket (maybe someone restarted sanlock at that point?) but it may also be some error code reused for sanlock internal error for accessing the storage. Usually you can find more info about the error in /var/sanlock.log
This caused the broker to restart itself,
Restarting because sanlock failed does sound like useful error handling for broker clients.
and while it was doing that, OST did 'hosted-engine --vm-status --json', which failed, thus failing the build.
If the broker may restart itself on errors, clients need to use a retry mechanism to deal with the restarts, so the test should probably have a retry mechanism before it fails.
This seems to me like another case of a communication problem in CI. Not sure what else could have caused it to fail to inquire the status of the lock. This (communication) issue was mentioned several times in the past already. Are we doing anything re this?
Thanks and best regards,
Build Number: 366 Build Status: Failure Triggered By: Started by timer
------------------------------------- Changes Since Last Success: ------------------------------------- Changes for Build #366 [Marcin Sobczyk] el8: Don't try to collect whole '/etc/httpd' dir
----------------- Failed Tests: ----------------- 1 tests failed. FAILED: 008_restart_he_vm.clear_global_maintenance
Error Message: 1 != 0 -------------------- >> begin captured logging << -------------------- root: INFO: Waiting For System Stability... lago.ssh: DEBUG: start task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client for lago-he-basic-suite-4-3-host-0: lago.ssh: DEBUG: end task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client for lago-he-basic-suite-4-3-host-0: lago.ssh: DEBUG: Running 9a90ca60 on lago-he-basic-suite-4-3-host-0: hosted-engine --set-maintenance --mode=none lago.ssh: DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 returned with 1 lago.ssh: DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 errors: Cannot connect to the HA daemon, please check the logs.
ovirtlago.testlib: ERROR: * Unhandled exception in <function <lambda> at 0x7f52673872a8> Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 234, in assert_equals_within res = func() File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 87, in <lambda> lambda: _set_and_test_maintenance_mode(host, False) File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 108, in _set_and_test_maintenance_mode nt.assert_equals(ret.code, 0) File "/usr/lib64/python2.7/unittest/case.py", line 553, in assertEqual assertion_func(first, second, msg=msg) File "/usr/lib64/python2.7/unittest/case.py", line 546, in _baseAssertEqual raise self.failureException(msg) AssertionError: 1 != 0 --------------------- >> end captured logging << ---------------------
Stack Trace: File "/usr/lib64/python2.7/unittest/case.py", line 369, in run testMethod() File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 142, in wrapped_test test() File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 60, in wrapper return func(get_test_prefix(), *args, **kwargs) File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 87, in clear_global_maintenance lambda: _set_and_test_maintenance_mode(host, False) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 282, in assert_true_within_short assert_equals_within_short(func, True, allowed_exceptions) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 266, in assert_equals_within_short func, value, SHORT_TIMEOUT, allowed_exceptions=allowed_exceptions File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 234, in assert_equals_within res = func() File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 87, in <lambda> lambda: _set_and_test_maintenance_mode(host, False) File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 108, in _set_and_test_maintenance_mode nt.assert_equals(ret.code, 0) File "/usr/lib64/python2.7/unittest/case.py", line 553, in assertEqual assertion_func(first, second, msg=msg) File "/usr/lib64/python2.7/unittest/case.py", line 546, in _baseAssertEqual raise self.failureException(msg) '1 != 0\n-------------------- >> begin captured logging << --------------------\nroot: INFO: Waiting For System Stability...\nlago.ssh: DEBUG: start task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client for lago-he-basic-suite-4-3-host-0:\nlago.ssh: DEBUG: end task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client for lago-he-basic-suite-4-3-host-0:\nlago.ssh: DEBUG: Running 9a90ca60 on lago-he-basic-suite-4-3-host-0: hosted-engine --set-maintenance --mode=none\nlago.ssh: DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 returned with 1\nlago.ssh: DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 errors:\n Cannot connect to the HA daemon, please check the logs.\n\novirtlago.testlib: ERROR: * Unhandled exception in <function <lambda> at 0x7f52673872a8>\nTraceback (most recent call last):\n File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 234, in assert_equals_within\n res = func()\n File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 87, in <lambda>\n lambda: _set_and_test_maintenance_mode(host, False)\n File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 108, in _set_and_test_maintenance_mode\n nt.assert_equals(ret.code, 0)\n File "/usr/lib64/python2.7/unittest/case.py", line 553, in assertEqual\n assertion_func(first, second, msg=msg)\n File "/usr/lib64/python2.7/unittest/case.py", line 546, in _baseAssertEqual\n raise self.failureException(msg)\nAssertionError: 1 != 0\n--------------------- >> end captured logging << ---------------------'
-- Didi _______________________________________________ Infra mailing list -- infra@ovirt.org To unsubscribe send an email to infra-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/message/QGRYTQWRPEF5Y2...

On Mon, Mar 2, 2020 at 12:52 AM Nir Soffer <nsoffer@redhat.com> wrote:
On Sun, Mar 1, 2020 at 10:10 AM Yedidyah Bar David <didi@redhat.com> wrote:
Hi all,
On Sun, Mar 1, 2020 at 6:06 AM <jenkins@jenkins.phx.ovirt.org> wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/366/
I think the root cause is:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/366/arti...
StatusStorageThread::ERROR::2020-02-29 23:03:04,671::status_broker::90::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to update state. Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 82, in run if (self._status_broker._inquire_whiteboard_lock() or File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 195, in _inquire_whiteboard_lock self.host_id, self._lease_file) SanlockException: (104, 'Sanlock lockspace inquire failure', 'Connection reset by peer')
Can you point us to the source using the sanlock API?
I think it's right where the above error message says it is: https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_eng...
The messages looks like client error accessing sanlock server socket (maybe someone restarted sanlock at that point?)
Maybe, I failed to find evidence :-)
but it may also be some error code reused for sanlock internal error for accessing the storage.
Usually you can find more info about the error in /var/sanlock.log
Couldn't find anything: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/366/arti...
This caused the broker to restart itself,
Restarting because sanlock failed does sound like useful error handling for broker clients.
and while it was doing that, OST did 'hosted-engine --vm-status --json', which failed, thus failing the build.
If the broker may restart itself on errors, clients need to use a retry mechanism to deal with the restarts, so the test should probably have a retry mechanism before it fails.
I am not it's the test that should have that retry mechanism, or the command 'hosted-engine --vm-status'. Opinions? If latter, we probably need to add user controls for this (time between retries, max number).
This seems to me like another case of a communication problem in CI. Not sure what else could have caused it to fail to inquire the status of the lock. This (communication) issue was mentioned several times in the past already. Are we doing anything re this?
I still didn't see any concrete reply to this point, but perhaps the reply should be: If our CI is not completely perfect, and sometimes has communication issues, that's simply normal life - also the networks of our users are like that. We should simply expect that, and do what's needed (above)... Thanks,
Thanks and best regards,
Build Number: 366 Build Status: Failure Triggered By: Started by timer
------------------------------------- Changes Since Last Success: ------------------------------------- Changes for Build #366 [Marcin Sobczyk] el8: Don't try to collect whole '/etc/httpd' dir
----------------- Failed Tests: ----------------- 1 tests failed. FAILED: 008_restart_he_vm.clear_global_maintenance
Error Message: 1 != 0 -------------------- >> begin captured logging << -------------------- root: INFO: Waiting For System Stability... lago.ssh: DEBUG: start task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client for lago-he-basic-suite-4-3-host-0: lago.ssh: DEBUG: end task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client for lago-he-basic-suite-4-3-host-0: lago.ssh: DEBUG: Running 9a90ca60 on lago-he-basic-suite-4-3-host-0: hosted-engine --set-maintenance --mode=none lago.ssh: DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 returned with 1 lago.ssh: DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 errors: Cannot connect to the HA daemon, please check the logs.
ovirtlago.testlib: ERROR: * Unhandled exception in <function <lambda> at 0x7f52673872a8> Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 234, in assert_equals_within res = func() File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 87, in <lambda> lambda: _set_and_test_maintenance_mode(host, False) File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 108, in _set_and_test_maintenance_mode nt.assert_equals(ret.code, 0) File "/usr/lib64/python2.7/unittest/case.py", line 553, in assertEqual assertion_func(first, second, msg=msg) File "/usr/lib64/python2.7/unittest/case.py", line 546, in _baseAssertEqual raise self.failureException(msg) AssertionError: 1 != 0 --------------------- >> end captured logging << ---------------------
Stack Trace: File "/usr/lib64/python2.7/unittest/case.py", line 369, in run testMethod() File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 142, in wrapped_test test() File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 60, in wrapper return func(get_test_prefix(), *args, **kwargs) File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 87, in clear_global_maintenance lambda: _set_and_test_maintenance_mode(host, False) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 282, in assert_true_within_short assert_equals_within_short(func, True, allowed_exceptions) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 266, in assert_equals_within_short func, value, SHORT_TIMEOUT, allowed_exceptions=allowed_exceptions File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 234, in assert_equals_within res = func() File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 87, in <lambda> lambda: _set_and_test_maintenance_mode(host, False) File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 108, in _set_and_test_maintenance_mode nt.assert_equals(ret.code, 0) File "/usr/lib64/python2.7/unittest/case.py", line 553, in assertEqual assertion_func(first, second, msg=msg) File "/usr/lib64/python2.7/unittest/case.py", line 546, in _baseAssertEqual raise self.failureException(msg) '1 != 0\n-------------------- >> begin captured logging << --------------------\nroot: INFO: Waiting For System Stability...\nlago.ssh: DEBUG: start task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client for lago-he-basic-suite-4-3-host-0:\nlago.ssh: DEBUG: end task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client for lago-he-basic-suite-4-3-host-0:\nlago.ssh: DEBUG: Running 9a90ca60 on lago-he-basic-suite-4-3-host-0: hosted-engine --set-maintenance --mode=none\nlago.ssh: DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 returned with 1\nlago.ssh: DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 errors:\n Cannot connect to the HA daemon, please check the logs.\n\novirtlago.testlib: ERROR: * Unhandled exception in <function <lambda> at 0x7f52673872a8>\nTraceback (most recent call last):\n File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 234, in assert_equals_within\n res = func()\n File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 87, in <lambda>\n lambda: _set_and_test_maintenance_mode(host, False)\n File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 108, in _set_and_test_maintenance_mode\n nt.assert_equals(ret.code, 0)\n File "/usr/lib64/python2.7/unittest/case.py", line 553, in assertEqual\n assertion_func(first, second, msg=msg)\n File "/usr/lib64/python2.7/unittest/case.py", line 546, in _baseAssertEqual\n raise self.failureException(msg)\nAssertionError: 1 != 0\n--------------------- >> end captured logging << ---------------------'
-- Didi _______________________________________________ Infra mailing list -- infra@ovirt.org To unsubscribe send an email to infra-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/message/QGRYTQWRPEF5Y2...
-- Didi

On Tue, Mar 3, 2020 at 9:07 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Mon, Mar 2, 2020 at 12:52 AM Nir Soffer <nsoffer@redhat.com> wrote:
On Sun, Mar 1, 2020 at 10:10 AM Yedidyah Bar David <didi@redhat.com> wrote:
Hi all,
On Sun, Mar 1, 2020 at 6:06 AM <jenkins@jenkins.phx.ovirt.org> wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/366/
I think the root cause is:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/366/arti...
StatusStorageThread::ERROR::2020-02-29 23:03:04,671::status_broker::90::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to update state. Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 82, in run if (self._status_broker._inquire_whiteboard_lock() or File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 195, in _inquire_whiteboard_lock self.host_id, self._lease_file) SanlockException: (104, 'Sanlock lockspace inquire failure', 'Connection reset by peer')
Can you point us to the source using the sanlock API?
I think it's right where the above error message says it is:
https://github.com/oVirt/ovirt-hosted-engine-ha/blob/master/ovirt_hosted_eng...
The messages looks like client error accessing sanlock server socket (maybe someone restarted sanlock at that point?)
Maybe, I failed to find evidence :-)
but it may also be some error code reused for sanlock internal error for accessing the storage.
Usually you can find more info about the error in /var/sanlock.log
Couldn't find anything:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-4.3/366/arti...
This is the end of the log: 2020-02-29 22:51:46 5313 [6267]: s8 host 2 1 5295 eb348ac8-979d-48c4-b0ce-eb9bac37efe8.lago-he-ba 2020-02-29 23:03:39 6027 [6273]: s9 lockspace hosted-engine:1:/var/run/vdsm/storage/93778fac-6989-4c6d-9bc0-edb0f43b35b1/19705ecb-0854-4f7f-8df6-146586c97b81/c32dbeaa-c935-43c9-a252-a7db6174f1c3:0 The error happened on: 2020-02-29 23:03:04,671 There is nothing in sanlock log, so this does not look like a network issue. We would see error logs in sanlock log in such case. sanlock inq_lockspace returns 0, -ENOENT, or -EINPROGRESS, so the error did not come from sanlock. So this looks like client failed when sending the request to sanlock, here: https://github.com/nirs/sanlock/blob/53f7f0284c084ac2e4542fd1f71d0406075adb5... https://github.com/nirs/sanlock/blob/53f7f0284c084ac2e4542fd1f71d0406075adb5... In vdsm log we see that sanlock was fine 15 seconds before the error: 2020-02-29 23:03:04,787-0500 INFO (jsonrpc/5) [vdsm.api] FINISH repoStats return={u'3f942432-c244-4969-b4bf-2ac02dee2f7f': {'code': 0, 'actual': True, 'version': 5, 'acquir ed': True, 'delay': '0.00257567', 'lastCheck': '1.7', 'valid': True}, u'6c82d788-cd54-440d-aef0-ba24b0d8d7c4': {'code': 0, 'actual': True, 'version': 5, 'acquired': True, 'd elay': '0.00595748', 'lastCheck': '0.5', 'valid': True}, u'93778fac-6989-4c6d-9bc0-edb0f43b35b1': {'code': 0, 'actual': True, 'version': 5, 'acquired': True, 'delay': '0.003 07431', 'lastCheck': '0.6', 'valid': True}, u'f970efe3-74a5-47ea-8319-f4c96aee4008': {'code': 0, 'actual': True, 'version': 0, 'acquired': True, 'delay': '0.00354385', 'last Check': '2.3', 'valid': True}, u'8fc30a42-6a05-4a6b-8f43-bb6438868a84': {'code': 0, 'actual': True, 'version': 0, 'acquired': True, 'delay': '0.00254181', 'lastCheck': '1.7' , 'valid': True}} from=::1,48818, task_id=9b2892b5-bbf4-4516-bb7b-353de93bbd8a (api:54) Then we see the hosted engine error: 2020-02-29 23:03:09,832-0500 ERROR (jsonrpc/5) [root] failed to retrieve Hosted Engine HA score (api:200) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 182, in _getHaInfo stats = instance.get_all_stats(timeout=5) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 97, in get_all_stats self._check_liveness_for_stats(stats, broker) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 83, in _check_liveness_for_stats self._check_liveness_metadata(stats[host_id], broker) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 73, in _check_liveness_metadata md["live-data"] = broker.is_host_alive(md["host-id"]) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 132, in is_host_alive host_list = self._proxy.is_host_alive() File "/usr/lib64/python2.7/xmlrpclib.py", line 1233, in __call__ return self.__send(self.__name, args) File "/usr/lib64/python2.7/xmlrpclib.py", line 1591, in __request verbose=self.__verbose File "/usr/lib64/python2.7/xmlrpclib.py", line 1273, in request return self.single_request(host, handler, request_body, verbose) File "/usr/lib64/python2.7/xmlrpclib.py", line 1303, in single_request response = h.getresponse(buffering=True) File "/usr/lib64/python2.7/httplib.py", line 1128, in getresponse response.begin() File "/usr/lib64/python2.7/httplib.py", line 453, in begin version, status, reason = self._read_status() File "/usr/lib64/python2.7/httplib.py", line 409, in _read_status line = self.fp.readline(_MAXLINE + 1) File "/usr/lib64/python2.7/socket.py", line 476, in readline data = self._sock.recv(self._rbufsize) timeout: timed out 15 seconds later sanlock is still fine (all domains have "acquired": True). 2020-02-29 23:03:19,136-0500 INFO (jsonrpc/1) [vdsm.api] FINISH repoStats return={u'3f942432-c244-4969-b4bf-2ac02dee2f7f': {'code': 0, 'actual': True, 'version': 5, 'acquired': True, 'delay': '0.00173669', 'lastCheck': '6.1', 'valid': True}, u'6c82d788-cd54-440d-aef0-ba24b0d8d7c4': {'code': 0, 'actual': True, 'version': 5, 'acquired': True, 'delay': '0.00188127', 'lastCheck': '4.9', 'valid': True}, u'93778fac-6989-4c6d-9bc0-edb0f43b35b1': {'code': 0, 'actual': True, 'version': 5, 'acquired': True, 'delay': '0.00217642', 'lastCheck': '2.3', 'valid': True}, u'f970efe3-74a5-47ea-8319-f4c96aee4008': {'code': 0, 'actual': True, 'version': 0, 'acquired': True, 'delay': '0.00168195', 'lastCheck': '6.6', 'valid': True}, u'8fc30a42-6a05-4a6b-8f43-bb6438868a84': {'code': 0, 'actual': True, 'version': 0, 'acquired': True, 'delay': '0.00366156', 'lastCheck': '6.1', 'valid': True}} from=::ffff:192.168.200.99,58968, task_id=452aedd9-34e5-486e-9a9d-4cdc16270133 (api:54) 2 So sanlock was not killed - looks like an issue with sending command to sanlock. You can file sanlock bug, but I don't think there is anything useful to do unless this is reproducible. Maybe David has an idea how to debug this.
This caused the broker to restart itself,
Restarting because sanlock failed does sound like useful error handling for broker clients.
and while it was doing that, OST did 'hosted-engine --vm-status --json', which failed, thus failing the build.
If the broker may restart itself on errors, clients need to use a retry mechanism to deal with the restarts, so the test should probably have a retry mechanism before it fails.
I am not it's the test that should have that retry mechanism, or the command 'hosted-engine --vm-status'. Opinions? If latter, we probably need to add user controls for this (time between retries, max number).
Most likely the broker should try harder before it fails, so the rest of the stack will not have to deal with the failure. But this failure looks like something that should never happen. Sanlock is not a service that people are expected to play with, and connection to unix socket is not expected to break.
This seems to me like another case of a communication problem in CI. Not sure what else could have caused it to fail to inquire the status of the lock. This (communication) issue was mentioned several times in the past already. Are we doing anything re this?
I still didn't see any concrete reply to this point, but perhaps the reply should be: If our CI is not completely perfect, and sometimes has communication issues, that's simply normal life - also the networks of our users are like that. We should simply expect that, and do what's needed (above)...
This is related to communication issues in the CI. If we had such issues you will see tons of WARN message in vdsm about blocked checkers, and there are no such warnings in vdsm log. Sanlock log will also complain about lockspace renewal errors.
Thanks,
Thanks and best regards,
Build Number: 366 Build Status: Failure Triggered By: Started by timer
------------------------------------- Changes Since Last Success: ------------------------------------- Changes for Build #366 [Marcin Sobczyk] el8: Don't try to collect whole '/etc/httpd' dir
----------------- Failed Tests: ----------------- 1 tests failed. FAILED: 008_restart_he_vm.clear_global_maintenance
Error Message: 1 != 0 -------------------- >> begin captured logging << -------------------- root: INFO: Waiting For System Stability... lago.ssh: DEBUG: start task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client for lago-he-basic-suite-4-3-host-0: lago.ssh: DEBUG: end task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client for lago-he-basic-suite-4-3-host-0: lago.ssh: DEBUG: Running 9a90ca60 on lago-he-basic-suite-4-3-host-0: hosted-engine --set-maintenance --mode=none lago.ssh: DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 returned with 1 lago.ssh: DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 errors: Cannot connect to the HA daemon, please check the logs.
ovirtlago.testlib: ERROR: * Unhandled exception in <function <lambda> at 0x7f52673872a8> Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 234, in assert_equals_within res = func() File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 87, in <lambda> lambda: _set_and_test_maintenance_mode(host, False) File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 108, in _set_and_test_maintenance_mode nt.assert_equals(ret.code, 0) File "/usr/lib64/python2.7/unittest/case.py", line 553, in assertEqual assertion_func(first, second, msg=msg) File "/usr/lib64/python2.7/unittest/case.py", line 546, in _baseAssertEqual raise self.failureException(msg) AssertionError: 1 != 0 --------------------- >> end captured logging << ---------------------
Stack Trace: File "/usr/lib64/python2.7/unittest/case.py", line 369, in run testMethod() File "/usr/lib/python2.7/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 142, in wrapped_test test() File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 60, in wrapper return func(get_test_prefix(), *args, **kwargs) File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 87, in clear_global_maintenance lambda: _set_and_test_maintenance_mode(host, False) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 282, in assert_true_within_short assert_equals_within_short(func, True, allowed_exceptions) File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 266, in assert_equals_within_short func, value, SHORT_TIMEOUT, allowed_exceptions=allowed_exceptions File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 234, in assert_equals_within res = func() File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 87, in <lambda> lambda: _set_and_test_maintenance_mode(host, False) File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 108, in _set_and_test_maintenance_mode nt.assert_equals(ret.code, 0) File "/usr/lib64/python2.7/unittest/case.py", line 553, in assertEqual assertion_func(first, second, msg=msg) File "/usr/lib64/python2.7/unittest/case.py", line 546, in _baseAssertEqual raise self.failureException(msg) '1 != 0\n-------------------- >> begin captured logging << --------------------\nroot: INFO: Waiting For System Stability...\nlago.ssh: DEBUG: start task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client for lago-he-basic-suite-4-3-host-0:\nlago.ssh: DEBUG: end task:29a79ef5-e211-4672-ac5b-12bf0e5f8ee9:Get ssh client for lago-he-basic-suite-4-3-host-0:\nlago.ssh: DEBUG: Running 9a90ca60 on lago-he-basic-suite-4-3-host-0: hosted-engine --set-maintenance --mode=none\nlago.ssh: DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 returned with 1\nlago.ssh: DEBUG: Command 9a90ca60 on lago-he-basic-suite-4-3-host-0 errors:\n Cannot connect to the HA daemon, please check the logs.\n\novirtlago.testlib: ERROR: * Unhandled exception in <function <lambda> at 0x7f52673872a8>\nTraceback (most recent call last):\n File "/usr/lib/python2.7/site-packages/ovirtlago/testlib.py", line 234, in assert_equals_within\n res = func()\n File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 87, in <lambda>\n lambda: _set_and_test_maintenance_mode(host, False)\n File "/home/jenkins/agent/workspace/ovirt-system-tests_he-basic-suite-4.3/ovirt-system-tests/he-basic-suite-4.3/test-scenarios/008_restart_he_vm.py", line 108, in _set_and_test_maintenance_mode\n nt.assert_equals(ret.code, 0)\n File "/usr/lib64/python2.7/unittest/case.py", line 553, in assertEqual\n assertion_func(first, second, msg=msg)\n File "/usr/lib64/python2.7/unittest/case.py", line 546, in _baseAssertEqual\n raise self.failureException(msg)\nAssertionError: 1 != 0\n--------------------- >> end captured logging << ---------------------'
-- Didi _______________________________________________ Infra mailing list -- infra@ovirt.org To unsubscribe send an email to infra-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/message/QGRYTQWRPEF5Y2...
-- Didi
participants (2)
-
Nir Soffer
-
Yedidyah Bar David