
On Mon, Mar 8, 2021 at 2:26 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Sun, Mar 7, 2021 at 8:24 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Sun, Mar 7, 2021 at 4:06 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Sun, Mar 7, 2021 at 5:12 AM <jenkins@jenkins.phx.ovirt.org> wrote:
Project: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/ Build: https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1943/
Failed during -basic-suite-master.test-scenarios.test_004_basic_sanity.test_ha_recovery due to timeout. Can't tell exactly why. Now rebuilt, to see if it happens again. Basic suite did pass this point with (mostly) the same code, and there shouldn't be anything HE-specific in this test. FYI.
It did fail again:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1944/
Again, engine.log [1] has no 'VM_DOWN_ERROR'. Any clue?
OK. It seems like an OST issue. With [1], I get:
2021-03-08 13:05:10+0100,088 DEBUG (MainThread) [test_004_basic_sanity] test_ha_recovery: ansible_host=ModuleMapper<config_builder=ConfigBuilder<inventory=/tmp/tmp_fdlcj8y, host_pattern=la go-he-basic-suite-master-host-0.lago.local>> (test_004_basic_sanity:1072) 2021-03-08 13:05:10+0100,650 DEBUG (MainThread) [test_004_basic_sanity] test_ha_recovery: pgrep_res={'cmd': 'dig +short vm0', 'stdout': '192.168.200.147', 'stderr': '', 'rc': 0, 'start': ' 2021-03-08 13:01:59.636833', 'end': '2021-03-08 13:01:59.656727', 'delta': '0:00:00.019894', 'changed': True, 'invocation': {'module_args': {'_raw_params': 'dig +short vm0', '_uses_shell': T rue, 'warn': True, 'stdin_add_newline': True, 'strip_empty_ends': True, 'argv': None, 'chdir': None, 'executable': None, 'creates': None, 'removes': None, 'stdin': None}}, 'stdout_lines': [' 192.168.200.147'], 'stderr_lines': [], '_ansible_no_log': False} (test_004_basic_sanity:1074) 2021-03-08 13:05:10+0100,650 DEBUG (MainThread) [test_004_basic_sanity] test_ha_recovery: pid=192.168.200.147 (test_004_basic_sanity:1076) 2021-03-08 13:05:11+0100,194 DEBUG (MainThread) [test_004_basic_sanity] test_ha_recovery: kill_res={'cmd': 'dig +short vm0', 'stdout': '192.168.200.147', 'stderr': '', 'rc': 0, 'start': '2 021-03-08 13:01:59.636833', 'end': '2021-03-08 13:01:59.656727', 'delta': '0:00:00.019894', 'changed': True, 'invocation': {'module_args': {'_raw_params': 'dig +short vm0', '_uses_shell': Tr ue, 'warn': True, 'stdin_add_newline': True, 'strip_empty_ends': True, 'argv': None, 'chdir': None, 'executable': None, 'creates': None, 'removes': None, 'stdin': None}}, 'stdout_lines': ['1 92.168.200.147'], 'stderr_lines': [], '_ansible_no_log': False} (test_004_basic_sanity:1078)
I think that this fails due to [1] and due to the fact that in HE suites, we deploy host-0 as part of HE deploy (not OST), and with the FQDN - which causes the engine to return this host FQDN when asking 'which host runs some VM'. Pushed [2], let's see. [1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/113391 [2] https://gerrit.ovirt.org/c/ovirt-system-tests/+/113814
[1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/113804
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1944/...
Some more details:
The function is:
@order_by(_TEST_LIST) def test_ha_recovery(engine_api, get_ansible_host_for_vm): engine = engine_api.system_service() with engine_utils.wait_for_event(engine, [119, 9602, 506]): # VM_DOWN_ERROR event(119) # HA_VM_FAILED event event(9602) # VDS_INITIATED_RUN_VM event(506) ansible_host = get_ansible_host_for_vm(VM2_NAME) pid = ansible_host.shell('pgrep -f qemu.*guest=vm2')['stdout'].strip() ansible_host.shell('kill -KILL {}'.format(pid))
vm_service = test_utils.get_vm_service(engine, VM2_NAME) assertions.assert_true_within_long( lambda: vm_service.get().status == types.VmStatus.UP ) with engine_utils.wait_for_event(engine, 33): # USER_STOP_VM event vm_service.stop()
It failed inside the 'with' statement. Can't find any of the expected events in engine.log. In last successful I see all three of them.
We do not actively log anything about running the pgrep command, but it's logged by ansible on the host, to /var/log/messages. I see it in last successful, but not in above failed run, in any of the two hosts. So most likely it didn't even get to run pgrep.
IMO we probably want some more information logged in such a case, to allow further investigation.
Best regards,
Build Number: 1943 Build Status: Still Failing Triggered By: Started by timer
------------------------------------- Changes Since Last Success: ------------------------------------- Changes for Build #1939 [hbraha] ost: Check if user custom repos are used on VM
Changes for Build #1940 [hbraha] network: add check_repos fixture
Changes for Build #1941 [Andrej Cernek] network, pylint: Simplify pylintrc file
Changes for Build #1942 [Andrej Cernek] network, pylint: Simplify pylintrc file
Changes for Build #1943 [Andrej Cernek] network, pylint: Simplify pylintrc file
----------------- Failed Tests: ----------------- 1 tests failed. FAILED: he-basic-suite-master.test-scenarios.test_004_basic_sanity.test_ha_recovery
Error Message: AssertionError: False != True after 600 seconds
Stack Trace: engine_api = <ovirtsdk4.Connection object at 0x7f408114c1d0> get_ansible_host_for_vm = <function get_ansible_host_for_vm.<locals>.ansible_host_for at 0x7f4081171950>
@order_by(_TEST_LIST) def test_ha_recovery(engine_api, get_ansible_host_for_vm): engine = engine_api.system_service() with engine_utils.wait_for_event(engine, [119, 9602, 506]): # VM_DOWN_ERROR event(119) # HA_VM_FAILED event event(9602) # VDS_INITIATED_RUN_VM event(506) ansible_host = get_ansible_host_for_vm(VM2_NAME) pid = ansible_host.shell('pgrep -f qemu.*guest=vm2')['stdout'].strip()
ansible_host.shell('kill -KILL {}'.format(pid))
../he-basic-suite-master/test-scenarios/test_004_basic_sanity.py:1070: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /usr/lib64/python3.6/contextlib.py:88: in __exit__ next(self.gen) ../ost_utils/ost_utils/engine_utils.py:42: in wait_for_event lambda: ../ost_utils/ost_utils/assertions.py:96: in assert_true_within_long assert_equals_within_long(func, True, allowed_exceptions) ../ost_utils/ost_utils/assertions.py:81: in assert_equals_within_long func, value, LONG_TIMEOUT, allowed_exceptions=allowed_exceptions _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
func = <function wait_for_event.<locals>.<lambda> at 0x7f40810d5bf8> value = True, timeout = 600, allowed_exceptions = [], initial_wait = 0 error_message = 'False != True after 600 seconds'
def assert_equals_within( func, value, timeout, allowed_exceptions=None, initial_wait=10, error_message=None ): allowed_exceptions = allowed_exceptions or [] res = '<no-result-obtained>' with _EggTimer(timeout) as timer: while not timer.elapsed(): try: res = func() if res == value: return except Exception as exc: if _instance_of_any(exc, allowed_exceptions): time.sleep(3) continue
LOGGER.exception("Unhandled exception in %s", func) raise
if initial_wait == 0: time.sleep(3) else: time.sleep(initial_wait) initial_wait = 0 try: if error_message is None: error_message = '%s != %s after %s seconds' % (res, value, timeout)
raise AssertionError(error_message)
E AssertionError: False != True after 600 seconds
../ost_utils/ost_utils/assertions.py:59: AssertionError
-- Didi
-- Didi
-- Didi
-- Didi