[ovirt-devel] Re: test_ha_recovery failed (was: [oVirt Jenkins] ovirt-system-tests_he-basic-suite-master - Build # 1943 - Still Failing!)

Tuesday, 9 March 2021

On Mon, Mar 8, 2021 at 2:26 PM Yedidyah Bar David <didi(a)redhat.com&gt; wrote:
...

 On Sun, Mar 7, 2021 at 8:24 PM Yedidyah Bar David <didi(a)redhat.com&gt; wrote:
 >
 > On Sun, Mar 7, 2021 at 4:06 PM Yedidyah Bar David <didi(a)redhat.com&gt; wrote:
 > >
 > > On Sun, Mar 7, 2021 at 5:12 AM <jenkins(a)jenkins.phx.ovirt.org&gt; wrote:
 > > >
 > > > Project:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/
 > > > Build:
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1943/
 > >
 > > Failed during
-basic-suite-master.test-scenarios.test_004_basic_sanity.test_ha_recovery
 > > due to timeout. Can't tell exactly why. Now rebuilt, to see if it
 > > happens again. Basic suite did pass this point with (mostly) the same
 > > code, and there shouldn't be anything HE-specific in this test. FYI.
 >
 > It did fail again:
 >
 > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1944/
 >
 > Again, engine.log [1] has no 'VM_DOWN_ERROR'. Any clue?

 OK. It seems like an OST issue. With [1], I get:

 2021-03-08 13:05:10+0100,088 DEBUG   (MainThread)
 [test_004_basic_sanity] test_ha_recovery:
 ansible_host=ModuleMapper<config_builder=ConfigBuilder<inventory=/tmp/tmp_fdlcj8y,
 host_pattern=la
 go-he-basic-suite-master-host-0.lago.local>> (test_004_basic_sanity:1072)
 2021-03-08 13:05:10+0100,650 DEBUG   (MainThread)
 [test_004_basic_sanity] test_ha_recovery: pgrep_res={'cmd': 'dig
 +short vm0', 'stdout': '192.168.200.147', 'stderr':
'', 'rc': 0,
 'start': '
 2021-03-08 13:01:59.636833', 'end': '2021-03-08 13:01:59.656727',
 'delta': '0:00:00.019894', 'changed': True,
'invocation':
 {'module_args': {'_raw_params': 'dig +short vm0',
'_uses_shell': T
 rue, 'warn': True, 'stdin_add_newline': True,
'strip_empty_ends':
 True, 'argv': None, 'chdir': None, 'executable': None,
'creates':
 None, 'removes': None, 'stdin': None}}, 'stdout_lines': ['
 192.168.200.147'], 'stderr_lines': [], '_ansible_no_log': False}
 (test_004_basic_sanity:1074)
 2021-03-08 13:05:10+0100,650 DEBUG   (MainThread)
 [test_004_basic_sanity] test_ha_recovery: pid=192.168.200.147
 (test_004_basic_sanity:1076)
 2021-03-08 13:05:11+0100,194 DEBUG   (MainThread)
 [test_004_basic_sanity] test_ha_recovery: kill_res={'cmd': 'dig +short
 vm0', 'stdout': '192.168.200.147', 'stderr': '',
'rc': 0, 'start': '2
 021-03-08 13:01:59.636833', 'end': '2021-03-08 13:01:59.656727',
 'delta': '0:00:00.019894', 'changed': True,
'invocation':
 {'module_args': {'_raw_params': 'dig +short vm0',
'_uses_shell': Tr
 ue, 'warn': True, 'stdin_add_newline': True, 'strip_empty_ends':
True,
 'argv': None, 'chdir': None, 'executable': None,
'creates': None,
 'removes': None, 'stdin': None}}, 'stdout_lines': ['1
 92.168.200.147'], 'stderr_lines': [], '_ansible_no_log': False}
 (test_004_basic_sanity:1078) 
I think that this fails due to [1] and due to the fact that in HE suites,
we deploy host-0 as part of HE deploy (not OST), and with the FQDN - which
causes the engine to return this host FQDN when asking 'which host runs
some VM'. Pushed [2], let's see.

[1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/113391
[2] https://gerrit.ovirt.org/c/ovirt-system-tests/+/113814

...

 [1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/113804

 >
 > [1]
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/19...
 >
 > >
 > > Some more details:
 > >
 > > The function is:
 > >
 > > @order_by(_TEST_LIST)
 > > def test_ha_recovery(engine_api, get_ansible_host_for_vm):
 > >     engine = engine_api.system_service()
 > >     with engine_utils.wait_for_event(engine, [119, 9602, 506]):
 > >         # VM_DOWN_ERROR event(119)
 > >         # HA_VM_FAILED event event(9602)
 > >         # VDS_INITIATED_RUN_VM event(506)
 > >         ansible_host = get_ansible_host_for_vm(VM2_NAME)
 > >         pid = ansible_host.shell('pgrep -f
qemu.*guest=vm2')['stdout'].strip()
 > >         ansible_host.shell('kill -KILL {}'.format(pid))
 > >
 > >     vm_service = test_utils.get_vm_service(engine, VM2_NAME)
 > >     assertions.assert_true_within_long(
 > >         lambda:
 > >         vm_service.get().status == types.VmStatus.UP
 > >     )
 > >     with engine_utils.wait_for_event(engine, 33): # USER_STOP_VM event
 > >         vm_service.stop()
 > >
 > > It failed inside the 'with' statement. Can't find any of the
expected
 > > events in engine.log. In last successful I see all three of them.
 > >
 > > We do not actively log anything about running the pgrep command, but
 > > it's logged by ansible on the host, to /var/log/messages. I see it in
 > > last successful, but not in above failed run, in any of the two hosts.
 > > So most likely it didn't even get to run pgrep.
 > >
 > > IMO we probably want some more information logged in such a case, to
 > > allow further investigation.
 > >
 > > Best regards,
 > >
 > > > Build Number: 1943
 > > > Build Status:  Still Failing
 > > > Triggered By: Started by timer
 > > >
 > > > -------------------------------------
 > > > Changes Since Last Success:
 > > > -------------------------------------
 > > > Changes for Build #1939
 > > > [hbraha] ost: Check if user custom repos are used on VM
 > > >
 > > >
 > > > Changes for Build #1940
 > > > [hbraha] network: add check_repos fixture
 > > >
 > > >
 > > > Changes for Build #1941
 > > > [Andrej Cernek] network, pylint: Simplify pylintrc file
 > > >
 > > >
 > > > Changes for Build #1942
 > > > [Andrej Cernek] network, pylint: Simplify pylintrc file
 > > >
 > > >
 > > > Changes for Build #1943
 > > > [Andrej Cernek] network, pylint: Simplify pylintrc file
 > > >
 > > >
 > > >
 > > >
 > > > -----------------
 > > > Failed Tests:
 > > > -----------------
 > > > 1 tests failed.
 > > > FAILED: 
he-basic-suite-master.test-scenarios.test_004_basic_sanity.test_ha_recovery
 > > >
 > > > Error Message:
 > > > AssertionError: False != True after 600 seconds
 > > >
 > > > Stack Trace:
 > > > engine_api = <ovirtsdk4.Connection object at 0x7f408114c1d0>
 > > > get_ansible_host_for_vm = <function
get_ansible_host_for_vm.<locals>.ansible_host_for at 0x7f4081171950>
 > > >
 > > >     @order_by(_TEST_LIST)
 > > >     def test_ha_recovery(engine_api, get_ansible_host_for_vm):
 > > >         engine = engine_api.system_service()
 > > >         with engine_utils.wait_for_event(engine, [119, 9602, 506]):
 > > >             # VM_DOWN_ERROR event(119)
 > > >             # HA_VM_FAILED event event(9602)
 > > >             # VDS_INITIATED_RUN_VM event(506)
 > > >             ansible_host = get_ansible_host_for_vm(VM2_NAME)
 > > >             pid = ansible_host.shell('pgrep -f
qemu.*guest=vm2')['stdout'].strip()
 > > > >           ansible_host.shell('kill -KILL {}'.format(pid))
 > > >
 > > > ../he-basic-suite-master/test-scenarios/test_004_basic_sanity.py:1070:
 > > > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _
 > > > /usr/lib64/python3.6/contextlib.py:88: in __exit__
 > > >     next(self.gen)
 > > > ../ost_utils/ost_utils/engine_utils.py:42: in wait_for_event
 > > >     lambda:
 > > > ../ost_utils/ost_utils/assertions.py:96: in assert_true_within_long
 > > >     assert_equals_within_long(func, True, allowed_exceptions)
 > > > ../ost_utils/ost_utils/assertions.py:81: in assert_equals_within_long
 > > >     func, value, LONG_TIMEOUT, allowed_exceptions=allowed_exceptions
 > > > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _
 > > >
 > > > func = <function wait_for_event.<locals>.<lambda> at
0x7f40810d5bf8>
 > > > value = True, timeout = 600, allowed_exceptions = [], initial_wait = 0
 > > > error_message = 'False != True after 600 seconds'
 > > >
 > > >     def assert_equals_within(
 > > >         func, value, timeout, allowed_exceptions=None, initial_wait=10,
 > > >         error_message=None
 > > >     ):
 > > >         allowed_exceptions = allowed_exceptions or []
 > > >         res = '<no-result-obtained>'
 > > >         with _EggTimer(timeout) as timer:
 > > >             while not timer.elapsed():
 > > >                 try:
 > > >                     res = func()
 > > >                     if res == value:
 > > >                         return
 > > >                 except Exception as exc:
 > > >                     if _instance_of_any(exc, allowed_exceptions):
 > > >                         time.sleep(3)
 > > >                         continue
 > > >
 > > >                     LOGGER.exception("Unhandled exception in
%s", func)
 > > >                     raise
 > > >
 > > >                 if initial_wait == 0:
 > > >                     time.sleep(3)
 > > >                 else:
 > > >                     time.sleep(initial_wait)
 > > >                     initial_wait = 0
 > > >         try:
 > > >             if error_message is None:
 > > >                 error_message = '%s != %s after %s seconds' %
(res, value, timeout)
 > > > >           raise AssertionError(error_message)
 > > > E           AssertionError: False != True after 600 seconds
 > > >
 > > > ../ost_utils/ost_utils/assertions.py:59: AssertionError
 > >
 > >
 > >
 > > --
 > > Didi
 >
 >
 >
 > --
 > Didi

 --
 Didi 

-- 
Didi

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[ovirt-devel] Re: test_ha_recovery failed (was: [oVirt Jenkins] ovirt-system-tests_he-basic-suite-master - Build # 1943 - Still Failing!)