Re: basic suite fails on test_metrics_and_log_collector

17 Mar 2021


      On Tue, Mar 16, 2021 at 9:07 PM Yedidyah Bar David <didi@redhat.com> wrote:
...
On Tue, Mar 16, 2021 at 1:11 PM Yedidyah Bar David <didi@redhat.com>
wrote:
...
On Tue, Mar 16, 2021 at 12:50 PM Dana Elfassy <delfassy@redhat.com>
...
...
About [5] https://bugzilla.redhat.com/show_bug.cgi?id=1917707, I have
found that when executing the http request of the playbook, sometimes in
wrote:
the response there are some events that are missing. The way we handle the
events is via while loop, where in each loop we continue from the
lastEventId. Since not all events were listed, the lastEventId (which is
returned as total number of events) is smaller than the actual eventId that
was handled. Thus, in the next iteration the last events are handled again.
It seems that this comes from Ansible Runner, and I'm investigating that
area now
...
Thanks
...
I'm not sure if it's the same issue as you get with the logs, though,
as I'm not familiar with this area in the OSTs. Is the creation of the 2
logs done by different threads?
...
I have no idea either.
...
Regarding the initial error (' directory is not empty.non-zero return
code') I'm not sure, will have to loop more into it. Martin, do you have
any idea?
...
I think it's quite obvious that the reason for this is that we run two
ovirt-log-collector processes in parallel. Why we (sometimes) do this
is an open question, but unrelated to ovirt-log-collector (or the
specific error message it emits).
I now did this:
1. Wrote a trivial dummy module [1] to test this
2. Ran it on an already finished basic-suite env with lagofy/run_tc [2].
As you can see, if you pipe [2] through grep -Ev
'do_something_2.*something 2|do_something_1.*something 1', in some
cases do_something_1 actually ran the ansible code of do_something_2.
I currently think it's some kind of race condition in ansible-runner,
causing it to misbehave when ran in parallel, as we do in
test_metrics_and_log_collector. I still didn't try to check its code
and/or add debugging stuff to it to try and find the exact location.
2021-03-16 20:54:18+0200,681 INFO    [test_003_01_dummy]
do_something_1: Tue Mar 16 19:54:18 CET 2021
lago-basic-suite-master-engine something 2 (test_003_01_dummy:39)
2021-03-16 20:55:41+0200,155 INFO    [test_003_01_dummy]
do_something_1: Tue Mar 16 19:55:40 CET 2021
lago-basic-suite-master-engine something 2 (test_003_01_dummy:39)
2021-03-16 20:56:56+0200,196 INFO    [test_003_01_dummy]
do_something_1: Tue Mar 16 19:56:55 CET 2021
lago-basic-suite-master-engine something 2 (test_003_01_dummy:39)
2021-03-16 20:58:19+0200,665 INFO    [test_003_01_dummy]
do_something_1: Tue Mar 16 19:58:19 CET 2021
lago-basic-suite-master-engine something 2 (test_003_01_dummy:39)
For now, pushed:

https://gerrit.ovirt.org/c/ovirt-system-tests/+/113906
test_metrics_and_log_collector: Run serially

Best regards,
...
[1]
https://github.com/didib/ovirt-system-tests/commit/6a5327976eaf5bc27e352723b...
[2] https://gist.github.com/didib/fe1b3bc28b8364cf6976d3d93e6cd9b6
...
...
On Mon, Mar 15, 2021 at 9:59 AM Yedidyah Bar David <didi@redhat.com>
...
...
...
On Mon, Mar 15, 2021 at 7:55 AM Yedidyah Bar David <didi@redhat.com>
wrote:
...
...
Hi all,
This started a few days ago [1] and randomly happens since then:
E           DEBUG: Configuration:
E           DEBUG: command: collect
E           DEBUG: Traceback (most recent call last):
E           DEBUG:   File
"/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py",
line 2067, in <module>
E           DEBUG:     '%s directory is not empty.' %
(conf["local_tmp_dir"])
...
E           DEBUG: Exception: /dev/shm/log directory is not
empty.ERROR: /dev/shm/log directory is not empty.non-zero return
code
Michal tried to fix this by using a random directory but it still
fails [2]:
DEBUG: command: collect
DEBUG: Traceback (most recent call last):
DEBUG:   File
"/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py",
line 2067, in <module>
DEBUG:     '%s directory is not empty.' % (conf["local_tmp_dir"])
DEBUG: Exception: /dev/shm/kaN7uY directory is not empty.ERROR:
/dev/shm/kaN7uY directory is not empty.non-zero return code
Since I suppose that the randomness of mktemp is good enough, it
must
be something else. Also, the last successful run before [1] used the
same OST git commit (same code), so I do not think it's something in
OST's code.
Any idea?
I think I'll push a patch to create and use the directory right
before
calling ovirt-log-collector, which is probably better in other ways.
My patch [1] still fails, with a somewhat different error message, but
this made me check further, and while I still do not understand, I
have
this to add:
In the failing runs, ovirt-log-collector is called *twice* in
wrote:
parallel. E.g.
...
...
...
in [2] (the check-patch of [1]):
Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59099]:
ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p
/dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors
--local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf
_uses_shell=True warn=True stdin_add_newline=True
strip_empty_ends=True argv=None chdir=None executable=None
creates=None removes=None stdin=None
Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59124]:
ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p
/dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors
--local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf
_uses_shell=True warn=True stdin_add_newline=True
strip_empty_ends=True argv=None chdir=None executable=None
creates=None removes=None stdin=None
It also generates two logs, which you can check/compare.
It's the same for previous ones, e.g. latest nightly [3][4]:
Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59343]:
ansible-command Invoked with _raw_params=ovirt-log-collector --verbose
--batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf
_uses_shell=True warn=True stdin_add_newline=True
strip_empty_ends=True argv=None chdir=None executable=None
creates=None removes=None stdin=None
Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]:
SELinux is preventing /usr/lib/systemd/systemd from unlink access on
the sock_file ansible-ssh-lago-basic-suite-master-host-1-22-root. For
complete SELinux messages run: sealert -l
d03a8655-9430-4fcf-9892-3b4df1939899
Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]:
SELinux is preventing /usr/lib/systemd/systemd from unlink access on
the sock_file
ansible-ssh-lago-basic-suite-master-host-1-22-root.#012#012*****
 Plugin catchall (100. confidence) suggests
**************************#012#012If you believe that systemd should
be allowed unlink access on the
ansible-ssh-lago-basic-suite-master-host-1-22-root sock_file by
default.#012Then you should report this as a bug.#012You can generate
a local policy module to allow this access.#012Do#012allow this access
for now by executing:#012# ausearch -c 'systemd' --raw | audit2allow
-M my-systemd#012# semodule -X 300 -i my-systemd.pp#012
Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59361]:
ansible-command Invoked with _raw_params=ovirt-log-collector --verbose
--batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf
_uses_shell=True warn=True stdin_add_newline=True
strip_empty_ends=True argv=None chdir=None executable=None
creates=None removes=None stdin=None
Any idea what might have caused this to start happening? Perhaps
a bug in ansible, or ansible-runner? It reminds me of [5].
Adding Dana and Martin.
I think [5] is quite a serious bug, btw, should be a 4.4.5 blocker.
Best regards,
[1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/113875
[2]
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/15980/...
[3]
https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/...
[4]
https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/...
[5] https://bugzilla.redhat.com/show_bug.cgi?id=1917707
...
Best regards,
[1]
https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/...
...
[2]
https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_nightly/...
...
--
Didi
--
Didi
--
Didi
--
Didi
-- 
Didi