basic suite fails on test_metrics_and_log_collector

ovirt-appliance CI

[oVirt Jenkins]...

Yedidyah Bar David

Monday, 15 March 2021 Mon, 15 Mar '21

12:55 a.m.

Hi all, This started a few days ago [1] and randomly happens since then: E DEBUG: Configuration: E DEBUG: command: collect E DEBUG: Traceback (most recent call last): E DEBUG: File "/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py", line 2067, in <module> E DEBUG: '%s directory is not empty.' % (conf["local_tmp_dir"]) E DEBUG: Exception: /dev/shm/log directory is not empty.ERROR: /dev/shm/log directory is not empty.non-zero return code Michal tried to fix this by using a random directory but it still fails [2]: DEBUG: command: collect DEBUG: Traceback (most recent call last): DEBUG: File "/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py", line 2067, in <module> DEBUG: '%s directory is not empty.' % (conf["local_tmp_dir"]) DEBUG: Exception: /dev/shm/kaN7uY directory is not empty.ERROR: /dev/shm/kaN7uY directory is not empty.non-zero return code Since I suppose that the randomness of mktemp is good enough, it must be something else. Also, the last successful run before [1] used the same OST git commit (same code), so I do not think it's something in OST's code. Any idea? I think I'll push a patch to create and use the directory right before calling ovirt-log-collector, which is probably better in other ways. Best regards, [1] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... [2] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... -- Didi

Show replies by date

Yedidyah Bar David

Monday, 15 March Mon, 15 Mar

2:58 a.m.

On Mon, Mar 15, 2021 at 7:55 AM Yedidyah Bar David <didi(a)redhat.com> wrote:

...

My patch [1] still fails, with a somewhat different error message, but this made me check further, and while I still do not understand, I have this to add: In the failing runs, ovirt-log-collector is called *twice* in parallel. E.g. in [2] (the check-patch of [1]): Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59099]: ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59124]: ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None It also generates two logs, which you can check/compare. It's the same for previous ones, e.g. latest nightly [3][4]: Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59343]: ansible-command Invoked with _raw_params=ovirt-log-collector --verbose --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]: SELinux is preventing /usr/lib/systemd/systemd from unlink access on the sock_file ansible-ssh-lago-basic-suite-master-host-1-22-root. For complete SELinux messages run: sealert -l d03a8655-9430-4fcf-9892-3b4df1939899 Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]: SELinux is preventing /usr/lib/systemd/systemd from unlink access on the sock_file ansible-ssh-lago-basic-suite-master-host-1-22-root.#012#012***** Plugin catchall (100. confidence) suggests **************************#012#012If you believe that systemd should be allowed unlink access on the ansible-ssh-lago-basic-suite-master-host-1-22-root sock_file by default.#012Then you should report this as a bug.#012You can generate a local policy module to allow this access.#012Do#012allow this access for now by executing:#012# ausearch -c 'systemd' --raw | audit2allow -M my-systemd#012# semodule -X 300 -i my-systemd.pp#012 Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59361]: ansible-command Invoked with _raw_params=ovirt-log-collector --verbose --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None Any idea what might have caused this to start happening? Perhaps a bug in ansible, or ansible-runner? It reminds me of [5]. Adding Dana and Martin. I think [5] is quite a serious bug, btw, should be a 4.4.5 blocker. Best regards, [1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/113875 [2] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/159... [3] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... [4] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... [5] https://bugzilla.redhat.com/show_bug.cgi?id=1917707

...

Best regards, [1] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... [2] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... -- Didi

-- Didi

Dana Elfassy

Tuesday, 16 March Tue, 16 Mar

5:47 a.m.

About [5] https://bugzilla.redhat.com/show_bug.cgi?id=1917707, I have found that when executing the http request of the playbook, sometimes in the response there are some events that are missing. The way we handle the events is via while loop, where in each loop we continue from the lastEventId. Since not all events were listed, the lastEventId (which is returned as total number of events) is smaller than the actual eventId that was handled. Thus, in the next iteration the last events are handled again. It seems that this comes from Ansible Runner, and I'm investigating that area now I'm not sure if it's the same issue as you get with the logs, though, as I'm not familiar with this area in the OSTs. Is the creation of the 2 logs done by different threads? Regarding the initial error (' directory is not empty.non-zero return code') I'm not sure, will have to loop more into it. Martin, do you have any idea? On Mon, Mar 15, 2021 at 9:59 AM Yedidyah Bar David <didi(a)redhat.com> wrote:

...

On Mon, Mar 15, 2021 at 7:55 AM Yedidyah Bar David <didi(a)redhat.com> wrote: > > Hi all, > > This started a few days ago [1] and randomly happens since then: > > E DEBUG: Configuration: > E DEBUG: command: collect > E DEBUG: Traceback (most recent call last): > E DEBUG: File > "/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py", > line 2067, in <module> > E DEBUG: '%s directory is not empty.' % (conf["local_tmp_dir"]) > E DEBUG: Exception: /dev/shm/log directory is not > empty.ERROR: /dev/shm/log directory is not empty.non-zero return code > > Michal tried to fix this by using a random directory but it still fails [2]: > > DEBUG: command: collect > DEBUG: Traceback (most recent call last): > DEBUG: File "/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py", > line 2067, in <module> > DEBUG: '%s directory is not empty.' % (conf["local_tmp_dir"]) > DEBUG: Exception: /dev/shm/kaN7uY directory is not empty.ERROR: > /dev/shm/kaN7uY directory is not empty.non-zero return code > > Since I suppose that the randomness of mktemp is good enough, it must > be something else. Also, the last successful run before [1] used the > same OST git commit (same code), so I do not think it's something in > OST's code. > > Any idea? > > I think I'll push a patch to create and use the directory right before > calling ovirt-log-collector, which is probably better in other ways. My patch [1] still fails, with a somewhat different error message, but this made me check further, and while I still do not understand, I have this to add: In the failing runs, ovirt-log-collector is called *twice* in parallel. E.g. in [2] (the check-patch of [1]): Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59099]: ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59124]: ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None It also generates two logs, which you can check/compare. It's the same for previous ones, e.g. latest nightly [3][4]: Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59343]: ansible-command Invoked with _raw_params=ovirt-log-collector --verbose --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]: SELinux is preventing /usr/lib/systemd/systemd from unlink access on the sock_file ansible-ssh-lago-basic-suite-master-host-1-22-root. For complete SELinux messages run: sealert -l d03a8655-9430-4fcf-9892-3b4df1939899 Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]: SELinux is preventing /usr/lib/systemd/systemd from unlink access on the sock_file ansible-ssh-lago-basic-suite-master-host-1-22-root.#012#012***** Plugin catchall (100. confidence) suggests **************************#012#012If you believe that systemd should be allowed unlink access on the ansible-ssh-lago-basic-suite-master-host-1-22-root sock_file by default.#012Then you should report this as a bug.#012You can generate a local policy module to allow this access.#012Do#012allow this access for now by executing:#012# ausearch -c 'systemd' --raw | audit2allow -M my-systemd#012# semodule -X 300 -i my-systemd.pp#012 Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59361]: ansible-command Invoked with _raw_params=ovirt-log-collector --verbose --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None Any idea what might have caused this to start happening? Perhaps a bug in ansible, or ansible-runner? It reminds me of [5]. Adding Dana and Martin. I think [5] is quite a serious bug, btw, should be a 4.4.5 blocker. Best regards, [1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/113875 [2] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/159... [3] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... [4] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... [5] https://bugzilla.redhat.com/show_bug.cgi?id=1917707 > > Best regards, > > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... > > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... > > > -- > Didi -- Didi

Yedidyah Bar David

6:11 a.m.

On Tue, Mar 16, 2021 at 12:50 PM Dana Elfassy <delfassy(a)redhat.com> wrote:

...

Thanks

...

I'm not sure if it's the same issue as you get with the logs, though, as I'm not familiar with this area in the OSTs. Is the creation of the 2 logs done by different threads?

I have no idea either.

...

Regarding the initial error (' directory is not empty.non-zero return code') I'm not sure, will have to loop more into it. Martin, do you have any idea?

I think it's quite obvious that the reason for this is that we run two ovirt-log-collector processes in parallel. Why we (sometimes) do this is an open question, but unrelated to ovirt-log-collector (or the specific error message it emits).

...

On Mon, Mar 15, 2021 at 9:59 AM Yedidyah Bar David <didi(a)redhat.com> wrote: > > On Mon, Mar 15, 2021 at 7:55 AM Yedidyah Bar David <didi(a)redhat.com> wrote: > > > > Hi all, > > > > This started a few days ago [1] and randomly happens since then: > > > > E DEBUG: Configuration: > > E DEBUG: command: collect > > E DEBUG: Traceback (most recent call last): > > E DEBUG: File > > "/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py", > > line 2067, in <module> > > E DEBUG: '%s directory is not empty.' % (conf["local_tmp_dir"]) > > E DEBUG: Exception: /dev/shm/log directory is not > > empty.ERROR: /dev/shm/log directory is not empty.non-zero return code > > > > Michal tried to fix this by using a random directory but it still fails [2]: > > > > DEBUG: command: collect > > DEBUG: Traceback (most recent call last): > > DEBUG: File "/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py", > > line 2067, in <module> > > DEBUG: '%s directory is not empty.' % (conf["local_tmp_dir"]) > > DEBUG: Exception: /dev/shm/kaN7uY directory is not empty.ERROR: > > /dev/shm/kaN7uY directory is not empty.non-zero return code > > > > Since I suppose that the randomness of mktemp is good enough, it must > > be something else. Also, the last successful run before [1] used the > > same OST git commit (same code), so I do not think it's something in > > OST's code. > > > > Any idea? > > > > I think I'll push a patch to create and use the directory right before > > calling ovirt-log-collector, which is probably better in other ways. > > My patch [1] still fails, with a somewhat different error message, but > this made me check further, and while I still do not understand, I have > this to add: > > In the failing runs, ovirt-log-collector is called *twice* in parallel. E.g. > in [2] (the check-patch of [1]): > > Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59099]: > ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p > /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors > --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf > _uses_shell=True warn=True stdin_add_newline=True > strip_empty_ends=True argv=None chdir=None executable=None > creates=None removes=None stdin=None > Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59124]: > ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p > /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors > --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf > _uses_shell=True warn=True stdin_add_newline=True > strip_empty_ends=True argv=None chdir=None executable=None > creates=None removes=None stdin=None > > It also generates two logs, which you can check/compare. > > It's the same for previous ones, e.g. latest nightly [3][4]: > > Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59343]: > ansible-command Invoked with _raw_params=ovirt-log-collector --verbose > --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf > _uses_shell=True warn=True stdin_add_newline=True > strip_empty_ends=True argv=None chdir=None executable=None > creates=None removes=None stdin=None > Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]: > SELinux is preventing /usr/lib/systemd/systemd from unlink access on > the sock_file ansible-ssh-lago-basic-suite-master-host-1-22-root. For > complete SELinux messages run: sealert -l > d03a8655-9430-4fcf-9892-3b4df1939899 > Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]: > SELinux is preventing /usr/lib/systemd/systemd from unlink access on > the sock_file ansible-ssh-lago-basic-suite-master-host-1-22-root.#012#012***** > Plugin catchall (100. confidence) suggests > **************************#012#012If you believe that systemd should > be allowed unlink access on the > ansible-ssh-lago-basic-suite-master-host-1-22-root sock_file by > default.#012Then you should report this as a bug.#012You can generate > a local policy module to allow this access.#012Do#012allow this access > for now by executing:#012# ausearch -c 'systemd' --raw | audit2allow > -M my-systemd#012# semodule -X 300 -i my-systemd.pp#012 > Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59361]: > ansible-command Invoked with _raw_params=ovirt-log-collector --verbose > --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf > _uses_shell=True warn=True stdin_add_newline=True > strip_empty_ends=True argv=None chdir=None executable=None > creates=None removes=None stdin=None > > Any idea what might have caused this to start happening? Perhaps > a bug in ansible, or ansible-runner? It reminds me of [5]. > Adding Dana and Martin. > > I think [5] is quite a serious bug, btw, should be a 4.4.5 blocker. > > Best regards, > > [1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/113875 > > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/159... > > [3] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... > > [4] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... > > [5] https://bugzilla.redhat.com/show_bug.cgi?id=1917707 > > > > > Best regards, > > > > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... > > > > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... > > > > > > -- > > Didi > > > > -- > Didi >

-- Didi

Yedidyah Bar David

2:07 p.m.

On Tue, Mar 16, 2021 at 1:11 PM Yedidyah Bar David <didi(a)redhat.com> wrote:

...

On Tue, Mar 16, 2021 at 12:50 PM Dana Elfassy <delfassy(a)redhat.com> wrote: > > About [5] https://bugzilla.redhat.com/show_bug.cgi?id=1917707, I have found that when executing the http request of the playbook, sometimes in the response there are some events that are missing. The way we handle the events is via while loop, where in each loop we continue from the lastEventId. Since not all events were listed, the lastEventId (which is returned as total number of events) is smaller than the actual eventId that was handled. Thus, in the next iteration the last events are handled again. It seems that this comes from Ansible Runner, and I'm investigating that area now Thanks > I'm not sure if it's the same issue as you get with the logs, though, as I'm not familiar with this area in the OSTs. Is the creation of the 2 logs done by different threads? I have no idea either. > > Regarding the initial error (' directory is not empty.non-zero return code') I'm not sure, will have to loop more into it. Martin, do you have any idea? I think it's quite obvious that the reason for this is that we run two ovirt-log-collector processes in parallel. Why we (sometimes) do this is an open question, but unrelated to ovirt-log-collector (or the specific error message it emits).

I now did this: 1. Wrote a trivial dummy module [1] to test this 2. Ran it on an already finished basic-suite env with lagofy/run_tc [2]. As you can see, if you pipe [2] through grep -Ev 'do_something_2.*something 2|do_something_1.*something 1', in some cases do_something_1 actually ran the ansible code of do_something_2. I currently think it's some kind of race condition in ansible-runner, causing it to misbehave when ran in parallel, as we do in test_metrics_and_log_collector. I still didn't try to check its code and/or add debugging stuff to it to try and find the exact location. 2021-03-16 20:54:18+0200,681 INFO [test_003_01_dummy] do_something_1: Tue Mar 16 19:54:18 CET 2021 lago-basic-suite-master-engine something 2 (test_003_01_dummy:39) 2021-03-16 20:55:41+0200,155 INFO [test_003_01_dummy] do_something_1: Tue Mar 16 19:55:40 CET 2021 lago-basic-suite-master-engine something 2 (test_003_01_dummy:39) 2021-03-16 20:56:56+0200,196 INFO [test_003_01_dummy] do_something_1: Tue Mar 16 19:56:55 CET 2021 lago-basic-suite-master-engine something 2 (test_003_01_dummy:39) 2021-03-16 20:58:19+0200,665 INFO [test_003_01_dummy] do_something_1: Tue Mar 16 19:58:19 CET 2021 lago-basic-suite-master-engine something 2 (test_003_01_dummy:39) [1] https://github.com/didib/ovirt-system-tests/commit/6a5327976eaf5bc27e3527... [2] https://gist.github.com/didib/fe1b3bc28b8364cf6976d3d93e6cd9b6

...

> > On Mon, Mar 15, 2021 at 9:59 AM Yedidyah Bar David <didi(a)redhat.com> wrote: >> >> On Mon, Mar 15, 2021 at 7:55 AM Yedidyah Bar David <didi(a)redhat.com> wrote: >> > >> > Hi all, >> > >> > This started a few days ago [1] and randomly happens since then: >> > >> > E DEBUG: Configuration: >> > E DEBUG: command: collect >> > E DEBUG: Traceback (most recent call last): >> > E DEBUG: File >> > "/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py", >> > line 2067, in <module> >> > E DEBUG: '%s directory is not empty.' % (conf["local_tmp_dir"]) >> > E DEBUG: Exception: /dev/shm/log directory is not >> > empty.ERROR: /dev/shm/log directory is not empty.non-zero return code >> > >> > Michal tried to fix this by using a random directory but it still fails [2]: >> > >> > DEBUG: command: collect >> > DEBUG: Traceback (most recent call last): >> > DEBUG: File "/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py", >> > line 2067, in <module> >> > DEBUG: '%s directory is not empty.' % (conf["local_tmp_dir"]) >> > DEBUG: Exception: /dev/shm/kaN7uY directory is not empty.ERROR: >> > /dev/shm/kaN7uY directory is not empty.non-zero return code >> > >> > Since I suppose that the randomness of mktemp is good enough, it must >> > be something else. Also, the last successful run before [1] used the >> > same OST git commit (same code), so I do not think it's something in >> > OST's code. >> > >> > Any idea? >> > >> > I think I'll push a patch to create and use the directory right before >> > calling ovirt-log-collector, which is probably better in other ways. >> >> My patch [1] still fails, with a somewhat different error message, but >> this made me check further, and while I still do not understand, I have >> this to add: >> >> In the failing runs, ovirt-log-collector is called *twice* in parallel. E.g. >> in [2] (the check-patch of [1]): >> >> Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59099]: >> ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p >> /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors >> --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf >> _uses_shell=True warn=True stdin_add_newline=True >> strip_empty_ends=True argv=None chdir=None executable=None >> creates=None removes=None stdin=None >> Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59124]: >> ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p >> /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors >> --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf >> _uses_shell=True warn=True stdin_add_newline=True >> strip_empty_ends=True argv=None chdir=None executable=None >> creates=None removes=None stdin=None >> >> It also generates two logs, which you can check/compare. >> >> It's the same for previous ones, e.g. latest nightly [3][4]: >> >> Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59343]: >> ansible-command Invoked with _raw_params=ovirt-log-collector --verbose >> --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf >> _uses_shell=True warn=True stdin_add_newline=True >> strip_empty_ends=True argv=None chdir=None executable=None >> creates=None removes=None stdin=None >> Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]: >> SELinux is preventing /usr/lib/systemd/systemd from unlink access on >> the sock_file ansible-ssh-lago-basic-suite-master-host-1-22-root. For >> complete SELinux messages run: sealert -l >> d03a8655-9430-4fcf-9892-3b4df1939899 >> Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]: >> SELinux is preventing /usr/lib/systemd/systemd from unlink access on >> the sock_file ansible-ssh-lago-basic-suite-master-host-1-22-root.#012#012***** >> Plugin catchall (100. confidence) suggests >> **************************#012#012If you believe that systemd should >> be allowed unlink access on the >> ansible-ssh-lago-basic-suite-master-host-1-22-root sock_file by >> default.#012Then you should report this as a bug.#012You can generate >> a local policy module to allow this access.#012Do#012allow this access >> for now by executing:#012# ausearch -c 'systemd' --raw | audit2allow >> -M my-systemd#012# semodule -X 300 -i my-systemd.pp#012 >> Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59361]: >> ansible-command Invoked with _raw_params=ovirt-log-collector --verbose >> --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf >> _uses_shell=True warn=True stdin_add_newline=True >> strip_empty_ends=True argv=None chdir=None executable=None >> creates=None removes=None stdin=None >> >> Any idea what might have caused this to start happening? Perhaps >> a bug in ansible, or ansible-runner? It reminds me of [5]. >> Adding Dana and Martin. >> >> I think [5] is quite a serious bug, btw, should be a 4.4.5 blocker. >> >> Best regards, >> >> [1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/113875 >> >> [2] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/159... >> >> [3] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... >> >> [4] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... >> >> [5] https://bugzilla.redhat.com/show_bug.cgi?id=1917707 >> >> > >> > Best regards, >> > >> > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... >> > >> > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... >> > >> > >> > -- >> > Didi >> >> >> >> -- >> Didi >> -- Didi

-- Didi

Yedidyah Bar David

Wednesday, 17 March Wed, 17 Mar

2:30 a.m.

On Tue, Mar 16, 2021 at 9:07 PM Yedidyah Bar David <didi(a)redhat.com> wrote:

...

On Tue, Mar 16, 2021 at 1:11 PM Yedidyah Bar David <didi(a)redhat.com> wrote: > > On Tue, Mar 16, 2021 at 12:50 PM Dana Elfassy <delfassy(a)redhat.com> wrote: > > > > About [5] https://bugzilla.redhat.com/show_bug.cgi?id=1917707, I have found that when executing the http request of the playbook, sometimes in the response there are some events that are missing. The way we handle the events is via while loop, where in each loop we continue from the lastEventId. Since not all events were listed, the lastEventId (which is returned as total number of events) is smaller than the actual eventId that was handled. Thus, in the next iteration the last events are handled again. It seems that this comes from Ansible Runner, and I'm investigating that area now > > Thanks > > > I'm not sure if it's the same issue as you get with the logs, though, as I'm not familiar with this area in the OSTs. Is the creation of the 2 logs done by different threads? > > I have no idea either. > > > > > Regarding the initial error (' directory is not empty.non-zero return code') I'm not sure, will have to loop more into it. Martin, do you have any idea? > > I think it's quite obvious that the reason for this is that we run two > ovirt-log-collector processes in parallel. Why we (sometimes) do this > is an open question, but unrelated to ovirt-log-collector (or the > specific error message it emits). I now did this: 1. Wrote a trivial dummy module [1] to test this 2. Ran it on an already finished basic-suite env with lagofy/run_tc [2]. As you can see, if you pipe [2] through grep -Ev 'do_something_2.*something 2|do_something_1.*something 1', in some cases do_something_1 actually ran the ansible code of do_something_2. I currently think it's some kind of race condition in ansible-runner, causing it to misbehave when ran in parallel, as we do in test_metrics_and_log_collector. I still didn't try to check its code and/or add debugging stuff to it to try and find the exact location. 2021-03-16 20:54:18+0200,681 INFO [test_003_01_dummy] do_something_1: Tue Mar 16 19:54:18 CET 2021 lago-basic-suite-master-engine something 2 (test_003_01_dummy:39) 2021-03-16 20:55:41+0200,155 INFO [test_003_01_dummy] do_something_1: Tue Mar 16 19:55:40 CET 2021 lago-basic-suite-master-engine something 2 (test_003_01_dummy:39) 2021-03-16 20:56:56+0200,196 INFO [test_003_01_dummy] do_something_1: Tue Mar 16 19:56:55 CET 2021 lago-basic-suite-master-engine something 2 (test_003_01_dummy:39) 2021-03-16 20:58:19+0200,665 INFO [test_003_01_dummy] do_something_1: Tue Mar 16 19:58:19 CET 2021 lago-basic-suite-master-engine something 2 (test_003_01_dummy:39)

For now, pushed: https://gerrit.ovirt.org/c/ovirt-system-tests/+/113906 test_metrics_and_log_collector: Run serially Best regards,

...

[1] https://github.com/didib/ovirt-system-tests/commit/6a5327976eaf5bc27e3527... [2] https://gist.github.com/didib/fe1b3bc28b8364cf6976d3d93e6cd9b6 > > > > > On Mon, Mar 15, 2021 at 9:59 AM Yedidyah Bar David <didi(a)redhat.com> wrote: > >> > >> On Mon, Mar 15, 2021 at 7:55 AM Yedidyah Bar David <didi(a)redhat.com> wrote: > >> > > >> > Hi all, > >> > > >> > This started a few days ago [1] and randomly happens since then: > >> > > >> > E DEBUG: Configuration: > >> > E DEBUG: command: collect > >> > E DEBUG: Traceback (most recent call last): > >> > E DEBUG: File > >> > "/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py", > >> > line 2067, in <module> > >> > E DEBUG: '%s directory is not empty.' % (conf["local_tmp_dir"]) > >> > E DEBUG: Exception: /dev/shm/log directory is not > >> > empty.ERROR: /dev/shm/log directory is not empty.non-zero return code > >> > > >> > Michal tried to fix this by using a random directory but it still fails [2]: > >> > > >> > DEBUG: command: collect > >> > DEBUG: Traceback (most recent call last): > >> > DEBUG: File "/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py", > >> > line 2067, in <module> > >> > DEBUG: '%s directory is not empty.' % (conf["local_tmp_dir"]) > >> > DEBUG: Exception: /dev/shm/kaN7uY directory is not empty.ERROR: > >> > /dev/shm/kaN7uY directory is not empty.non-zero return code > >> > > >> > Since I suppose that the randomness of mktemp is good enough, it must > >> > be something else. Also, the last successful run before [1] used the > >> > same OST git commit (same code), so I do not think it's something in > >> > OST's code. > >> > > >> > Any idea? > >> > > >> > I think I'll push a patch to create and use the directory right before > >> > calling ovirt-log-collector, which is probably better in other ways. > >> > >> My patch [1] still fails, with a somewhat different error message, but > >> this made me check further, and while I still do not understand, I have > >> this to add: > >> > >> In the failing runs, ovirt-log-collector is called *twice* in parallel. E.g. > >> in [2] (the check-patch of [1]): > >> > >> Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59099]: > >> ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p > >> /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors > >> --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf > >> _uses_shell=True warn=True stdin_add_newline=True > >> strip_empty_ends=True argv=None chdir=None executable=None > >> creates=None removes=None stdin=None > >> Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59124]: > >> ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p > >> /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors > >> --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf > >> _uses_shell=True warn=True stdin_add_newline=True > >> strip_empty_ends=True argv=None chdir=None executable=None > >> creates=None removes=None stdin=None > >> > >> It also generates two logs, which you can check/compare. > >> > >> It's the same for previous ones, e.g. latest nightly [3][4]: > >> > >> Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59343]: > >> ansible-command Invoked with _raw_params=ovirt-log-collector --verbose > >> --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf > >> _uses_shell=True warn=True stdin_add_newline=True > >> strip_empty_ends=True argv=None chdir=None executable=None > >> creates=None removes=None stdin=None > >> Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]: > >> SELinux is preventing /usr/lib/systemd/systemd from unlink access on > >> the sock_file ansible-ssh-lago-basic-suite-master-host-1-22-root. For > >> complete SELinux messages run: sealert -l > >> d03a8655-9430-4fcf-9892-3b4df1939899 > >> Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]: > >> SELinux is preventing /usr/lib/systemd/systemd from unlink access on > >> the sock_file ansible-ssh-lago-basic-suite-master-host-1-22-root.#012#012***** > >> Plugin catchall (100. confidence) suggests > >> **************************#012#012If you believe that systemd should > >> be allowed unlink access on the > >> ansible-ssh-lago-basic-suite-master-host-1-22-root sock_file by > >> default.#012Then you should report this as a bug.#012You can generate > >> a local policy module to allow this access.#012Do#012allow this access > >> for now by executing:#012# ausearch -c 'systemd' --raw | audit2allow > >> -M my-systemd#012# semodule -X 300 -i my-systemd.pp#012 > >> Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59361]: > >> ansible-command Invoked with _raw_params=ovirt-log-collector --verbose > >> --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf > >> _uses_shell=True warn=True stdin_add_newline=True > >> strip_empty_ends=True argv=None chdir=None executable=None > >> creates=None removes=None stdin=None > >> > >> Any idea what might have caused this to start happening? Perhaps > >> a bug in ansible, or ansible-runner? It reminds me of [5]. > >> Adding Dana and Martin. > >> > >> I think [5] is quite a serious bug, btw, should be a 4.4.5 blocker. > >> > >> Best regards, > >> > >> [1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/113875 > >> > >> [2] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/159... > >> > >> [3] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... > >> > >> [4] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... > >> > >> [5] https://bugzilla.redhat.com/show_bug.cgi?id=1917707 > >> > >> > > >> > Best regards, > >> > > >> > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... > >> > > >> > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... > >> > > >> > > >> > -- > >> > Didi > >> > >> > >> > >> -- > >> Didi > >> > > > -- > Didi -- Didi

-- Didi

Dana Elfassy

7:51 a.m.

Adding +Martin Perina <mperina(a)redhat.com> On Mon, Mar 15, 2021 at 9:59 AM Yedidyah Bar David <didi(a)redhat.com> wrote:

...

Dana Elfassy

7:53 a.m.

Adding +Marcin Sobczyk <msobczyk(a)redhat.com> On Mon, Mar 15, 2021 at 9:59 AM Yedidyah Bar David <didi(a)redhat.com> wrote:

...

Michal Skrivanek

9:22 a.m.

...

On 17. 3. 2021, at 13:53, Dana Elfassy <delfassy(a)redhat.com> wrote: Adding +Marcin Sobczyk <mailto:msobczyk@redhat.com> On Mon, Mar 15, 2021 at 9:59 AM Yedidyah Bar David <didi(a)redhat.com <mailto:didi@redhat.com>> wrote: On Mon, Mar 15, 2021 at 7:55 AM Yedidyah Bar David <didi(a)redhat.com <mailto:didi@redhat.com>> wrote: > > Hi all, > > This started a few days ago [1] and randomly happens since then: > > E DEBUG: Configuration: > E DEBUG: command: collect > E DEBUG: Traceback (most recent call last): > E DEBUG: File > "/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py", > line 2067, in <module> > E DEBUG: '%s directory is not empty.' % (conf["local_tmp_dir"]) > E DEBUG: Exception: /dev/shm/log directory is not > empty.ERROR: /dev/shm/log directory is not empty.non-zero return code > > Michal tried to fix this by using a random directory but it still fails [2]: > > DEBUG: command: collect > DEBUG: Traceback (most recent call last): > DEBUG: File "/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py", > line 2067, in <module> > DEBUG: '%s directory is not empty.' % (conf["local_tmp_dir"]) > DEBUG: Exception: /dev/shm/kaN7uY directory is not empty.ERROR: > /dev/shm/kaN7uY directory is not empty.non-zero return code > > Since I suppose that the randomness of mktemp is good enough, it must > be something else. Also, the last successful run before [1] used the > same OST git commit (same code), so I do not think it's something in > OST's code. > > Any idea? > > I think I'll push a patch to create and use the directory right before > calling ovirt-log-collector, which is probably better in other ways. My patch [1] still fails, with a somewhat different error message, but this made me check further, and while I still do not understand, I have this to add: In the failing runs, ovirt-log-collector is called *twice* in parallel. E.g. in [2] (the check-patch of [1]): Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59099]: ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59124]: ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None It also generates two logs, which you can check/compare. It's the same for previous ones, e.g. latest nightly [3][4]: Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59343]: ansible-command Invoked with _raw_params=ovirt-log-collector --verbose --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]: SELinux is preventing /usr/lib/systemd/systemd from unlink access on the sock_file ansible-ssh-lago-basic-suite-master-host-1-22-root. For complete SELinux messages run: sealert -l d03a8655-9430-4fcf-9892-3b4df1939899 Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]: SELinux is preventing /usr/lib/systemd/systemd from unlink access on the sock_file ansible-ssh-lago-basic-suite-master-host-1-22-root.#012#012***** Plugin catchall (100. confidence) suggests **************************#012#012If you believe that systemd should be allowed unlink access on the ansible-ssh-lago-basic-suite-master-host-1-22-root sock_file by default.#012Then you should report this as a bug.#012You can generate a local policy module to allow this access.#012Do#012allow this access for now by executing:#012# ausearch -c 'systemd' --raw | audit2allow -M my-systemd#012# semodule -X 300 -i my-systemd.pp#012 Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59361]: ansible-command Invoked with _raw_params=ovirt-log-collector --verbose --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf _uses_shell=True warn=True stdin_add_newline=True strip_empty_ends=True argv=None chdir=None executable=None creates=None removes=None stdin=None Any idea what might have caused this to start happening? Perhaps a bug in ansible, or ansible-runner? It reminds me of [5]. Adding Dana and Martin. I think [5] is quite a serious bug, btw, should be a 4.4.5 blocker.

it’s from January and tehre are no comments there. Dana, any update? It does look serious but perhaps not really hit in real world scenarios?

...

Best regards, [1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/113875 <https://gerrit.ovirt.org/c/ovirt-system-tests/+/113875> [2] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/159... <https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/159... [3] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... <https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... [4] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... <https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... [5] https://bugzilla.redhat.com/show_bug.cgi?id=1917707 <https://bugzilla.redhat.com/show_bug.cgi?id=1917707> > > Best regards, > > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... <https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... > > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... <https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... > > > -- > Didi -- Didi _______________________________________________ Infra mailing list -- infra(a)ovirt.org To unsubscribe send an email to infra-leave(a)ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/message/PSO5IRX73FS...

Martin Perina

9:48 a.m.

New subject: [ovirt-devel] Re: basic suite fails on test_metrics_and_log_collector

On Wed, Mar 17, 2021 at 3:24 PM Michal Skrivanek < michal.skrivanek(a)redhat.com> wrote:

...

On 17. 3. 2021, at 13:53, Dana Elfassy <delfassy(a)redhat.com> wrote: Adding +Marcin Sobczyk <msobczyk(a)redhat.com> On Mon, Mar 15, 2021 at 9:59 AM Yedidyah Bar David <didi(a)redhat.com> wrote: > On Mon, Mar 15, 2021 at 7:55 AM Yedidyah Bar David <didi(a)redhat.com> > wrote: > > > > Hi all, > > > > This started a few days ago [1] and randomly happens since then: > > > > E DEBUG: Configuration: > > E DEBUG: command: collect > > E DEBUG: Traceback (most recent call last): > > E DEBUG: File > > "/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py", > > line 2067, in <module> > > E DEBUG: '%s directory is not empty.' % > (conf["local_tmp_dir"]) > > E DEBUG: Exception: /dev/shm/log directory is not > > empty.ERROR: /dev/shm/log directory is not empty.non-zero return code > > > > Michal tried to fix this by using a random directory but it still fails > [2]: > > > > DEBUG: command: collect > > DEBUG: Traceback (most recent call last): > > DEBUG: File > "/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py", > > line 2067, in <module> > > DEBUG: '%s directory is not empty.' % (conf["local_tmp_dir"]) > > DEBUG: Exception: /dev/shm/kaN7uY directory is not empty.ERROR: > > /dev/shm/kaN7uY directory is not empty.non-zero return code > > > > Since I suppose that the randomness of mktemp is good enough, it must > > be something else. Also, the last successful run before [1] used the > > same OST git commit (same code), so I do not think it's something in > > OST's code. > > > > Any idea? > > > > I think I'll push a patch to create and use the directory right before > > calling ovirt-log-collector, which is probably better in other ways. > > My patch [1] still fails, with a somewhat different error message, but > this made me check further, and while I still do not understand, I have > this to add: > > In the failing runs, ovirt-log-collector is called *twice* in parallel. > E.g. > in [2] (the check-patch of [1]): > > Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59099]: > ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p > /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors > --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf > _uses_shell=True warn=True stdin_add_newline=True > strip_empty_ends=True argv=None chdir=None executable=None > creates=None removes=None stdin=None > Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59124]: > ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p > /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors > --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf > _uses_shell=True warn=True stdin_add_newline=True > strip_empty_ends=True argv=None chdir=None executable=None > creates=None removes=None stdin=None > > It also generates two logs, which you can check/compare. > > It's the same for previous ones, e.g. latest nightly [3][4]: > > Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59343]: > ansible-command Invoked with _raw_params=ovirt-log-collector --verbose > --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf > _uses_shell=True warn=True stdin_add_newline=True > strip_empty_ends=True argv=None chdir=None executable=None > creates=None removes=None stdin=None > Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]: > SELinux is preventing /usr/lib/systemd/systemd from unlink access on > the sock_file ansible-ssh-lago-basic-suite-master-host-1-22-root. For > complete SELinux messages run: sealert -l > d03a8655-9430-4fcf-9892-3b4df1939899 > Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]: > SELinux is preventing /usr/lib/systemd/systemd from unlink access on > the sock_file > ansible-ssh-lago-basic-suite-master-host-1-22-root.#012#012***** > Plugin catchall (100. confidence) suggests > **************************#012#012If you believe that systemd should > be allowed unlink access on the > ansible-ssh-lago-basic-suite-master-host-1-22-root sock_file by > default.#012Then you should report this as a bug.#012You can generate > a local policy module to allow this access.#012Do#012allow this access > for now by executing:#012# ausearch -c 'systemd' --raw | audit2allow > -M my-systemd#012# semodule -X 300 -i my-systemd.pp#012 > Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59361]: > ansible-command Invoked with _raw_params=ovirt-log-collector --verbose > --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf > _uses_shell=True warn=True stdin_add_newline=True > strip_empty_ends=True argv=None chdir=None executable=None > creates=None removes=None stdin=None > > Any idea what might have caused this to start happening? Perhaps > a bug in ansible, or ansible-runner? It reminds me of [5]. > Adding Dana and Martin. > > I think [5] is quite a serious bug, btw, should be a 4.4.5 blocker. > it’s from January and tehre are no comments there. Dana, any update? It does look serious but perhaps not really hit in real world scenarios?

It's reproducible outside OST, Dana is investigating intensively last 2 weeks, but so far we haven't been able to find where events are multiplied. But AFAIK this issue is visble only with ansible-runner-service, respectively only in the part when engine fetches events from ARS and create audit_log records from them. It doesn't mean, that playbook tasks are executed multiple times! OST is using ansible-runner directly, so it doesn't seem to me like the same issue So [5] is not a blocker, it's just really annoying if you need to works with audit_log records

...

> Best regards, > > [1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/113875 > > [2] > https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/159... > > [3] > https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... > > [4] > https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... > > [5] https://bugzilla.redhat.com/show_bug.cgi?id=1917707 > > > > > Best regards, > > > > [1] > https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... > > > > [2] > https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... > > > > > > -- > > Didi > > > > -- > Didi > > _______________________________________________ Infra mailing list -- infra(a)ovirt.org To unsubscribe send an email to infra-leave(a)ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/message/PSO5IRX73FS... _______________________________________________ Devel mailing list -- devel(a)ovirt.org To unsubscribe send an email to devel-leave(a)ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/6DZRAMSAJCX...

-- Martin Perina Manager, Software Engineering Red Hat Czech s.r.o.

Yedidyah Bar David

10:38 a.m.

New subject: [ovirt-devel] Re: basic suite fails on test_metrics_and_log_collector

On Wed, Mar 17, 2021 at 4:48 PM Martin Perina <mperina(a)redhat.com> wrote:

...

On Wed, Mar 17, 2021 at 3:24 PM Michal Skrivanek <michal.skrivanek(a)redhat.com> wrote: > > > > On 17. 3. 2021, at 13:53, Dana Elfassy <delfassy(a)redhat.com> wrote: > > Adding +Marcin Sobczyk > > On Mon, Mar 15, 2021 at 9:59 AM Yedidyah Bar David <didi(a)redhat.com> wrote: >> >> On Mon, Mar 15, 2021 at 7:55 AM Yedidyah Bar David <didi(a)redhat.com> wrote: >> > >> > Hi all, >> > >> > This started a few days ago [1] and randomly happens since then: >> > >> > E DEBUG: Configuration: >> > E DEBUG: command: collect >> > E DEBUG: Traceback (most recent call last): >> > E DEBUG: File >> > "/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py", >> > line 2067, in <module> >> > E DEBUG: '%s directory is not empty.' % (conf["local_tmp_dir"]) >> > E DEBUG: Exception: /dev/shm/log directory is not >> > empty.ERROR: /dev/shm/log directory is not empty.non-zero return code >> > >> > Michal tried to fix this by using a random directory but it still fails [2]: >> > >> > DEBUG: command: collect >> > DEBUG: Traceback (most recent call last): >> > DEBUG: File "/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py", >> > line 2067, in <module> >> > DEBUG: '%s directory is not empty.' % (conf["local_tmp_dir"]) >> > DEBUG: Exception: /dev/shm/kaN7uY directory is not empty.ERROR: >> > /dev/shm/kaN7uY directory is not empty.non-zero return code >> > >> > Since I suppose that the randomness of mktemp is good enough, it must >> > be something else. Also, the last successful run before [1] used the >> > same OST git commit (same code), so I do not think it's something in >> > OST's code. >> > >> > Any idea? >> > >> > I think I'll push a patch to create and use the directory right before >> > calling ovirt-log-collector, which is probably better in other ways. >> >> My patch [1] still fails, with a somewhat different error message, but >> this made me check further, and while I still do not understand, I have >> this to add: >> >> In the failing runs, ovirt-log-collector is called *twice* in parallel. E.g. >> in [2] (the check-patch of [1]): >> >> Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59099]: >> ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p >> /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors >> --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf >> _uses_shell=True warn=True stdin_add_newline=True >> strip_empty_ends=True argv=None chdir=None executable=None >> creates=None removes=None stdin=None >> Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59124]: >> ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p >> /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors >> --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf >> _uses_shell=True warn=True stdin_add_newline=True >> strip_empty_ends=True argv=None chdir=None executable=None >> creates=None removes=None stdin=None >> >> It also generates two logs, which you can check/compare. >> >> It's the same for previous ones, e.g. latest nightly [3][4]: >> >> Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59343]: >> ansible-command Invoked with _raw_params=ovirt-log-collector --verbose >> --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf >> _uses_shell=True warn=True stdin_add_newline=True >> strip_empty_ends=True argv=None chdir=None executable=None >> creates=None removes=None stdin=None >> Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]: >> SELinux is preventing /usr/lib/systemd/systemd from unlink access on >> the sock_file ansible-ssh-lago-basic-suite-master-host-1-22-root. For >> complete SELinux messages run: sealert -l >> d03a8655-9430-4fcf-9892-3b4df1939899 >> Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]: >> SELinux is preventing /usr/lib/systemd/systemd from unlink access on >> the sock_file ansible-ssh-lago-basic-suite-master-host-1-22-root.#012#012***** >> Plugin catchall (100. confidence) suggests >> **************************#012#012If you believe that systemd should >> be allowed unlink access on the >> ansible-ssh-lago-basic-suite-master-host-1-22-root sock_file by >> default.#012Then you should report this as a bug.#012You can generate >> a local policy module to allow this access.#012Do#012allow this access >> for now by executing:#012# ausearch -c 'systemd' --raw | audit2allow >> -M my-systemd#012# semodule -X 300 -i my-systemd.pp#012 >> Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59361]: >> ansible-command Invoked with _raw_params=ovirt-log-collector --verbose >> --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf >> _uses_shell=True warn=True stdin_add_newline=True >> strip_empty_ends=True argv=None chdir=None executable=None >> creates=None removes=None stdin=None >> >> Any idea what might have caused this to start happening? Perhaps >> a bug in ansible, or ansible-runner? It reminds me of [5]. >> Adding Dana and Martin. >> >> I think [5] is quite a serious bug, btw, should be a 4.4.5 blocker. > > > it’s from January and tehre are no comments there. Dana, any update? > It does look serious but perhaps not really hit in real world scenarios? It's reproducible outside OST, Dana is investigating intensively last 2 weeks, but so far we haven't been able to find where events are multiplied. But AFAIK this issue is visble only with ansible-runner-service, respectively only in the part when engine fetches events from ARS and create audit_log records from them. It doesn't mean, that playbook tasks are executed multiple times!

Are you sure? Please check also bz 1930565. I closed it as a duplicate, but if you are certain that tasks are never executed multiple times, then we should reopen. I looked at the attached logs, and they were hard to follow (due to the duplication), but at least I didn't notice there any bug around imgbased.

...

OST is using ansible-runner directly, so it doesn't seem to me like the same issue

Indeed. That's also what I currently think myself. It seems like a bug in OST. Still didn't manage to find the root cause - I am pushing my debugging changes to [1], for anyone that wants to follow, and: 1. run_tc basic-suite-master/test-scenarios/test_003_01_dummy.py > ~/run_tc/run-tc-$(date +%Y-%m-%d-%H%M%S).log 2>&1 2. cat $(ls -t1 run-tc* | head -1) | grep -Ev 'do_something_2.*something 2|do_something_1.*something 1' | grep -v 'Start' 3. Pick a line from the output (there are always a few, among 100 runs), check its timestamp, then look at pytest.log and search for (currently): $timestamp|something..|_run_ansible_runner|_run_ansible_runner.*before run|_run_ansible_runner.*after run|Starting functools.partial|.*test_dummy: Starting iteration.*|ModuleArgsMapper Hope to update tomorrow. For now, we can merge the workaround patch to run them serially, to unbreak basic-suite at this point. [1] https://github.com/didib/ovirt-system-tests/commits/ansible-logs

...

So [5] is not a blocker, it's just really annoying if you need to works with audit_log records > >> >> Best regards, >> >> [1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/113875 >> >> [2] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/159... >> >> [3] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... >> >> [4] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... >> >> [5] https://bugzilla.redhat.com/show_bug.cgi?id=1917707 >> >> > >> > Best regards, >> > >> > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... >> > >> > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... >> > >> > >> > -- >> > Didi >> >> >> >> -- >> Didi >> > _______________________________________________ > Infra mailing list -- infra(a)ovirt.org > To unsubscribe send an email to infra-leave(a)ovirt.org > Privacy Statement: https://www.ovirt.org/privacy-policy.html > oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ > List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/message/PSO5IRX73FS... > > > _______________________________________________ > Devel mailing list -- devel(a)ovirt.org > To unsubscribe send an email to devel-leave(a)ovirt.org > Privacy Statement: https://www.ovirt.org/privacy-policy.html > oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ > List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/6DZRAMSAJCX... -- Martin Perina Manager, Software Engineering Red Hat Czech s.r.o.

-- Didi

Yedidyah Bar David

Thursday, 18 March Thu, 18 Mar

5:33 a.m.

New subject: [ovirt-devel] Re: basic suite fails on test_metrics_and_log_collector

On Wed, Mar 17, 2021 at 5:38 PM Yedidyah Bar David <didi(a)redhat.com> wrote:

...

On Wed, Mar 17, 2021 at 4:48 PM Martin Perina <mperina(a)redhat.com> wrote: > > > > On Wed, Mar 17, 2021 at 3:24 PM Michal Skrivanek <michal.skrivanek(a)redhat.com> wrote: >> >> >> >> On 17. 3. 2021, at 13:53, Dana Elfassy <delfassy(a)redhat.com> wrote: >> >> Adding +Marcin Sobczyk >> >> On Mon, Mar 15, 2021 at 9:59 AM Yedidyah Bar David <didi(a)redhat.com> wrote: >>> >>> On Mon, Mar 15, 2021 at 7:55 AM Yedidyah Bar David <didi(a)redhat.com> wrote: >>> > >>> > Hi all, >>> > >>> > This started a few days ago [1] and randomly happens since then: >>> > >>> > E DEBUG: Configuration: >>> > E DEBUG: command: collect >>> > E DEBUG: Traceback (most recent call last): >>> > E DEBUG: File >>> > "/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py", >>> > line 2067, in <module> >>> > E DEBUG: '%s directory is not empty.' % (conf["local_tmp_dir"]) >>> > E DEBUG: Exception: /dev/shm/log directory is not >>> > empty.ERROR: /dev/shm/log directory is not empty.non-zero return code >>> > >>> > Michal tried to fix this by using a random directory but it still fails [2]: >>> > >>> > DEBUG: command: collect >>> > DEBUG: Traceback (most recent call last): >>> > DEBUG: File "/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py", >>> > line 2067, in <module> >>> > DEBUG: '%s directory is not empty.' % (conf["local_tmp_dir"]) >>> > DEBUG: Exception: /dev/shm/kaN7uY directory is not empty.ERROR: >>> > /dev/shm/kaN7uY directory is not empty.non-zero return code >>> > >>> > Since I suppose that the randomness of mktemp is good enough, it must >>> > be something else. Also, the last successful run before [1] used the >>> > same OST git commit (same code), so I do not think it's something in >>> > OST's code. >>> > >>> > Any idea? >>> > >>> > I think I'll push a patch to create and use the directory right before >>> > calling ovirt-log-collector, which is probably better in other ways. >>> >>> My patch [1] still fails, with a somewhat different error message, but >>> this made me check further, and while I still do not understand, I have >>> this to add: >>> >>> In the failing runs, ovirt-log-collector is called *twice* in parallel. E.g. >>> in [2] (the check-patch of [1]): >>> >>> Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59099]: >>> ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p >>> /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors >>> --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf >>> _uses_shell=True warn=True stdin_add_newline=True >>> strip_empty_ends=True argv=None chdir=None executable=None >>> creates=None removes=None stdin=None >>> Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59124]: >>> ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p >>> /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors >>> --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf >>> _uses_shell=True warn=True stdin_add_newline=True >>> strip_empty_ends=True argv=None chdir=None executable=None >>> creates=None removes=None stdin=None >>> >>> It also generates two logs, which you can check/compare. >>> >>> It's the same for previous ones, e.g. latest nightly [3][4]: >>> >>> Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59343]: >>> ansible-command Invoked with _raw_params=ovirt-log-collector --verbose >>> --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf >>> _uses_shell=True warn=True stdin_add_newline=True >>> strip_empty_ends=True argv=None chdir=None executable=None >>> creates=None removes=None stdin=None >>> Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]: >>> SELinux is preventing /usr/lib/systemd/systemd from unlink access on >>> the sock_file ansible-ssh-lago-basic-suite-master-host-1-22-root. For >>> complete SELinux messages run: sealert -l >>> d03a8655-9430-4fcf-9892-3b4df1939899 >>> Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]: >>> SELinux is preventing /usr/lib/systemd/systemd from unlink access on >>> the sock_file ansible-ssh-lago-basic-suite-master-host-1-22-root.#012#012***** >>> Plugin catchall (100. confidence) suggests >>> **************************#012#012If you believe that systemd should >>> be allowed unlink access on the >>> ansible-ssh-lago-basic-suite-master-host-1-22-root sock_file by >>> default.#012Then you should report this as a bug.#012You can generate >>> a local policy module to allow this access.#012Do#012allow this access >>> for now by executing:#012# ausearch -c 'systemd' --raw | audit2allow >>> -M my-systemd#012# semodule -X 300 -i my-systemd.pp#012 >>> Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59361]: >>> ansible-command Invoked with _raw_params=ovirt-log-collector --verbose >>> --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf >>> _uses_shell=True warn=True stdin_add_newline=True >>> strip_empty_ends=True argv=None chdir=None executable=None >>> creates=None removes=None stdin=None >>> >>> Any idea what might have caused this to start happening? Perhaps >>> a bug in ansible, or ansible-runner? It reminds me of [5]. >>> Adding Dana and Martin. >>> >>> I think [5] is quite a serious bug, btw, should be a 4.4.5 blocker. >> >> >> it’s from January and tehre are no comments there. Dana, any update? >> It does look serious but perhaps not really hit in real world scenarios? > > > It's reproducible outside OST, Dana is investigating intensively last 2 weeks, but so far we haven't been able to find where events are multiplied. > But AFAIK this issue is visble only with ansible-runner-service, respectively only in the part when engine fetches events from ARS and create > audit_log records from them. It doesn't mean, that playbook tasks are executed multiple times! Are you sure? Please check also bz 1930565. I closed it as a duplicate, but if you are certain that tasks are never executed multiple times, then we should reopen. I looked at the attached logs, and they were hard to follow (due to the duplication), but at least I didn't notice there any bug around imgbased. > OST is using ansible-runner directly, so it doesn't seem to me like the same issue Indeed. That's also what I currently think myself. It seems like a bug in OST. Still didn't manage to find the root cause - I am pushing my debugging changes to [1], for anyone that wants to follow, and: 1. run_tc basic-suite-master/test-scenarios/test_003_01_dummy.py > ~/run_tc/run-tc-$(date +%Y-%m-%d-%H%M%S).log 2>&1 2. cat $(ls -t1 run-tc* | head -1) | grep -Ev 'do_something_2.*something 2|do_something_1.*something 1' | grep -v 'Start' 3. Pick a line from the output (there are always a few, among 100 runs), check its timestamp, then look at pytest.log and search for (currently): $timestamp|something..|_run_ansible_runner|_run_ansible_runner.*before run|_run_ansible_runner.*after run|Starting functools.partial|.*test_dummy: Starting iteration.*|ModuleArgsMapper Hope to update tomorrow. For now, we can merge the workaround patch to run them serially, to unbreak basic-suite at this point.

OK. I now pushed two alternative patches that fix the bug. Please pick the one you like and merge. I tested both of them on [1] and both seem ok. shorter fix: https://gerrit.ovirt.org/c/ovirt-system-tests/+/113937 clearer fix, imo: https://gerrit.ovirt.org/c/ovirt-system-tests/+/113938 [1] https://github.com/didib/ovirt-system-tests/commits/add-test_003_01_dummy

...

[1] https://github.com/didib/ovirt-system-tests/commits/ansible-logs > > So [5] is not a blocker, it's just really annoying if you need to works with audit_log records > >> >>> >>> Best regards, >>> >>> [1] https://gerrit.ovirt.org/c/ovirt-system-tests/+/113875 >>> >>> [2] https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/159... >>> >>> [3] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... >>> >>> [4] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... >>> >>> [5] https://bugzilla.redhat.com/show_bug.cgi?id=1917707 >>> >>> > >>> > Best regards, >>> > >>> > [1] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... >>> > >>> > [2] https://jenkins.ovirt.org/job/ovirt-system-tests_basic-suite-master_night... >>> > >>> > >>> > -- >>> > Didi >>> >>> >>> >>> -- >>> Didi >>> >> _______________________________________________ >> Infra mailing list -- infra(a)ovirt.org >> To unsubscribe send an email to infra-leave(a)ovirt.org >> Privacy Statement: https://www.ovirt.org/privacy-policy.html >> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ >> List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/message/PSO5IRX73FS... >> >> >> _______________________________________________ >> Devel mailing list -- devel(a)ovirt.org >> To unsubscribe send an email to devel-leave(a)ovirt.org >> Privacy Statement: https://www.ovirt.org/privacy-policy.html >> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ >> List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/6DZRAMSAJCX... > > > > -- > Martin Perina > Manager, Software Engineering > Red Hat Czech s.r.o. -- Didi

-- Didi

1572

days inactive

1575

days old

infra@ovirt.org

Manage subscription

11 comments

4 participants

tags (0)

participants (4)

Dana Elfassy
Martin Perina
Michal Skrivanek
Yedidyah Bar David

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

basic suite fails on test_metrics_and_log_collector