On Wed, Mar 17, 2021 at 5:38 PM Yedidyah Bar David <didi(a)redhat.com> wrote:
On Wed, Mar 17, 2021 at 4:48 PM Martin Perina <mperina(a)redhat.com> wrote:
>
>
>
> On Wed, Mar 17, 2021 at 3:24 PM Michal Skrivanek <michal.skrivanek(a)redhat.com>
wrote:
>>
>>
>>
>> On 17. 3. 2021, at 13:53, Dana Elfassy <delfassy(a)redhat.com> wrote:
>>
>> Adding +Marcin Sobczyk
>>
>> On Mon, Mar 15, 2021 at 9:59 AM Yedidyah Bar David <didi(a)redhat.com>
wrote:
>>>
>>> On Mon, Mar 15, 2021 at 7:55 AM Yedidyah Bar David <didi(a)redhat.com>
wrote:
>>> >
>>> > Hi all,
>>> >
>>> > This started a few days ago [1] and randomly happens since then:
>>> >
>>> > E DEBUG: Configuration:
>>> > E DEBUG: command: collect
>>> > E DEBUG: Traceback (most recent call last):
>>> > E DEBUG: File
>>> >
"/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py",
>>> > line 2067, in <module>
>>> > E DEBUG: '%s directory is not empty.' %
(conf["local_tmp_dir"])
>>> > E DEBUG: Exception: /dev/shm/log directory is not
>>> > empty.ERROR: /dev/shm/log directory is not empty.non-zero return code
>>> >
>>> > Michal tried to fix this by using a random directory but it still fails
[2]:
>>> >
>>> > DEBUG: command: collect
>>> > DEBUG: Traceback (most recent call last):
>>> > DEBUG: File
"/usr/lib/python3.6/site-packages/ovirt_log_collector/__main__.py",
>>> > line 2067, in <module>
>>> > DEBUG: '%s directory is not empty.' %
(conf["local_tmp_dir"])
>>> > DEBUG: Exception: /dev/shm/kaN7uY directory is not empty.ERROR:
>>> > /dev/shm/kaN7uY directory is not empty.non-zero return code
>>> >
>>> > Since I suppose that the randomness of mktemp is good enough, it must
>>> > be something else. Also, the last successful run before [1] used the
>>> > same OST git commit (same code), so I do not think it's something
in
>>> > OST's code.
>>> >
>>> > Any idea?
>>> >
>>> > I think I'll push a patch to create and use the directory right
before
>>> > calling ovirt-log-collector, which is probably better in other ways.
>>>
>>> My patch [1] still fails, with a somewhat different error message, but
>>> this made me check further, and while I still do not understand, I have
>>> this to add:
>>>
>>> In the failing runs, ovirt-log-collector is called *twice* in parallel.
E.g.
>>> in [2] (the check-patch of [1]):
>>>
>>> Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59099]:
>>> ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p
>>> /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors
>>> --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf
>>> _uses_shell=True warn=True stdin_add_newline=True
>>> strip_empty_ends=True argv=None chdir=None executable=None
>>> creates=None removes=None stdin=None
>>> Mar 15 07:38:59 lago-basic-suite-master-engine platform-python[59124]:
>>> ansible-command Invoked with _raw_params=lctmp=$(mktemp -d -p
>>> /dev/shm); ovirt-log-collector --verbose --batch --no-hypervisors
>>> --local-tmp="${lctmp}" --conf-file=/root/ovirt-log-collector.conf
>>> _uses_shell=True warn=True stdin_add_newline=True
>>> strip_empty_ends=True argv=None chdir=None executable=None
>>> creates=None removes=None stdin=None
>>>
>>> It also generates two logs, which you can check/compare.
>>>
>>> It's the same for previous ones, e.g. latest nightly [3][4]:
>>>
>>> Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59343]:
>>> ansible-command Invoked with _raw_params=ovirt-log-collector --verbose
>>> --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf
>>> _uses_shell=True warn=True stdin_add_newline=True
>>> strip_empty_ends=True argv=None chdir=None executable=None
>>> creates=None removes=None stdin=None
>>> Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]:
>>> SELinux is preventing /usr/lib/systemd/systemd from unlink access on
>>> the sock_file ansible-ssh-lago-basic-suite-master-host-1-22-root. For
>>> complete SELinux messages run: sealert -l
>>> d03a8655-9430-4fcf-9892-3b4df1939899
>>> Mar 15 06:23:30 lago-basic-suite-master-engine setroubleshoot[58889]:
>>> SELinux is preventing /usr/lib/systemd/systemd from unlink access on
>>> the sock_file
ansible-ssh-lago-basic-suite-master-host-1-22-root.#012#012*****
>>> Plugin catchall (100. confidence) suggests
>>> **************************#012#012If you believe that systemd should
>>> be allowed unlink access on the
>>> ansible-ssh-lago-basic-suite-master-host-1-22-root sock_file by
>>> default.#012Then you should report this as a bug.#012You can generate
>>> a local policy module to allow this access.#012Do#012allow this access
>>> for now by executing:#012# ausearch -c 'systemd' --raw |
audit2allow
>>> -M my-systemd#012# semodule -X 300 -i my-systemd.pp#012
>>> Mar 15 06:23:30 lago-basic-suite-master-engine platform-python[59361]:
>>> ansible-command Invoked with _raw_params=ovirt-log-collector --verbose
>>> --batch --no-hypervisors --conf-file=/root/ovirt-log-collector.conf
>>> _uses_shell=True warn=True stdin_add_newline=True
>>> strip_empty_ends=True argv=None chdir=None executable=None
>>> creates=None removes=None stdin=None
>>>
>>> Any idea what might have caused this to start happening? Perhaps
>>> a bug in ansible, or ansible-runner? It reminds me of [5].
>>> Adding Dana and Martin.
>>>
>>> I think [5] is quite a serious bug, btw, should be a 4.4.5 blocker.
>>
>>
>> it’s from January and tehre are no comments there. Dana, any update?
>> It does look serious but perhaps not really hit in real world scenarios?
>
>
> It's reproducible outside OST, Dana is investigating intensively last 2 weeks,
but so far we haven't been able to find where events are multiplied.
> But AFAIK this issue is visble only with ansible-runner-service, respectively only
in the part when engine fetches events from ARS and create
> audit_log records from them. It doesn't mean, that playbook tasks are executed
multiple times!
Are you sure? Please check also bz 1930565. I closed it as a
duplicate, but if you are certain that tasks are never executed
multiple times, then we should reopen. I looked at the attached logs,
and they were hard to follow (due to the duplication), but at least I
didn't notice there any bug around imgbased.
> OST is using ansible-runner directly, so it doesn't seem to me like the same
issue
Indeed. That's also what I currently think myself. It seems like a bug
in OST. Still didn't manage to find the root cause - I am pushing my
debugging changes to [1], for anyone that wants to follow, and:
1. run_tc basic-suite-master/test-scenarios/test_003_01_dummy.py >
~/run_tc/run-tc-$(date +%Y-%m-%d-%H%M%S).log 2>&1
2. cat $(ls -t1 run-tc* | head -1) | grep -Ev
'do_something_2.*something 2|do_something_1.*something 1' | grep -v
'Start'
3. Pick a line from the output (there are always a few, among 100
runs), check its timestamp, then look at pytest.log and search for
(currently):
$timestamp|something..|_run_ansible_runner|_run_ansible_runner.*before
run|_run_ansible_runner.*after run|Starting
functools.partial|.*test_dummy: Starting iteration.*|ModuleArgsMapper
Hope to update tomorrow. For now, we can merge the workaround patch to
run them serially, to unbreak basic-suite at this point.
OK. I now pushed two alternative patches that fix the bug. Please pick
the one you like and merge. I tested both of them on [1] and both seem ok.
shorter fix:
https://gerrit.ovirt.org/c/ovirt-system-tests/+/113937
clearer fix, imo:
https://gerrit.ovirt.org/c/ovirt-system-tests/+/113938
[1]
https://github.com/didib/ovirt-system-tests/commits/add-test_003_01_dummy
--
Didi