On Mon, Jan 18, 2021 at 10:53 AM Martin Perina
<mperina(a)redhat.com> wrote:
>
>
> On Mon, Jan 18, 2021 at 9:08 AM Yedidyah Bar David <didi(a)redhat.com> wrote:
>> On Sun, Jan 17, 2021 at 3:11 PM Yedidyah Bar David <didi(a)redhat.com>
wrote:
>>> On Thu, Jan 14, 2021 at 1:41 PM Yedidyah Bar David <didi(a)redhat.com>
wrote:
>>>> On Thu, Jan 14, 2021 at 8:35 AM Yedidyah Bar David
<didi(a)redhat.com> wrote:
>>>>> On Wed, Jan 13, 2021 at 5:34 PM Yedidyah Bar David
<didi(a)redhat.com> wrote:
>>>>>> On Wed, Jan 13, 2021 at 2:48 PM Yedidyah Bar David
<didi(a)redhat.com> wrote:
>>>>>>> On Wed, Jan 13, 2021 at 1:57 PM Marcin Sobczyk
<msobczyk(a)redhat.com> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> my guess is it's selinux-related.
>>>>>>>>
>>>>>>>> Unfortunately I can't find any meaningful errors in
audit.log in a
>>>>>>>> scenario where host deployment fails.
>>>>>>>> However switching selinux to permissive mode before
adding hosts makes
>>>>>>>> the problem go away, so it's probably not an error
somewhere in logic.
>>>>>>> It's getting weirder: Under strace, it succeeds:
>>>>>>>
>>>>>>>
https://gerrit.ovirt.org/c/ovirt-system-tests/+/112948
>>>>>>>
>>>>>>> (Can't see the actual log, as I didn't add
'-A', so it was overwritten
>>>>>>> on restart...)
>>>>>> After updating it to use '-A' it indeed shows that it
worked:
>>>>>>
>>>>>> 43664 14:16:55.997639
access("/etc/pki/ovirt-engine/requests", W_OK
>>>>>> <unfinished ...>
>>>>>> 43664 14:16:55.997695 <... access resumed>) = 0
>>>>>>
>>>>>> Weird.
>>>>>>
>>>>>> Now ran in parallel 'ci test' for this patch and another
one from
>>>>>> master, for comparison:
>>>>> Again, the same:
>>>>>
>>>>>>
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14916/
>>>>> With strace, passed,
>>>>>
>>>>>>
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1883/
>>>>> Without strace, failed.
>>>>>
>>>>> Last nightly run that passed [1] used:
>>>>>
>>>>> ost-images-el8-host-installed-1-202101100446.x86_64
>>>>> ovirt-engine-appliance-4.4-20210109182828.1.el8.x86_64
>>>>>
>>>>> Trying now with these - not sure it possible to put specific versions
inside
>>>>> automation/*packages, let's see:
>>>>>
>>>>>
https://gerrit.ovirt.org/c/ovirt-system-tests/+/112977
>>>> Indeed, with a fixed ost-images and removing updates, it passes. network
suite
>>>> failed, but he-basic passed:
>>>>
>>>>
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/149...
>>>>
>>>> So I am quite certain this is an OS issue. Not sure how we do not see
>>>> this in basic-suite.
>>>> Perhaps it's related to nested-kvm, or to load/slowness caused by
that? Weird.
>>>>
>>>> when this fails, we do not collect all engine's /var/log, only
>>>> messages and ovirt-engine/ .
>>>> So it's not easy to get a list of the packages that were updated.
>>>>
>>>> Pushed now:
>>>>
>>>>
https://github.com/oVirt/ovirt-ansible-collection/pull/202
>>>>
>>>> to get all of engine's /var/log, and ran manual HE job with it:
>>>>
>>>>
https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-te...
>>> This one I accidentally ran with the wrong repo, then ran another one
>>> with the correct repo [1],
>>> But:
>>>
>>> 1. The repo wasn't used. Emailed about this a separate thread:
"manual
>>> job does not use custom repo"
>>>
>>> 2. It passed! Being what seems like a heisenbug, I understand why when
>>> you run it under strace it
>>> works differently. But even if you just intend to collect more logs it
>>> also causes it to behave
>>> differently? :-) This does not mean that "problem solved" - latest
>>> nightly run [2] did fail with
>>> the same error.
>> Status:
>>
>> 1. he-basic-suite is still failing.
>>
>> 2. Patch to collect all of /var/log from the engine merged.
>>
>> Dana, can you please update? Did you have any progress?
>>
>> IMO it's an OS bug. If Marcin says it's an selinux issue, I do not argue
:-).
>> So, how do we continue?
>
> Switching to CentOS Stream development/testing is a big effort, I'm not sure we
can do this and still deliver all the RFEs/bugs planned for 4.4.5 ...
IMO we should now revert appliance and node to CentOS 8.3, and then
continue the discussion.
Having he-basic-suite broken for a week is too much.
+1 The testing infrastructure
for Stream is here, but if it doesn't work
yet than let's stick to the plan and focus on 8.3.