On Mon, Jan 18, 2021 at 11:19 AM Marcin Sobczyk <msobczyk(a)redhat.com> wrote:
On 1/18/21 9:58 AM, Yedidyah Bar David wrote:
> On Mon, Jan 18, 2021 at 10:53 AM Martin Perina <mperina(a)redhat.com> wrote:
>>
>>
>> On Mon, Jan 18, 2021 at 9:08 AM Yedidyah Bar David <didi(a)redhat.com>
wrote:
>>> On Sun, Jan 17, 2021 at 3:11 PM Yedidyah Bar David <didi(a)redhat.com>
wrote:
>>>> On Thu, Jan 14, 2021 at 1:41 PM Yedidyah Bar David
<didi(a)redhat.com> wrote:
>>>>> On Thu, Jan 14, 2021 at 8:35 AM Yedidyah Bar David
<didi(a)redhat.com> wrote:
>>>>>> On Wed, Jan 13, 2021 at 5:34 PM Yedidyah Bar David
<didi(a)redhat.com> wrote:
>>>>>>> On Wed, Jan 13, 2021 at 2:48 PM Yedidyah Bar David
<didi(a)redhat.com> wrote:
>>>>>>>> On Wed, Jan 13, 2021 at 1:57 PM Marcin Sobczyk
<msobczyk(a)redhat.com> wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> my guess is it's selinux-related.
>>>>>>>>>
>>>>>>>>> Unfortunately I can't find any meaningful errors
in audit.log in a
>>>>>>>>> scenario where host deployment fails.
>>>>>>>>> However switching selinux to permissive mode before
adding hosts makes
>>>>>>>>> the problem go away, so it's probably not an
error somewhere in logic.
>>>>>>>> It's getting weirder: Under strace, it succeeds:
>>>>>>>>
>>>>>>>>
https://gerrit.ovirt.org/c/ovirt-system-tests/+/112948
>>>>>>>>
>>>>>>>> (Can't see the actual log, as I didn't add
'-A', so it was overwritten
>>>>>>>> on restart...)
>>>>>>> After updating it to use '-A' it indeed shows that
it worked:
>>>>>>>
>>>>>>> 43664 14:16:55.997639
access("/etc/pki/ovirt-engine/requests", W_OK
>>>>>>> <unfinished ...>
>>>>>>> 43664 14:16:55.997695 <... access resumed>) = 0
>>>>>>>
>>>>>>> Weird.
>>>>>>>
>>>>>>> Now ran in parallel 'ci test' for this patch and
another one from
>>>>>>> master, for comparison:
>>>>>> Again, the same:
>>>>>>
>>>>>>>
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14916/
>>>>>> With strace, passed,
>>>>>>
>>>>>>>
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1883/
>>>>>> Without strace, failed.
>>>>>>
>>>>>> Last nightly run that passed [1] used:
>>>>>>
>>>>>> ost-images-el8-host-installed-1-202101100446.x86_64
>>>>>> ovirt-engine-appliance-4.4-20210109182828.1.el8.x86_64
>>>>>>
>>>>>> Trying now with these - not sure it possible to put specific
versions inside
>>>>>> automation/*packages, let's see:
>>>>>>
>>>>>>
https://gerrit.ovirt.org/c/ovirt-system-tests/+/112977
>>>>> Indeed, with a fixed ost-images and removing updates, it passes.
network suite
>>>>> failed, but he-basic passed:
>>>>>
>>>>>
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/149...
>>>>>
>>>>> So I am quite certain this is an OS issue. Not sure how we do not
see
>>>>> this in basic-suite.
>>>>> Perhaps it's related to nested-kvm, or to load/slowness caused
by that? Weird.
>>>>>
>>>>> when this fails, we do not collect all engine's /var/log, only
>>>>> messages and ovirt-engine/ .
>>>>> So it's not easy to get a list of the packages that were
updated.
>>>>>
>>>>> Pushed now:
>>>>>
>>>>>
https://github.com/oVirt/ovirt-ansible-collection/pull/202
>>>>>
>>>>> to get all of engine's /var/log, and ran manual HE job with it:
>>>>>
>>>>>
https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-te...
>>>> This one I accidentally ran with the wrong repo, then ran another one
>>>> with the correct repo [1],
>>>> But:
>>>>
>>>> 1. The repo wasn't used. Emailed about this a separate thread:
"manual
>>>> job does not use custom repo"
>>>>
>>>> 2. It passed! Being what seems like a heisenbug, I understand why when
>>>> you run it under strace it
>>>> works differently. But even if you just intend to collect more logs it
>>>> also causes it to behave
>>>> differently? :-) This does not mean that "problem solved" -
latest
>>>> nightly run [2] did fail with
>>>> the same error.
>>> Status:
>>>
>>> 1. he-basic-suite is still failing.
>>>
>>> 2. Patch to collect all of /var/log from the engine merged.
>>>
>>> Dana, can you please update? Did you have any progress?
>>>
>>> IMO it's an OS bug. If Marcin says it's an selinux issue, I do not
argue :-).
>>> So, how do we continue?
>>
>> Switching to CentOS Stream development/testing is a big effort, I'm not sure
we can do this and still deliver all the RFEs/bugs planned for 4.4.5 ...
+1
> IMO we should now revert appliance and node to CentOS 8.3, and then
> continue the discussion.
> Having he-basic-suite broken for a week is too much.
+1 The testing infrastructure for Stream is here, but if it doesn't work
yet than let's stick to the plan and focus on 8.3.
Just to conclude the original issue - a workaround found, root cause still
under investigation. Commented on the bugs (oVirt and Stream) with details.
Best regards,
--
Didi