bz 1915329: [Stream] Add host fails with: Destination /etc/pki/ovirt-engine/requests not writable

Hi all, Now filed $Subject [1]. Any clues are most welcome. Thanks. Best regards, [1] https://bugzilla.redhat.com/show_bug.cgi?id=1915329 -- Didi

Hi, my guess is it's selinux-related. Unfortunately I can't find any meaningful errors in audit.log in a scenario where host deployment fails. However switching selinux to permissive mode before adding hosts makes the problem go away, so it's probably not an error somewhere in logic. Regards, Marcin On 1/12/21 1:54 PM, Yedidyah Bar David wrote:
Hi all,
Now filed $Subject [1].
Any clues are most welcome. Thanks.
Best regards,

On Wed, Jan 13, 2021 at 1:57 PM Marcin Sobczyk <msobczyk@redhat.com> wrote:
Hi,
my guess is it's selinux-related.
Unfortunately I can't find any meaningful errors in audit.log in a scenario where host deployment fails. However switching selinux to permissive mode before adding hosts makes the problem go away, so it's probably not an error somewhere in logic.
It's getting weirder: Under strace, it succeeds: https://gerrit.ovirt.org/c/ovirt-system-tests/+/112948 (Can't see the actual log, as I didn't add '-A', so it was overwritten on restart...)
Regards, Marcin
On 1/12/21 1:54 PM, Yedidyah Bar David wrote:
Hi all,
Now filed $Subject [1].
Any clues are most welcome. Thanks.
Best regards,
-- Didi

On Wed, Jan 13, 2021 at 2:48 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Jan 13, 2021 at 1:57 PM Marcin Sobczyk <msobczyk@redhat.com> wrote:
Hi,
my guess is it's selinux-related.
Unfortunately I can't find any meaningful errors in audit.log in a scenario where host deployment fails. However switching selinux to permissive mode before adding hosts makes the problem go away, so it's probably not an error somewhere in logic.
It's getting weirder: Under strace, it succeeds:
https://gerrit.ovirt.org/c/ovirt-system-tests/+/112948
(Can't see the actual log, as I didn't add '-A', so it was overwritten on restart...)
After updating it to use '-A' it indeed shows that it worked: 43664 14:16:55.997639 access("/etc/pki/ovirt-engine/requests", W_OK <unfinished ...> 43664 14:16:55.997695 <... access resumed>) = 0 Weird. Now ran in parallel 'ci test' for this patch and another one from master, for comparison: https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14916/ https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1883/
Regards, Marcin
On 1/12/21 1:54 PM, Yedidyah Bar David wrote:
Hi all,
Now filed $Subject [1].
Any clues are most welcome. Thanks.
Best regards,
-- Didi
-- Didi

On Wed, Jan 13, 2021 at 5:34 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Jan 13, 2021 at 2:48 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Jan 13, 2021 at 1:57 PM Marcin Sobczyk <msobczyk@redhat.com> wrote:
Hi,
my guess is it's selinux-related.
Unfortunately I can't find any meaningful errors in audit.log in a scenario where host deployment fails. However switching selinux to permissive mode before adding hosts makes the problem go away, so it's probably not an error somewhere in logic.
It's getting weirder: Under strace, it succeeds:
https://gerrit.ovirt.org/c/ovirt-system-tests/+/112948
(Can't see the actual log, as I didn't add '-A', so it was overwritten on restart...)
After updating it to use '-A' it indeed shows that it worked:
43664 14:16:55.997639 access("/etc/pki/ovirt-engine/requests", W_OK <unfinished ...> 43664 14:16:55.997695 <... access resumed>) = 0
Weird.
Now ran in parallel 'ci test' for this patch and another one from master, for comparison:
Again, the same:
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14916/
With strace, passed,
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1883/
Without strace, failed. Last nightly run that passed [1] used: ost-images-el8-host-installed-1-202101100446.x86_64 ovirt-engine-appliance-4.4-20210109182828.1.el8.x86_64 Trying now with these - not sure it possible to put specific versions inside automation/*packages, let's see: https://gerrit.ovirt.org/c/ovirt-system-tests/+/112977 [1] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1879/ -- Didi

On Thu, Jan 14, 2021 at 8:35 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Jan 13, 2021 at 5:34 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Jan 13, 2021 at 2:48 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Jan 13, 2021 at 1:57 PM Marcin Sobczyk <msobczyk@redhat.com> wrote:
Hi,
my guess is it's selinux-related.
Unfortunately I can't find any meaningful errors in audit.log in a scenario where host deployment fails. However switching selinux to permissive mode before adding hosts makes the problem go away, so it's probably not an error somewhere in logic.
It's getting weirder: Under strace, it succeeds:
https://gerrit.ovirt.org/c/ovirt-system-tests/+/112948
(Can't see the actual log, as I didn't add '-A', so it was overwritten on restart...)
After updating it to use '-A' it indeed shows that it worked:
43664 14:16:55.997639 access("/etc/pki/ovirt-engine/requests", W_OK <unfinished ...> 43664 14:16:55.997695 <... access resumed>) = 0
Weird.
Now ran in parallel 'ci test' for this patch and another one from master, for comparison:
Again, the same:
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14916/
With strace, passed,
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1883/
Without strace, failed.
Last nightly run that passed [1] used:
ost-images-el8-host-installed-1-202101100446.x86_64 ovirt-engine-appliance-4.4-20210109182828.1.el8.x86_64
Trying now with these - not sure it possible to put specific versions inside automation/*packages, let's see:
Indeed, with a fixed ost-images and removing updates, it passes. network suite failed, but he-basic passed: https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14920/... So I am quite certain this is an OS issue. Not sure how we do not see this in basic-suite. Perhaps it's related to nested-kvm, or to load/slowness caused by that? Weird. when this fails, we do not collect all engine's /var/log, only messages and ovirt-engine/ . So it's not easy to get a list of the packages that were updated. Pushed now: https://github.com/oVirt/ovirt-ansible-collection/pull/202 to get all of engine's /var/log, and ran manual HE job with it: https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests...
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1879/
-- Didi

On Thu, Jan 14, 2021 at 1:41 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Thu, Jan 14, 2021 at 8:35 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Jan 13, 2021 at 5:34 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Jan 13, 2021 at 2:48 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Jan 13, 2021 at 1:57 PM Marcin Sobczyk <msobczyk@redhat.com> wrote:
Hi,
my guess is it's selinux-related.
Unfortunately I can't find any meaningful errors in audit.log in a scenario where host deployment fails. However switching selinux to permissive mode before adding hosts makes the problem go away, so it's probably not an error somewhere in logic.
It's getting weirder: Under strace, it succeeds:
https://gerrit.ovirt.org/c/ovirt-system-tests/+/112948
(Can't see the actual log, as I didn't add '-A', so it was overwritten on restart...)
After updating it to use '-A' it indeed shows that it worked:
43664 14:16:55.997639 access("/etc/pki/ovirt-engine/requests", W_OK <unfinished ...> 43664 14:16:55.997695 <... access resumed>) = 0
Weird.
Now ran in parallel 'ci test' for this patch and another one from master, for comparison:
Again, the same:
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14916/
With strace, passed,
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1883/
Without strace, failed.
Last nightly run that passed [1] used:
ost-images-el8-host-installed-1-202101100446.x86_64 ovirt-engine-appliance-4.4-20210109182828.1.el8.x86_64
Trying now with these - not sure it possible to put specific versions inside automation/*packages, let's see:
Indeed, with a fixed ost-images and removing updates, it passes. network suite failed, but he-basic passed:
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14920/...
So I am quite certain this is an OS issue. Not sure how we do not see this in basic-suite. Perhaps it's related to nested-kvm, or to load/slowness caused by that? Weird.
when this fails, we do not collect all engine's /var/log, only messages and ovirt-engine/ . So it's not easy to get a list of the packages that were updated.
Pushed now:
https://github.com/oVirt/ovirt-ansible-collection/pull/202
to get all of engine's /var/log, and ran manual HE job with it:
https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests...
This one I accidentally ran with the wrong repo, then ran another one with the correct repo [1], But: 1. The repo wasn't used. Emailed about this a separate thread: "manual job does not use custom repo" 2. It passed! Being what seems like a heisenbug, I understand why when you run it under strace it works differently. But even if you just intend to collect more logs it also causes it to behave differently? :-) This does not mean that "problem solved" - latest nightly run [2] did fail with the same error. [1] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... [2] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1887/
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1879/
-- Didi
-- Didi

On Sun, Jan 17, 2021 at 3:11 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Thu, Jan 14, 2021 at 1:41 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Thu, Jan 14, 2021 at 8:35 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Jan 13, 2021 at 5:34 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Jan 13, 2021 at 2:48 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Jan 13, 2021 at 1:57 PM Marcin Sobczyk <msobczyk@redhat.com> wrote:
Hi,
my guess is it's selinux-related.
Unfortunately I can't find any meaningful errors in audit.log in a scenario where host deployment fails. However switching selinux to permissive mode before adding hosts makes the problem go away, so it's probably not an error somewhere in logic.
It's getting weirder: Under strace, it succeeds:
https://gerrit.ovirt.org/c/ovirt-system-tests/+/112948
(Can't see the actual log, as I didn't add '-A', so it was overwritten on restart...)
After updating it to use '-A' it indeed shows that it worked:
43664 14:16:55.997639 access("/etc/pki/ovirt-engine/requests", W_OK <unfinished ...> 43664 14:16:55.997695 <... access resumed>) = 0
Weird.
Now ran in parallel 'ci test' for this patch and another one from master, for comparison:
Again, the same:
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14916/
With strace, passed,
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1883/
Without strace, failed.
Last nightly run that passed [1] used:
ost-images-el8-host-installed-1-202101100446.x86_64 ovirt-engine-appliance-4.4-20210109182828.1.el8.x86_64
Trying now with these - not sure it possible to put specific versions inside automation/*packages, let's see:
Indeed, with a fixed ost-images and removing updates, it passes. network suite failed, but he-basic passed:
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14920/...
So I am quite certain this is an OS issue. Not sure how we do not see this in basic-suite. Perhaps it's related to nested-kvm, or to load/slowness caused by that? Weird.
when this fails, we do not collect all engine's /var/log, only messages and ovirt-engine/ . So it's not easy to get a list of the packages that were updated.
Pushed now:
https://github.com/oVirt/ovirt-ansible-collection/pull/202
to get all of engine's /var/log, and ran manual HE job with it:
https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests...
This one I accidentally ran with the wrong repo, then ran another one with the correct repo [1], But:
1. The repo wasn't used. Emailed about this a separate thread: "manual job does not use custom repo"
2. It passed! Being what seems like a heisenbug, I understand why when you run it under strace it works differently. But even if you just intend to collect more logs it also causes it to behave differently? :-) This does not mean that "problem solved" - latest nightly run [2] did fail with the same error.
Status: 1. he-basic-suite is still failing. 2. Patch to collect all of /var/log from the engine merged. Dana, can you please update? Did you have any progress? IMO it's an OS bug. If Marcin says it's an selinux issue, I do not argue :-). So, how do we continue?
[1] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... [2] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1887/
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1879/
-- Didi
-- Didi
-- Didi

On Mon, Jan 18, 2021 at 9:08 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Sun, Jan 17, 2021 at 3:11 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Thu, Jan 14, 2021 at 1:41 PM Yedidyah Bar David <didi@redhat.com>
On Thu, Jan 14, 2021 at 8:35 AM Yedidyah Bar David <didi@redhat.com>
wrote:
On Wed, Jan 13, 2021 at 5:34 PM Yedidyah Bar David <didi@redhat.com>
wrote:
On Wed, Jan 13, 2021 at 2:48 PM Yedidyah Bar David <
didi@redhat.com> wrote:
On Wed, Jan 13, 2021 at 1:57 PM Marcin Sobczyk <
msobczyk@redhat.com> wrote:
> > Hi, > > my guess is it's selinux-related. > > Unfortunately I can't find any meaningful errors in audit.log in a > scenario where host deployment fails. > However switching selinux to permissive mode before adding hosts makes > the problem go away, so it's probably not an error somewhere in logic.
It's getting weirder: Under strace, it succeeds:
https://gerrit.ovirt.org/c/ovirt-system-tests/+/112948
(Can't see the actual log, as I didn't add '-A', so it was overwritten on restart...)
After updating it to use '-A' it indeed shows that it worked:
43664 14:16:55.997639 access("/etc/pki/ovirt-engine/requests", W_OK <unfinished ...> 43664 14:16:55.997695 <... access resumed>) = 0
Weird.
Now ran in parallel 'ci test' for this patch and another one from master, for comparison:
Again, the same:
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14916/
With strace, passed,
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1883/
Without strace, failed.
Last nightly run that passed [1] used:
ost-images-el8-host-installed-1-202101100446.x86_64 ovirt-engine-appliance-4.4-20210109182828.1.el8.x86_64
Trying now with these - not sure it possible to put specific versions inside automation/*packages, let's see:
Indeed, with a fixed ost-images and removing updates, it passes. network suite failed, but he-basic passed:
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14920/...
So I am quite certain this is an OS issue. Not sure how we do not see this in basic-suite. Perhaps it's related to nested-kvm, or to load/slowness caused by
wrote: that? Weird.
when this fails, we do not collect all engine's /var/log, only messages and ovirt-engine/ . So it's not easy to get a list of the packages that were updated.
Pushed now:
https://github.com/oVirt/ovirt-ansible-collection/pull/202
to get all of engine's /var/log, and ran manual HE job with it:
https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests...
This one I accidentally ran with the wrong repo, then ran another one with the correct repo [1], But:
1. The repo wasn't used. Emailed about this a separate thread: "manual job does not use custom repo"
2. It passed! Being what seems like a heisenbug, I understand why when you run it under strace it works differently. But even if you just intend to collect more logs it also causes it to behave differently? :-) This does not mean that "problem solved" - latest nightly run [2] did fail with the same error.
Status:
1. he-basic-suite is still failing.
2. Patch to collect all of /var/log from the engine merged.
Dana, can you please update? Did you have any progress?
IMO it's an OS bug. If Marcin says it's an selinux issue, I do not argue :-). So, how do we continue?
Switching to CentOS Stream development/testing is a big effort, I'm not sure we can do this and still deliver all the RFEs/bugs planned for 4.4.5 ...
[1]
https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests...
[2] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1887/
[1]
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1879/
-- Didi
-- Didi
-- Didi
-- Martin Perina Manager, Software Engineering Red Hat Czech s.r.o.

On Mon, Jan 18, 2021 at 10:53 AM Martin Perina <mperina@redhat.com> wrote:
On Mon, Jan 18, 2021 at 9:08 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Sun, Jan 17, 2021 at 3:11 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Thu, Jan 14, 2021 at 1:41 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Thu, Jan 14, 2021 at 8:35 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Jan 13, 2021 at 5:34 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Jan 13, 2021 at 2:48 PM Yedidyah Bar David <didi@redhat.com> wrote: > > On Wed, Jan 13, 2021 at 1:57 PM Marcin Sobczyk <msobczyk@redhat.com> wrote: > > > > Hi, > > > > my guess is it's selinux-related. > > > > Unfortunately I can't find any meaningful errors in audit.log in a > > scenario where host deployment fails. > > However switching selinux to permissive mode before adding hosts makes > > the problem go away, so it's probably not an error somewhere in logic. > > It's getting weirder: Under strace, it succeeds: > > https://gerrit.ovirt.org/c/ovirt-system-tests/+/112948 > > (Can't see the actual log, as I didn't add '-A', so it was overwritten > on restart...)
After updating it to use '-A' it indeed shows that it worked:
43664 14:16:55.997639 access("/etc/pki/ovirt-engine/requests", W_OK <unfinished ...> 43664 14:16:55.997695 <... access resumed>) = 0
Weird.
Now ran in parallel 'ci test' for this patch and another one from master, for comparison:
Again, the same:
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14916/
With strace, passed,
https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1883/
Without strace, failed.
Last nightly run that passed [1] used:
ost-images-el8-host-installed-1-202101100446.x86_64 ovirt-engine-appliance-4.4-20210109182828.1.el8.x86_64
Trying now with these - not sure it possible to put specific versions inside automation/*packages, let's see:
Indeed, with a fixed ost-images and removing updates, it passes. network suite failed, but he-basic passed:
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14920/...
So I am quite certain this is an OS issue. Not sure how we do not see this in basic-suite. Perhaps it's related to nested-kvm, or to load/slowness caused by that? Weird.
when this fails, we do not collect all engine's /var/log, only messages and ovirt-engine/ . So it's not easy to get a list of the packages that were updated.
Pushed now:
https://github.com/oVirt/ovirt-ansible-collection/pull/202
to get all of engine's /var/log, and ran manual HE job with it:
https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests...
This one I accidentally ran with the wrong repo, then ran another one with the correct repo [1], But:
1. The repo wasn't used. Emailed about this a separate thread: "manual job does not use custom repo"
2. It passed! Being what seems like a heisenbug, I understand why when you run it under strace it works differently. But even if you just intend to collect more logs it also causes it to behave differently? :-) This does not mean that "problem solved" - latest nightly run [2] did fail with the same error.
Status:
1. he-basic-suite is still failing.
2. Patch to collect all of /var/log from the engine merged.
Dana, can you please update? Did you have any progress?
IMO it's an OS bug. If Marcin says it's an selinux issue, I do not argue :-). So, how do we continue?
Switching to CentOS Stream development/testing is a big effort, I'm not sure we can do this and still deliver all the RFEs/bugs planned for 4.4.5 ...
IMO we should now revert appliance and node to CentOS 8.3, and then continue the discussion. Having he-basic-suite broken for a week is too much.
[1] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... [2] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1887/
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1879/
-- Didi
-- Didi
-- Didi
-- Martin Perina Manager, Software Engineering Red Hat Czech s.r.o.
-- Didi

On 1/18/21 9:58 AM, Yedidyah Bar David wrote:
On Mon, Jan 18, 2021 at 10:53 AM Martin Perina <mperina@redhat.com> wrote:
On Mon, Jan 18, 2021 at 9:08 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Sun, Jan 17, 2021 at 3:11 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Thu, Jan 14, 2021 at 1:41 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Thu, Jan 14, 2021 at 8:35 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Wed, Jan 13, 2021 at 5:34 PM Yedidyah Bar David <didi@redhat.com> wrote: > On Wed, Jan 13, 2021 at 2:48 PM Yedidyah Bar David <didi@redhat.com> wrote: >> On Wed, Jan 13, 2021 at 1:57 PM Marcin Sobczyk <msobczyk@redhat.com> wrote: >>> Hi, >>> >>> my guess is it's selinux-related. >>> >>> Unfortunately I can't find any meaningful errors in audit.log in a >>> scenario where host deployment fails. >>> However switching selinux to permissive mode before adding hosts makes >>> the problem go away, so it's probably not an error somewhere in logic. >> It's getting weirder: Under strace, it succeeds: >> >> https://gerrit.ovirt.org/c/ovirt-system-tests/+/112948 >> >> (Can't see the actual log, as I didn't add '-A', so it was overwritten >> on restart...) > After updating it to use '-A' it indeed shows that it worked: > > 43664 14:16:55.997639 access("/etc/pki/ovirt-engine/requests", W_OK > <unfinished ...> > 43664 14:16:55.997695 <... access resumed>) = 0 > > Weird. > > Now ran in parallel 'ci test' for this patch and another one from > master, for comparison: Again, the same:
> https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14916/ With strace, passed,
> https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1883/ Without strace, failed.
Last nightly run that passed [1] used:
ost-images-el8-host-installed-1-202101100446.x86_64 ovirt-engine-appliance-4.4-20210109182828.1.el8.x86_64
Trying now with these - not sure it possible to put specific versions inside automation/*packages, let's see:
https://gerrit.ovirt.org/c/ovirt-system-tests/+/112977 Indeed, with a fixed ost-images and removing updates, it passes. network suite failed, but he-basic passed:
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14920/...
So I am quite certain this is an OS issue. Not sure how we do not see this in basic-suite. Perhaps it's related to nested-kvm, or to load/slowness caused by that? Weird.
when this fails, we do not collect all engine's /var/log, only messages and ovirt-engine/ . So it's not easy to get a list of the packages that were updated.
Pushed now:
https://github.com/oVirt/ovirt-ansible-collection/pull/202
to get all of engine's /var/log, and ran manual HE job with it:
https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... This one I accidentally ran with the wrong repo, then ran another one with the correct repo [1], But:
1. The repo wasn't used. Emailed about this a separate thread: "manual job does not use custom repo"
2. It passed! Being what seems like a heisenbug, I understand why when you run it under strace it works differently. But even if you just intend to collect more logs it also causes it to behave differently? :-) This does not mean that "problem solved" - latest nightly run [2] did fail with the same error. Status:
1. he-basic-suite is still failing.
2. Patch to collect all of /var/log from the engine merged.
Dana, can you please update? Did you have any progress?
IMO it's an OS bug. If Marcin says it's an selinux issue, I do not argue :-). So, how do we continue?
Switching to CentOS Stream development/testing is a big effort, I'm not sure we can do this and still deliver all the RFEs/bugs planned for 4.4.5 ...
+1 IMO we should now revert appliance and node to CentOS 8.3, and then continue the discussion. Having he-basic-suite broken for a week is too much. +1 The testing infrastructure for Stream is here, but if it doesn't work yet than let's stick to the plan and focus on 8.3.
[1] https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... [2] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1887/
[1] https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1879/ -- Didi
-- Didi
-- Didi
-- Martin Perina Manager, Software Engineering Red Hat Czech s.r.o.

On Mon, Jan 18, 2021 at 11:19 AM Marcin Sobczyk <msobczyk@redhat.com> wrote:
On 1/18/21 9:58 AM, Yedidyah Bar David wrote:
On Mon, Jan 18, 2021 at 10:53 AM Martin Perina <mperina@redhat.com> wrote:
On Mon, Jan 18, 2021 at 9:08 AM Yedidyah Bar David <didi@redhat.com> wrote:
On Sun, Jan 17, 2021 at 3:11 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Thu, Jan 14, 2021 at 1:41 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Thu, Jan 14, 2021 at 8:35 AM Yedidyah Bar David <didi@redhat.com> wrote: > On Wed, Jan 13, 2021 at 5:34 PM Yedidyah Bar David <didi@redhat.com> wrote: >> On Wed, Jan 13, 2021 at 2:48 PM Yedidyah Bar David <didi@redhat.com> wrote: >>> On Wed, Jan 13, 2021 at 1:57 PM Marcin Sobczyk <msobczyk@redhat.com> wrote: >>>> Hi, >>>> >>>> my guess is it's selinux-related. >>>> >>>> Unfortunately I can't find any meaningful errors in audit.log in a >>>> scenario where host deployment fails. >>>> However switching selinux to permissive mode before adding hosts makes >>>> the problem go away, so it's probably not an error somewhere in logic. >>> It's getting weirder: Under strace, it succeeds: >>> >>> https://gerrit.ovirt.org/c/ovirt-system-tests/+/112948 >>> >>> (Can't see the actual log, as I didn't add '-A', so it was overwritten >>> on restart...) >> After updating it to use '-A' it indeed shows that it worked: >> >> 43664 14:16:55.997639 access("/etc/pki/ovirt-engine/requests", W_OK >> <unfinished ...> >> 43664 14:16:55.997695 <... access resumed>) = 0 >> >> Weird. >> >> Now ran in parallel 'ci test' for this patch and another one from >> master, for comparison: > Again, the same: > >> https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14916/ > With strace, passed, > >> https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/1883/ > Without strace, failed. > > Last nightly run that passed [1] used: > > ost-images-el8-host-installed-1-202101100446.x86_64 > ovirt-engine-appliance-4.4-20210109182828.1.el8.x86_64 > > Trying now with these - not sure it possible to put specific versions inside > automation/*packages, let's see: > > https://gerrit.ovirt.org/c/ovirt-system-tests/+/112977 Indeed, with a fixed ost-images and removing updates, it passes. network suite failed, but he-basic passed:
https://jenkins.ovirt.org/job/ovirt-system-tests_standard-check-patch/14920/...
So I am quite certain this is an OS issue. Not sure how we do not see this in basic-suite. Perhaps it's related to nested-kvm, or to load/slowness caused by that? Weird.
when this fails, we do not collect all engine's /var/log, only messages and ovirt-engine/ . So it's not easy to get a list of the packages that were updated.
Pushed now:
https://github.com/oVirt/ovirt-ansible-collection/pull/202
to get all of engine's /var/log, and ran manual HE job with it:
https://jenkins.ovirt.org/view/oVirt%20system%20tests/job/ovirt-system-tests... This one I accidentally ran with the wrong repo, then ran another one with the correct repo [1], But:
1. The repo wasn't used. Emailed about this a separate thread: "manual job does not use custom repo"
2. It passed! Being what seems like a heisenbug, I understand why when you run it under strace it works differently. But even if you just intend to collect more logs it also causes it to behave differently? :-) This does not mean that "problem solved" - latest nightly run [2] did fail with the same error. Status:
1. he-basic-suite is still failing.
2. Patch to collect all of /var/log from the engine merged.
Dana, can you please update? Did you have any progress?
IMO it's an OS bug. If Marcin says it's an selinux issue, I do not argue :-). So, how do we continue?
Switching to CentOS Stream development/testing is a big effort, I'm not sure we can do this and still deliver all the RFEs/bugs planned for 4.4.5 ...
+1 IMO we should now revert appliance and node to CentOS 8.3, and then continue the discussion. Having he-basic-suite broken for a week is too much. +1 The testing infrastructure for Stream is here, but if it doesn't work yet than let's stick to the plan and focus on 8.3.
Just to conclude the original issue - a workaround found, root cause still under investigation. Commented on the bugs (oVirt and Stream) with details. Best regards, -- Didi
participants (3)
-
Marcin Sobczyk
-
Martin Perina
-
Yedidyah Bar David