[OST] Network suites fail CI builds

I had many failures in recent OST patches, so I posted this change: https://gerrit.ovirt.org/c/112174/ This patch does not change anything, but it modifies the lago vm configuration so it triggers 8 jobs. 2 network test suites failed: https://jenkins.ovirt.org/blue/organizations/jenkins/ovirt-system-tests_stan... Can someone look at the network suite failures? If these suites are not stable, we should not included them in the CI for OST patches, or mark them as expected failures so they do not fail the build. I triggered another build since I see lot of random failures in other suites. Nir

On Thu, Nov 12, 2020 at 4:01 PM Nir Soffer <nsoffer@redhat.com> wrote:
I had many failures in recent OST patches, so I posted this change: https://gerrit.ovirt.org/c/112174/
This patch does not change anything, but it modifies the lago vm configuration so it triggers 8 jobs. 2 network test suites failed: https://jenkins.ovirt.org/blue/organizations/jenkins/ovirt-system-tests_stan...
Can someone look at the network suite failures?
If these suites are not stable, we should not included them in the CI for OST patches, or mark them as expected failures so they do not fail the build.
I triggered another build since I see lot of random failures in other suites.
On the next build - different errors: https://jenkins.ovirt.org/blue/organizations/jenkins/ovirt-system-tests_stan... - basic_suite_4.3.el7.x86_64 - failed - basic_suite_master.el7.x86_64 - failed - network_suite_4.3.el7.x86_64 - failed - network_suite_master.el7.x86_64 - failed With the current state OST CI is not useful to anyone. Builds take hours and fail randomly. This wastes our limited resources for other projects and makes contribution to this project very hard.

On Thu, Nov 12, 2020 at 5:46 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Thu, Nov 12, 2020 at 4:01 PM Nir Soffer <nsoffer@redhat.com> wrote:
I had many failures in recent OST patches, so I posted this change: https://gerrit.ovirt.org/c/112174/
This patch does not change anything, but it modifies the lago vm
configuration
so it triggers 8 jobs. 2 network test suites failed:
https://jenkins.ovirt.org/blue/organizations/jenkins/ovirt-system-tests_stan...
Can someone look at the network suite failures?
Network suite has indeed been failing randomly recently. More often than not it was due to timeouts while waiting for connections to the hosts, timeouts while waiting for hosts to reach deserted statuses, and in the above I also see what looks like a sock error on port 22. Not only are the failing tests random but also usually the next nightly passes. This leads me to believe that the cause of the failures is outside the scope of the tests code.
If these suites are not stable, we should not included them in the CI for OST
patches, or mark them as expected failures so they do not fail the build.
I triggered another build since I see lot of random failures in other suites.
On the next build - different errors:
https://jenkins.ovirt.org/blue/organizations/jenkins/ovirt-system-tests_stan...
- basic_suite_4.3.el7.x86_64 - failed - basic_suite_master.el7.x86_64 - failed - network_suite_4.3.el7.x86_64 - failed - network_suite_master.el7.x86_64 - failed
With the current state OST CI is not useful to anyone. Builds take hours and fail randomly. This wastes our limited resources for other projects and makes contribution to this project very hard.
+1 _______________________________________________
Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JBA4FBJN2N6MMH...

On Thu, Nov 12, 2020 at 9:24 PM Eitan Raviv <eraviv@redhat.com> wrote:
On Thu, Nov 12, 2020 at 5:46 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Thu, Nov 12, 2020 at 4:01 PM Nir Soffer <nsoffer@redhat.com> wrote:
I had many failures in recent OST patches, so I posted this change: https://gerrit.ovirt.org/c/112174/
This patch does not change anything, but it modifies the lago vm configuration so it triggers 8 jobs. 2 network test suites failed: https://jenkins.ovirt.org/blue/organizations/jenkins/ovirt-system-tests_stan...
Can someone look at the network suite failures?
Network suite has indeed been failing randomly recently. More often than not it was due to timeouts while waiting for connections to the hosts, timeouts while waiting for hosts to reach deserted statuses, and in the above I also see what looks like a sock error on port 22. Not only are the failing tests random but also usually the next nightly passes. This leads me to believe that the cause of the failures is outside the scope of the tests code.
I noticed something similar as well - see thread: [oVirt Jenkins] ovirt-system-tests_basic-suite-master_nightly - Build # 561 - Failure!
If these suites are not stable, we should not included them in the CI for OST patches, or mark them as expected failures so they do not fail the build.
I triggered another build since I see lot of random failures in other suites.
On the next build - different errors:
https://jenkins.ovirt.org/blue/organizations/jenkins/ovirt-system-tests_stan...
- basic_suite_4.3.el7.x86_64 - failed - basic_suite_master.el7.x86_64 - failed - network_suite_4.3.el7.x86_64 - failed - network_suite_master.el7.x86_64 - failed
With the current state OST CI is not useful to anyone. Builds take hours and fail randomly. This wastes our limited resources for other projects and makes contribution to this project very hard.
+1
_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JBA4FBJN2N6MMH...
_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ZVFQOUXWCREYAH...
-- Didi

On Sun, Nov 15, 2020 at 12:28 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Thu, Nov 12, 2020 at 9:24 PM Eitan Raviv <eraviv@redhat.com> wrote:
On Thu, Nov 12, 2020 at 5:46 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Thu, Nov 12, 2020 at 4:01 PM Nir Soffer <nsoffer@redhat.com> wrote:
I had many failures in recent OST patches, so I posted this change: https://gerrit.ovirt.org/c/112174/
This patch does not change anything, but it modifies the lago vm configuration so it triggers 8 jobs. 2 network test suites failed: https://jenkins.ovirt.org/blue/organizations/jenkins/ovirt-system-tests_stan...
Can someone look at the network suite failures?
Network suite has indeed been failing randomly recently. More often than not it was due to timeouts while waiting for connections to the hosts, timeouts while waiting for hosts to reach deserted statuses, and in the above I also see what looks like a sock error on port 22. Not only are the failing tests random but also usually the next nightly passes. This leads me to believe that the cause of the failures is outside the scope of the tests code.
I noticed something similar as well - see thread:
[oVirt Jenkins] ovirt-system-tests_basic-suite-master_nightly - Build # 561 - Failure!
This is not only the networks suites, lot of other suites fail randomly. Regarding the networks suites - can it be related to old kernel when running the tests in mock on el7 host? Do we need to require el8 host? Do we see the same failures when running the network suites locally?
If these suites are not stable, we should not included them in the CI for OST patches, or mark them as expected failures so they do not fail the build.
I triggered another build since I see lot of random failures in other suites.
On the next build - different errors:
https://jenkins.ovirt.org/blue/organizations/jenkins/ovirt-system-tests_stan...
- basic_suite_4.3.el7.x86_64 - failed - basic_suite_master.el7.x86_64 - failed - network_suite_4.3.el7.x86_64 - failed - network_suite_master.el7.x86_64 - failed
Third build failed: https://jenkins.ovirt.org/blue/organizations/jenkins/ovirt-system-tests_stan... Failing suites: - basic_suite_master.el7.x86_64 - network_suite_master.el7.x86_64 Looks like all failures happen with el7. Why are we running master (el8 based) on el7 hosts? The basic master suites never failed when I run it locally, even with nested environment. But maybe I did not try enough, I did 10 runs.
With the current state OST CI is not useful to anyone. Builds take hours and fail randomly. This wastes our limited resources for other projects and makes contribution to this project very hard.
+1
_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JBA4FBJN2N6MMH...
_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ZVFQOUXWCREYAH...
-- Didi

On Sun, Nov 15, 2020 at 4:13 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Sun, Nov 15, 2020 at 12:28 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Thu, Nov 12, 2020 at 9:24 PM Eitan Raviv <eraviv@redhat.com> wrote:
On Thu, Nov 12, 2020 at 5:46 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Thu, Nov 12, 2020 at 4:01 PM Nir Soffer <nsoffer@redhat.com> wrote:
I had many failures in recent OST patches, so I posted this change: https://gerrit.ovirt.org/c/112174/
This patch does not change anything, but it modifies the lago vm configuration so it triggers 8 jobs. 2 network test suites failed: https://jenkins.ovirt.org/blue/organizations/jenkins/ovirt-system-tests_stan...
Can someone look at the network suite failures?
Network suite has indeed been failing randomly recently. More often than not it was due to timeouts while waiting for connections to the hosts, timeouts while waiting for hosts to reach deserted statuses, and in the above I also see what looks like a sock error on port 22. Not only are the failing tests random but also usually the next nightly passes. This leads me to believe that the cause of the failures is outside the scope of the tests code.
I noticed something similar as well - see thread:
[oVirt Jenkins] ovirt-system-tests_basic-suite-master_nightly - Build # 561 - Failure!
This is not only the networks suites, lot of other suites fail randomly.
Regarding the networks suites - can it be related to old kernel when running the tests in mock on el7 host? Do we need to require el8 host?
Do we see the same failures when running the network suites locally?
If these suites are not stable, we should not included them in the CI for OST patches, or mark them as expected failures so they do not fail the build.
I triggered another build since I see lot of random failures in other suites.
On the next build - different errors:
https://jenkins.ovirt.org/blue/organizations/jenkins/ovirt-system-tests_stan...
- basic_suite_4.3.el7.x86_64 - failed - basic_suite_master.el7.x86_64 - failed - network_suite_4.3.el7.x86_64 - failed - network_suite_master.el7.x86_64 - failed
Third build failed:
https://jenkins.ovirt.org/blue/organizations/jenkins/ovirt-system-tests_stan...
Failing suites:
- basic_suite_master.el7.x86_64 - network_suite_master.el7.x86_64
Forth build failed: https://jenkins.ovirt.org/blue/organizations/jenkins/ovirt-system-tests_stan... - basic_suite_master.el7.x86_64 - upgrade-from-release_suite_4.3.el7.x86_64
Looks like all failures happen with el7. Why are we running master (el8 based) on el7 hosts?
The basic master suites never failed when I run it locally, even with nested environment. But maybe I did not try enough, I did 10 runs.
With the current state OST CI is not useful to anyone. Builds take hours and fail randomly. This wastes our limited resources for other projects and makes contribution to this project very hard.
+1
_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JBA4FBJN2N6MMH...
_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ZVFQOUXWCREYAH...
-- Didi

On 15 Nov 2020, at 17:40, Nir Soffer <nsoffer@redhat.com> wrote:
On Sun, Nov 15, 2020 at 4:13 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Sun, Nov 15, 2020 at 12:28 PM Yedidyah Bar David <didi@redhat.com> wrote:
On Thu, Nov 12, 2020 at 9:24 PM Eitan Raviv <eraviv@redhat.com> wrote:
On Thu, Nov 12, 2020 at 5:46 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Thu, Nov 12, 2020 at 4:01 PM Nir Soffer <nsoffer@redhat.com> wrote:
I had many failures in recent OST patches, so I posted this change: https://gerrit.ovirt.org/c/112174/
This patch does not change anything, but it modifies the lago vm configuration so it triggers 8 jobs. 2 network test suites failed: https://jenkins.ovirt.org/blue/organizations/jenkins/ovirt-system-tests_stan...
Can someone look at the network suite failures?
Network suite has indeed been failing randomly recently. More often than not it was due to timeouts while waiting for connections to the hosts, timeouts while waiting for hosts to reach deserted statuses, and in the above I also see what looks like a sock error on port 22. Not only are the failing tests random but also usually the next nightly passes. This leads me to believe that the cause of the failures is outside the scope of the tests code.
I noticed something similar as well - see thread:
[oVirt Jenkins] ovirt-system-tests_basic-suite-master_nightly - Build # 561 - Failure!
This is not only the networks suites, lot of other suites fail randomly.
Regarding the networks suites - can it be related to old kernel when running the tests in mock on el7 host? Do we need to require el8 host?
Do we see the same failures when running the network suites locally?
If these suites are not stable, we should not included them in the CI for OST patches, or mark them as expected failures so they do not fail the build.
I triggered another build since I see lot of random failures in other suites.
On the next build - different errors:
https://jenkins.ovirt.org/blue/organizations/jenkins/ovirt-system-tests_stan...
- basic_suite_4.3.el7.x86_64 - failed
This one should be removed, we do not maintain 4.3 anymore
- basic_suite_master.el7.x86_64 - failed
This is obsoleted by ost-images and el8 runs, as you do locally. We are waiting to get rid of el7 jenkins slaves and mock env there
- network_suite_4.3.el7.x86_64 - failed
Should be removed
- network_suite_master.el7.x86_64 - failed
Will be obsoleted once network suite completes move to ost-images/el8
Third build failed:
https://jenkins.ovirt.org/blue/organizations/jenkins/ovirt-system-tests_stan...
Failing suites:
- basic_suite_master.el7.x86_64 - network_suite_master.el7.x86_64
Forth build failed: https://jenkins.ovirt.org/blue/organizations/jenkins/ovirt-system-tests_stan...
- basic_suite_master.el7.x86_64 - upgrade-from-release_suite_4.3.el7.x86_64
This should be removed
Looks like all failures happen with el7. Why are we running master (el8 based) on el7 hosts?
Progress with our CI env is very slow indeed
The basic master suites never failed when I run it locally, even with nested environment. But maybe I did not try enough, I did 10 runs.
Yes, ost-image based runs are way more reliable. Currently this applies only to master basic suite.
With the current state OST CI is not useful to anyone. Builds take hours and fail randomly. This wastes our limited resources for other projects and makes contribution to this project very hard.
+1
_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/JBA4FBJN2N6MMH...
_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/ZVFQOUXWCREYAH...
-- Didi
_______________________________________________ Devel mailing list -- devel@ovirt.org To unsubscribe send an email to devel-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/GTHOJADMXAQDOU...
participants (4)
-
Eitan Raviv
-
Michal Skrivanek
-
Nir Soffer
-
Yedidyah Bar David