Removing big (hc-* and metrics) suits from OST's check-patch (How to make OST faster in CI)

If you have been using or monitoring any OST suits recently, you may have noticed we've been suffering from long delays in allocating CI hardware resources for running OST suits. I'd like to briefly discuss the reasons behind this, what are planning to do to resolve this and the implication of those actions for big suit owners. As you might know, we have moved a while ago from running OST suits each on its own dedicated server to running them inside containers managed by OpenShift. That had allowed us to run multiple OST suits on the same bare-metal host which in turn increased our overall capacity by 50% while still allowing us to free up hardware for accommodating the kubevirt project on our CI hardware. Our infrastructure is currently built in a way where we use the exact same POD specification (and therefore resource settings) for all suits. Making it more flexible at this point would require significant code changes we are not likely to make. What this means is that we need to make sure our PODs have enough resources to run the most demanding suits. It also means we waste some resources when running less demanding ones. Given the set of OST suits we have ATM, we sized our PODs to allocate 32Gibs of RAM. Given the servers we have, this means we can run 15 suits at a time in parallel. This was sufficient for a while, but given increasing demand, and the expectation for it to increase further once we introduce the patch gating features we've been working on, we must find a way to significantly increase our suit running capacity. We have measured the amount of RAM required by each suit and came to the conclusion that for the vast majority of suits, we could settle for PODs that allocate only 14Gibs of RAM. If we make that change, we would be able to run a total of 40 suits at a time, almost tripling our current capacity. The downside of making this change is that our STDCI V2 infrastructure will no longer be able to run suits that require more then 14Gib of RAM. This effectively means it would no longer be possible to run these suits from OST's check-patch job or from the OST manual job. The list of relevant suits that would be affected follows, the suit owners, as documented in the CI configuration, have be added as "to" recipients to the message: - hc-basic-suite-4.3 - hc-basic-suite-master - metrics-suite-4.3 Since we're aware people would still like to be able to work with the bigger suits, we will leverage the nightly suit invocation jobs to enable then to be run in the CI infra. We will support the following use cases: - *Periodically running the suit on the latest oVirt packages* - this will be done by the nightly job like it is done today - *Running the suit to test changes to the suit`s code* - while currently this is done automatically by check-patch, this would have to be done manually in the future by manually triggering the nightly job and setting the REFSPEC parameter to point to the examined patch - *Triggering the suit manually* - This would be done by triggering the suit-specific nightly job (as opposed to the general OST manual job) The patches listed below implement the changes outlined above: - 102757 <https://gerrit.ovirt.org/102757> nightly-system-tests: big suits -> big containers - 102771 <https://gerrit.ovirt.org/102771>: stdci: Drop `big` suits from check-patch We know that making the changes we presented will make things a little less convenient for users and maintainers of the big suits, but we believe the benefits of having vastly increased execution capacity for all other suits outweigh those shortcomings. We would like to hear all relevant comment and questions from the quite owners and other interested parties, especially is you think we should not carry out the changes we propose. Please take the time to respond on this thread, or on the linked patches. Thanks, -- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

Adding Evgeny and Shirly who are AFAIK the owners of the metrics suit. On Sun, 1 Sep 2019 at 17:07, Barak Korren <bkorren@redhat.com> wrote:
If you have been using or monitoring any OST suits recently, you may have noticed we've been suffering from long delays in allocating CI hardware resources for running OST suits. I'd like to briefly discuss the reasons behind this, what are planning to do to resolve this and the implication of those actions for big suit owners.
As you might know, we have moved a while ago from running OST suits each on its own dedicated server to running them inside containers managed by OpenShift. That had allowed us to run multiple OST suits on the same bare-metal host which in turn increased our overall capacity by 50% while still allowing us to free up hardware for accommodating the kubevirt project on our CI hardware.
Our infrastructure is currently built in a way where we use the exact same POD specification (and therefore resource settings) for all suits. Making it more flexible at this point would require significant code changes we are not likely to make. What this means is that we need to make sure our PODs have enough resources to run the most demanding suits. It also means we waste some resources when running less demanding ones.
Given the set of OST suits we have ATM, we sized our PODs to allocate 32Gibs of RAM. Given the servers we have, this means we can run 15 suits at a time in parallel. This was sufficient for a while, but given increasing demand, and the expectation for it to increase further once we introduce the patch gating features we've been working on, we must find a way to significantly increase our suit running capacity.
We have measured the amount of RAM required by each suit and came to the conclusion that for the vast majority of suits, we could settle for PODs that allocate only 14Gibs of RAM. If we make that change, we would be able to run a total of 40 suits at a time, almost tripling our current capacity.
The downside of making this change is that our STDCI V2 infrastructure will no longer be able to run suits that require more then 14Gib of RAM. This effectively means it would no longer be possible to run these suits from OST's check-patch job or from the OST manual job.
The list of relevant suits that would be affected follows, the suit owners, as documented in the CI configuration, have be added as "to" recipients to the message:
- hc-basic-suite-4.3 - hc-basic-suite-master - metrics-suite-4.3
Since we're aware people would still like to be able to work with the bigger suits, we will leverage the nightly suit invocation jobs to enable then to be run in the CI infra. We will support the following use cases:
- *Periodically running the suit on the latest oVirt packages* - this will be done by the nightly job like it is done today - *Running the suit to test changes to the suit`s code* - while currently this is done automatically by check-patch, this would have to be done manually in the future by manually triggering the nightly job and setting the REFSPEC parameter to point to the examined patch - *Triggering the suit manually* - This would be done by triggering the suit-specific nightly job (as opposed to the general OST manual job)
The patches listed below implement the changes outlined above:
- 102757 <https://gerrit.ovirt.org/102757> nightly-system-tests: big suits -> big containers - 102771 <https://gerrit.ovirt.org/102771>: stdci: Drop `big` suits from check-patch
We know that making the changes we presented will make things a little less convenient for users and maintainers of the big suits, but we believe the benefits of having vastly increased execution capacity for all other suits outweigh those shortcomings.
We would like to hear all relevant comment and questions from the quite owners and other interested parties, especially is you think we should not carry out the changes we propose. Please take the time to respond on this thread, or on the linked patches.
Thanks,
-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

I haven't seen any comments on this thread, so we are going to move forward with the change. On Mon, 2 Sep 2019 at 09:03, Barak Korren <bkorren@redhat.com> wrote:
Adding Evgeny and Shirly who are AFAIK the owners of the metrics suit.
On Sun, 1 Sep 2019 at 17:07, Barak Korren <bkorren@redhat.com> wrote:
If you have been using or monitoring any OST suits recently, you may have noticed we've been suffering from long delays in allocating CI hardware resources for running OST suits. I'd like to briefly discuss the reasons behind this, what are planning to do to resolve this and the implication of those actions for big suit owners.
As you might know, we have moved a while ago from running OST suits each on its own dedicated server to running them inside containers managed by OpenShift. That had allowed us to run multiple OST suits on the same bare-metal host which in turn increased our overall capacity by 50% while still allowing us to free up hardware for accommodating the kubevirt project on our CI hardware.
Our infrastructure is currently built in a way where we use the exact same POD specification (and therefore resource settings) for all suits. Making it more flexible at this point would require significant code changes we are not likely to make. What this means is that we need to make sure our PODs have enough resources to run the most demanding suits. It also means we waste some resources when running less demanding ones.
Given the set of OST suits we have ATM, we sized our PODs to allocate 32Gibs of RAM. Given the servers we have, this means we can run 15 suits at a time in parallel. This was sufficient for a while, but given increasing demand, and the expectation for it to increase further once we introduce the patch gating features we've been working on, we must find a way to significantly increase our suit running capacity.
We have measured the amount of RAM required by each suit and came to the conclusion that for the vast majority of suits, we could settle for PODs that allocate only 14Gibs of RAM. If we make that change, we would be able to run a total of 40 suits at a time, almost tripling our current capacity.
The downside of making this change is that our STDCI V2 infrastructure will no longer be able to run suits that require more then 14Gib of RAM. This effectively means it would no longer be possible to run these suits from OST's check-patch job or from the OST manual job.
The list of relevant suits that would be affected follows, the suit owners, as documented in the CI configuration, have be added as "to" recipients to the message:
- hc-basic-suite-4.3 - hc-basic-suite-master - metrics-suite-4.3
Since we're aware people would still like to be able to work with the bigger suits, we will leverage the nightly suit invocation jobs to enable then to be run in the CI infra. We will support the following use cases:
- *Periodically running the suit on the latest oVirt packages* - this will be done by the nightly job like it is done today - *Running the suit to test changes to the suit`s code* - while currently this is done automatically by check-patch, this would have to be done manually in the future by manually triggering the nightly job and setting the REFSPEC parameter to point to the examined patch - *Triggering the suit manually* - This would be done by triggering the suit-specific nightly job (as opposed to the general OST manual job)
The patches listed below implement the changes outlined above:
- 102757 <https://gerrit.ovirt.org/102757> nightly-system-tests: big suits -> big containers - 102771 <https://gerrit.ovirt.org/102771>: stdci: Drop `big` suits from check-patch
We know that making the changes we presented will make things a little less convenient for users and maintainers of the big suits, but we believe the benefits of having vastly increased execution capacity for all other suits outweigh those shortcomings.
We would like to hear all relevant comment and questions from the quite owners and other interested parties, especially is you think we should not carry out the changes we propose. Please take the time to respond on this thread, or on the linked patches.
Thanks,
-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

On Thu, Sep 19, 2019 at 3:47 PM Barak Korren <bkorren@redhat.com> wrote:
I haven't seen any comments on this thread, so we are going to move forward with the change.
I started writing some reply, then realized that the only effect on developers is when pushing patches to OST, not to their own project. Right? CQ will continue as normal, nightly runs, etc.? So I didn't reply... If so, that's fine for me. Please document that somewhere. Specifically, how to do the last two points in [1]:
On Mon, 2 Sep 2019 at 09:03, Barak Korren <bkorren@redhat.com> wrote:
Adding Evgeny and Shirly who are AFAIK the owners of the metrics suit.
On Sun, 1 Sep 2019 at 17:07, Barak Korren <bkorren@redhat.com> wrote:
If you have been using or monitoring any OST suits recently, you may have noticed we've been suffering from long delays in allocating CI hardware resources for running OST suits. I'd like to briefly discuss the reasons behind this, what are planning to do to resolve this and the implication of those actions for big suit owners.
As you might know, we have moved a while ago from running OST suits each on its own dedicated server to running them inside containers managed by OpenShift. That had allowed us to run multiple OST suits on the same bare-metal host which in turn increased our overall capacity by 50% while still allowing us to free up hardware for accommodating the kubevirt project on our CI hardware.
Our infrastructure is currently built in a way where we use the exact same POD specification (and therefore resource settings) for all suits. Making it more flexible at this point would require significant code changes we are not likely to make. What this means is that we need to make sure our PODs have enough resources to run the most demanding suits. It also means we waste some resources when running less demanding ones.
Given the set of OST suits we have ATM, we sized our PODs to allocate 32Gibs of RAM. Given the servers we have, this means we can run 15 suits at a time in parallel. This was sufficient for a while, but given increasing demand, and the expectation for it to increase further once we introduce the patch gating features we've been working on, we must find a way to significantly increase our suit running capacity.
We have measured the amount of RAM required by each suit and came to the conclusion that for the vast majority of suits, we could settle for PODs that allocate only 14Gibs of RAM. If we make that change, we would be able to run a total of 40 suits at a time, almost tripling our current capacity.
The downside of making this change is that our STDCI V2 infrastructure will no longer be able to run suits that require more then 14Gib of RAM. This effectively means it would no longer be possible to run these suits from OST's check-patch job or from the OST manual job.
The list of relevant suits that would be affected follows, the suit owners, as documented in the CI configuration, have be added as "to" recipients to the message:
hc-basic-suite-4.3 hc-basic-suite-master metrics-suite-4.3
Since we're aware people would still like to be able to work with the bigger suits, we will leverage the nightly suit invocation jobs to enable then to be run in the CI infra. We will support the following use cases:
Periodically running the suit on the latest oVirt packages - this will be done by the nightly job like it is done today Running the suit to test changes to the suit`s code - while currently this is done automatically by check-patch, this would have to be done manually in the future by manually triggering the nightly job and setting the REFSPEC parameter to point to the examined patch Triggering the suit manually - This would be done by triggering the suit-specific nightly job (as opposed to the general OST manual job)
[1] ^^
The patches listed below implement the changes outlined above:
102757 nightly-system-tests: big suits -> big containers 102771: stdci: Drop `big` suits from check-patch
We know that making the changes we presented will make things a little less convenient for users and maintainers of the big suits, but we believe the benefits of having vastly increased execution capacity for all other suits outweigh those shortcomings.
We would like to hear all relevant comment and questions from the quite owners and other interested parties, especially is you think we should not carry out the changes we propose. Please take the time to respond on this thread, or on the linked patches.
Thanks,
-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted _______________________________________________ Infra mailing list -- infra@ovirt.org To unsubscribe send an email to infra-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/message/6UMJLCA45AICC5...
-- Didi

On Thu, 19 Sep 2019 at 16:21, Yedidyah Bar David <didi@redhat.com> wrote:
On Thu, Sep 19, 2019 at 3:47 PM Barak Korren <bkorren@redhat.com> wrote:
I haven't seen any comments on this thread, so we are going to move
forward with the change.
I started writing some reply, then realized that the only effect on developers is when pushing patches to OST, not to their own project. Right? CQ will continue as normal, nightly runs, etc.? So I didn't reply...
Yeah, this only has to do with the big suits that are listed in $subject, none of those are used by the CQ ATM.
If so, that's fine for me.
Please document that somewhere. Specifically, how to do the last two points in [1]:
On Mon, 2 Sep 2019 at 09:03, Barak Korren <bkorren@redhat.com> wrote:
Adding Evgeny and Shirly who are AFAIK the owners of the metrics suit.
On Sun, 1 Sep 2019 at 17:07, Barak Korren <bkorren@redhat.com> wrote:
If you have been using or monitoring any OST suits recently, you may
As you might know, we have moved a while ago from running OST suits
each on its own dedicated server to running them inside containers managed by OpenShift. That had allowed us to run multiple OST suits on the same bare-metal host which in turn increased our overall capacity by 50% while still allowing us to free up hardware for accommodating the kubevirt
Our infrastructure is currently built in a way where we use the exact
same POD specification (and therefore resource settings) for all suits. Making it more flexible at this point would require significant code changes we are not likely to make. What this means is that we need to make sure our PODs have enough resources to run the most demanding suits. It also means we waste some resources when running less demanding ones.
Given the set of OST suits we have ATM, we sized our PODs to allocate
32Gibs of RAM. Given the servers we have, this means we can run 15 suits at a time in parallel. This was sufficient for a while, but given increasing demand, and the expectation for it to increase further once we introduce
We have measured the amount of RAM required by each suit and came to
The downside of making this change is that our STDCI V2 infrastructure
will no longer be able to run suits that require more then 14Gib of RAM. This effectively means it would no longer be possible to run these suits from OST's check-patch job or from the OST manual job.
The list of relevant suits that would be affected follows, the suit
owners, as documented in the CI configuration, have be added as "to" recipients to the message:
hc-basic-suite-4.3 hc-basic-suite-master metrics-suite-4.3
Since we're aware people would still like to be able to work with the
bigger suits, we will leverage the nightly suit invocation jobs to enable
Periodically running the suit on the latest oVirt packages - this will
be done by the nightly job like it is done today
Running the suit to test changes to the suit`s code - while currently
have noticed we've been suffering from long delays in allocating CI hardware resources for running OST suits. I'd like to briefly discuss the reasons behind this, what are planning to do to resolve this and the implication of those actions for big suit owners. project on our CI hardware. the patch gating features we've been working on, we must find a way to significantly increase our suit running capacity. the conclusion that for the vast majority of suits, we could settle for PODs that allocate only 14Gibs of RAM. If we make that change, we would be able to run a total of 40 suits at a time, almost tripling our current capacity. then to be run in the CI infra. We will support the following use cases: this is done automatically by check-patch, this would have to be done manually in the future by manually triggering the nightly job and setting the REFSPEC parameter to point to the examined patch
Triggering the suit manually - This would be done by triggering the suit-specific nightly job (as opposed to the general OST manual job)
[1] ^^
The patches listed below implement the changes outlined above:
102757 nightly-system-tests: big suits -> big containers 102771: stdci: Drop `big` suits from check-patch
We know that making the changes we presented will make things a little
less convenient for users and maintainers of the big suits, but we believe the benefits of having vastly increased execution capacity for all other suits outweigh those shortcomings.
We would like to hear all relevant comment and questions from the
Please take the time to respond on this thread, or on the linked
quite owners and other interested parties, especially is you think we should not carry out the changes we propose. patches.
Thanks,
-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted _______________________________________________ Infra mailing list -- infra@ovirt.org To unsubscribe send an email to infra-leave@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/infra@ovirt.org/message/6UMJLCA45AICC5...
-- Didi
-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
participants (2)
-
Barak Korren
-
Yedidyah Bar David