
Hi everyone, We've recently seen repeating errors where the OST 'master upgrade from release' suit failed with a repoman exception. Close analysis revealed that repoman was failing because it ran out of space in /dev/shm (OST suites are configured to run fro, /dev/shm if the slave has more then 16G available in it). The thing is, there is nothing that seems special about this suit and the packages it downloads, but since we suspected package sizes we opened OST-49 [1]. Trying to get more information we monitored a slave while it was running the suit. We found out that it wasn't the /dev/shm that we filling up with files, instead, repoman`s memory usage was exploding (20G+) to the point where there was not more memory available for use by /dev/shm. As a result we reported REP-3 [2]. This is not happening all the time. The same suit sometimes succeeds on the exact same slaves. We haven't yet managed to manually reproduce this. Since we cannot reproduce this, and we cannot easily stop using repoman in OST at this point. We implemented a work-around for the time being where we directed the master flow to run on a fixed set of nodes that have A LOT of RAM [3]. Needless to say this is not a long term solution. We need to somehow manage to reproduce or gain insight on the problem. Alternatively we can consider reworking the OST suites to not use repoman for downloading, but still use it for local repo building (Where its unique properties are crucial). [1]: https://ovirt-jira.atlassian.net/browse/OST-49 [2]: https://ovirt-jira.atlassian.net/browse/REP-3 [3]: http://jenkins.ovirt.org/label/integ-tests-big/ -- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/

Since we cannot reproduce this, and we cannot easily stop using repoman in OST at this point. We implemented a work-around for the time being where we directed the master flow to run on a fixed set of nodes that have A LOT of RAM [3].
Take into account that this will significantly make the suites run slower(+10 minutes), as iirc all those servers are multi-NUMA. Also something must be really exploding, because the basic suite does not take more than 10GB of ram, and most of the low memory servers have around 48GB.
filling up with files, instead, repoman`s memory usage was exploding (20G+) to the point where there was not more memory available for use by /dev/shm.
I have a wild guess that this also happens because repoman does post-filtering, and it first downloads all packages, then filters them. About node and appliance, I think we should avoid downloading them, they are not used anywhere as far as I know. This filter should work(in extra_sources) last I checked, i.e.: rec:http://plain.resources.ovirt.org/repos/ovirt/tested/4.1/rpm/el7/:name~^(?!ovirt-node-ng-image|ovirt-engine-appliance).* If it goes in the groovy it will need some regex escaping love.. Though if my previous assumption is correct(post-filtering) it probably wouldn't matter. This raises the questions(again) of how do we filter stuff from repoman efficiently, without hiding them in 'extra_sources'. Nadav. On Wed, Feb 22, 2017 at 8:07 PM, Barak Korren <bkorren@redhat.com> wrote:
Hi everyone,
We've recently seen repeating errors where the OST 'master upgrade from release' suit failed with a repoman exception. Close analysis revealed that repoman was failing because it ran out of space in /dev/shm (OST suites are configured to run fro, /dev/shm if the slave has more then 16G available in it).
The thing is, there is nothing that seems special about this suit and the packages it downloads, but since we suspected package sizes we opened OST-49 [1].
Trying to get more information we monitored a slave while it was running the suit. We found out that it wasn't the /dev/shm that we filling up with files, instead, repoman`s memory usage was exploding (20G+) to the point where there was not more memory available for use by /dev/shm. As a result we reported REP-3 [2].
This is not happening all the time. The same suit sometimes succeeds on the exact same slaves. We haven't yet managed to manually reproduce this.
Since we cannot reproduce this, and we cannot easily stop using repoman in OST at this point. We implemented a work-around for the time being where we directed the master flow to run on a fixed set of nodes that have A LOT of RAM [3].
Needless to say this is not a long term solution. We need to somehow manage to reproduce or gain insight on the problem. Alternatively we can consider reworking the OST suites to not use repoman for downloading, but still use it for local repo building (Where its unique properties are crucial).
[1]: https://ovirt-jira.atlassian.net/browse/OST-49 [2]: https://ovirt-jira.atlassian.net/browse/REP-3 [3]: http://jenkins.ovirt.org/label/integ-tests-big/
-- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/ _______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

On 22 February 2017 at 20:36, Nadav Goldin <ngoldin@redhat.com> wrote:
Since we cannot reproduce this, and we cannot easily stop using repoman in OST at this point. We implemented a work-around for the time being where we directed the master flow to run on a fixed set of nodes that have A LOT of RAM [3].
Take into account that this will significantly make the suites run slower(+10 minutes), as iirc all those servers are multi-NUMA. Also something must be really exploding, because the basic suite does not take more than 10GB of ram, and most of the low memory servers have around 48GB.
The alternative is to not run in RAM at all...
filling up with files, instead, repoman`s memory usage was exploding (20G+) to the point where there was not more memory available for use by /dev/shm.
I have a wild guess that this also happens because repoman does post-filtering, and it first downloads all packages, then filters them.
If that was the case we would see used space in /dev/shm growing. We did not.
About node and appliance, I think we should avoid downloading them, they are not used anywhere as far as I know. This filter should work(in extra_sources) last I checked, i.e.: rec:http://plain.resources.ovirt.org/repos/ovirt/tested/4.1/rpm/el7/:name~^(?!ovirt-node-ng-image|ovirt-engine-appliance).* If it goes in the groovy it will need some regex escaping love.. Though if my previous assumption is correct(post-filtering) it probably wouldn't matter.
Its gonna be hard to add this and also allow incorporating the node/HA suits at some point. -- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/
participants (2)
-
Barak Korren
-
Nadav Goldin