
Hi everyone, We've recently seen repeating errors where the OST 'master upgrade from release' suit failed with a repoman exception. Close analysis revealed that repoman was failing because it ran out of space in /dev/shm (OST suites are configured to run fro, /dev/shm if the slave has more then 16G available in it). The thing is, there is nothing that seems special about this suit and the packages it downloads, but since we suspected package sizes we opened OST-49 [1]. Trying to get more information we monitored a slave while it was running the suit. We found out that it wasn't the /dev/shm that we filling up with files, instead, repoman`s memory usage was exploding (20G+) to the point where there was not more memory available for use by /dev/shm. As a result we reported REP-3 [2]. This is not happening all the time. The same suit sometimes succeeds on the exact same slaves. We haven't yet managed to manually reproduce this. Since we cannot reproduce this, and we cannot easily stop using repoman in OST at this point. We implemented a work-around for the time being where we directed the master flow to run on a fixed set of nodes that have A LOT of RAM [3]. Needless to say this is not a long term solution. We need to somehow manage to reproduce or gain insight on the problem. Alternatively we can consider reworking the OST suites to not use repoman for downloading, but still use it for local repo building (Where its unique properties are crucial). [1]: https://ovirt-jira.atlassian.net/browse/OST-49 [2]: https://ovirt-jira.atlassian.net/browse/REP-3 [3]: http://jenkins.ovirt.org/label/integ-tests-big/ -- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/