Hi everyone,
We've recently seen repeating errors where the OST 'master upgrade
from release' suit failed with a repoman exception.
Close analysis revealed that repoman was failing because it ran out of
space in /dev/shm (OST suites are configured to run fro, /dev/shm if
the slave has more then 16G available in it).
The thing is, there is nothing that seems special about this suit and
the packages it downloads, but since we suspected package sizes we
opened OST-49 [1].
Trying to get more information we monitored a slave while it was
running the suit. We found out that it wasn't the /dev/shm that we
filling up with files, instead, repoman`s memory usage was exploding
(20G+) to the point where there was not more memory available for use
by /dev/shm.
As a result we reported REP-3 [2].
This is not happening all the time. The same suit sometimes succeeds
on the exact same slaves. We haven't yet managed to manually reproduce
this.
Since we cannot reproduce this, and we cannot easily stop using
repoman in OST at this point. We implemented a work-around for the
time being where we directed the master flow to run on a fixed set of
nodes that have A LOT of RAM [3].
Needless to say this is not a long term solution. We need to somehow
manage to reproduce or gain insight on the problem. Alternatively we
can consider reworking the OST suites to not use repoman for
downloading, but still use it for local repo building (Where its
unique properties are crucial).
[1]:
https://ovirt-jira.atlassian.net/browse/OST-49
[2]:
https://ovirt-jira.atlassian.net/browse/REP-3
[3]:
http://jenkins.ovirt.org/label/integ-tests-big/
--
Barak Korren
bkorren(a)redhat.com
RHCE, RHCi, RHV-DevOps Team
https://ifireball.wordpress.com/