Since we cannot reproduce this, and we cannot easily stop using
repoman in OST at this point. We implemented a work-around for the
time being where we directed the master flow to run on a fixed set of
nodes that have A LOT of RAM [3].
Take into account that this will significantly make the suites run
slower(+10 minutes), as iirc all those servers are multi-NUMA. Also
something must be really exploding, because the basic suite does not
take more than 10GB of ram, and most of the low memory servers have
around 48GB.
filling up with files, instead, repoman`s memory usage was exploding
(20G+) to the point where there was not more memory available for use
by /dev/shm.
I have a wild guess that this also happens because repoman does
post-filtering, and it first downloads all packages, then filters
them.
About node and appliance, I think we should avoid downloading them,
they are not used anywhere as far as I know. This filter should
work(in extra_sources) last I checked, i.e.:
rec:http://plain.resources.ovirt.org/repos/ovirt/tested/4.1/rpm/el7/:name...
If it goes in the groovy it will need some regex escaping love..
Though if my previous assumption is correct(post-filtering) it
probably wouldn't matter.
This raises the questions(again) of how do we filter stuff from
repoman efficiently, without hiding them in 'extra_sources'.
Nadav.
On Wed, Feb 22, 2017 at 8:07 PM, Barak Korren <bkorren(a)redhat.com> wrote:
> Hi everyone,
>
> We've recently seen repeating errors where the OST 'master upgrade
> from release' suit failed with a repoman exception.
> Close analysis revealed that repoman was failing because it ran out of
> space in /dev/shm (OST suites are configured to run fro, /dev/shm if
> the slave has more then 16G available in it).
>
> The thing is, there is nothing that seems special about this suit and
> the packages it downloads, but since we suspected package sizes we
> opened OST-49 [1].
>
> Trying to get more information we monitored a slave while it was
> running the suit. We found out that it wasn't the /dev/shm that we
filling up with files, instead, repoman`s memory usage was exploding
(20G+) to the point where there was not more memory available for use
by /dev/shm.
> As a result we reported REP-3 [2].
>
> This is not happening all the time. The same suit sometimes succeeds
> on the exact same slaves. We haven't yet managed to manually reproduce
> this.
>
Since we cannot reproduce this, and we cannot easily stop using
repoman in OST at this point. We implemented a work-around for the
time being where we directed the master flow to run on a fixed set of
nodes that have A LOT of RAM [3].
>
> Needless to say this is not a long term solution. We need to somehow
> manage to reproduce or gain insight on the problem. Alternatively we
> can consider reworking the OST suites to not use repoman for
> downloading, but still use it for local repo building (Where its
> unique properties are crucial).
>
> [1]:
https://ovirt-jira.atlassian.net/browse/OST-49
> [2]:
https://ovirt-jira.atlassian.net/browse/REP-3
> [3]:
http://jenkins.ovirt.org/label/integ-tests-big/
>
> --
> Barak Korren
> bkorren(a)redhat.com
> RHCE, RHCi, RHV-DevOps Team
>
https://ifireball.wordpress.com/
> _______________________________________________
> Infra mailing list
> Infra(a)ovirt.org
>
http://lists.ovirt.org/mailman/listinfo/infra