On Tue, Oct 13, 2020 at 6:46 PM Nir Soffer <nsoffer(a)redhat.com> wrote:
On Mon, Oct 12, 2020 at 9:05 AM Yedidyah Bar David <didi(a)redhat.com> wrote:
> The next run of the job (480) did finish successfully. No idea if it
> was already fixed by a patch, or is simply a random/env issue.
I think this is env issue, we run on overloaded vms with small amount of memory.
I have seen such radnom failures before.
Generally speaking, I think we must aim for zero failures due to "env
issues" - and not ignore them as such.
It would obviously be nice if we had more hardware in CI, no doubt.
But I wonder if perhaps stressing the system like we do (due to resources
scarcity) is actually a good thing - that it helps us find bugs that real
users might also run into in actually legitimate scenarios - meaning, using
what we recommend in terms of hardware etc. but with a load that is higher
than what we have in CI per-run - as, admittedly, we only have minimal
_data_ there.
So: If we decide that some code "worked as designed" and failed due to
"env issue", I still think we should fix this - either in our code, or
in CI.
For latter, I do not think it makes sense to just say "the machines are
overloaded and not have enough memory" - we must come up with concrete
details - e.g. "We need at least X MiB RAM".
For current issue, if we are certain that this is due to low mem, it's
quite easy to e.g. revert this patch:
https://gerrit.ovirt.org/110530
Obviously it will mean either longer queues or over-committing (higher
load). Not sure which.
But personally, I wouldn't do that without knowing more (e.g. following
the other thread).
Best regards,
--
Didi