
On Tue, Oct 13, 2020 at 6:46 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Oct 12, 2020 at 9:05 AM Yedidyah Bar David <didi@redhat.com> wrote:
The next run of the job (480) did finish successfully. No idea if it was already fixed by a patch, or is simply a random/env issue.
I think this is env issue, we run on overloaded vms with small amount of memory. I have seen such radnom failures before.
Generally speaking, I think we must aim for zero failures due to "env issues" - and not ignore them as such. It would obviously be nice if we had more hardware in CI, no doubt. But I wonder if perhaps stressing the system like we do (due to resources scarcity) is actually a good thing - that it helps us find bugs that real users might also run into in actually legitimate scenarios - meaning, using what we recommend in terms of hardware etc. but with a load that is higher than what we have in CI per-run - as, admittedly, we only have minimal _data_ there. So: If we decide that some code "worked as designed" and failed due to "env issue", I still think we should fix this - either in our code, or in CI. For latter, I do not think it makes sense to just say "the machines are overloaded and not have enough memory" - we must come up with concrete details - e.g. "We need at least X MiB RAM". For current issue, if we are certain that this is due to low mem, it's quite easy to e.g. revert this patch: https://gerrit.ovirt.org/110530 Obviously it will mean either longer queues or over-committing (higher load). Not sure which. But personally, I wouldn't do that without knowing more (e.g. following the other thread). Best regards, -- Didi