[ovirt-devel] Re: "env issues" in CI (was: virt-sparsify failed (was: [oVirt Jenkins] ovirt-system-tests_basic-suite-master_nightly - Build # 479 - Failure!))

Thursday, 15 October 2020

...
 On 14 Oct 2020, at 08:14, Yedidyah Bar David <didi(a)redhat.com&gt;
wrote:

 On Tue, Oct 13, 2020 at 6:46 PM Nir Soffer <nsoffer(a)redhat.com&gt; wrote:
> 
> On Mon, Oct 12, 2020 at 9:05 AM Yedidyah Bar David <didi(a)redhat.com&gt; wrote:
>> The next run of the job (480) did finish successfully. No idea if it
>> was already fixed by a patch, or is simply a random/env issue.
> 
> I think this is env issue, we run on overloaded vms with small amount of memory.
> I have seen such radnom failures before.

 Generally speaking, I think we must aim for zero failures due to "env
 issues" - and not ignore them as such. 
Exactly. We cannot ignore that any longer. 

...
 It would obviously be nice if we had more hardware in CI, no doubt.

there’s never enough

...
 But I wonder if perhaps stressing the system like we do (due to
resources
 scarcity) is actually a good thing - that it helps us find bugs that real
 users might also run into in actually legitimate scenarios 
yes, it absolutely does.

...
 - meaning, using
 what we recommend in terms of hardware etc. but with a load that is higher
 than what we have in CI per-run - as, admittedly, we only have minimal
 _data_ there.

 So: If we decide that some code "worked as designed" and failed due to
 "env issue", I still think we should fix this - either in our code, or
 in CI. 
yes!

...

 For latter, I do not think it makes sense to just say "the machines are
 overloaded and not have enough memory" - we must come up with concrete
 details - e.g. "We need at least X MiB RAM". 
I’ve spent quite some time analyzing the flakes in basic suite this past half year…so
allow me to say that that’s usually just an excuse for a lousy test (or functionality:)

...

 For current issue, if we are certain that this is due to low mem, it's
 quite easy to e.g. revert this patch:

 https://gerrit.ovirt.org/110530

 Obviously it will mean either longer queues or over-committing (higher
 load). Not sure which. 
it’s difficult to pinpoint the reason really. If it’s happening rarely (as this one is)
you’d need a statistically relevant comparison. Which takes time…

About this specific sparsify test - it was me uncommenting it few months ago, after
running around 100 tests over a weekend. It may have failed once (there were/are still
some other flakes)…but to me considering the overall success rate being quite low at that
time it sounded acceptable.
If this is now happening more often then it does sound like a regression somewhere. Could
be all the OST changes or tests rearrangements, but it also could be a code regression.

Either way it’s supposed to be predictable. And it is, just not in this environment we use
for this particular job - it’s the old one without ost-images, inside the troublesome
mock, so you don’t know what it picked up, what’s the really underlying system(outside of
mock)

Thanks,
michal

...

 But personally, I wouldn't do that without knowing more (e.g. following
 the other thread).

 Best regards,
 --
 Didi

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[ovirt-devel] Re: "env issues" in CI (was: virt-sparsify failed (was: [oVirt Jenkins] ovirt-system-tests_basic-suite-master_nightly - Build # 479 - Failure!))