On Wed, Feb 24, 2021 at 2:36 PM Marcin Sobczyk <msobczyk(a)redhat.com> wrote:
On 2/24/21 11:05 AM, Yedidyah Bar David wrote:
> On Wed, Feb 24, 2021 at 11:43 AM Milan Zamazal <mzamazal(a)redhat.com> wrote:
>> Yedidyah Bar David <didi(a)redhat.com> writes:
>>
>>> Hi all,
>>>
>>> Right now, when we merge a patch e.g. to the engine (and many other
>>> projects), it can take up to several days until it is used by the
>>> hosted-engine ovirt-system-tests suite. Something similar will happen
>>> soon if/when we introduce suites that use ovirt-node.
>>>
>>> If I got it right:
>>> - Merge causes CI to build the engine - immediately, takes ~ 1 hour (say)
>>> - A publisher job [1] publishes it to
resources.ovirt.org (daily,
>>> midnight (UTC))
>>> - The next run of an appliance build [2] includes it (daily, afternoon)
>>> - The next run of the publisher [1] publishes the appliance (daily,
midnight)
>>> - The next run of ost-images [3] includes the appliance (daily,
>>> midnight, 2 hours after the publisher) (and publishes it immediately)
>>> - The next run of ost (e.g. [4]) will use it (daily, slightly *before*
>>> ost-images, but I guess we can change that. And this does not affect
>>> manual runs of OST, so can probably be ignored in the calculation, at
>>> least to some extent).
>>>
>>> So if I got it right, a patch merged to the engine in some morning,
>>> will be used by the nightly run of OST HE only after almost 3 days,
>>> and available for manual runs after 2 days. IMO that's too much time.
>>> I might be somewhat wrong, but not very, I think.
>>>
>>> One partial solution is to add automation .repos lines to relevant
>>> projects that will link at lastSuccessfulBuild (let's call it lastSB)
>>> of the more important projects they consume - e.g. appliance to use
>>> lastSB of engine+dwh+a few others, node to use lastSB of vdsm, etc.
>>> This will require more maintenance (adding/removing/fixing projects as
>>> needed) and cause some more load on CI (as now packages will be
>>> downloaded from it instead of from
resources.ovirt.org).
>>>
>>> Another solution is to run relevant jobs (publisher/appliance/node)
>>> far more often - say, once every two hours.
>> One important thing to consider is an ability to run OST on our patches
>> at all. If there is (almost) always a newer build available then custom
>> repos added to OST runs, whether on Jenkins or locally, will be ignored
>> and we'll be unable to test our patches before they are merged.
> Indeed. That's an important point. IIRC OST has a ticket specifically
> addressing this issue.
Yes, we have:
https://gerrit.ovirt.org/#/c/ovirt-system-tests/+/113223/
and:
https://issues.redhat.com/browse/RHV-41025
which is not implemented yet.
The downside of upgrading to the latest RPMs from 'tested' repo is, as
Milan mentioned,
an increased chance that your own packages will not be used cause
they're too old.
The upside is that if someone breaks OST globally with i.e. some engine
patch,
and a fix for the problem is merged midday, upgrading to the latest RPMs
will unblock the runs.
If we don't upgrade, we'll have to wait for the nightly job to rebuild
ost-images to include the fix.
Rebuilding ost-images midday is an option, but it takes a lot of time,
so in most cases
one can simply wait till tomorrow...
I want to fix this by implementing an option in OST's manual run
(switched off by default)
that will allow you to upgrade to the latest RPMs from 'tested'. That
way one has ~24h
for his/her patches to be fresh enough to be picked up by dnf.
'check-patch' jobs should always use latest RPMs from 'tested' IMO.
>
>>> This will also add load, and might cause "perceived" instability -
as
>>> things will likely fluctuate between green and red more often.
>> This doesn't sound very good, I perceive the things less than stable
>> already now.
> Agreed.
> I quoted "perceived" because I do not think they'll actually be less
stable.
> Right now, when something critical is broken, we fix it, then manually
> run some of the above jobs as needed, to quickly get back to business.
> When we don't (often), some things simply remain broken for two days.
>
> Running more often will simply notify us about breakage faster. If we
> then fix, it will automatically propage the fix faster.
Isn't upgrading the engine RPM on the appliance an option?
You mean, as part of the OST run itself?
Generally speaking, 'hosted-engine --deploy' already does that, but
in practice this does not work in CI. I didn't check recently why. Probably
some configuration (repos) or missing proxy or something like that.
It's done in a task called 'Update all packages' (something to search
for in logs if you feel like it).
It can be controlled from the CLI with he_offline_deployment [1],
but I do not see anywhere that we use this in CI.
[1]
>
>>> I think I prefer the latter. What do you think?
>> Wouldn't it be possible to run the whole pipeline nightly (even if it
>> means e.g. running the publisher twice during the night)?
> It will. But this will only fix the specific issue of appliance/node.
> Running more often also simply gives feedback faster.
>
> But I agree that perhaps we should wait with this until OST allows
> using a custom repo reliably and easily.
>
> Thanks,