Change-queue job failures this weekend

Sun Jan 21 10:39:02 UTC 2018

There is another issue, which is currently failing all CQ, and its related
to the new IBRS CPU model.
It looks like all of the lago slaves were upgraded to new Libvirt and
kernel on Friday, while we still don't have a fix on lago-ost-plugin for
that.

I think there was a misunderstanding about what to upgrade, and it might
have been understood that only the bios upgrade breaks it and not the
kernel one.

In any case, we're currently fixing the issue, either by downgrading the
relevant pkgs on lago slaves or adding the mapping to new CPU types from
OST.

For future, I suggest a few updates to maintenance work on Jenkins slaves (
VMs or BM ):

1. Let's avoid doing an upgrade close to a weekend ( i.e not on Thu-Sun ),
so all the team can be around to help if needed or if something unexpected
happens.
2. When we have a system-wide upgrade scheduled, like all BM slaves or VMs
for a specific OS, let's adopt a gradual upgrade with a few days window in
between,
  e.g, if we need to upgrade all Lago slaves, let's upgrade 1-2 and wait to
see if nothing breaks and continue after we verify OST runs ( either seeing
on CQ or running manually )

Thoughts?

On Sun, Jan 21, 2018 at 10:42 AM, Barak Korren <bkorren at redhat.com> wrote:

> Hi,
>
> We seen a great deal of noise coming from the change queue this
> weekend. While a part of it is due to actual code regressions, some of
> that was actually due to two separate infra issues.
>
> One issue we had was with building FC26 packages - it turns out that a
> yum-incompatible update if the 'cmake' package was introduced to the
> FC26 updates repo. Since for the time being we use 'yum' to setup the
> mock environments, the build jobs for FC26 started failing.
>
> This issue was actually reported to us [1].
>
> To resolve this - we rolled back the FC26 mirror to a time before the
> breaking change was introduced, and then re-triggered the merge events
> for all the patches that failed building to introduce passing build to
> the change queue.
>
> The second issue had to do with the introduction of FC27 slaves - it
> seems that slaves were misconfigured [2] and did not include vary
> basic packages like 'git' - this caused the CQ master job to simply
> crash and stop queue processing.
>
> To resolve this issue we disabled the FC27 slaves, resumed CQ
> operation and then re-sent all changes that failed to be added into
> the queue.
>
> We are in the final phases of integrating a new oVirt release, so
> proper CQ operation is crucial at this time. Additionally, due to a
> substantial amount of regressions introduced last week, the CQ
> currently has a huge backlog of ~180 changes to work through, this
> means that every bisection takes 8 OST runs, so we have no CQ minutes
> to spare.
>
> The FC27 slaves issue cost us 11 hours in which the CQ was not
> running. It also manifested itself in failures of the
> 'standard-enqueue' job. These kinds of failures need to be handled
> promptly or be avoided altogether.
>
> Build failures can make the CQ waste time too, as it runs bisections
> to detect and remove changes that fail to build. At this time, a
> single failed build can waste up to 8 hours!
>
> Lets try to be more careful about introducing infrastructure changes
> to the system at sensitive times, and be more vigilant about failure
> reports from jenkins.
>
> [1]: https://ovirt-jira.atlassian.net/browse/OVIRT-1854
> [2]: https://ovirt-jira.atlassian.net/browse/OVIRT-1855
>
> --
> Barak Korren
> RHV DevOps team , RHCE, RHCi
> Red Hat EMEA
> redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
> _______________________________________________
> Infra mailing list
> Infra at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/infra
>

-- 

Eyal edri

MANAGER

RHV DevOps

EMEA VIRTUALIZATION R&D

Red Hat EMEA <https://www.redhat.com/>
<https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>
phone: +972-9-7692018
irc: eedri (on #tlv #rhev-dev #rhev-integ)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/infra/attachments/20180121/731710ea/attachment-0001.html>