Change-queue job failures this weekend

Barak Korren bkorren at redhat.com
Sun Jan 21 08:42:04 UTC 2018


Hi,

We seen a great deal of noise coming from the change queue this
weekend. While a part of it is due to actual code regressions, some of
that was actually due to two separate infra issues.

One issue we had was with building FC26 packages - it turns out that a
yum-incompatible update if the 'cmake' package was introduced to the
FC26 updates repo. Since for the time being we use 'yum' to setup the
mock environments, the build jobs for FC26 started failing.

This issue was actually reported to us [1].

To resolve this - we rolled back the FC26 mirror to a time before the
breaking change was introduced, and then re-triggered the merge events
for all the patches that failed building to introduce passing build to
the change queue.

The second issue had to do with the introduction of FC27 slaves - it
seems that slaves were misconfigured [2] and did not include vary
basic packages like 'git' - this caused the CQ master job to simply
crash and stop queue processing.

To resolve this issue we disabled the FC27 slaves, resumed CQ
operation and then re-sent all changes that failed to be added into
the queue.

We are in the final phases of integrating a new oVirt release, so
proper CQ operation is crucial at this time. Additionally, due to a
substantial amount of regressions introduced last week, the CQ
currently has a huge backlog of ~180 changes to work through, this
means that every bisection takes 8 OST runs, so we have no CQ minutes
to spare.

The FC27 slaves issue cost us 11 hours in which the CQ was not
running. It also manifested itself in failures of the
'standard-enqueue' job. These kinds of failures need to be handled
promptly or be avoided altogether.

Build failures can make the CQ waste time too, as it runs bisections
to detect and remove changes that fail to build. At this time, a
single failed build can waste up to 8 hours!

Lets try to be more careful about introducing infrastructure changes
to the system at sensitive times, and be more vigilant about failure
reports from jenkins.

[1]: https://ovirt-jira.atlassian.net/browse/OVIRT-1854
[2]: https://ovirt-jira.atlassian.net/browse/OVIRT-1855

-- 
Barak Korren
RHV DevOps team , RHCE, RHCi
Red Hat EMEA
redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted


More information about the Infra mailing list