]
eyal edri updated OVIRT-1856:
-----------------------------
Summary: Document procedure for infra upgrades (was: Re: Change-queue job failures
this weekend)
Document procedure for infra upgrades
--------------------------------------
Key: OVIRT-1856
URL:
https://ovirt-jira.atlassian.net/browse/OVIRT-1856
Project: oVirt - virtualization made easy
Issue Type: By-EMAIL
Reporter: eyal edri
Assignee: infra
On Sun, Jan 21, 2018 at 1:01 PM, Barak Korren <bkorren(a)redhat.com> wrote:
>
>
> On 21 January 2018 at 12:50, Eyal Edri <eedri(a)redhat.com> wrote:
>
>>
>>
>> On Sun, Jan 21, 2018 at 12:47 PM, Barak Korren <bkorren(a)redhat.com>
>> wrote:
>>
>>>
>>>
>>> On 21 January 2018 at 12:39, Eyal Edri <eedri(a)redhat.com> wrote:
>>>
>>>> There is another issue, which is currently failing all CQ, and its
>>>> related to the new IBRS CPU model.
>>>> It looks like all of the lago slaves were upgraded to new Libvirt and
>>>> kernel on Friday, while we still don't have a fix on lago-ost-plugin
for
>>>> that.
>>>>
>>>> I think there was a misunderstanding about what to upgrade, and it
>>>> might have been understood that only the bios upgrade breaks it and not
the
>>>> kernel one.
>>>>
>>>> In any case, we're currently fixing the issue, either by
downgrading
>>>> the relevant pkgs on lago slaves or adding the mapping to new CPU types
>>>> from OST.
>>>>
>>>> For future, I suggest a few updates to maintenance work on Jenkins
>>>> slaves ( VMs or BM ):
>>>>
>>>> 1. Let's avoid doing an upgrade close to a weekend ( i.e not on
Thu-Sun
>>>> ), so all the team can be around to help if needed or if something
>>>> unexpected happens.
>>>> 2. When we have a system-wide upgrade scheduled, like all BM slaves or
>>>> VMs for a specific OS, let's adopt a gradual upgrade with a few days
window
>>>> in between,
>>>> e.g, if we need to upgrade all Lago slaves, let's upgrade 1-2 and
>>>> wait to see if nothing breaks and continue after we verify OST runs (
>>>> either seeing on CQ or running manually )
>>>>
>>>>
>>>> Thoughts?
>>>>
>>>>
>>> We have a staging system - we should be using it for staging....
>>>
>>
>> Do we have OST tests or manual job avaialble there?
>>
>
> We can add them easily, or simply run Lago manually when needed.
>
>
>> In any case, this doesn't contradict what I suggested, even if you test
>> on staging, there could be differences from the production system, so we
>> should take care when we upgrade regardless.
>>
>
> Yes, but at least we'd know we green lighted the new configuration - I'm
> sure in this case we could have found at least some of the issues on
> staging (Like the fc27 issues for example) and could have avoided expansive
> production failures.
>
> Another point when scheduling an upgrade, is to talk to infra owner or the
>> CI team and understand if we currently have a large Q in CQ or known
>> failures, so it might be best to wait a bit until its cleared.
>>
>>
>
>
Adding infra-support so we can gather this info and prepare a maintanaince
/ upgrade checklist to add to the oVirt infra docs.
Let's continue the discussion, suggestion on that ticket.
> --
> Barak Korren
> RHV DevOps team , RHCE, RHCi
> Red Hat EMEA
>
redhat.com | TRIED. TESTED. TRUSTED. |
redhat.com/trusted
>
--
Eyal edri
MANAGER
RHV DevOps
EMEA VIRTUALIZATION R&D
Red Hat EMEA <
https://www.redhat.com/>
<
https://red.ht/sig> TRIED. TESTED. TRUSTED. <
https://redhat.com/trusted>
phone: +972-9-7692018
irc: eedri (on #tlv #rhev-dev #rhev-integ)