[JIRA] (OVIRT-1857) Please create ovirt-ansible-dpdk-setup github repostiory
by Ondra Machacek (oVirt JIRA)
Ondra Machacek created OVIRT-1857:
-------------------------------------
Summary: Please create ovirt-ansible-dpdk-setup github repostiory
Key: OVIRT-1857
URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1857
Project: oVirt - virtualization made easy
Issue Type: Task
Reporter: Ondra Machacek
Assignee: infra
Please create ovirt-ansible-dpdk-setup github repostiory
And please assign igoihman as administrator.
Thanks.
[~igoihman(a)redhat.com]
[~mwperina]
--
This message was sent by Atlassian Jira
(v1001.0.0-SNAPSHOT#100076)
6 years, 9 months
[JIRA] (OVIRT-1856) Document procedure for infra upgrades
by eyal edri (oVirt JIRA)
[ https://ovirt-jira.atlassian.net/browse/OVIRT-1856?page=com.atlassian.jir... ]
eyal edri updated OVIRT-1856:
-----------------------------
Priority: High (was: Medium)
> Document procedure for infra upgrades
> --------------------------------------
>
> Key: OVIRT-1856
> URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1856
> Project: oVirt - virtualization made easy
> Issue Type: By-EMAIL
> Reporter: eyal edri
> Assignee: infra
> Priority: High
>
> On Sun, Jan 21, 2018 at 1:01 PM, Barak Korren <bkorren(a)redhat.com> wrote:
> >
> >
> > On 21 January 2018 at 12:50, Eyal Edri <eedri(a)redhat.com> wrote:
> >
> >>
> >>
> >> On Sun, Jan 21, 2018 at 12:47 PM, Barak Korren <bkorren(a)redhat.com>
> >> wrote:
> >>
> >>>
> >>>
> >>> On 21 January 2018 at 12:39, Eyal Edri <eedri(a)redhat.com> wrote:
> >>>
> >>>> There is another issue, which is currently failing all CQ, and its
> >>>> related to the new IBRS CPU model.
> >>>> It looks like all of the lago slaves were upgraded to new Libvirt and
> >>>> kernel on Friday, while we still don't have a fix on lago-ost-plugin for
> >>>> that.
> >>>>
> >>>> I think there was a misunderstanding about what to upgrade, and it
> >>>> might have been understood that only the bios upgrade breaks it and not the
> >>>> kernel one.
> >>>>
> >>>> In any case, we're currently fixing the issue, either by downgrading
> >>>> the relevant pkgs on lago slaves or adding the mapping to new CPU types
> >>>> from OST.
> >>>>
> >>>> For future, I suggest a few updates to maintenance work on Jenkins
> >>>> slaves ( VMs or BM ):
> >>>>
> >>>> 1. Let's avoid doing an upgrade close to a weekend ( i.e not on Thu-Sun
> >>>> ), so all the team can be around to help if needed or if something
> >>>> unexpected happens.
> >>>> 2. When we have a system-wide upgrade scheduled, like all BM slaves or
> >>>> VMs for a specific OS, let's adopt a gradual upgrade with a few days window
> >>>> in between,
> >>>> e.g, if we need to upgrade all Lago slaves, let's upgrade 1-2 and
> >>>> wait to see if nothing breaks and continue after we verify OST runs (
> >>>> either seeing on CQ or running manually )
> >>>>
> >>>>
> >>>> Thoughts?
> >>>>
> >>>>
> >>> We have a staging system - we should be using it for staging....
> >>>
> >>
> >> Do we have OST tests or manual job avaialble there?
> >>
> >
> > We can add them easily, or simply run Lago manually when needed.
> >
> >
> >> In any case, this doesn't contradict what I suggested, even if you test
> >> on staging, there could be differences from the production system, so we
> >> should take care when we upgrade regardless.
> >>
> >
> > Yes, but at least we'd know we green lighted the new configuration - I'm
> > sure in this case we could have found at least some of the issues on
> > staging (Like the fc27 issues for example) and could have avoided expansive
> > production failures.
> >
> > Another point when scheduling an upgrade, is to talk to infra owner or the
> >> CI team and understand if we currently have a large Q in CQ or known
> >> failures, so it might be best to wait a bit until its cleared.
> >>
> >>
> >
> >
> Adding infra-support so we can gather this info and prepare a maintanaince
> / upgrade checklist to add to the oVirt infra docs.
> Let's continue the discussion, suggestion on that ticket.
> > --
> > Barak Korren
> > RHV DevOps team , RHCE, RHCi
> > Red Hat EMEA
> > redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
> >
> --
> Eyal edri
> MANAGER
> RHV DevOps
> EMEA VIRTUALIZATION R&D
> Red Hat EMEA <https://www.redhat.com/>
> <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>
> phone: +972-9-7692018
> irc: eedri (on #tlv #rhev-dev #rhev-integ)
--
This message was sent by Atlassian Jira
(v1001.0.0-SNAPSHOT#100076)
6 years, 9 months
[JIRA] (OVIRT-1856) Document procedure for infra upgrades
by eyal edri (oVirt JIRA)
[ https://ovirt-jira.atlassian.net/browse/OVIRT-1856?page=com.atlassian.jir... ]
eyal edri updated OVIRT-1856:
-----------------------------
Summary: Document procedure for infra upgrades (was: Re: Change-queue job failures this weekend)
> Document procedure for infra upgrades
> --------------------------------------
>
> Key: OVIRT-1856
> URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1856
> Project: oVirt - virtualization made easy
> Issue Type: By-EMAIL
> Reporter: eyal edri
> Assignee: infra
>
> On Sun, Jan 21, 2018 at 1:01 PM, Barak Korren <bkorren(a)redhat.com> wrote:
> >
> >
> > On 21 January 2018 at 12:50, Eyal Edri <eedri(a)redhat.com> wrote:
> >
> >>
> >>
> >> On Sun, Jan 21, 2018 at 12:47 PM, Barak Korren <bkorren(a)redhat.com>
> >> wrote:
> >>
> >>>
> >>>
> >>> On 21 January 2018 at 12:39, Eyal Edri <eedri(a)redhat.com> wrote:
> >>>
> >>>> There is another issue, which is currently failing all CQ, and its
> >>>> related to the new IBRS CPU model.
> >>>> It looks like all of the lago slaves were upgraded to new Libvirt and
> >>>> kernel on Friday, while we still don't have a fix on lago-ost-plugin for
> >>>> that.
> >>>>
> >>>> I think there was a misunderstanding about what to upgrade, and it
> >>>> might have been understood that only the bios upgrade breaks it and not the
> >>>> kernel one.
> >>>>
> >>>> In any case, we're currently fixing the issue, either by downgrading
> >>>> the relevant pkgs on lago slaves or adding the mapping to new CPU types
> >>>> from OST.
> >>>>
> >>>> For future, I suggest a few updates to maintenance work on Jenkins
> >>>> slaves ( VMs or BM ):
> >>>>
> >>>> 1. Let's avoid doing an upgrade close to a weekend ( i.e not on Thu-Sun
> >>>> ), so all the team can be around to help if needed or if something
> >>>> unexpected happens.
> >>>> 2. When we have a system-wide upgrade scheduled, like all BM slaves or
> >>>> VMs for a specific OS, let's adopt a gradual upgrade with a few days window
> >>>> in between,
> >>>> e.g, if we need to upgrade all Lago slaves, let's upgrade 1-2 and
> >>>> wait to see if nothing breaks and continue after we verify OST runs (
> >>>> either seeing on CQ or running manually )
> >>>>
> >>>>
> >>>> Thoughts?
> >>>>
> >>>>
> >>> We have a staging system - we should be using it for staging....
> >>>
> >>
> >> Do we have OST tests or manual job avaialble there?
> >>
> >
> > We can add them easily, or simply run Lago manually when needed.
> >
> >
> >> In any case, this doesn't contradict what I suggested, even if you test
> >> on staging, there could be differences from the production system, so we
> >> should take care when we upgrade regardless.
> >>
> >
> > Yes, but at least we'd know we green lighted the new configuration - I'm
> > sure in this case we could have found at least some of the issues on
> > staging (Like the fc27 issues for example) and could have avoided expansive
> > production failures.
> >
> > Another point when scheduling an upgrade, is to talk to infra owner or the
> >> CI team and understand if we currently have a large Q in CQ or known
> >> failures, so it might be best to wait a bit until its cleared.
> >>
> >>
> >
> >
> Adding infra-support so we can gather this info and prepare a maintanaince
> / upgrade checklist to add to the oVirt infra docs.
> Let's continue the discussion, suggestion on that ticket.
> > --
> > Barak Korren
> > RHV DevOps team , RHCE, RHCi
> > Red Hat EMEA
> > redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
> >
> --
> Eyal edri
> MANAGER
> RHV DevOps
> EMEA VIRTUALIZATION R&D
> Red Hat EMEA <https://www.redhat.com/>
> <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>
> phone: +972-9-7692018
> irc: eedri (on #tlv #rhev-dev #rhev-integ)
--
This message was sent by Atlassian Jira
(v1001.0.0-SNAPSHOT#100076)
6 years, 9 months
[JIRA] (OVIRT-1856) Re: Change-queue job failures this weekend
by Daniel Belenky (oVirt JIRA)
[ https://ovirt-jira.atlassian.net/browse/OVIRT-1856?page=com.atlassian.jir... ]
Daniel Belenky commented on OVIRT-1856:
---------------------------------------
As for now, [tester 5029|http://jenkins.ovirt.org/view/Change%20queue%20jobs/job/ovirt-master...] passed with 175 patches.
So for now, we know that there are no unknown regressions in master repo.
> Re: Change-queue job failures this weekend
> ------------------------------------------
>
> Key: OVIRT-1856
> URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1856
> Project: oVirt - virtualization made easy
> Issue Type: By-EMAIL
> Reporter: eyal edri
> Assignee: infra
>
> On Sun, Jan 21, 2018 at 1:01 PM, Barak Korren <bkorren(a)redhat.com> wrote:
> >
> >
> > On 21 January 2018 at 12:50, Eyal Edri <eedri(a)redhat.com> wrote:
> >
> >>
> >>
> >> On Sun, Jan 21, 2018 at 12:47 PM, Barak Korren <bkorren(a)redhat.com>
> >> wrote:
> >>
> >>>
> >>>
> >>> On 21 January 2018 at 12:39, Eyal Edri <eedri(a)redhat.com> wrote:
> >>>
> >>>> There is another issue, which is currently failing all CQ, and its
> >>>> related to the new IBRS CPU model.
> >>>> It looks like all of the lago slaves were upgraded to new Libvirt and
> >>>> kernel on Friday, while we still don't have a fix on lago-ost-plugin for
> >>>> that.
> >>>>
> >>>> I think there was a misunderstanding about what to upgrade, and it
> >>>> might have been understood that only the bios upgrade breaks it and not the
> >>>> kernel one.
> >>>>
> >>>> In any case, we're currently fixing the issue, either by downgrading
> >>>> the relevant pkgs on lago slaves or adding the mapping to new CPU types
> >>>> from OST.
> >>>>
> >>>> For future, I suggest a few updates to maintenance work on Jenkins
> >>>> slaves ( VMs or BM ):
> >>>>
> >>>> 1. Let's avoid doing an upgrade close to a weekend ( i.e not on Thu-Sun
> >>>> ), so all the team can be around to help if needed or if something
> >>>> unexpected happens.
> >>>> 2. When we have a system-wide upgrade scheduled, like all BM slaves or
> >>>> VMs for a specific OS, let's adopt a gradual upgrade with a few days window
> >>>> in between,
> >>>> e.g, if we need to upgrade all Lago slaves, let's upgrade 1-2 and
> >>>> wait to see if nothing breaks and continue after we verify OST runs (
> >>>> either seeing on CQ or running manually )
> >>>>
> >>>>
> >>>> Thoughts?
> >>>>
> >>>>
> >>> We have a staging system - we should be using it for staging....
> >>>
> >>
> >> Do we have OST tests or manual job avaialble there?
> >>
> >
> > We can add them easily, or simply run Lago manually when needed.
> >
> >
> >> In any case, this doesn't contradict what I suggested, even if you test
> >> on staging, there could be differences from the production system, so we
> >> should take care when we upgrade regardless.
> >>
> >
> > Yes, but at least we'd know we green lighted the new configuration - I'm
> > sure in this case we could have found at least some of the issues on
> > staging (Like the fc27 issues for example) and could have avoided expansive
> > production failures.
> >
> > Another point when scheduling an upgrade, is to talk to infra owner or the
> >> CI team and understand if we currently have a large Q in CQ or known
> >> failures, so it might be best to wait a bit until its cleared.
> >>
> >>
> >
> >
> Adding infra-support so we can gather this info and prepare a maintanaince
> / upgrade checklist to add to the oVirt infra docs.
> Let's continue the discussion, suggestion on that ticket.
> > --
> > Barak Korren
> > RHV DevOps team , RHCE, RHCi
> > Red Hat EMEA
> > redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
> >
> --
> Eyal edri
> MANAGER
> RHV DevOps
> EMEA VIRTUALIZATION R&D
> Red Hat EMEA <https://www.redhat.com/>
> <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>
> phone: +972-9-7692018
> irc: eedri (on #tlv #rhev-dev #rhev-integ)
--
This message was sent by Atlassian Jira
(v1001.0.0-SNAPSHOT#100076)
6 years, 9 months
[JIRA] (OVIRT-1856) Re: Change-queue job failures this weekend
by eyal edri (oVirt JIRA)
eyal edri created OVIRT-1856:
--------------------------------
Summary: Re: Change-queue job failures this weekend
Key: OVIRT-1856
URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1856
Project: oVirt - virtualization made easy
Issue Type: By-EMAIL
Reporter: eyal edri
Assignee: infra
On Sun, Jan 21, 2018 at 1:01 PM, Barak Korren <bkorren(a)redhat.com> wrote:
>
>
> On 21 January 2018 at 12:50, Eyal Edri <eedri(a)redhat.com> wrote:
>
>>
>>
>> On Sun, Jan 21, 2018 at 12:47 PM, Barak Korren <bkorren(a)redhat.com>
>> wrote:
>>
>>>
>>>
>>> On 21 January 2018 at 12:39, Eyal Edri <eedri(a)redhat.com> wrote:
>>>
>>>> There is another issue, which is currently failing all CQ, and its
>>>> related to the new IBRS CPU model.
>>>> It looks like all of the lago slaves were upgraded to new Libvirt and
>>>> kernel on Friday, while we still don't have a fix on lago-ost-plugin for
>>>> that.
>>>>
>>>> I think there was a misunderstanding about what to upgrade, and it
>>>> might have been understood that only the bios upgrade breaks it and not the
>>>> kernel one.
>>>>
>>>> In any case, we're currently fixing the issue, either by downgrading
>>>> the relevant pkgs on lago slaves or adding the mapping to new CPU types
>>>> from OST.
>>>>
>>>> For future, I suggest a few updates to maintenance work on Jenkins
>>>> slaves ( VMs or BM ):
>>>>
>>>> 1. Let's avoid doing an upgrade close to a weekend ( i.e not on Thu-Sun
>>>> ), so all the team can be around to help if needed or if something
>>>> unexpected happens.
>>>> 2. When we have a system-wide upgrade scheduled, like all BM slaves or
>>>> VMs for a specific OS, let's adopt a gradual upgrade with a few days window
>>>> in between,
>>>> e.g, if we need to upgrade all Lago slaves, let's upgrade 1-2 and
>>>> wait to see if nothing breaks and continue after we verify OST runs (
>>>> either seeing on CQ or running manually )
>>>>
>>>>
>>>> Thoughts?
>>>>
>>>>
>>> We have a staging system - we should be using it for staging....
>>>
>>
>> Do we have OST tests or manual job avaialble there?
>>
>
> We can add them easily, or simply run Lago manually when needed.
>
>
>> In any case, this doesn't contradict what I suggested, even if you test
>> on staging, there could be differences from the production system, so we
>> should take care when we upgrade regardless.
>>
>
> Yes, but at least we'd know we green lighted the new configuration - I'm
> sure in this case we could have found at least some of the issues on
> staging (Like the fc27 issues for example) and could have avoided expansive
> production failures.
>
> Another point when scheduling an upgrade, is to talk to infra owner or the
>> CI team and understand if we currently have a large Q in CQ or known
>> failures, so it might be best to wait a bit until its cleared.
>>
>>
>
>
Adding infra-support so we can gather this info and prepare a maintanaince
/ upgrade checklist to add to the oVirt infra docs.
Let's continue the discussion, suggestion on that ticket.
> --
> Barak Korren
> RHV DevOps team , RHCE, RHCi
> Red Hat EMEA
> redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
>
--
Eyal edri
MANAGER
RHV DevOps
EMEA VIRTUALIZATION R&D
Red Hat EMEA <https://www.redhat.com/>
<https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>
phone: +972-9-7692018
irc: eedri (on #tlv #rhev-dev #rhev-integ)
--
This message was sent by Atlassian Jira
(v1001.0.0-SNAPSHOT#100076)
6 years, 9 months
Change-queue job failures this weekend
by Barak Korren
Hi,
We seen a great deal of noise coming from the change queue this
weekend. While a part of it is due to actual code regressions, some of
that was actually due to two separate infra issues.
One issue we had was with building FC26 packages - it turns out that a
yum-incompatible update if the 'cmake' package was introduced to the
FC26 updates repo. Since for the time being we use 'yum' to setup the
mock environments, the build jobs for FC26 started failing.
This issue was actually reported to us [1].
To resolve this - we rolled back the FC26 mirror to a time before the
breaking change was introduced, and then re-triggered the merge events
for all the patches that failed building to introduce passing build to
the change queue.
The second issue had to do with the introduction of FC27 slaves - it
seems that slaves were misconfigured [2] and did not include vary
basic packages like 'git' - this caused the CQ master job to simply
crash and stop queue processing.
To resolve this issue we disabled the FC27 slaves, resumed CQ
operation and then re-sent all changes that failed to be added into
the queue.
We are in the final phases of integrating a new oVirt release, so
proper CQ operation is crucial at this time. Additionally, due to a
substantial amount of regressions introduced last week, the CQ
currently has a huge backlog of ~180 changes to work through, this
means that every bisection takes 8 OST runs, so we have no CQ minutes
to spare.
The FC27 slaves issue cost us 11 hours in which the CQ was not
running. It also manifested itself in failures of the
'standard-enqueue' job. These kinds of failures need to be handled
promptly or be avoided altogether.
Build failures can make the CQ waste time too, as it runs bisections
to detect and remove changes that fail to build. At this time, a
single failed build can waste up to 8 hours!
Lets try to be more careful about introducing infrastructure changes
to the system at sensitive times, and be more vigilant about failure
reports from jenkins.
[1]: https://ovirt-jira.atlassian.net/browse/OVIRT-1854
[2]: https://ovirt-jira.atlassian.net/browse/OVIRT-1855
--
Barak Korren
RHV DevOps team , RHCE, RHCi
Red Hat EMEA
redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
6 years, 9 months