January 2018 - Infra - oVirt List Archives

[JIRA] (OVIRT-1858) upgrade jenkins.ovirt.org to 2.89.3 LTS
by Evgheni Dereveanchin (oVirt JIRA) 22 Jan '18

22 Jan '18

Evgheni Dereveanchin created OVIRT-1858: ------------------------------------------- Summary: upgrade jenkins.ovirt.org to 2.89.3 LTS Key: OVIRT-1858 URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1858 Project: oVirt - virtualization made easy Issue Type: Task Reporter: Evgheni Dereveanchin Assignee: infra Jenkins 2.89.3 LTS is out, our instance needs to be updated. There is also a new security advisory out so we will bump plugin versions: https://jenkins.io/security/advisory/2018-01-22/ -- This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100076)

1 0

[JIRA] (OVIRT-1857) Please create ovirt-ansible-dpdk-setup github repostiory
by Ondra Machacek (oVirt JIRA) 22 Jan '18

22 Jan '18

Ondra Machacek created OVIRT-1857: ------------------------------------- Summary: Please create ovirt-ansible-dpdk-setup github repostiory Key: OVIRT-1857 URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1857 Project: oVirt - virtualization made easy Issue Type: Task Reporter: Ondra Machacek Assignee: infra Please create ovirt-ansible-dpdk-setup github repostiory And please assign igoihman as administrator. Thanks. [~igoihman(a)redhat.com] [~mwperina] -- This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100076)

1 0

[ OST Failure Report ] [ oVirt hc master ] [ 19-01-2018 ] [ 002_bootstrap.add_hosts ]
by Dafna Ron 22 Jan '18

22 Jan '18

Hi, we are failing hc master basic suite on test: 002_bootstrap.add_hosts *Link and headline of suspected patches: Link to Job:http://jenkins.ovirt.org/job/ovirt-system-tests_hc-basic-suite-master/1… <http://jenkins.ovirt.org/job/ovirt-system-tests_hc-basic-suite-master/163/>Link to all logs:http://jenkins.ovirt.org/job/ovirt-system-tests_hc-basic-suite-master/… <http://jenkins.ovirt.org/job/ovirt-system-tests_hc-basic-suite-master/163/a…>(Relevant) error snippet from the log: <error>*2018-01-18 22:30:56,141-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (VdsDeploy) [3e58f8ce] EVENT_ID: VDS_INSTALL_IN_PROGRESS_ERROR(511), An error has occurred during installation of Host lago_basic_suite_hc_host0: Failed to execute stage 'Closing up': 'Plugin' object has no attribute 'exist' *</error>*

3 2

[oVirt Jenkins] ovirt-appliance_master_build-artifacts-el7-x86_64 - Build # 678 - Failure!
by jenkins＠jenkins.phx.ovirt.org 21 Jan '18

21 Jan '18

------=_Part_193_634794444.1516455686513 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Project: http://jenkins.ovirt.org/job/ovirt-appliance_master_build-artifacts-el7-x86… Build: http://jenkins.ovirt.org/job/ovirt-appliance_master_build-artifacts-el7-x86… Build Number: 678 Build Status: Failure Triggered By: Started by timer ------------------------------------- Changes Since Last Success: ------------------------------------- Changes for Build #678 [Sandro Bonazzola] automation: check and save unsigned rpms ----------------- Failed Tests: ----------------- No tests ran. ------=_Part_193_634794444.1516455686513--

1 1

[JIRA] (OVIRT-1856) Document procedure for infra upgrades
by eyal edri (oVirt JIRA) 21 Jan '18

21 Jan '18

[ https://ovirt-jira.atlassian.net/browse/OVIRT-1856?page=com.atlassian.jira.… ] eyal edri updated OVIRT-1856: ----------------------------- Priority: High (was: Medium) > Document procedure for infra upgrades > -------------------------------------- > > Key: OVIRT-1856 > URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1856 > Project: oVirt - virtualization made easy > Issue Type: By-EMAIL > Reporter: eyal edri > Assignee: infra > Priority: High > > On Sun, Jan 21, 2018 at 1:01 PM, Barak Korren <bkorren(a)redhat.com> wrote: > > > > > > On 21 January 2018 at 12:50, Eyal Edri <eedri(a)redhat.com> wrote: > > > >> > >> > >> On Sun, Jan 21, 2018 at 12:47 PM, Barak Korren <bkorren(a)redhat.com> > >> wrote: > >> > >>> > >>> > >>> On 21 January 2018 at 12:39, Eyal Edri <eedri(a)redhat.com> wrote: > >>> > >>>> There is another issue, which is currently failing all CQ, and its > >>>> related to the new IBRS CPU model. > >>>> It looks like all of the lago slaves were upgraded to new Libvirt and > >>>> kernel on Friday, while we still don't have a fix on lago-ost-plugin for > >>>> that. > >>>> > >>>> I think there was a misunderstanding about what to upgrade, and it > >>>> might have been understood that only the bios upgrade breaks it and not the > >>>> kernel one. > >>>> > >>>> In any case, we're currently fixing the issue, either by downgrading > >>>> the relevant pkgs on lago slaves or adding the mapping to new CPU types > >>>> from OST. > >>>> > >>>> For future, I suggest a few updates to maintenance work on Jenkins > >>>> slaves ( VMs or BM ): > >>>> > >>>> 1. Let's avoid doing an upgrade close to a weekend ( i.e not on Thu-Sun > >>>> ), so all the team can be around to help if needed or if something > >>>> unexpected happens. > >>>> 2. When we have a system-wide upgrade scheduled, like all BM slaves or > >>>> VMs for a specific OS, let's adopt a gradual upgrade with a few days window > >>>> in between, > >>>> e.g, if we need to upgrade all Lago slaves, let's upgrade 1-2 and > >>>> wait to see if nothing breaks and continue after we verify OST runs ( > >>>> either seeing on CQ or running manually ) > >>>> > >>>> > >>>> Thoughts? > >>>> > >>>> > >>> We have a staging system - we should be using it for staging.... > >>> > >> > >> Do we have OST tests or manual job avaialble there? > >> > > > > We can add them easily, or simply run Lago manually when needed. > > > > > >> In any case, this doesn't contradict what I suggested, even if you test > >> on staging, there could be differences from the production system, so we > >> should take care when we upgrade regardless. > >> > > > > Yes, but at least we'd know we green lighted the new configuration - I'm > > sure in this case we could have found at least some of the issues on > > staging (Like the fc27 issues for example) and could have avoided expansive > > production failures. > > > > Another point when scheduling an upgrade, is to talk to infra owner or the > >> CI team and understand if we currently have a large Q in CQ or known > >> failures, so it might be best to wait a bit until its cleared. > >> > >> > > > > > Adding infra-support so we can gather this info and prepare a maintanaince > / upgrade checklist to add to the oVirt infra docs. > Let's continue the discussion, suggestion on that ticket. > > -- > > Barak Korren > > RHV DevOps team , RHCE, RHCi > > Red Hat EMEA > > redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted > > > -- > Eyal edri > MANAGER > RHV DevOps > EMEA VIRTUALIZATION R&D > Red Hat EMEA <https://www.redhat.com/> > <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted> > phone: +972-9-7692018 > irc: eedri (on #tlv #rhev-dev #rhev-integ) -- This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100076)

1 0

[JIRA] (OVIRT-1856) Document procedure for infra upgrades
by eyal edri (oVirt JIRA) 21 Jan '18

21 Jan '18

[ https://ovirt-jira.atlassian.net/browse/OVIRT-1856?page=com.atlassian.jira.… ] eyal edri updated OVIRT-1856: ----------------------------- Summary: Document procedure for infra upgrades (was: Re: Change-queue job failures this weekend) > Document procedure for infra upgrades > -------------------------------------- > > Key: OVIRT-1856 > URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1856 > Project: oVirt - virtualization made easy > Issue Type: By-EMAIL > Reporter: eyal edri > Assignee: infra > > On Sun, Jan 21, 2018 at 1:01 PM, Barak Korren <bkorren(a)redhat.com> wrote: > > > > > > On 21 January 2018 at 12:50, Eyal Edri <eedri(a)redhat.com> wrote: > > > >> > >> > >> On Sun, Jan 21, 2018 at 12:47 PM, Barak Korren <bkorren(a)redhat.com> > >> wrote: > >> > >>> > >>> > >>> On 21 January 2018 at 12:39, Eyal Edri <eedri(a)redhat.com> wrote: > >>> > >>>> There is another issue, which is currently failing all CQ, and its > >>>> related to the new IBRS CPU model. > >>>> It looks like all of the lago slaves were upgraded to new Libvirt and > >>>> kernel on Friday, while we still don't have a fix on lago-ost-plugin for > >>>> that. > >>>> > >>>> I think there was a misunderstanding about what to upgrade, and it > >>>> might have been understood that only the bios upgrade breaks it and not the > >>>> kernel one. > >>>> > >>>> In any case, we're currently fixing the issue, either by downgrading > >>>> the relevant pkgs on lago slaves or adding the mapping to new CPU types > >>>> from OST. > >>>> > >>>> For future, I suggest a few updates to maintenance work on Jenkins > >>>> slaves ( VMs or BM ): > >>>> > >>>> 1. Let's avoid doing an upgrade close to a weekend ( i.e not on Thu-Sun > >>>> ), so all the team can be around to help if needed or if something > >>>> unexpected happens. > >>>> 2. When we have a system-wide upgrade scheduled, like all BM slaves or > >>>> VMs for a specific OS, let's adopt a gradual upgrade with a few days window > >>>> in between, > >>>> e.g, if we need to upgrade all Lago slaves, let's upgrade 1-2 and > >>>> wait to see if nothing breaks and continue after we verify OST runs ( > >>>> either seeing on CQ or running manually ) > >>>> > >>>> > >>>> Thoughts? > >>>> > >>>> > >>> We have a staging system - we should be using it for staging.... > >>> > >> > >> Do we have OST tests or manual job avaialble there? > >> > > > > We can add them easily, or simply run Lago manually when needed. > > > > > >> In any case, this doesn't contradict what I suggested, even if you test > >> on staging, there could be differences from the production system, so we > >> should take care when we upgrade regardless. > >> > > > > Yes, but at least we'd know we green lighted the new configuration - I'm > > sure in this case we could have found at least some of the issues on > > staging (Like the fc27 issues for example) and could have avoided expansive > > production failures. > > > > Another point when scheduling an upgrade, is to talk to infra owner or the > >> CI team and understand if we currently have a large Q in CQ or known > >> failures, so it might be best to wait a bit until its cleared. > >> > >> > > > > > Adding infra-support so we can gather this info and prepare a maintanaince > / upgrade checklist to add to the oVirt infra docs. > Let's continue the discussion, suggestion on that ticket. > > -- > > Barak Korren > > RHV DevOps team , RHCE, RHCi > > Red Hat EMEA > > redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted > > > -- > Eyal edri > MANAGER > RHV DevOps > EMEA VIRTUALIZATION R&D > Red Hat EMEA <https://www.redhat.com/> > <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted> > phone: +972-9-7692018 > irc: eedri (on #tlv #rhev-dev #rhev-integ) -- This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100076)

1 0

[JIRA] (OVIRT-1856) Re: Change-queue job failures this weekend
by Daniel Belenky (oVirt JIRA) 21 Jan '18

21 Jan '18

[ https://ovirt-jira.atlassian.net/browse/OVIRT-1856?page=com.atlassian.jira.… ] Daniel Belenky commented on OVIRT-1856: --------------------------------------- As for now, [tester 5029|http://jenkins.ovirt.org/view/Change%20queue%20jobs/job/ovirt-master_c…] passed with 175 patches. So for now, we know that there are no unknown regressions in master repo. > Re: Change-queue job failures this weekend > ------------------------------------------ > > Key: OVIRT-1856 > URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1856 > Project: oVirt - virtualization made easy > Issue Type: By-EMAIL > Reporter: eyal edri > Assignee: infra > > On Sun, Jan 21, 2018 at 1:01 PM, Barak Korren <bkorren(a)redhat.com> wrote: > > > > > > On 21 January 2018 at 12:50, Eyal Edri <eedri(a)redhat.com> wrote: > > > >> > >> > >> On Sun, Jan 21, 2018 at 12:47 PM, Barak Korren <bkorren(a)redhat.com> > >> wrote: > >> > >>> > >>> > >>> On 21 January 2018 at 12:39, Eyal Edri <eedri(a)redhat.com> wrote: > >>> > >>>> There is another issue, which is currently failing all CQ, and its > >>>> related to the new IBRS CPU model. > >>>> It looks like all of the lago slaves were upgraded to new Libvirt and > >>>> kernel on Friday, while we still don't have a fix on lago-ost-plugin for > >>>> that. > >>>> > >>>> I think there was a misunderstanding about what to upgrade, and it > >>>> might have been understood that only the bios upgrade breaks it and not the > >>>> kernel one. > >>>> > >>>> In any case, we're currently fixing the issue, either by downgrading > >>>> the relevant pkgs on lago slaves or adding the mapping to new CPU types > >>>> from OST. > >>>> > >>>> For future, I suggest a few updates to maintenance work on Jenkins > >>>> slaves ( VMs or BM ): > >>>> > >>>> 1. Let's avoid doing an upgrade close to a weekend ( i.e not on Thu-Sun > >>>> ), so all the team can be around to help if needed or if something > >>>> unexpected happens. > >>>> 2. When we have a system-wide upgrade scheduled, like all BM slaves or > >>>> VMs for a specific OS, let's adopt a gradual upgrade with a few days window > >>>> in between, > >>>> e.g, if we need to upgrade all Lago slaves, let's upgrade 1-2 and > >>>> wait to see if nothing breaks and continue after we verify OST runs ( > >>>> either seeing on CQ or running manually ) > >>>> > >>>> > >>>> Thoughts? > >>>> > >>>> > >>> We have a staging system - we should be using it for staging.... > >>> > >> > >> Do we have OST tests or manual job avaialble there? > >> > > > > We can add them easily, or simply run Lago manually when needed. > > > > > >> In any case, this doesn't contradict what I suggested, even if you test > >> on staging, there could be differences from the production system, so we > >> should take care when we upgrade regardless. > >> > > > > Yes, but at least we'd know we green lighted the new configuration - I'm > > sure in this case we could have found at least some of the issues on > > staging (Like the fc27 issues for example) and could have avoided expansive > > production failures. > > > > Another point when scheduling an upgrade, is to talk to infra owner or the > >> CI team and understand if we currently have a large Q in CQ or known > >> failures, so it might be best to wait a bit until its cleared. > >> > >> > > > > > Adding infra-support so we can gather this info and prepare a maintanaince > / upgrade checklist to add to the oVirt infra docs. > Let's continue the discussion, suggestion on that ticket. > > -- > > Barak Korren > > RHV DevOps team , RHCE, RHCi > > Red Hat EMEA > > redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted > > > -- > Eyal edri > MANAGER > RHV DevOps > EMEA VIRTUALIZATION R&D > Red Hat EMEA <https://www.redhat.com/> > <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted> > phone: +972-9-7692018 > irc: eedri (on #tlv #rhev-dev #rhev-integ) -- This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100076)

1 0

[JIRA] (OVIRT-1856) Re: Change-queue job failures this weekend
by eyal edri (oVirt JIRA) 21 Jan '18

21 Jan '18

eyal edri created OVIRT-1856: -------------------------------- Summary: Re: Change-queue job failures this weekend Key: OVIRT-1856 URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1856 Project: oVirt - virtualization made easy Issue Type: By-EMAIL Reporter: eyal edri Assignee: infra On Sun, Jan 21, 2018 at 1:01 PM, Barak Korren <bkorren(a)redhat.com> wrote: > > > On 21 January 2018 at 12:50, Eyal Edri <eedri(a)redhat.com> wrote: > >> >> >> On Sun, Jan 21, 2018 at 12:47 PM, Barak Korren <bkorren(a)redhat.com> >> wrote: >> >>> >>> >>> On 21 January 2018 at 12:39, Eyal Edri <eedri(a)redhat.com> wrote: >>> >>>> There is another issue, which is currently failing all CQ, and its >>>> related to the new IBRS CPU model. >>>> It looks like all of the lago slaves were upgraded to new Libvirt and >>>> kernel on Friday, while we still don't have a fix on lago-ost-plugin for >>>> that. >>>> >>>> I think there was a misunderstanding about what to upgrade, and it >>>> might have been understood that only the bios upgrade breaks it and not the >>>> kernel one. >>>> >>>> In any case, we're currently fixing the issue, either by downgrading >>>> the relevant pkgs on lago slaves or adding the mapping to new CPU types >>>> from OST. >>>> >>>> For future, I suggest a few updates to maintenance work on Jenkins >>>> slaves ( VMs or BM ): >>>> >>>> 1. Let's avoid doing an upgrade close to a weekend ( i.e not on Thu-Sun >>>> ), so all the team can be around to help if needed or if something >>>> unexpected happens. >>>> 2. When we have a system-wide upgrade scheduled, like all BM slaves or >>>> VMs for a specific OS, let's adopt a gradual upgrade with a few days window >>>> in between, >>>> e.g, if we need to upgrade all Lago slaves, let's upgrade 1-2 and >>>> wait to see if nothing breaks and continue after we verify OST runs ( >>>> either seeing on CQ or running manually ) >>>> >>>> >>>> Thoughts? >>>> >>>> >>> We have a staging system - we should be using it for staging.... >>> >> >> Do we have OST tests or manual job avaialble there? >> > > We can add them easily, or simply run Lago manually when needed. > > >> In any case, this doesn't contradict what I suggested, even if you test >> on staging, there could be differences from the production system, so we >> should take care when we upgrade regardless. >> > > Yes, but at least we'd know we green lighted the new configuration - I'm > sure in this case we could have found at least some of the issues on > staging (Like the fc27 issues for example) and could have avoided expansive > production failures. > > Another point when scheduling an upgrade, is to talk to infra owner or the >> CI team and understand if we currently have a large Q in CQ or known >> failures, so it might be best to wait a bit until its cleared. >> >> > > Adding infra-support so we can gather this info and prepare a maintanaince / upgrade checklist to add to the oVirt infra docs. Let's continue the discussion, suggestion on that ticket. > -- > Barak Korren > RHV DevOps team , RHCE, RHCi > Red Hat EMEA > redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted > -- Eyal edri MANAGER RHV DevOps EMEA VIRTUALIZATION R&D Red Hat EMEA <https://www.redhat.com/> <https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted> phone: +972-9-7692018 irc: eedri (on #tlv #rhev-dev #rhev-integ) -- This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100076)

1 0

Change-queue job failures this weekend
by Barak Korren 21 Jan '18

21 Jan '18

Hi, We seen a great deal of noise coming from the change queue this weekend. While a part of it is due to actual code regressions, some of that was actually due to two separate infra issues. One issue we had was with building FC26 packages - it turns out that a yum-incompatible update if the 'cmake' package was introduced to the FC26 updates repo. Since for the time being we use 'yum' to setup the mock environments, the build jobs for FC26 started failing. This issue was actually reported to us [1]. To resolve this - we rolled back the FC26 mirror to a time before the breaking change was introduced, and then re-triggered the merge events for all the patches that failed building to introduce passing build to the change queue. The second issue had to do with the introduction of FC27 slaves - it seems that slaves were misconfigured [2] and did not include vary basic packages like 'git' - this caused the CQ master job to simply crash and stop queue processing. To resolve this issue we disabled the FC27 slaves, resumed CQ operation and then re-sent all changes that failed to be added into the queue. We are in the final phases of integrating a new oVirt release, so proper CQ operation is crucial at this time. Additionally, due to a substantial amount of regressions introduced last week, the CQ currently has a huge backlog of ~180 changes to work through, this means that every bisection takes 8 OST runs, so we have no CQ minutes to spare. The FC27 slaves issue cost us 11 hours in which the CQ was not running. It also manifested itself in failures of the 'standard-enqueue' job. These kinds of failures need to be handled promptly or be avoided altogether. Build failures can make the CQ waste time too, as it runs bisections to detect and remove changes that fail to build. At this time, a single failed build can waste up to 8 hours! Lets try to be more careful about introducing infrastructure changes to the system at sensitive times, and be more vigilant about failure reports from jenkins. [1]: https://ovirt-jira.atlassian.net/browse/OVIRT-1854 [2]: https://ovirt-jira.atlassian.net/browse/OVIRT-1855 -- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted

3 6

[CQ]: 86329, 2 (ovirt-engine) failed "ovirt-master" system tests, but isn't the failure root cause
by oVirt Jenkins 21 Jan '18

21 Jan '18

A system test invoked by the "ovirt-master" change queue including change 86329,2 (ovirt-engine) failed. However, this change seems not to be the root cause for this failure. Change 85224,12 (ovirt-engine) that this change depends on or is based on, was detected as the cause of the testing failures. This change had been removed from the testing queue. Artifacts built from this change will not be released until either change 85224,12 (ovirt-engine) is fixed and this change is updated to refer to or rebased on the fixed version, or this change is modified to no longer depend on it. For further details about the change see: https://gerrit.ovirt.org/#/c/86329/2 For further details about the change that seems to be the root cause behind the testing failures see: https://gerrit.ovirt.org/#/c/85224/12 For failed test results see: http://jenkins.ovirt.org/job/ovirt-master_change-queue-tester/5027/

1 0