[JIRA] (OVIRT-2593) How does stdci prevent regressions and proactively monitor the cluster?

Monday, 24 December 2018

    [
https://ovirt-jira.atlassian.net/browse/OVIRT-2593?page=com.atlassian.jir...
] 

Eyal Edri commented on OVIRT-2593:
----------------------------------

[~ederevea] I believe we solved some of the issues here and some are in progress, for
e.g:
We've identified the source for slowness on the UI and its a memory leak on the SSE
plugin blue ocean is using, 
[~dbelenky(a)redhat.com] please add a link to the ticket that refers to that.
We've also applied JVM improvements to the master, and limit the session timeout ( it
was unlimited so far ).
Also, we're working on splitting the kubevirt Jenkins to be independent and not shared
with oVirt, tracked on another ticket [~bkorren(a)redhat.com] can add links.

We are also planning to add monitoring, hopefully soon, [~ederevea] please add link to the
card on it.
As for flexibility of the project, we're doing our best with the very limited
resources we have available and the number of developers available to contribute.
Having said that, we have staging systems and we try to add tests to any new code that we
introduce, including testing on staging. 

...
 How does stdci prevent regressions and proactively monitor the
cluster?
 -----------------------------------------------------------------------

                 Key: OVIRT-2593
                 URL: https://ovirt-jira.atlassian.net/browse/OVIRT-2593
             Project: oVirt - virtualization made easy
          Issue Type: Improvement
            Reporter: Roman Mohr
            Assignee: infra

 We want to go one step further with KubeVirt and sooner or later only merge
 when the tests are green (automatically).
 Therefore we want to ensure that this CI system is the right system for us
 and can be properly scaled, developed and operated.
 Apart from requirements like, automatically re-run tests and a merge-pools
 stability and QoS of the CI system are interesting for us.
 Some examples:
  * Sometimes jobs break with a system error shown in the logs (is that
 monitored and worked on?)
  * Sometimes things like "out-of-disk-space" show up. Is e.g. disk
 utilization proactively handled?
  * We had one issue where the docker installation was broken in a
 build-slot and all jobs stopped fast. As a consequence all following builds
 were scheduled there too. Is something like that monitored?
  * We repeatedly have issues, connecting to jenkins. It is extremely slow
 (not just Blue-Ocean-slow, really slow). Are such things monitored and
 alarms raised, countermeasures taken?
  * That did not happen for a while, but there were repeatedly bare-metal
 machines whithout kvm-nesting added to the cluster. Are there measures in
 place which prevent such regressions where the same issues happen multiple
 times?
  * How is the flexibility of the project ensured? Is it also tested and
 maintained in a sane fashion to allow proper evolution in time? Automated
 tests? Offline-testing of changes? And so on ... 

--
This message was sent by Atlassian Jira
(v1001.0.0-SNAPSHOT#100096)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011