[JIRA] (OVIRT-919) Fwd: CI slaves extremely slow - overloaded slaves?

8 Dec 2016

      [ https://ovirt-jira.atlassian.net/browse/OVIRT-919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=24101#comment-24101 ] 

Barak Korren commented on OVIRT-919:
------------------------------------

{quote}
  ... snip ...
17:19:49 1.105877 seconds

This test runs in 0.048 seconds on my laptop:
  ... snip ...
Ran 4 tests in 0.189s
{quote}

[~nsoffer@redhat.com] Lets make sure we're comparing apples and apples first, did you run this using "{{mock_runner.sh}}" on your laptop? Perhaps try this in a VM on your laptop?

{quote}
It seems that we are overloading the CI slaves. We should not use nested kvm
for the CI, such vms are much slower then regular vms, and we probably run
too many vms per cpu.
{quote}

Not sure where nested kvm comes into play here, "{{check_patch.sh}}" runs on plain VMs. 
(As an OT side-note: Are you sure nested VMs should be slower CPU-wise? AFAIK those are just VMs with nested I/O routing... it may be interesting to check this with Lago...)

WRT "overloaded slaves" - that can hardly happen, no slave ever runs more then one job at at time.
The hypervisors themselves may be loaded, we'll try to gather some more data here:
[~ngoldin@redhat.com] Do we have hypervisor load graphs in Graphite?
[~ederevea] What is out vm/core ratio? Are we running oVirt DWH? maybe it can help us here...

{quote}
We can disable such tests in the CI, but we do want to know when there is
a regression in this code. Before it was fixed, the same test took 9 seconds
on my laptop. We need fast machines in the CI for this.
{quote}

As far as machines go, our upstream hardware is quite beefy these days, so lets not jump to any conclusions. More data is needed, what is this test actually doing? is this pure CPU, or is there so I/O involved here as well?

[~rgolan@redhat.com] Are there any oVirt features that can help us here? Higher priority VMs? Realtime? Better scheduling?
...
Fwd: CI slaves extremely slow - overloaded slaves?
--------------------------------------------------
Key: OVIRT-919
                URL: https://ovirt-jira.atlassian.net/browse/OVIRT-919
            Project: oVirt - virtualization made easy
         Issue Type: By-EMAIL
         Components: Jenkins
           Reporter: Barak Korren
           Assignee: infra
             Labels: hypervisors, performance, slaves, standard-ci
From: Nir Soffer <nsoffer@redhat.com>
Date: 7 December 2016 at 21:33
Subject: CI slaves extremely slow - overloaded slaves?
To: infra <infra@ovirt.org>, Eyal Edri <eedri@redhat.com>, Dan
Kenigsberg <danken@redhat.com>
Hi all,
In the last weeks we see more and more test failures due to timeouts in the CI.
For example:
17:19:49 ======================================================================
17:19:49 FAIL: test_scale (storage_filesd_test.GetAllVolumesTests)
17:19:49 ----------------------------------------------------------------------
17:19:49 Traceback (most recent call last):
17:19:49   File
"/home/jenkins/workspace/vdsm_master_check-patch-fc24-x86_64/vdsm/tests/storage_filesd_test.py",
line 165, in test_scale
17:19:49     self.assertTrue(elapsed < 1.0, "Elapsed time: %f seconds"
% elapsed)
17:19:49 AssertionError: Elapsed time: 1.105877 seconds
17:19:49 -------------------- >> begin captured stdout << ---------------------
17:19:49 1.105877 seconds
This test runs in 0.048 seconds on my laptop:
$ ./run_tests_local.sh storage_filesd_test.py -s
nose.config: INFO: Ignoring files matching ['^\\.', '^_', '^setup\\.py$']
storage_filesd_test.GetAllVolumesTests
    test_no_templates                                           OK
    test_no_volumes                                             OK
    test_scale                                                  0.047932 seconds
OK
    test_with_template                                          OK
----------------------------------------------------------------------
Ran 4 tests in 0.189s
It seems that we are overloading the CI slaves. We should not use nested kvm
for the CI, such vms are much slower then regular vms, and we probably run
too many vms per cpu.
We can disable such tests in the CI, but we do want to know when there is
a regression in this code. Before it was fixed, the same test took 9 seconds
on my laptop. We need fast machines in the CI for this.
Nir
--
This message was sent by Atlassian JIRA
(v1000.620.0#100023)

[JIRA] (OVIRT-919) Fwd: CI slaves extremely slow - overloaded slaves?

Barak Korren (oVirt JIRA)