Vdsm tests are 4X times faster on travis

Nir Soffer

3 Dec 2016 3 Dec '16

8:36 p.m.

HI all, Watching vdsm travis builds in the last weeks, it is clear that vdsm tests on travis are about 4X times faster compared with jenkins builds. Here is a typical build: ovirt ci: http://jenkins.ovirt.org/job/vdsm_master_check-patch-fc24-x86_64/5101/consol... travis ci: https://travis-ci.org/nirs/vdsm/builds/179056079 The build took 4:34 on travis, and 19:34 on ovirt ci. This has a huge impact on vdsm maintainers. Having to wait 20 minutes for each patch means that we must ignore the ci and merge and hope that previous tests without rebasing on master were good enough. The builds are mostly the same, expect: - In travis we don't check if the build system was changed and packages should be built takes 9:18 minutes in ovirt ci. - In travis we don't clean or install anything before the test, we use a container with all the available packages, pulled from dockerhub. takes about 3:52 minutes in ovirt ci - In travis we don't enable coverage. Running the tests with coverage may slow down the tests takes 5:04 minutes in ovirt ci creating the coverage report takes only 15 seconds, not interesting - In travis we don't cleanup anything after the test this takes 34 seconds in ovirt ci The biggest problem is the build system check taking 9:18 minutes. fixing it will cut the build time in half. This is how time is spent in ovirt ci: 1. Starting (1:28) 00:00:00.001 Triggered by Gerrit: https://gerrit.ovirt.org/67338 00:00:00.039 [EnvInject] - Loading node environment variables. 00:00:00.056 Building remotely on fc24-vm15.phx.ovirt.org (phx nested local_disk fc24) in workspace /home/jenkins/workspace/vdsm_master_check-patch-fc24-x86_64 2. Installing packages (2:24) 00:01:28.338 INFO: installing package(s): autoconf automake gdb git libguestfs-tools-c libselinux-python3 libvirt-python3 m2crypto make mom openvswitch policycoreutils-python PyYAML python-blivet python-coverage python2-decorator python-devel python-inotify python-ioprocess python-mock python-netaddr python-pthreading python-setuptools python-six python-requests python3-decorator python3-netaddr python3-nose python3-six python3-yaml rpm-build sanlock-python sudo yum yum-utils 3. Setup in check-patch.sh (00:04) 00:03:52.838 + export VDSM_AUTOMATION=1 4. Running the actual tests 00:03:56.670 + ./autogen.sh --system --enable-hooks --enable-vhostmd 5. Installing python-debugifo (00:19) 00:04:15.385 Yum-utils package has been deprecated, use dnf instead. 6. Running make check (05:04) 00:04:30.948 + TIMEOUT=600 00:04:30.948 + make check NOSE_WITH_COVERAGE=1 NOSE_COVER_PACKAGE=/home/jenkins/workspace/vdsm_master_check-patch-fc24-x86_64/vdsm/vdsm,/home/jenkins/workspace/vdsm_master_check-patch-fc24-x86_64/vdsm/lib ... 00:09:33.981 tests: commands succeeded 00:09:33.981 congratulations :) 7. Creating coverage report (00:15) 00:09:34.017 + coverage html -d /home/jenkins/workspace/vdsm_master_check-patch-fc24-x86_64/vdsm/exported-artifacts/htmlcov 8. Finding modified files (09:18) 00:09:49.213 + git diff-tree --no-commit-id --name-only -r HEAD 00:09:49.213 + egrep --quiet 'vdsm.spec.in|Makefile.am' 9. Cleaing up (00:27) 00:19:07.994 Took 915 seconds 00:19:34.973 Finished: SUCCESS

Show replies by date

Barak Korren

4 Dec 4 Dec

8:59 a.m.

On 3 December 2016 at 21:36, Nir Soffer <nsoffer@redhat.com> wrote:

...

HI all,

Watching vdsm travis builds in the last weeks, it is clear that vdsm tests on travis are about 4X times faster compared with jenkins builds.

Here is a typical build:

ovirt ci: http://jenkins.ovirt.org/job/vdsm_master_check-patch-fc24-x86_64/5101/consol... travis ci: https://travis-ci.org/nirs/vdsm/builds/179056079

The build took 4:34 on travis, and 19:34 on ovirt ci.

Interesting, thanks for looking at this!

...

This has a huge impact on vdsm maintainers. Having to wait 20 minutes for each patch means that we must ignore the ci and merge and hope that previous tests without rebasing on master were good enough.

The builds are mostly the same, expect:

- In travis we don't check if the build system was changed and packages should be built takes 9:18 minutes in ovirt ci.

Well, I guess the infra team can't help with that, but still, is there anything we could do at the infrastructure level to speed this up?

...

- In travis we don't clean or install anything before the test, we use a container with all the available packages, pulled from dockerhub. takes about 3:52 minutes in ovirt ci

Well, I guess this is where we (the infra team) should pay attention. We do have a plan to switch from mock to Docker at some point (OVIRT-873 [1]), but it'll take a while until we can make such a large switch. It the meantime there may be some low-hanging fruit we can pick to make things faster. Looking at the same log: 16:03:28 Init took 77 seconds 16:05:50 Install packages took 142 seconds We may be able to speed those up - looking at the way muck is configured, we may be running with its caches turned off (I'm not yet 100% sure about this - muck_runner.sh is not the simplest script...). I've created OVIRT-902 [2] for us to look at this.

...

- In travis we don't enable coverage. Running the tests with coverage may slow down the tests takes 5:04 minutes in ovirt ci creating the coverage report takes only 15 seconds, not interesting

We can easily check this by just sending a patch with coverage turned on and then sending another patch set for the same patch with coverage turned off.

...

- In travis we don't cleanup anything after the test this takes 34 seconds in ovirt ci

We can look at speeding this up - or perhaps just change things so that results are reported as soon as check_patch.sh is done as opposed to when the Jenkins job is done. There may be some pitfalls here so I need to think a little more before I recommend going down this path.

...

The biggest problem is the build system check taking 9:18 minutes. fixing it will cut the build time in half.

Please try fixing that, or maybe this should just move to build_artifacts.sh? [1]: https://ovirt-jira.atlassian.net/browse/OVIRT-873 [2]: https://ovirt-jira.atlassian.net/browse/OVIRT-902 -- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/

Nir Soffer

12:28 p.m.

On Sun, Dec 4, 2016 at 9:59 AM, Barak Korren <bkorren@redhat.com> wrote:

...

On 3 December 2016 at 21:36, Nir Soffer <nsoffer@redhat.com> wrote:

...
HI all,

Watching vdsm travis builds in the last weeks, it is clear that vdsm tests on travis are about 4X times faster compared with jenkins builds.

Here is a typical build:

ovirt ci: http://jenkins.ovirt.org/job/vdsm_master_check-patch-fc24-x86_64/5101/consol... travis ci: https://travis-ci.org/nirs/vdsm/builds/179056079

The build took 4:34 on travis, and 19:34 on ovirt ci.

Interesting, thanks for looking at this!

...
This has a huge impact on vdsm maintainers. Having to wait 20 minutes for each patch means that we must ignore the ci and merge and hope that previous tests without rebasing on master were good enough.

The builds are mostly the same, expect:

- In travis we don't check if the build system was changed and packages should be built takes 9:18 minutes in ovirt ci.

Well, I guess the infra team can't help with that, but still, is there anything we could do at the infrastructure level to speed this up?

The line taking 9 minutes is: if git diff-tree --no-commit-id --name-only -r HEAD | egrep --quiet 'vdsm.spec.in|Makefile.am' ; then ./automation/build-artifacts.sh yum -y install "$EXPORT_DIR/"!(*.src).rpm fi In the build above the condition is false, and we not run build-artifacts.sh or installing the packages. The time is spent in: git diff-tree --no-commit-id --name-only -r HEAD | egrep --quiet 'vdsm.spec.in|Makefile.am' Running locally: $ time git diff-tree --no-commit-id --name-only -r HEAD | egrep --quiet 'vdsm.spec.in|Makefile.am' real 0m0.009s user 0m0.006s sys 0m0.009s To debug this we need to get a shell on a jenkins slave with the exact environment of a running job.

...

...
- In travis we don't clean or install anything before the test, we use a container with all the available packages, pulled from dockerhub. takes about 3:52 minutes in ovirt ci

Well, I guess this is where we (the infra team) should pay attention. We do have a plan to switch from mock to Docker at some point (OVIRT-873 [1]), but it'll take a while until we can make such a large switch.

It the meantime there may be some low-hanging fruit we can pick to make things faster. Looking at the same log:

16:03:28 Init took 77 seconds

16:05:50 Install packages took 142 seconds

We may be able to speed those up - looking at the way muck is configured, we may be running with its caches turned off (I'm not yet 100% sure about this - muck_runner.sh is not the simplest script...). I've created OVIRT-902 [2] for us to look at this.

...
- In travis we don't enable coverage. Running the tests with coverage may slow down the tests takes 5:04 minutes in ovirt ci creating the coverage report takes only 15 seconds, not interesting

We can easily check this by just sending a patch with coverage turned on and then sending another patch set for the same patch with coverage turned off.

...
- In travis we don't cleanup anything after the test this takes 34 seconds in ovirt ci

We can look at speeding this up - or perhaps just change things so that results are reported as soon as check_patch.sh is done as opposed to when the Jenkins job is done. There may be some pitfalls here so I need to think a little more before I recommend going down this path.

...
The biggest problem is the build system check taking 9:18 minutes. fixing it will cut the build time in half.

Please try fixing that, or maybe this should just move to build_artifacts.sh?

[1]: https://ovirt-jira.atlassian.net/browse/OVIRT-873 [2]: https://ovirt-jira.atlassian.net/browse/OVIRT-902

-- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/

Barak Korren

12:35 p.m.

...

To debug this we need to get a shell on a jenkins slave with the exact environment of a running job.

Perhaps try to check if this reproduces with mock_runner.sh. You can try running with with something like: JENKINS=<where you cloned jenkins repo> cd <vdsm sources> $JENKINS/mock_configs/mock_runner.sh --patch-only \ --mock-confs-dir $JENKINS/mock_configs "fc24.*x86_64" -- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/

Nir Soffer

10:56 p.m.

On Sun, Dec 4, 2016 at 1:35 PM, Barak Korren <bkorren@redhat.com> wrote:

...

...
To debug this we need to get a shell on a jenkins slave with the exact environment of a running job.

Perhaps try to check if this reproduces with mock_runner.sh.

You can try running with with something like:

JENKINS=<where you cloned jenkins repo> cd <vdsm sources> $JENKINS/mock_configs/mock_runner.sh --patch-only \ --mock-confs-dir $JENKINS/mock_configs "fc24.*x86_64"

I could reproduce this issue locally. Turns out that it was faulty test timeout code in vdsm test runner. We did not terminate a sleep child process when the tests finished, and mock was waiting until the sleep child process terminated. This patch should fix the issue: https://gerrit.ovirt.org/67799 Here are successful builds: - http://jenkins.ovirt.org/job/vdsm_master_check-patch-fc24-x86_64/5765/consol... - http://jenkins.ovirt.org/job/vdsm_master_check-patch-el7-x86_64/4155/console Note that these builds include also make rpm and install check, since the patch changed a makefile. This takes about 2 minutes but most patches do not trigger this check. Nir

Nir Soffer

11:23 p.m.

On Sun, Dec 4, 2016 at 11:56 PM, Nir Soffer <nsoffer@redhat.com> wrote:

...

On Sun, Dec 4, 2016 at 1:35 PM, Barak Korren <bkorren@redhat.com> wrote:

...
...
To debug this we need to get a shell on a jenkins slave with the exact environment of a running job.

Perhaps try to check if this reproduces with mock_runner.sh.

You can try running with with something like:

JENKINS=<where you cloned jenkins repo> cd <vdsm sources> $JENKINS/mock_configs/mock_runner.sh --patch-only \ --mock-confs-dir $JENKINS/mock_configs "fc24.*x86_64"

I could reproduce this issue locally.

Turns out that it was faulty test timeout code in vdsm test runner. We did not terminate a sleep child process when the tests finished, and mock was waiting until the sleep child process terminated.

This patch should fix the issue: https://gerrit.ovirt.org/67799

Here are successful builds: - http://jenkins.ovirt.org/job/vdsm_master_check-patch-fc24-x86_64/5765/consol... - http://jenkins.ovirt.org/job/vdsm_master_check-patch-el7-x86_64/4155/console

Note that these builds include also make rpm and install check, since the patch changed a makefile. This takes about 2 minutes but most patches do not trigger this check.

Here are builds that do not change the build system: - http://jenkins.ovirt.org/job/vdsm_master_check-patch-fc24-x86_64/5767/consol...: 10:07 - http://jenkins.ovirt.org/job/vdsm_master_check-patch-el7-x86_64/4157/console: 10:16 So we about 2X times faster now. Nir

Barak Korren

5 Dec 5 Dec

8:59 a.m.

...

Here are builds that do not change the build system: - http://jenkins.ovirt.org/job/vdsm_master_check-patch-fc24-x86_64/5767/consol...: 10:07 - http://jenkins.ovirt.org/job/vdsm_master_check-patch-el7-x86_64/4157/console: 10:16

So we about 2X times faster now.

Awesome! also for fc24: 22:11:22 Init took 73 seconds 22:13:45 Install packages took 143 seconds So 3m 36s, our pending patches can probably bring that down to around 20s. That will get us to around 7m... Maybe we could shave some more seconds off by optimizing the git clone and making some of the cleanups happen less frequently. (It seems we spend 16s total outside of mock_runner.sh, so perhaps not much to gain there). So any more ideas where we can get extra 2-3m? Things we didn`t try yet: 1. Ensure all downloads happen through the proxy (there is a patch pending, but some tweaking in check_patch.sh may be needed as well) 2. Run mock in tmpfs (it has a plugin for that) 3. Avoid setting some FS attributes on files (mock is configured for that but we don't install the OS package needed to make that actually work) Nut sure any of the above will provide significant gains though. -- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/

Nir Soffer

9:07 a.m.

On Mon, Dec 5, 2016 at 9:59 AM, Barak Korren <bkorren@redhat.com> wrote:

...

...
Here are builds that do not change the build system: - http://jenkins.ovirt.org/job/vdsm_master_check-patch-fc24-x86_64/5767/consol...: 10:07 - http://jenkins.ovirt.org/job/vdsm_master_check-patch-el7-x86_64/4157/console: 10:16

So we about 2X times faster now.

Awesome! also for fc24:

22:11:22 Init took 73 seconds

22:13:45 Install packages took 143 seconds

So 3m 36s, our pending patches can probably bring that down to around 20s.

20 seconds setup sounds great. Can we try the patches with vdsm builds?

...

That will get us to around 7m... Maybe we could shave some more seconds off by optimizing the git clone and making some of the cleanups happen less frequently. (It seems we spend 16s total outside of mock_runner.sh, so perhaps not much to gain there).

So any more ideas where we can get extra 2-3m?

Things we didn`t try yet: 1. Ensure all downloads happen through the proxy (there is a patch pending, but some tweaking in check_patch.sh may be needed as well) 2. Run mock in tmpfs (it has a plugin for that) 3. Avoid setting some FS attributes on files (mock is configured for that but we don't install the OS package needed to make that actually work)

Nut sure any of the above will provide significant gains though.

-- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/

Barak Korren

9:11 a.m.

On 5 December 2016 at 10:07, Nir Soffer <nsoffer@redhat.com> wrote:

...

20 seconds setup sounds great.

Can we try the patches with vdsm builds?

We'll probably merge today, if not, I'll manually cherry-pick this for the vdsm jobs. -- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/

Nir Soffer

7 Dec 7 Dec

1:15 p.m.

On Mon, Dec 5, 2016 at 10:11 AM, Barak Korren <bkorren@redhat.com> wrote:

...

On 5 December 2016 at 10:07, Nir Soffer <nsoffer@redhat.com> wrote:

...
20 seconds setup sounds great.

Can we try the patches with vdsm builds?

We'll probably merge today, if not, I'll manually cherry-pick this for the vdsm jobs.

With all patches merged, builds take now: http://jenkins.ovirt.org/job/vdsm_master_check-patch-el7-x86_64/4373/console 00:07:13.065 Finished: SUCCESS http://jenkins.ovirt.org/job/vdsm_master_check-patch-fc24-x86_64/5991/consol... 00:08:19.035 Finished: SUCCESS Last week it was 19-20 mintues, great improvement! What is missing now small improvement in the logs - add log for each part of the build with the time it took. - setup - run script - cleanup The run script part is something that only project maintainers can optimize, the rest can be optimized only by ci maintainers. I think we should have metrics collection system and keep these times so we can detect regressions and improvement easily. But the first step is measuring the time. Nir

Barak Korren

5:26 p.m.

On 7 December 2016 at 14:15, Nir Soffer <nsoffer@redhat.com> wrote:

...

On Mon, Dec 5, 2016 at 10:11 AM, Barak Korren <bkorren@redhat.com> wrote:

...
On 5 December 2016 at 10:07, Nir Soffer <nsoffer@redhat.com> wrote:

...
20 seconds setup sounds great.

Can we try the patches with vdsm builds?

We'll probably merge today, if not, I'll manually cherry-pick this for the vdsm jobs.

With all patches merged, builds take now:

http://jenkins.ovirt.org/job/vdsm_master_check-patch-el7-x86_64/4373/console 00:07:13.065 Finished: SUCCESS

http://jenkins.ovirt.org/job/vdsm_master_check-patch-fc24-x86_64/5991/consol... 00:08:19.035 Finished: SUCCESS

Last week it was 19-20 mintues, great improvement!

...

What is missing now small improvement in the logs - add log for each part of the build with the time it took.

- setup - run script - cleanup

The run script part is something that only project maintainers can optimize, the rest can be optimized only by ci maintainers.

WRT to times - mock_runner.sh does print those out to the output page (surrounded by many asterisks and other symbols...) WRT separation of logs - we already mostly have that, you can see individual step logs in the job artifacts. For example for one of the jobs above: http://jenkins.ovirt.org/job/vdsm_master_check-patch-fc24-x86_64/5991/artifa... But yeah, we could probably make the output in the main Jenkins output page better. That would require potentially breaking changes to "mock_runner.sh", so I'd rather focus on ultimately replacing it... We do already have https://ovirt-jira.atlassian.net/browse/OVIRT-682 for discussing this issue.

...

I think we should have metrics collection system and keep these times so we can detect regressions and improvement easily. But the first step is measuring the time.

Jenkins (partially) gives us that, see here: http://jenkins.ovirt.org/job/vdsm_master_check-patch-fc24-x86_64/buildTimeTr... I completely agree that the UX here is not as good as it can and should be, and we do have plans to make it A LOT better, please bare with us in the meantime... -- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/

Barak Korren

5:31 p.m.

On 7 December 2016 at 14:15, Nir Soffer <nsoffer@redhat.com> wrote:

...

Last week it was 19-20 mintues, great improvement!

there are a couple of other things we might try soon, that will, perhaps, help us shave off another 2-3 minutes... -- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/

Yaniv Kaul

5 Dec 5 Dec

11:48 a.m.

On Mon, Dec 5, 2016 at 9:59 AM, Barak Korren <bkorren@redhat.com> wrote:

...

...
Here are builds that do not change the build system: - http://jenkins.ovirt.org/job/vdsm_master_check-patch-fc24- x86_64/5767/console: 10:07 - http://jenkins.ovirt.org/job/vdsm_master_check-patch-el7- x86_64/4157/console: 10:16

So we about 2X times faster now.

Awesome! also for fc24:

22:11:22 Init took 73 seconds

22:13:45 Install packages took 143 seconds

So 3m 36s, our pending patches can probably bring that down to around 20s. That will get us to around 7m... Maybe we could shave some more seconds off by optimizing the git clone and making some of the cleanups happen less frequently. (It seems we spend 16s total outside of mock_runner.sh, so perhaps not much to gain there).

So any more ideas where we can get extra 2-3m?

Can we run the tests in parallel in nose[1] ? Y. [1] http://nose.readthedocs.io/en/latest/doc_tests/test_multiprocess/multiproces...

...

Things we didn`t try yet: 1. Ensure all downloads happen through the proxy (there is a patch pending, but some tweaking in check_patch.sh may be needed as well) 2. Run mock in tmpfs (it has a plugin for that) 3. Avoid setting some FS attributes on files (mock is configured for that but we don't install the OS package needed to make that actually work)

Nut sure any of the above will provide significant gains though.

-- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/ _______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Nir Soffer

4:50 p.m.

On Mon, Dec 5, 2016 at 12:48 PM, Yaniv Kaul <ykaul@redhat.com> wrote:

...

On Mon, Dec 5, 2016 at 9:59 AM, Barak Korren <bkorren@redhat.com> wrote:

...
...
Here are builds that do not change the build system: - http://jenkins.ovirt.org/job/vdsm_master_check-patch-fc24-x86_64/5767/consol...: 10:07 - http://jenkins.ovirt.org/job/vdsm_master_check-patch-el7-x86_64/4157/console: 10:16

So we about 2X times faster now.

Awesome! also for fc24:

22:11:22 Init took 73 seconds

22:13:45 Install packages took 143 seconds

So 3m 36s, our pending patches can probably bring that down to around 20s. That will get us to around 7m... Maybe we could shave some more seconds off by optimizing the git clone and making some of the cleanups happen less frequently. (It seems we spend 16s total outside of mock_runner.sh, so perhaps not much to gain there).

So any more ideas where we can get extra 2-3m?

Can we run the tests in parallel in nose[1] ?

We don't want to go there, lot of tests are not safe when run concurrently. If we want this we need to check all tests and mark those that can be run in parallel and then find how to do this in the test framework. Anyway we don't want to use nose, it is a dead project, and we want to move to pytest. We already use pytest for the import tests, see "tox -e imports"

...

Y.

[1] http://nose.readthedocs.io/en/latest/doc_tests/test_multiprocess/multiproces...

...
Things we didn`t try yet: 1. Ensure all downloads happen through the proxy (there is a patch pending, but some tweaking in check_patch.sh may be needed as well) 2. Run mock in tmpfs (it has a plugin for that) 3. Avoid setting some FS attributes on files (mock is configured for that but we don't install the OS package needed to make that actually work)

Nut sure any of the above will provide significant gains though.

-- Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/ _______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

3277

Age (days ago)

3281

Last active (days ago)

List overview

Download

13 comments

3 participants

participants (3)

Barak Korren
Nir Soffer
Yaniv Kaul

Vdsm tests are 4X times faster on travis

tags

participants (3)