This is a multi-part message in MIME format...
------------=_1510042642-11328-477
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
[
https://ovirt-jira.atlassian.net/browse/OVIRT-1744?page=com.atlassian.jir...
]
Barak Korren commented on OVIRT-1744:
-------------------------------------
[~ederevea] We talked about this before, but AFAIK did not open a ticket. We need to start
working on this. IMO, barring prevention of looming outage scenarios, this should be our
top priority.
I think we need to look at block-level de-duplication solutions. I think Red Hat recently
bought a company that provides a block-level de-duplication software solution. This
solution might be available in CentOS by now. Given that we store build results, I expect
the actual data difference between different builds of the same package to be small.
Additionally, the big packages like 'node' and 'appliance' are essentially
collections of other packages, so they share data with them. All this makes me estimate
that block-level de-duplication might provide very good results for us as far as storage
efficiency goes.
I also suspect that storing data on less disk blocks can make processes like `createrepo`,
which is by far the biggest time consumer when publishing to 'tested', run
faster.
Another block-level de-duplication solution is
[
casync|http://0pointer.net/blog/casync-a-tool-for-distributing-file-syste...],
which is a de-duplicating rsync-style tool. The main shortcoming here is that I'm not
sure we can make using it transparent.
Another direction to look at is [
pulp|http://docs.pulpproject.org/index.html]. I think it
can at least help when it comes to making publish performance better and to efficiently
making data available in multiple ways. Maybe we can even somhow combine it with a
de-duplicating backend storage.
Revisit build artifact storage and retnesion
--------------------------------------------
Key: OVIRT-1744
URL:
https://ovirt-jira.atlassian.net/browse/OVIRT-1744
Project: oVirt - virtualization made easy
Issue Type: New Feature
Components: Repositories Mgmt
Reporter: Barak Korren
Assignee: infra
Priority: Highest
Labels: artifacts, repositories
We need to revisit how we store and manage build artifacts in our environment.
We need to do this to reach several goals:
# Stop having to frequently deal with running out of space on the Jenkins server
# Stop having to frequently deal with running out of space on the Resources server
# Make Jenkins load faster
# Make publishing of artifacts faster (If can take up to 20m to publish to
'tested' ATM)
# Make it so that finding artifacts is possible without knowing the exact details of the
job that made them. We would like to be able to find artifacts by at least:
#* Knowing the build URL in Jenkins
#* Knowing the STDCI stage/project/branch/distro/arch/git hash combination.
#* Asking for STDCI stage/project/branch/distro/arch/latest artifact
We need to achieve the above without significantly harming the UX we provide. For
example, users should still be able to find artifacts by navigating from links posted to
Gerrit/GitHub to the Jenkins job result pages.
--
This message was sent by Atlassian Jira
(v1001.0.0-SNAPSHOT#100070)
------------=_1510042642-11328-477
Content-Type: text/html; charset="UTF-8"
Content-Disposition: inline
Content-Transfer-Encoding: 7bit
<html><body>
<pre>[
https://ovirt-jira.atlassian.net/browse/OVIRT-1744?page=com.atlassian.jir...
]</pre>
<h3>Barak Korren commented on OVIRT-1744:</h3>
<p>[~ederevea] We talked about this before, but AFAIK did not open a ticket. We need
to start working on this. IMO, barring prevention of looming outage scenarios, this should
be our top priority.</p>
<p>I think we need to look at block-level de-duplication solutions. I think Red Hat
recently bought a company that provides a block-level de-duplication software solution.
This solution might be available in CentOS by now. Given that we store build results, I
expect the actual data difference between different builds of the same package to be
small. Additionally, the big packages like ‘node’ and
‘appliance’ are essentially collections of other packages, so they
share data with them. All this makes me estimate that block-level de-duplication might
provide very good results for us as far as storage efficiency goes.</p>
<p>I also suspect that storing data on less disk blocks can make processes like
`createrepo`, which is by far the biggest time consumer when publishing to
‘tested’, run faster.</p>
<p>Another block-level de-duplication solution is [casync|<a
href="http://0pointer.net/blog/casync-a-tool-for-distributing-file-s...>],
which is a de-duplicating rsync-style tool. The main shortcoming here is that I'm not
sure we can make using it transparent.</p>
<p>Another direction to look at is [pulp|<a
href="http://docs.pulpproject.org/index.html">http://docs.pu...>].
I think it can at least help when it comes to making publish performance better and to
efficiently making data available in multiple ways. Maybe we can even somhow combine it
with a de-duplicating backend storage.</p>
<blockquote><h3>Revisit build artifact storage and retnesion</h3>
<pre> Key: OVIRT-1744
URL:
https://ovirt-jira.atlassian.net/browse/OVIRT-1744
Project: oVirt - virtualization made easy
Issue Type: New Feature
Components: Repositories Mgmt
Reporter: Barak Korren
Assignee: infra
Priority: Highest
Labels: artifacts, repositories</pre>
<p>We need to revisit how we store and manage build artifacts in our environment. We
need to do this to reach several goals: # Stop having to frequently deal with running out
of space on the Jenkins server # Stop having to frequently deal with running out of space
on the Resources server # Make Jenkins load faster # Make publishing of artifacts faster
(If can take up to 20m to publish to ‘tested’ ATM) # Make it so that
finding artifacts is possible without knowing the exact details of the job that made them.
We would like to be able to find artifacts by at least: #* Knowing the build URL in
Jenkins #* Knowing the STDCI stage/project/branch/distro/arch/git hash combination. #*
Asking for STDCI stage/project/branch/distro/arch/latest artifact We need to achieve the
above without significantly harming the UX we provide. For example, users should still be
able to find artifacts by navigating from links posted to Gerrit/GitHub to the Jenkins job
result pages.</p></blockquote>
<p>— This message was sent by Atlassian Jira
(v1001.0.0-SNAPSHOT#100070)</p>
<img
src="https://u4043402.ct.sendgrid.net/wf/open?upn=i5TMWGV99amJbNxJpS...
alt="" width="1" height="1" border="0"
style="height:1px !important;width:1px !important;border-width:0
!important;margin-top:0 !important;margin-bottom:0 !important;margin-right:0
!important;margin-left:0 !important;padding-top:0 !important;padding-bottom:0
!important;padding-right:0 !important;padding-left:0 !important;"/>
</body></html>
------------=_1510042642-11328-477--