[JIRA] (OVIRT-1744) Revisit build artifact storage and retnesion

7 Nov 2017

      This is a multi-part message in MIME format...

------------=_1510042642-11328-477
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

    [ https://ovirt-jira.atlassian.net/browse/OVIRT-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=35299#comment-35299 ] 

Barak Korren commented on OVIRT-1744:
-------------------------------------

[~ederevea] We talked about this before, but AFAIK did not open a ticket. We need to start working on this. IMO, barring prevention of looming outage scenarios, this should be our top priority.

I think we need to look at block-level de-duplication solutions. I think Red Hat recently bought a company that provides a block-level de-duplication software solution. This solution might be available in CentOS by now. Given that we store build results, I expect the actual data difference between different builds of the same package to be small. Additionally, the big packages like 'node' and 'appliance' are essentially collections of other packages, so they share data with them. All this makes me estimate that block-level de-duplication might provide very good results for us as far as storage efficiency goes.

I also suspect that storing data on less disk blocks can make processes like `createrepo`, which is by far the biggest time consumer when publishing to 'tested', run faster.

Another block-level de-duplication solution is [casync|http://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.h...], which is a de-duplicating rsync-style tool. The main shortcoming here is that I'm not sure we can make using it transparent.

Another direction to look at is [pulp|http://docs.pulpproject.org/index.html]. I think it can at least help when it comes to making publish performance better and to efficiently making data available in multiple ways. Maybe we can even somhow combine it with a de-duplicating backend storage.
...
Revisit build artifact storage and retnesion
--------------------------------------------
Key: OVIRT-1744
                URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1744
            Project: oVirt - virtualization made easy
         Issue Type: New Feature
         Components: Repositories Mgmt
           Reporter: Barak Korren
           Assignee: infra
           Priority: Highest
             Labels: artifacts, repositories
We need to revisit how we store and manage build artifacts in our environment.
We need to do this to reach several goals:
# Stop having to frequently deal with running out of space on the Jenkins server
# Stop having to frequently deal with running out of space on the Resources server
# Make Jenkins load faster
# Make publishing of artifacts faster (If can take up to 20m to publish to 'tested' ATM)
# Make it so that finding artifacts is possible without knowing the exact details of the job that made them. We would like to be able to find artifacts by at least:
#* Knowing the build URL in Jenkins
#* Knowing the STDCI stage/project/branch/distro/arch/git hash combination.
#* Asking for STDCI stage/project/branch/distro/arch/latest artifact
We need to achieve the above without significantly harming the UX we provide. For example, users should still be able to find artifacts by navigating from links posted to Gerrit/GitHub to the Jenkins job result pages.
--
This message was sent by Atlassian Jira
(v1001.0.0-SNAPSHOT#100070)

------------=_1510042642-11328-477
Content-Type: text/html; charset="UTF-8"
Content-Disposition: inline
Content-Transfer-Encoding: 7bit

<html><body>
<pre>[ https://ovirt-jira.atlassian.net/browse/OVIRT-1744?page=com.atlassian.jira.p... ]</pre>
<h3>Barak Korren commented on OVIRT-1744:</h3>
<p>[~ederevea] We talked about this before, but AFAIK did not open a ticket. We need to start working on this. IMO, barring prevention of looming outage scenarios, this should be our top priority.</p>
<p>I think we need to look at block-level de-duplication solutions. I think Red Hat recently bought a company that provides a block-level de-duplication software solution. This solution might be available in CentOS by now. Given that we store build results, I expect the actual data difference between different builds of the same package to be small. Additionally, the big packages like ‘node’ and ‘appliance’ are essentially collections of other packages, so they share data with them. All this makes me estimate that block-level de-duplication might provide very good results for us as far as storage efficiency goes.</p>
<p>I also suspect that storing data on less disk blocks can make processes like `createrepo`, which is by far the biggest time consumer when publishing to ‘tested’, run faster.</p>
<p>Another block-level de-duplication solution is [casync|<a href="http://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html">http://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html</a>], which is a de-duplicating rsync-style tool. The main shortcoming here is that I'm not sure we can make using it transparent.</p>
<p>Another direction to look at is [pulp|<a href="http://docs.pulpproject.org/index.html">http://docs.pulpproject.org/index.html</a>]. I think it can at least help when it comes to making publish performance better and to efficiently making data available in multiple ways. Maybe we can even somhow combine it with a de-duplicating backend storage.</p>
<blockquote><h3>Revisit build artifact storage and retnesion</h3>
<pre>     Key: OVIRT-1744
     URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1744
 Project: oVirt - virtualization made easy
         Issue Type: New Feature
         Components: Repositories Mgmt
Reporter: Barak Korren
Assignee: infra
Priority: Highest
  Labels: artifacts, repositories</pre>
<p>We need to revisit how we store and manage build artifacts in our environment. We need to do this to reach several goals: # Stop having to frequently deal with running out of space on the Jenkins server # Stop having to frequently deal with running out of space on the Resources server # Make Jenkins load faster # Make publishing of artifacts faster (If can take up to 20m to publish to ‘tested’ ATM) # Make it so that finding artifacts is possible without knowing the exact details of the job that made them. We would like to be able to find artifacts by at least: #* Knowing the build URL in Jenkins #* Knowing the STDCI stage/project/branch/distro/arch/git hash combination. #* Asking for STDCI stage/project/branch/distro/arch/latest artifact We need to achieve the above without significantly harming the UX we provide. For example, users should still be able to find artifacts by navigating from links posted to Gerrit/GitHub to the Jenkins job result pages.</p></blockquote>
<p>— This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100070)</p>

<img src="https://u4043402.ct.sendgrid.net/wf/open?upn=i5TMWGV99amJbNxJpSp2-2BCmpYLyzY..." alt="" width="1" height="1" border="0" style="height:1px !important;width:1px !important;border-width:0 !important;margin-top:0 !important;margin-bottom:0 !important;margin-right:0 !important;margin-left:0 !important;padding-top:0 !important;padding-bottom:0 !important;padding-right:0 !important;padding-left:0 !important;"/>
</body></html>

------------=_1510042642-11328-477--

Barak Korren (oVirt JIRA)

tags

participants (1)