[JIRA] (OVIRT-1744) Revisit build artifact storage and retnesion

Tue Nov 7 08:17:22 UTC 2017

    [ https://ovirt-jira.atlassian.net/browse/OVIRT-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=35299#comment-35299 ] 

Barak Korren commented on OVIRT-1744:
-------------------------------------

[~ederevea] We talked about this before, but AFAIK did not open a ticket. We need to start working on this. IMO, barring prevention of looming outage scenarios, this should be our top priority.

I think we need to look at block-level de-duplication solutions. I think Red Hat recently bought a company that provides a block-level de-duplication software solution. This solution might be available in CentOS by now. Given that we store build results, I expect the actual data difference between different builds of the same package to be small. Additionally, the big packages like 'node' and 'appliance' are essentially collections of other packages, so they share data with them. All this makes me estimate that block-level de-duplication might provide very good results for us as far as storage efficiency goes.

I also suspect that storing data on less disk blocks can make processes like `createrepo`, which is by far the biggest time consumer when publishing to 'tested', run faster.

Another block-level de-duplication solution is [casync|http://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html], which is a de-duplicating rsync-style tool. The main shortcoming here is that I'm not sure we can make using it transparent.

Another direction to look at is [pulp|http://docs.pulpproject.org/index.html]. I think it can at least help when it comes to making publish performance better and to efficiently making data available in multiple ways. Maybe we can even somhow combine it with a de-duplicating backend storage.

> Revisit build artifact storage and retnesion
> --------------------------------------------
>
>                 Key: OVIRT-1744
>                 URL: https://ovirt-jira.atlassian.net/browse/OVIRT-1744
>             Project: oVirt - virtualization made easy
>          Issue Type: New Feature
>          Components: Repositories Mgmt
>            Reporter: Barak Korren
>            Assignee: infra
>            Priority: Highest
>              Labels: artifacts, repositories
>
> We need to revisit how we store and manage build artifacts in our environment.
> We need to do this to reach several goals:
> # Stop having to frequently deal with running out of space on the Jenkins server
> # Stop having to frequently deal with running out of space on the Resources server
> # Make Jenkins load faster
> # Make publishing of artifacts faster (If can take up to 20m to publish to 'tested' ATM)
> # Make it so that finding artifacts is possible without knowing the exact details of the job that made them. We would like to be able to find artifacts by at least:
> #* Knowing the build URL in Jenkins
> #* Knowing the STDCI stage/project/branch/distro/arch/git hash combination.
> #* Asking for STDCI stage/project/branch/distro/arch/latest artifact
> We need to achieve the above without significantly harming the UX we provide. For example, users should still be able to find artifacts by navigating from links posted to Gerrit/GitHub to the Jenkins job result pages.

--
This message was sent by Atlassian Jira
(v1001.0.0-SNAPSHOT#100070)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/infra/attachments/20171107/6a380b6f/attachment.html>