[
https://ovirt-jira.atlassian.net/browse/OVIRT-2794?page=com.atlassian.jir...
]
Barak Korren commented on OVIRT-2794:
-------------------------------------
This was a bit Puzzling, we've seen issues between {{docker_cleanup.py}} and Docker
appear sporadically in the past, and therefore have have made the job code generally not
fail when {{docker_cleanup.py}} fails, and instead send an email to the infra list. It
turn out that was only true for the V2 code, for the V1 code (which is still used in the
manual job and the nightly jobs) thos failures could still arise.
We did verify that {{docker_cleanup.py}} works on CentOS 7 with the Python 3 docker API
client before merging the patch, so its strange we did not see the issue then.
[~accountid:557058:5ca52a09-2675-4285-a044-12ad20f6166a] some of your statements above
seem to include some wrong assumption about how the system is built. We're not
actually exposing the host's Docker deamon to the CI code, instead we we our own
docker instance running inside the container that is used to run the CI code. That way we
can ensure there can be no cross-talk when running multiple CI containers on the same
hosts.
[~accountid:557058:cc1e0e66-9881-45e2-b0b7-ccaa3e60f26e] as far as using podman, I think
doing that at this point will be quite a challenge for a number of reasons:
# We're currently using OpenShift 3.7 to manage our containers, this implies that we
must run Docker on our hosts, since AFAIK OpenShift only started supporting CRIO in 4.0 or
4.1.
# To allow CI scripts and tests suits to use Docker we run nested Docker instances inside
the CI containers. We know that Docker in Docker work well for our use cases. Running
Podman in Docker will probably be more challenging.
# Since we're still using {{mock}} to encapsulate the CI script inside the CI
container, we're bind-mounting the docker socket from the container into mock. We know
there are issues when running Podman in mock, so solving those will take some work.
# People that write CI scripts and suits tend to expect things to "just work" in
CI like it does on their laptops, and hence tend to use Docker commands. Removing docker
will force everyone to learn Podman, and we'll need to make changes everywhere.
Out current suspicion is that this issue may have to do with the particular version Docker
that is installed inside the CI container. While our {{global_setup.sh}} script generally
keeps Docker up to date on the CI slaves, we've intentionally skipped that update code
when running in a container. I suspect that the version of Docker that is in the CI
containers is older then the once running on the CI slaves. That would explain why we did
not see this issue when working on the {{docker_cleanup.py}} patch, since that was tested
on the the normal slaves and not the containers.
Here is what I think we should do now:
# Verify again, that {{docker_cleanup.py}} woks well on CentOS with the Python 3 Docker
client API .
# If so, inspect the version of Docker we have in the containers and finally
# Build an updated container image with a newer version of Docker as needed
Note that updating the container image will require us to tests it thoroughly and ensure
it can properly run both OST and {{kubevirt-ci}}.
OST is broken since this morning - looks like infra issue
---------------------------------------------------------
Key: OVIRT-2794
URL:
https://ovirt-jira.atlassian.net/browse/OVIRT-2794
Project: oVirt - virtualization made easy
Issue Type: By-EMAIL
Reporter: Nir Soffer
Assignee: infra
The last successful build was today at 08:10:
Since then all builds fail very early with the error below - which is not
related to oVirt.
{code}
Removing image:
sha256:f8e5aa8e979155e074411bfef9adade6cdcdf3a5a2eb1d5ad2dbf0288d585ffa,
force=True
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/docker/api/client.py", line 222,
in _raise_for_status
response.raise_for_status()
File "/usr/lib/python3.6/site-packages/requests/models.py", line 893, in
raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url:
http+docker://localunixsocket/v1.30/images/sha256:f8e5aa8e979155e074411bfef9adade6cdcdf3a5a2eb1d5ad2dbf0288d585ffa?force=True&noprune=False
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File
"/home/jenkins/workspace/ovirt-system-tests_manual/jenkins/scripts/docker_cleanup.py",
line 349, in <module>
main()
File
"/home/jenkins/workspace/ovirt-system-tests_manual/jenkins/scripts/docker_cleanup.py",
line 37, in main
safe_image_cleanup(client, whitelisted_repos)
File
"/home/jenkins/workspace/ovirt-system-tests_manual/jenkins/scripts/docker_cleanup.py",
line 107, in safe_image_cleanup
_safe_rm(client, parent)
File
"/home/jenkins/workspace/ovirt-system-tests_manual/jenkins/scripts/docker_cleanup.py",
line 329, in _safe_rm
client.images.remove(image_id, force=force)
File "/usr/lib/python3.6/site-packages/docker/models/images.py", line
288, in remove
self.client.api.remove_image(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/docker/utils/decorators.py", line
19, in wrapped
return f(self, resource_id, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/docker/api/image.py", line 481, in
remove_image
return self._result(res, True)
File "/usr/lib/python3.6/site-packages/docker/api/client.py", line 228,
in _result
self._raise_for_status(response)
File "/usr/lib/python3.6/site-packages/docker/api/client.py", line 224,
in _raise_for_status
raise create_api_error_from_http_exception(e)
File "/usr/lib/python3.6/site-packages/docker/errors.py", line 31, in
create_api_error_from_http_exception
raise cls(e, response=response, explanation=explanation)
docker.errors.NotFound: 404 Client Error: Not Found ("reference does not
exist")
Aborting.
Build step 'Execute shell' marked build as failure
{code}
x
[image: Failed > Console Output]
<
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5542/console>
#5542 <
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5542/>
Sep 5, 2019 3:02 PM
<
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5542/>
[image: Failed > Console Output]
<
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5541/console>
#5541 <
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5541/>
Sep 5, 2019 3:02 PM
<
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5541/>
[image: Failed > Console Output]
<
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5540/console>
#5540 <
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5540/>
Sep 5, 2019 3:01 PM
<
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5540/>
[image: Failed > Console Output]
<
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5539/console>
#5539 <
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5539/>
Sep 5, 2019 2:13 PM
<
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5539/>
[image: Failed > Console Output]
<
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5538/console>
#5538 <
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5538/>
Sep 5, 2019 1:58 PM
<
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5538/>
[image: Failed > Console Output]
<
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5537/console>
#5537 <
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5537/>
Sep 5, 2019 1:50 PM
<
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5537/>
[image: Failed > Console Output]
<
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5536/console>
#5536 <
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5536/>
Sep 5, 2019 10:21 AM
<
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5536/>
[image: x]
<
http://jenkins.ovirt.org/job/ovirt-system-tests_manual/jobConfigHistory/s...
[image: Success > Console Output]
<
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5535/console>
#5535 <
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5535/>
Sep 5, 2019 8:10 AM
<
https://jenkins.ovirt.org/job/ovirt-system-tests_manual/5535/>
--
This message was sent by Atlassian Jira
(v1001.0.0-SNAPSHOT#100109)