Job stuck in cleanup for 13 hours

See http://jenkins.ovirt.org/job/ovirt-release_4.1_build-artifacts-el7-x86_64/20... Build stuck after 16 minutes (using elapsed time): *00:16:49.203* + sudo systemctl restart docker Failure detected after 6 hours:*06:00:07.301* Build timed out (after 360 minutes). Marking the build as failed. But the job is still running: *06:00:08.490* + xargs -r sudo docker rm -f Why build-artifacts job needs 6 hours timeout? Nir

The jenkins level timeout dates back to the days on the looong looong upgrade test jobs. Then as today it was designed as last resort failsfae, and typically leaves a mess around. You are still seeing stuff running after it because it is not supposed to kill the clenup scripts or we're guaranteed to get dirty slaves. We do need to figure out what caused the simple docker restart to not finish for 6 hours. I suspect were seeing symptoms and confusing log buffer behaviour and not the real issue. Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/ בתאריך 30 במאי 2017 07:52 PM, "Nir Soffer" <nsoffer@redhat.com> כתב: See http://jenkins.ovirt.org/job/ovirt-release_4.1_build- artifacts-el7-x86_64/205/console Build stuck after 16 minutes (using elapsed time): *00:16:49.203* + sudo systemctl restart docker Failure detected after 6 hours:*06:00:07.301* Build timed out (after 360 minutes). Marking the build as failed. But the job is still running: *06:00:08.490* + xargs -r sudo docker rm -f Why build-artifacts job needs 6 hours timeout? Nir _______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

Ok. I looked deeper into this, the failure was in the pre-run setup as opposed to the post-run cleanup, but it also cause the cleanup after timeout to get stuck. The core issue seems to be with Docker to setting itself up well on a node without LVM. This may or may not be unique to FC24 slaves (And its really strange that we're seeing this just now because the docker code had been in place for a while now). I've opened a couple of tickets to look deeper into this and put fail-safes in place: https://ovirt-jira.atlassian.net/browse/OVIRT-1421 https://ovirt-jira.atlassian.net/browse/OVIRT-1420 In the meantime I killed all stuck processes on the slave to make the job finish and took the slave offline. The next job run was successful. On 31 May 2017 at 00:28, Barak Korren <bkorren@redhat.com> wrote:
The jenkins level timeout dates back to the days on the looong looong upgrade test jobs.
Then as today it was designed as last resort failsfae, and typically leaves a mess around.
You are still seeing stuff running after it because it is not supposed to kill the clenup scripts or we're guaranteed to get dirty slaves.
We do need to figure out what caused the simple docker restart to not finish for 6 hours. I suspect were seeing symptoms and confusing log buffer behaviour and not the real issue.
Barak Korren bkorren@redhat.com RHCE, RHCi, RHV-DevOps Team https://ifireball.wordpress.com/
בתאריך 30 במאי 2017 07:52 PM, "Nir Soffer" <nsoffer@redhat.com> כתב:
See http://jenkins.ovirt.org/job/ovirt-release_4.1_build-artifacts-el7-x86_64/20...
Build stuck after 16 minutes (using elapsed time):
00:16:49.203 + sudo systemctl restart docker
Failure detected after 6 hours: 06:00:07.301 Build timed out (after 360 minutes). Marking the build as failed.
But the job is still running:
06:00:08.490 + xargs -r sudo docker rm -f
Why build-artifacts job needs 6 hours timeout?
Nir
_______________________________________________ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra
-- Barak Korren RHV DevOps team , RHCE, RHCi Red Hat EMEA redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
participants (2)
-
Barak Korren
-
Nir Soffer