Job stuck in cleanup for 13 hours

Barak Korren bkorren at redhat.com
Wed May 31 06:35:25 UTC 2017


Ok.

I looked deeper into this, the failure was in the pre-run setup as
opposed to the post-run cleanup, but it also cause the cleanup after
timeout to get stuck.

The core issue seems to be with Docker to setting itself up well on a
node without LVM. This may or may not be unique to FC24 slaves (And
its really strange that we're seeing this just now because the docker
code had been in place for a while now).

I've opened a couple of tickets to look deeper into this and put
fail-safes in place:
https://ovirt-jira.atlassian.net/browse/OVIRT-1421
https://ovirt-jira.atlassian.net/browse/OVIRT-1420

In the meantime I killed all stuck processes on the slave to make the
job finish and took the slave offline.

The next job run was successful.


On 31 May 2017 at 00:28, Barak Korren <bkorren at redhat.com> wrote:
> The jenkins level timeout dates back to the days on the looong looong
> upgrade test jobs.
>
> Then as today it was designed as last resort failsfae, and typically leaves
> a mess around.
>
> You are still seeing stuff running after it because it is not supposed to
> kill the clenup scripts or we're guaranteed to get dirty slaves.
>
> We do need to figure out what caused the simple docker restart to not finish
> for 6 hours. I suspect were seeing symptoms and confusing log buffer
> behaviour and not the real issue.
>
>
> Barak Korren
> bkorren at redhat.com
> RHCE, RHCi, RHV-DevOps Team
> https://ifireball.wordpress.com/
>
>
> בתאריך 30 במאי 2017 07:52 PM,‏ "Nir Soffer" <nsoffer at redhat.com> כתב:
>
> See
> http://jenkins.ovirt.org/job/ovirt-release_4.1_build-artifacts-el7-x86_64/205/console
>
> Build stuck after 16 minutes (using elapsed time):
>
> 00:16:49.203 + sudo systemctl restart docker
>
>
> Failure detected after 6 hours:
> 06:00:07.301 Build timed out (after 360 minutes). Marking the build as
> failed.
>
>
> But the job is still running:
>
> 06:00:08.490 + xargs -r sudo docker rm -f
>
>
> Why build-artifacts job needs 6 hours timeout?
>
> Nir
>
> _______________________________________________
> Infra mailing list
> Infra at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/infra
>
>



-- 
Barak Korren
RHV DevOps team , RHCE, RHCi
Red Hat EMEA
redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted


More information about the Infra mailing list