]
eyal edri [Administrator] reassigned OVIRT-296:
-----------------------------------------------
Assignee: Evgheni Dereveanchin (was: infra)
[jenkins] take offline faulty bad slaves
----------------------------------------
Key: OVIRT-296
URL:
https://ovirt-jira.atlassian.net/browse/OVIRT-296
Project: oVirt - virtualization made easy
Issue Type: Task
Components: Jenkins
Affects Versions: Test
Reporter: eyal edri [Administrator]
Assignee: Evgheni Dereveanchin
Labels: jenkins, monitoring,
it seems that quite often we hit an issue with a specific slave on phx, due to various
reasons (out of space/git/network/etc..).
which leads to multiple jobs trying to run on it and failing.
we need an automated way of finding this.
proposal:
add post groovy build to jobs that will take a slave offline if it's misbehaves
using:
manager.build.getBuiltOn().toComputer.setTemporarilyOffline(true)
the trick is to find such a slave and to be able to know if it failed consistently in the
past X hours to justify it's disable.
we need some sort of counter or service to track slaves and thier error state and
according to it take offline a specific slave.
for example:
if a slave was failing x jobs in Y time and runtime was < Z min , it might indicate
such a problem.
(e.g 10 jobs were failing on the same slave in a timeframe of 5 min and job runtime was
less than a 1 min.. )
the post script should email infra(a)ovirt.org that it disabled a slave and we should look
into it.