
[ https://ovirt-jira.atlassian.net/browse/OVIRT-296?page=com.atlassian.jira.pl... ] eyal edri [Administrator] updated OVIRT-296: -------------------------------------------- Priority: Medium (was: Highest)
[jenkins] take offline faulty bad slaves ----------------------------------------
Key: OVIRT-296 URL: https://ovirt-jira.atlassian.net/browse/OVIRT-296 Project: oVirt - virtualization made easy Issue Type: Task Components: Jenkins Affects Versions: Test Reporter: eyal edri [Administrator] Assignee: infra Labels: jenkins, monitoring,
it seems that quite often we hit an issue with a specific slave on phx, due to various reasons (out of space/git/network/etc..). which leads to multiple jobs trying to run on it and failing. we need an automated way of finding this. proposal: add post groovy build to jobs that will take a slave offline if it's misbehaves using: manager.build.getBuiltOn().toComputer.setTemporarilyOffline(true) the trick is to find such a slave and to be able to know if it failed consistently in the past X hours to justify it's disable. we need some sort of counter or service to track slaves and thier error state and according to it take offline a specific slave. for example: if a slave was failing x jobs in Y time and runtime was < Z min , it might indicate such a problem. (e.g 10 jobs were failing on the same slave in a timeframe of 5 min and job runtime was less than a 1 min.. ) the post script should email infra@ovirt.org that it disabled a slave and we should look into it.
-- This message was sent by Atlassian JIRA (v1000.620.0#100023)