[JIRA] (OVIRT-296) [jenkins] take offline faulty bad slaves
eyal edri [Administrator] (oVirt JIRA)
jira at ovirt-jira.atlassian.net
Thu Dec 22 10:18:03 UTC 2016
[ https://ovirt-jira.atlassian.net/browse/OVIRT-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
eyal edri [Administrator] reassigned OVIRT-296:
-----------------------------------------------
Assignee: Evgheni Dereveanchin (was: infra)
> [jenkins] take offline faulty bad slaves
> ----------------------------------------
>
> Key: OVIRT-296
> URL: https://ovirt-jira.atlassian.net/browse/OVIRT-296
> Project: oVirt - virtualization made easy
> Issue Type: Task
> Components: Jenkins
> Affects Versions: Test
> Reporter: eyal edri [Administrator]
> Assignee: Evgheni Dereveanchin
> Labels: jenkins, monitoring,
>
> it seems that quite often we hit an issue with a specific slave on phx, due to various reasons (out of space/git/network/etc..).
> which leads to multiple jobs trying to run on it and failing.
> we need an automated way of finding this.
> proposal:
> add post groovy build to jobs that will take a slave offline if it's misbehaves using:
> manager.build.getBuiltOn().toComputer.setTemporarilyOffline(true)
> the trick is to find such a slave and to be able to know if it failed consistently in the past X hours to justify it's disable.
> we need some sort of counter or service to track slaves and thier error state and according to it take offline a specific slave.
> for example:
> if a slave was failing x jobs in Y time and runtime was < Z min , it might indicate such a problem.
> (e.g 10 jobs were failing on the same slave in a timeframe of 5 min and job runtime was less than a 1 min.. )
> the post script should email infra at ovirt.org that it disabled a slave and we should look into it.
--
This message was sent by Atlassian JIRA
(v1000.621.5#100023)
More information about the Infra
mailing list