Hi all,

I've spent some time over the last two days going over all our slaves that were offline and bringing them back online.

Some slaves were offline because of obvious technical issues (Usually having to do with disk space), and I've fixed those issues as well as wrote some patches to automatically resolve them in the future.

Other slaves were put offline manually by people with comments that either point to stalled Jira tickets or simply make some general suggestions to clean up. In many cases those slaves were offline for a few months or in some cases over a year.

Given the frequency in which we've seen our system used at full capacity recently, I must urge people to avoid doing this. If you find a troublesome slave please do one of the following:
  1. Resolve the technical issues and restore the slave to full working order
  2. Prove that the issue in question cannot be easily reproduced and restore the slave to full working order.
  3. Re install the slave from scratch
  4. (As a last resort) Open an urgent ticket to investigate and resolve the issue and follow up on it.
Any any case pleas make an effort to avoid having the slave remain offline more then 2-3 days and having more then 1-2 slave offline.

@Evgheni Dereveanchin - do you think we can setup some monitoring in Nagios so we get alerts if too many slaves are offline or we have slaves offline for too long?

--
Barak Korren
RHV DevOps team , RHCE, RHCi
Red Hat EMEA
redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted