Offline slaves - staying offline for long periods of time

16 Jan 2019

      Hi all,

I've spent some time over the last two days going over all our slaves that
were offline and bringing them back online.

Some slaves were offline because of obvious technical issues (Usually
having to do with disk space), and I've fixed those issues as well as wrote
some patches to automatically resolve them in the future.

Other slaves were put offline manually by people with comments that either
point to stalled Jira tickets or simply make some general suggestions to
clean up. In many cases those slaves were offline for a few months or in
some cases over a year.

Given the frequency in which we've seen our system used at full capacity
recently, I must urge people to avoid doing this. If you find a troublesome
slave please do one of the following:

   1. Resolve the technical issues and restore the slave to full working
   order
   2. Prove that the issue in question cannot be easily reproduced and
   restore the slave to full working order.
   3. Re install the slave from scratch
   4. (As a last resort) Open an urgent ticket to investigate and resolve
   the issue and *follow up* on it.

Any any case pleas make an effort to avoid having the slave remain offline
more then 2-3 days and having more then 1-2 slave offline.

@Evgheni Dereveanchin <ederevea@redhat.com> - do you think we can setup
some monitoring in Nagios so we get alerts if too many slaves are offline
or we have slaves offline for too long?

-- 
Barak Korren
RHV DevOps team , RHCE, RHCi
Red Hat EMEA
redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted