Experimental Jenkins monitoring

Nadav Goldin ngoldin at redhat.com
Sun Apr 17 20:55:58 UTC 2016


>
> I think that will change a lot per-project basis, if we can get that info
> per
> job, with grafana then we can aggregate and create secondary stats (like
> bilds
> per hour as you say).
> So I'd say just to collect the 'bare' data, like job built event, job
> ended,
> duration and such.

agree. will need to improve that, right now it 'pulls' each X seconds via
the CLI,
instead of Jenkins sending the events, so it is limited to what the CLI can
provide and not that efficient. I plan to install [1] and do the opposite
(Jenkins will send a POST request with the data on each build
event and then it would be sent to graphite)

Have you checked the current ds fabric checks?
> There are already a bunch of fabric tasks that monitor jenkins, if we
> install
> the nagiosgraph (see ds for details) to send the nagios performance data
> into
> graphite, we can use them as is to also start alarms and such
>
Icinga2 has integrated graphite support, so after the upgrade we will
get all of our alarms data sent to graphite 'out-of-the-box'.

>
>     dcaro at akhos$ fab -l | grep nagi
>     do.jenkins.nagios.check_build_load                      Checks if the
> bui...
>     do.jenkins.nagios.check_executors                       Checks if the
> exe...
>     do.jenkins.nagios.check_queue                           Check if the
> buil...
>     do.provision.nagios_check                               Show a summary
> of...
>
> Though those will not give you the bare data (were designed with nagios in
> mind, not graphite so they are just checks, the stats were added later)
>
> There's also a bunch of helpers functions to create nagios checks too.
>

cool, wasn't aware of those fabric checks.
I think for simple metrics(loads and such) we could use that(i.e. query
Jenkins from fabric)
but for more complicated queries we'd need to query graphite itself,
with this[2] I could create scripts that query graphite and trigger Icinga
alerts.
such as: calculate the 'expected' slaves load for the next hour(in graphite)
and then:
Icinga queries graphite -> triggers another Icinga alert -> triggers custom
script(such as
fab task to create slaves)

for now, added two more metrics: top 10 jobs in past X time, and
avg number of builds running / builds waiting in queue in the past X time.
some metrics might 'glitch' from time to time as there is not a lot of data
yet
and it mainly counts integer values while graphite is oriented towards
floats, so the data has to be smoothed(usually with movingAverage())



[1]
https://wiki.jenkins-ci.org/display/JENKINS/Statistics+Notification+Plugin
[2] https://github.com/klen/graphite-beacon

On Fri, Apr 15, 2016 at 9:39 AM, David Caro <dcaro at redhat.com> wrote:

> On 04/15 01:24, Nadav Goldin wrote:
> > Hi,
> > I've created an experimental dashboard for Jenkins at our Grafana
> instance:
> > http://graphite.phx.ovirt.org/dashboard/db/jenkins-monitoring
> > (if you don't have an account, you can enrol with github/google)
>
> Nice! \o/
>
> >
> > currently it collects the following metrics:
> > 1) How many jobs in the Build Queue are waiting per slaves' label:
> >
> > for instance: if there are 4 builds of a job that is restricted to 'el7'
> > and 2 builds of another job
> > which is restricted to 'el7' in the build queue we will see 6 for 'el7'
> in
> > the first graph.
> > 'No label' sums jobs which are waiting but are unrestricted.
> >
> > 2) How many slaves are idle per label.
> > note that the slave's labels are contained in the job's labels, but not
> > vice versa, as
> > we allow regex expressions such as (fc21 || fc22 ). right now it treats
> > them as simple
> > strings.
> >
> > 3) Total number of online/offline/idle slaves
> >
> > besides the normal monitoring, it can help us:
> > 1) minimize the difference between 'idle' slaves per label and jobs
> waiting
> > in the build queue per label.
> > this might be caused by unnecessary restrictions on the label, or maybe
> by
> > the
> > 'Throttle Concurrent Builds' plugin.
> > 2) decide how many VMs and which OS to install on the new hosts.
> > 3) in the future, once we have the 'slave pools' implemented, we could
> > implement
> > auto-scaling based on thresholds or some other function.
> >
> >
> > 'experimental' - as it still needs to be tested for stability(it is based
> > on python-jenkins
> > and graphite-send) and also more metrics can be added(maybe avg running
> time
> > per job? builds per hour? ) - will be happy to hear.
>
> I think that will change a lot per-project basis, if we can get that info
> per
> job, with grafana then we can aggregate and create secondary stats (like
> bilds
> per hour as you say).
> So I'd say just to collect the 'bare' data, like job built event, job
> ended,
> duration and such.
>
> >
> > I plan later to pack it all into independent fabric tasks(i.e. fab
> > do.jenkins.slaves.show)
>
> Have you checked the current ds fabric checks?
> There are already a bunch of fabric tasks that monitor jenkins, if we
> install
> the nagiosgraph (see ds for details) to send the nagios performance data
> into
> graphite, we can use them as is to also start alarms and such.
>
>     dcaro at akhos$ fab -l | grep nagi
>     do.jenkins.nagios.check_build_load                      Checks if the
> bui...
>     do.jenkins.nagios.check_executors                       Checks if the
> exe...
>     do.jenkins.nagios.check_queue                           Check if the
> buil...
>     do.provision.nagios_check                               Show a summary
> of...
>
> Though those will not give you the bare data (were designed with nagios in
> mind, not graphite so they are just checks, the stats were added later)
>
> There's also a bunch of helpers functions to create nagios checks too.
>
>
> >
> >
> > Nadav
>
> > _______________________________________________
> > Infra mailing list
> > Infra at ovirt.org
> > http://lists.ovirt.org/mailman/listinfo/infra
>
>
> --
> David Caro
>
> Red Hat S.L.
> Continuous Integration Engineer - EMEA ENG Virtualization R&D
>
> Tel.: +420 532 294 605
> Email: dcaro at redhat.com
> IRC: dcaro|dcaroest@{freenode|oftc|redhat}
> Web: www.redhat.com
> RHT Global #: 82-62605
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/infra/attachments/20160417/d2b8e2f3/attachment.html>


More information about the Infra mailing list