Re: Experimental Jenkins monitoring

20 Apr 2016


      Looks amazing!

Just adding a screenshot so people will see how nice it is :) [1]

[1]
http://graphite.phx.ovirt.org/dashboard/snapshot/qvdhPL34HoUpygGldklmT5uLUnR...


On Mon, Apr 18, 2016 at 10:36 AM, David Caro <dcaro@redhat.com> wrote:
...
On 04/17 23:55, Nadav Goldin wrote:
...
...
I think that will change a lot per-project basis, if we can get that
info
...
per
job, with grafana then we can aggregate and create secondary stats
(like
bilds
per hour as you say).
So I'd say just to collect the 'bare' data, like job built event, job
ended,
duration and such.
agree. will need to improve that, right now it 'pulls' each X seconds via
the CLI,
instead of Jenkins sending the events, so it is limited to what the CLI
can
provide and not that efficient. I plan to install [1] and do the opposite
(Jenkins will send a POST request with the data on each build
event and then it would be sent to graphite)
Amarchuk had already some ideas on integrating collectd with jenkins, imo
that
will work well for 'master' related stats and more difficult for others
like
job started, etc. but worth looking at it
...
Have you checked the current ds fabric checks?
...
There are already a bunch of fabric tasks that monitor jenkins, if we
install
the nagiosgraph (see ds for details) to send the nagios performance
data
...
...
into
graphite, we can use them as is to also start alarms and such
Icinga2 has integrated graphite support, so after the upgrade we will
get all of our alarms data sent to graphite 'out-of-the-box'.
+1!
...
...
dcaro@akhos$ fab -l | grep nagi
    do.jenkins.nagios.check_build_load                      Checks if
...
...
bui...
    do.jenkins.nagios.check_executors                       Checks if
...
...
exe...
    do.jenkins.nagios.check_queue                           Check if
...
...
buil...
    do.provision.nagios_check                               Show a
the
the
the
summary
...
...
of...
Though those will not give you the bare data (were designed with
nagios in
mind, not graphite so they are just checks, the stats were added later)
There's also a bunch of helpers functions to create nagios checks too.
cool, wasn't aware of those fabric checks.
I think for simple metrics(loads and such) we could use that(i.e. query
Jenkins from fabric)
but for more complicated queries we'd need to query graphite itself,
with this[2] I could create scripts that query graphite and trigger
Icinga
alerts.
such as: calculate the 'expected' slaves load for the next hour(in
graphite)
and then:
Icinga queries graphite -> triggers another Icinga alert -> triggers
custom
script(such as
fab task to create slaves)
I'd be careful with the reactions for now, but yes, that's great.
...
for now, added two more metrics: top 10 jobs in past X time, and
avg number of builds running / builds waiting in queue in the past X
time.
...
some metrics might 'glitch' from time to time as there is not a lot of
data
yet
and it mainly counts integer values while graphite is oriented towards
floats, so the data has to be smoothed(usually with movingAverage())
[1]
...
[2] https://github.com/klen/graphite-beacon
On Fri, Apr 15, 2016 at 9:39 AM, David Caro <dcaro@redhat.com> wrote:
...
On 04/15 01:24, Nadav Goldin wrote:
...
Hi,
I've created an experimental dashboard for Jenkins at our Grafana
instance:
http://graphite.phx.ovirt.org/dashboard/db/jenkins-monitoring
(if you don't have an account, you can enrol with github/google)
Nice! \o/
...
currently it collects the following metrics:
1) How many jobs in the Build Queue are waiting per slaves' label:
for instance: if there are 4 builds of a job that is restricted to
'el7'
...
and 2 builds of another job
which is restricted to 'el7' in the build queue we will see 6 for
'el7'
in
the first graph.
'No label' sums jobs which are waiting but are unrestricted.
2) How many slaves are idle per label.
note that the slave's labels are contained in the job's labels, but
not
vice versa, as
we allow regex expressions such as (fc21 || fc22 ). right now it
...
...
...
them as simple
strings.
3) Total number of online/offline/idle slaves
besides the normal monitoring, it can help us:
1) minimize the difference between 'idle' slaves per label and jobs
waiting
in the build queue per label.
this might be caused by unnecessary restrictions on the label, or
maybe
by
the
'Throttle Concurrent Builds' plugin.
2) decide how many VMs and which OS to install on the new hosts.
3) in the future, once we have the 'slave pools' implemented, we
could
implement
auto-scaling based on thresholds or some other function.
'experimental' - as it still needs to be tested for stability(it is
...
...
...
on python-jenkins
and graphite-send) and also more metrics can be added(maybe avg
running
time
per job? builds per hour? ) - will be happy to hear.
I think that will change a lot per-project basis, if we can get that
info
per
job, with grafana then we can aggregate and create secondary stats
(like
bilds
per hour as you say).
So I'd say just to collect the 'bare' data, like job built event, job
ended,
duration and such.
...
I plan later to pack it all into independent fabric tasks(i.e. fab
do.jenkins.slaves.show)
Have you checked the current ds fabric checks?
There are already a bunch of fabric tasks that monitor jenkins, if we
install
the nagiosgraph (see ds for details) to send the nagios performance
data
into
graphite, we can use them as is to also start alarms and such.
dcaro@akhos$ fab -l | grep nagi
    do.jenkins.nagios.check_build_load                      Checks if
...
...
bui...
    do.jenkins.nagios.check_executors                       Checks if
...
...
exe...
    do.jenkins.nagios.check_queue                           Check if
...
...
buil...
    do.provision.nagios_check                               Show a
https://wiki.jenkins-ci.org/display/JENKINS/Statistics+Notification+Plugin
treats
based
the
the
the
summary
...
...
of...
Though those will not give you the bare data (were designed with
nagios in
mind, not graphite so they are just checks, the stats were added later)
There's also a bunch of helpers functions to create nagios checks too.
...
Nadav
...
_______________________________________________
Infra mailing list
Infra@ovirt.org
http://lists.ovirt.org/mailman/listinfo/infra
--
David Caro
Red Hat S.L.
Continuous Integration Engineer - EMEA ENG Virtualization R&D
Tel.: +420 532 294 605
Email: dcaro@redhat.com
IRC: dcaro|dcaroest@{freenode|oftc|redhat}
Web: www.redhat.com
RHT Global #: 82-62605
...
_______________________________________________
Infra mailing list
Infra@ovirt.org
http://lists.ovirt.org/mailman/listinfo/infra
--
David Caro
Red Hat S.L.
Continuous Integration Engineer - EMEA ENG Virtualization R&D
Tel.: +420 532 294 605
Email: dcaro@redhat.com
IRC: dcaro|dcaroest@{freenode|oftc|redhat}
Web: www.redhat.com
RHT Global #: 82-62605
_______________________________________________
Infra mailing list
Infra@ovirt.org
http://lists.ovirt.org/mailman/listinfo/infra
-- 
Eyal Edri
Associate Manager
RHEV DevOps
EMEA ENG Virtualization R&D
Red Hat Israel

phone: +972-9-7692018
irc: eedri (on #tlv #rhev-dev #rhev-integ)