Experimental Jenkins monitoring

Eyal Edri eedri at redhat.com
Wed Apr 20 11:48:23 UTC 2016


Looks amazing!

Just adding a screenshot so people will see how nice it is :) [1]

[1]
http://graphite.phx.ovirt.org/dashboard/snapshot/qvdhPL34HoUpygGldklmT5uLUnRyj19E


On Mon, Apr 18, 2016 at 10:36 AM, David Caro <dcaro at redhat.com> wrote:

> On 04/17 23:55, Nadav Goldin wrote:
> > >
> > > I think that will change a lot per-project basis, if we can get that
> info
> > > per
> > > job, with grafana then we can aggregate and create secondary stats
> (like
> > > bilds
> > > per hour as you say).
> > > So I'd say just to collect the 'bare' data, like job built event, job
> > > ended,
> > > duration and such.
> >
> > agree. will need to improve that, right now it 'pulls' each X seconds via
> > the CLI,
> > instead of Jenkins sending the events, so it is limited to what the CLI
> can
> > provide and not that efficient. I plan to install [1] and do the opposite
> > (Jenkins will send a POST request with the data on each build
> > event and then it would be sent to graphite)
>
> Amarchuk had already some ideas on integrating collectd with jenkins, imo
> that
> will work well for 'master' related stats and more difficult for others
> like
> job started, etc. but worth looking at it
>
> >
> > Have you checked the current ds fabric checks?
> > > There are already a bunch of fabric tasks that monitor jenkins, if we
> > > install
> > > the nagiosgraph (see ds for details) to send the nagios performance
> data
> > > into
> > > graphite, we can use them as is to also start alarms and such
> > >
> > Icinga2 has integrated graphite support, so after the upgrade we will
> > get all of our alarms data sent to graphite 'out-of-the-box'.
>
> +1!
>
> >
> > >
> > >     dcaro at akhos$ fab -l | grep nagi
> > >     do.jenkins.nagios.check_build_load                      Checks if
> the
> > > bui...
> > >     do.jenkins.nagios.check_executors                       Checks if
> the
> > > exe...
> > >     do.jenkins.nagios.check_queue                           Check if
> the
> > > buil...
> > >     do.provision.nagios_check                               Show a
> summary
> > > of...
> > >
> > > Though those will not give you the bare data (were designed with
> nagios in
> > > mind, not graphite so they are just checks, the stats were added later)
> > >
> > > There's also a bunch of helpers functions to create nagios checks too.
> > >
> >
> > cool, wasn't aware of those fabric checks.
> > I think for simple metrics(loads and such) we could use that(i.e. query
> > Jenkins from fabric)
> > but for more complicated queries we'd need to query graphite itself,
> > with this[2] I could create scripts that query graphite and trigger
> Icinga
> > alerts.
> > such as: calculate the 'expected' slaves load for the next hour(in
> graphite)
> > and then:
> > Icinga queries graphite -> triggers another Icinga alert -> triggers
> custom
> > script(such as
> > fab task to create slaves)
>
> I'd be careful with the reactions for now, but yes, that's great.
>
> >
> > for now, added two more metrics: top 10 jobs in past X time, and
> > avg number of builds running / builds waiting in queue in the past X
> time.
> > some metrics might 'glitch' from time to time as there is not a lot of
> data
> > yet
> > and it mainly counts integer values while graphite is oriented towards
> > floats, so the data has to be smoothed(usually with movingAverage())
> >
> >
> >
> > [1]
> >
> https://wiki.jenkins-ci.org/display/JENKINS/Statistics+Notification+Plugin
> > [2] https://github.com/klen/graphite-beacon
> >
> > On Fri, Apr 15, 2016 at 9:39 AM, David Caro <dcaro at redhat.com> wrote:
> >
> > > On 04/15 01:24, Nadav Goldin wrote:
> > > > Hi,
> > > > I've created an experimental dashboard for Jenkins at our Grafana
> > > instance:
> > > > http://graphite.phx.ovirt.org/dashboard/db/jenkins-monitoring
> > > > (if you don't have an account, you can enrol with github/google)
> > >
> > > Nice! \o/
> > >
> > > >
> > > > currently it collects the following metrics:
> > > > 1) How many jobs in the Build Queue are waiting per slaves' label:
> > > >
> > > > for instance: if there are 4 builds of a job that is restricted to
> 'el7'
> > > > and 2 builds of another job
> > > > which is restricted to 'el7' in the build queue we will see 6 for
> 'el7'
> > > in
> > > > the first graph.
> > > > 'No label' sums jobs which are waiting but are unrestricted.
> > > >
> > > > 2) How many slaves are idle per label.
> > > > note that the slave's labels are contained in the job's labels, but
> not
> > > > vice versa, as
> > > > we allow regex expressions such as (fc21 || fc22 ). right now it
> treats
> > > > them as simple
> > > > strings.
> > > >
> > > > 3) Total number of online/offline/idle slaves
> > > >
> > > > besides the normal monitoring, it can help us:
> > > > 1) minimize the difference between 'idle' slaves per label and jobs
> > > waiting
> > > > in the build queue per label.
> > > > this might be caused by unnecessary restrictions on the label, or
> maybe
> > > by
> > > > the
> > > > 'Throttle Concurrent Builds' plugin.
> > > > 2) decide how many VMs and which OS to install on the new hosts.
> > > > 3) in the future, once we have the 'slave pools' implemented, we
> could
> > > > implement
> > > > auto-scaling based on thresholds or some other function.
> > > >
> > > >
> > > > 'experimental' - as it still needs to be tested for stability(it is
> based
> > > > on python-jenkins
> > > > and graphite-send) and also more metrics can be added(maybe avg
> running
> > > time
> > > > per job? builds per hour? ) - will be happy to hear.
> > >
> > > I think that will change a lot per-project basis, if we can get that
> info
> > > per
> > > job, with grafana then we can aggregate and create secondary stats
> (like
> > > bilds
> > > per hour as you say).
> > > So I'd say just to collect the 'bare' data, like job built event, job
> > > ended,
> > > duration and such.
> > >
> > > >
> > > > I plan later to pack it all into independent fabric tasks(i.e. fab
> > > > do.jenkins.slaves.show)
> > >
> > > Have you checked the current ds fabric checks?
> > > There are already a bunch of fabric tasks that monitor jenkins, if we
> > > install
> > > the nagiosgraph (see ds for details) to send the nagios performance
> data
> > > into
> > > graphite, we can use them as is to also start alarms and such.
> > >
> > >     dcaro at akhos$ fab -l | grep nagi
> > >     do.jenkins.nagios.check_build_load                      Checks if
> the
> > > bui...
> > >     do.jenkins.nagios.check_executors                       Checks if
> the
> > > exe...
> > >     do.jenkins.nagios.check_queue                           Check if
> the
> > > buil...
> > >     do.provision.nagios_check                               Show a
> summary
> > > of...
> > >
> > > Though those will not give you the bare data (were designed with
> nagios in
> > > mind, not graphite so they are just checks, the stats were added later)
> > >
> > > There's also a bunch of helpers functions to create nagios checks too.
> > >
> > >
> > > >
> > > >
> > > > Nadav
> > >
> > > > _______________________________________________
> > > > Infra mailing list
> > > > Infra at ovirt.org
> > > > http://lists.ovirt.org/mailman/listinfo/infra
> > >
> > >
> > > --
> > > David Caro
> > >
> > > Red Hat S.L.
> > > Continuous Integration Engineer - EMEA ENG Virtualization R&D
> > >
> > > Tel.: +420 532 294 605
> > > Email: dcaro at redhat.com
> > > IRC: dcaro|dcaroest@{freenode|oftc|redhat}
> > > Web: www.redhat.com
> > > RHT Global #: 82-62605
> > >
>
> > _______________________________________________
> > Infra mailing list
> > Infra at ovirt.org
> > http://lists.ovirt.org/mailman/listinfo/infra
>
>
> --
> David Caro
>
> Red Hat S.L.
> Continuous Integration Engineer - EMEA ENG Virtualization R&D
>
> Tel.: +420 532 294 605
> Email: dcaro at redhat.com
> IRC: dcaro|dcaroest@{freenode|oftc|redhat}
> Web: www.redhat.com
> RHT Global #: 82-62605
>
> _______________________________________________
> Infra mailing list
> Infra at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/infra
>
>


-- 
Eyal Edri
Associate Manager
RHEV DevOps
EMEA ENG Virtualization R&D
Red Hat Israel

phone: +972-9-7692018
irc: eedri (on #tlv #rhev-dev #rhev-integ)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/infra/attachments/20160420/b596a851/attachment.html>


More information about the Infra mailing list