Experimental Jenkins monitoring

Fabian Deutsch fdeutsch at redhat.com
Wed Apr 20 12:14:20 UTC 2016


That's looking pretty nice!

- fabian

On Wed, Apr 20, 2016 at 1:48 PM, Eyal Edri <eedri at redhat.com> wrote:

> Looks amazing!
>
> Just adding a screenshot so people will see how nice it is :) [1]
>
> [1]
> http://graphite.phx.ovirt.org/dashboard/snapshot/qvdhPL34HoUpygGldklmT5uLUnRyj19E
>
>
> On Mon, Apr 18, 2016 at 10:36 AM, David Caro <dcaro at redhat.com> wrote:
>
>> On 04/17 23:55, Nadav Goldin wrote:
>> > >
>> > > I think that will change a lot per-project basis, if we can get that
>> info
>> > > per
>> > > job, with grafana then we can aggregate and create secondary stats
>> (like
>> > > bilds
>> > > per hour as you say).
>> > > So I'd say just to collect the 'bare' data, like job built event, job
>> > > ended,
>> > > duration and such.
>> >
>> > agree. will need to improve that, right now it 'pulls' each X seconds
>> via
>> > the CLI,
>> > instead of Jenkins sending the events, so it is limited to what the CLI
>> can
>> > provide and not that efficient. I plan to install [1] and do the
>> opposite
>> > (Jenkins will send a POST request with the data on each build
>> > event and then it would be sent to graphite)
>>
>> Amarchuk had already some ideas on integrating collectd with jenkins, imo
>> that
>> will work well for 'master' related stats and more difficult for others
>> like
>> job started, etc. but worth looking at it
>>
>> >
>> > Have you checked the current ds fabric checks?
>> > > There are already a bunch of fabric tasks that monitor jenkins, if we
>> > > install
>> > > the nagiosgraph (see ds for details) to send the nagios performance
>> data
>> > > into
>> > > graphite, we can use them as is to also start alarms and such
>> > >
>> > Icinga2 has integrated graphite support, so after the upgrade we will
>> > get all of our alarms data sent to graphite 'out-of-the-box'.
>>
>> +1!
>>
>> >
>> > >
>> > >     dcaro at akhos$ fab -l | grep nagi
>> > >     do.jenkins.nagios.check_build_load                      Checks if
>> the
>> > > bui...
>> > >     do.jenkins.nagios.check_executors                       Checks if
>> the
>> > > exe...
>> > >     do.jenkins.nagios.check_queue                           Check if
>> the
>> > > buil...
>> > >     do.provision.nagios_check                               Show a
>> summary
>> > > of...
>> > >
>> > > Though those will not give you the bare data (were designed with
>> nagios in
>> > > mind, not graphite so they are just checks, the stats were added
>> later)
>> > >
>> > > There's also a bunch of helpers functions to create nagios checks too.
>> > >
>> >
>> > cool, wasn't aware of those fabric checks.
>> > I think for simple metrics(loads and such) we could use that(i.e. query
>> > Jenkins from fabric)
>> > but for more complicated queries we'd need to query graphite itself,
>> > with this[2] I could create scripts that query graphite and trigger
>> Icinga
>> > alerts.
>> > such as: calculate the 'expected' slaves load for the next hour(in
>> graphite)
>> > and then:
>> > Icinga queries graphite -> triggers another Icinga alert -> triggers
>> custom
>> > script(such as
>> > fab task to create slaves)
>>
>> I'd be careful with the reactions for now, but yes, that's great.
>>
>> >
>> > for now, added two more metrics: top 10 jobs in past X time, and
>> > avg number of builds running / builds waiting in queue in the past X
>> time.
>> > some metrics might 'glitch' from time to time as there is not a lot of
>> data
>> > yet
>> > and it mainly counts integer values while graphite is oriented towards
>> > floats, so the data has to be smoothed(usually with movingAverage())
>> >
>> >
>> >
>> > [1]
>> >
>> https://wiki.jenkins-ci.org/display/JENKINS/Statistics+Notification+Plugin
>> > [2] https://github.com/klen/graphite-beacon
>> >
>> > On Fri, Apr 15, 2016 at 9:39 AM, David Caro <dcaro at redhat.com> wrote:
>> >
>> > > On 04/15 01:24, Nadav Goldin wrote:
>> > > > Hi,
>> > > > I've created an experimental dashboard for Jenkins at our Grafana
>> > > instance:
>> > > > http://graphite.phx.ovirt.org/dashboard/db/jenkins-monitoring
>> > > > (if you don't have an account, you can enrol with github/google)
>> > >
>> > > Nice! \o/
>> > >
>> > > >
>> > > > currently it collects the following metrics:
>> > > > 1) How many jobs in the Build Queue are waiting per slaves' label:
>> > > >
>> > > > for instance: if there are 4 builds of a job that is restricted to
>> 'el7'
>> > > > and 2 builds of another job
>> > > > which is restricted to 'el7' in the build queue we will see 6 for
>> 'el7'
>> > > in
>> > > > the first graph.
>> > > > 'No label' sums jobs which are waiting but are unrestricted.
>> > > >
>> > > > 2) How many slaves are idle per label.
>> > > > note that the slave's labels are contained in the job's labels, but
>> not
>> > > > vice versa, as
>> > > > we allow regex expressions such as (fc21 || fc22 ). right now it
>> treats
>> > > > them as simple
>> > > > strings.
>> > > >
>> > > > 3) Total number of online/offline/idle slaves
>> > > >
>> > > > besides the normal monitoring, it can help us:
>> > > > 1) minimize the difference between 'idle' slaves per label and jobs
>> > > waiting
>> > > > in the build queue per label.
>> > > > this might be caused by unnecessary restrictions on the label, or
>> maybe
>> > > by
>> > > > the
>> > > > 'Throttle Concurrent Builds' plugin.
>> > > > 2) decide how many VMs and which OS to install on the new hosts.
>> > > > 3) in the future, once we have the 'slave pools' implemented, we
>> could
>> > > > implement
>> > > > auto-scaling based on thresholds or some other function.
>> > > >
>> > > >
>> > > > 'experimental' - as it still needs to be tested for stability(it is
>> based
>> > > > on python-jenkins
>> > > > and graphite-send) and also more metrics can be added(maybe avg
>> running
>> > > time
>> > > > per job? builds per hour? ) - will be happy to hear.
>> > >
>> > > I think that will change a lot per-project basis, if we can get that
>> info
>> > > per
>> > > job, with grafana then we can aggregate and create secondary stats
>> (like
>> > > bilds
>> > > per hour as you say).
>> > > So I'd say just to collect the 'bare' data, like job built event, job
>> > > ended,
>> > > duration and such.
>> > >
>> > > >
>> > > > I plan later to pack it all into independent fabric tasks(i.e. fab
>> > > > do.jenkins.slaves.show)
>> > >
>> > > Have you checked the current ds fabric checks?
>> > > There are already a bunch of fabric tasks that monitor jenkins, if we
>> > > install
>> > > the nagiosgraph (see ds for details) to send the nagios performance
>> data
>> > > into
>> > > graphite, we can use them as is to also start alarms and such.
>> > >
>> > >     dcaro at akhos$ fab -l | grep nagi
>> > >     do.jenkins.nagios.check_build_load                      Checks if
>> the
>> > > bui...
>> > >     do.jenkins.nagios.check_executors                       Checks if
>> the
>> > > exe...
>> > >     do.jenkins.nagios.check_queue                           Check if
>> the
>> > > buil...
>> > >     do.provision.nagios_check                               Show a
>> summary
>> > > of...
>> > >
>> > > Though those will not give you the bare data (were designed with
>> nagios in
>> > > mind, not graphite so they are just checks, the stats were added
>> later)
>> > >
>> > > There's also a bunch of helpers functions to create nagios checks too.
>> > >
>> > >
>> > > >
>> > > >
>> > > > Nadav
>> > >
>> > > > _______________________________________________
>> > > > Infra mailing list
>> > > > Infra at ovirt.org
>> > > > http://lists.ovirt.org/mailman/listinfo/infra
>> > >
>> > >
>> > > --
>> > > David Caro
>> > >
>> > > Red Hat S.L.
>> > > Continuous Integration Engineer - EMEA ENG Virtualization R&D
>> > >
>> > > Tel.: +420 532 294 605
>> > > Email: dcaro at redhat.com
>> > > IRC: dcaro|dcaroest@{freenode|oftc|redhat}
>> > > Web: www.redhat.com
>> > > RHT Global #: 82-62605
>> > >
>>
>> > _______________________________________________
>> > Infra mailing list
>> > Infra at ovirt.org
>> > http://lists.ovirt.org/mailman/listinfo/infra
>>
>>
>> --
>> David Caro
>>
>> Red Hat S.L.
>> Continuous Integration Engineer - EMEA ENG Virtualization R&D
>>
>> Tel.: +420 532 294 605
>> Email: dcaro at redhat.com
>> IRC: dcaro|dcaroest@{freenode|oftc|redhat}
>> Web: www.redhat.com
>> RHT Global #: 82-62605
>>
>> _______________________________________________
>> Infra mailing list
>> Infra at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/infra
>>
>>
>
>
> --
> Eyal Edri
> Associate Manager
> RHEV DevOps
> EMEA ENG Virtualization R&D
> Red Hat Israel
>
> phone: +972-9-7692018
> irc: eedri (on #tlv #rhev-dev #rhev-integ)
>
> _______________________________________________
> Infra mailing list
> Infra at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/infra
>
>


-- 
Fabian Deutsch <fdeutsch at redhat.com>
RHEV Hypervisor
Red Hat
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/infra/attachments/20160420/0e38c166/attachment.html>


More information about the Infra mailing list