<div dir="ltr">Looks amazing!<div><br></div><div>Just adding a screenshot so people will see how nice it is :) [1]</div><div><br></div><div>[1] <a href="http://graphite.phx.ovirt.org/dashboard/snapshot/qvdhPL34HoUpygGldklmT5uLUnRyj19E">http://graphite.phx.ovirt.org/dashboard/snapshot/qvdhPL34HoUpygGldklmT5uLUnRyj19E</a><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Apr 18, 2016 at 10:36 AM, David Caro <span dir="ltr"><<a href="mailto:dcaro@redhat.com" target="_blank">dcaro@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 04/17 23:55, Nadav Goldin wrote:<br>
> ><br>
> > I think that will change a lot per-project basis, if we can get that info<br>
> > per<br>
> > job, with grafana then we can aggregate and create secondary stats (like<br>
> > bilds<br>
> > per hour as you say).<br>
> > So I'd say just to collect the 'bare' data, like job built event, job<br>
> > ended,<br>
> > duration and such.<br>
><br>
> agree. will need to improve that, right now it 'pulls' each X seconds via<br>
> the CLI,<br>
> instead of Jenkins sending the events, so it is limited to what the CLI can<br>
> provide and not that efficient. I plan to install [1] and do the opposite<br>
> (Jenkins will send a POST request with the data on each build<br>
> event and then it would be sent to graphite)<br>
<br>
</span>Amarchuk had already some ideas on integrating collectd with jenkins, imo that<br>
will work well for 'master' related stats and more difficult for others like<br>
job started, etc. but worth looking at it<br>
<span class=""><br>
><br>
> Have you checked the current ds fabric checks?<br>
> > There are already a bunch of fabric tasks that monitor jenkins, if we<br>
> > install<br>
> > the nagiosgraph (see ds for details) to send the nagios performance data<br>
> > into<br>
> > graphite, we can use them as is to also start alarms and such<br>
> ><br>
> Icinga2 has integrated graphite support, so after the upgrade we will<br>
> get all of our alarms data sent to graphite 'out-of-the-box'.<br>
<br>
</span>+1!<br>
<span class=""><br>
><br>
> ><br>
> > dcaro@akhos$ fab -l | grep nagi<br>
> > do.jenkins.nagios.check_build_load Checks if the<br>
> > bui...<br>
> > do.jenkins.nagios.check_executors Checks if the<br>
> > exe...<br>
> > do.jenkins.nagios.check_queue Check if the<br>
> > buil...<br>
> > do.provision.nagios_check Show a summary<br>
> > of...<br>
> ><br>
> > Though those will not give you the bare data (were designed with nagios in<br>
> > mind, not graphite so they are just checks, the stats were added later)<br>
> ><br>
> > There's also a bunch of helpers functions to create nagios checks too.<br>
> ><br>
><br>
> cool, wasn't aware of those fabric checks.<br>
> I think for simple metrics(loads and such) we could use that(i.e. query<br>
> Jenkins from fabric)<br>
> but for more complicated queries we'd need to query graphite itself,<br>
> with this[2] I could create scripts that query graphite and trigger Icinga<br>
> alerts.<br>
> such as: calculate the 'expected' slaves load for the next hour(in graphite)<br>
> and then:<br>
> Icinga queries graphite -> triggers another Icinga alert -> triggers custom<br>
> script(such as<br>
> fab task to create slaves)<br>
<br>
</span>I'd be careful with the reactions for now, but yes, that's great.<br>
<div class="HOEnZb"><div class="h5"><br>
><br>
> for now, added two more metrics: top 10 jobs in past X time, and<br>
> avg number of builds running / builds waiting in queue in the past X time.<br>
> some metrics might 'glitch' from time to time as there is not a lot of data<br>
> yet<br>
> and it mainly counts integer values while graphite is oriented towards<br>
> floats, so the data has to be smoothed(usually with movingAverage())<br>
><br>
><br>
><br>
> [1]<br>
> <a href="https://wiki.jenkins-ci.org/display/JENKINS/Statistics+Notification+Plugin" rel="noreferrer" target="_blank">https://wiki.jenkins-ci.org/display/JENKINS/Statistics+Notification+Plugin</a><br>
> [2] <a href="https://github.com/klen/graphite-beacon" rel="noreferrer" target="_blank">https://github.com/klen/graphite-beacon</a><br>
><br>
> On Fri, Apr 15, 2016 at 9:39 AM, David Caro <<a href="mailto:dcaro@redhat.com">dcaro@redhat.com</a>> wrote:<br>
><br>
> > On 04/15 01:24, Nadav Goldin wrote:<br>
> > > Hi,<br>
> > > I've created an experimental dashboard for Jenkins at our Grafana<br>
> > instance:<br>
> > > <a href="http://graphite.phx.ovirt.org/dashboard/db/jenkins-monitoring" rel="noreferrer" target="_blank">http://graphite.phx.ovirt.org/dashboard/db/jenkins-monitoring</a><br>
> > > (if you don't have an account, you can enrol with github/google)<br>
> ><br>
> > Nice! \o/<br>
> ><br>
> > ><br>
> > > currently it collects the following metrics:<br>
> > > 1) How many jobs in the Build Queue are waiting per slaves' label:<br>
> > ><br>
> > > for instance: if there are 4 builds of a job that is restricted to 'el7'<br>
> > > and 2 builds of another job<br>
> > > which is restricted to 'el7' in the build queue we will see 6 for 'el7'<br>
> > in<br>
> > > the first graph.<br>
> > > 'No label' sums jobs which are waiting but are unrestricted.<br>
> > ><br>
> > > 2) How many slaves are idle per label.<br>
> > > note that the slave's labels are contained in the job's labels, but not<br>
> > > vice versa, as<br>
> > > we allow regex expressions such as (fc21 || fc22 ). right now it treats<br>
> > > them as simple<br>
> > > strings.<br>
> > ><br>
> > > 3) Total number of online/offline/idle slaves<br>
> > ><br>
> > > besides the normal monitoring, it can help us:<br>
> > > 1) minimize the difference between 'idle' slaves per label and jobs<br>
> > waiting<br>
> > > in the build queue per label.<br>
> > > this might be caused by unnecessary restrictions on the label, or maybe<br>
> > by<br>
> > > the<br>
> > > 'Throttle Concurrent Builds' plugin.<br>
> > > 2) decide how many VMs and which OS to install on the new hosts.<br>
> > > 3) in the future, once we have the 'slave pools' implemented, we could<br>
> > > implement<br>
> > > auto-scaling based on thresholds or some other function.<br>
> > ><br>
> > ><br>
> > > 'experimental' - as it still needs to be tested for stability(it is based<br>
> > > on python-jenkins<br>
> > > and graphite-send) and also more metrics can be added(maybe avg running<br>
> > time<br>
> > > per job? builds per hour? ) - will be happy to hear.<br>
> ><br>
> > I think that will change a lot per-project basis, if we can get that info<br>
> > per<br>
> > job, with grafana then we can aggregate and create secondary stats (like<br>
> > bilds<br>
> > per hour as you say).<br>
> > So I'd say just to collect the 'bare' data, like job built event, job<br>
> > ended,<br>
> > duration and such.<br>
> ><br>
> > ><br>
> > > I plan later to pack it all into independent fabric tasks(i.e. fab<br>
> > > do.jenkins.slaves.show)<br>
> ><br>
> > Have you checked the current ds fabric checks?<br>
> > There are already a bunch of fabric tasks that monitor jenkins, if we<br>
> > install<br>
> > the nagiosgraph (see ds for details) to send the nagios performance data<br>
> > into<br>
> > graphite, we can use them as is to also start alarms and such.<br>
> ><br>
> > dcaro@akhos$ fab -l | grep nagi<br>
> > do.jenkins.nagios.check_build_load Checks if the<br>
> > bui...<br>
> > do.jenkins.nagios.check_executors Checks if the<br>
> > exe...<br>
> > do.jenkins.nagios.check_queue Check if the<br>
> > buil...<br>
> > do.provision.nagios_check Show a summary<br>
> > of...<br>
> ><br>
> > Though those will not give you the bare data (were designed with nagios in<br>
> > mind, not graphite so they are just checks, the stats were added later)<br>
> ><br>
> > There's also a bunch of helpers functions to create nagios checks too.<br>
> ><br>
> ><br>
> > ><br>
> > ><br>
> > > Nadav<br>
> ><br>
> > > _______________________________________________<br>
> > > Infra mailing list<br>
> > > <a href="mailto:Infra@ovirt.org">Infra@ovirt.org</a><br>
> > > <a href="http://lists.ovirt.org/mailman/listinfo/infra" rel="noreferrer" target="_blank">http://lists.ovirt.org/mailman/listinfo/infra</a><br>
> ><br>
> ><br>
> > --<br>
> > David Caro<br>
> ><br>
> > Red Hat S.L.<br>
> > Continuous Integration Engineer - EMEA ENG Virtualization R&D<br>
> ><br>
> > Tel.: <a href="tel:%2B420%20532%20294%20605" value="+420532294605">+420 532 294 605</a><br>
> > Email: <a href="mailto:dcaro@redhat.com">dcaro@redhat.com</a><br>
> > IRC: dcaro|dcaroest@{freenode|oftc|redhat}<br>
> > Web: <a href="http://www.redhat.com" rel="noreferrer" target="_blank">www.redhat.com</a><br>
> > RHT Global #: 82-62605<br>
> ><br>
<br>
> _______________________________________________<br>
> Infra mailing list<br>
> <a href="mailto:Infra@ovirt.org">Infra@ovirt.org</a><br>
> <a href="http://lists.ovirt.org/mailman/listinfo/infra" rel="noreferrer" target="_blank">http://lists.ovirt.org/mailman/listinfo/infra</a><br>
<br>
<br>
--<br>
David Caro<br>
<br>
Red Hat S.L.<br>
Continuous Integration Engineer - EMEA ENG Virtualization R&D<br>
<br>
Tel.: <a href="tel:%2B420%20532%20294%20605" value="+420532294605">+420 532 294 605</a><br>
Email: <a href="mailto:dcaro@redhat.com">dcaro@redhat.com</a><br>
IRC: dcaro|dcaroest@{freenode|oftc|redhat}<br>
Web: <a href="http://www.redhat.com" rel="noreferrer" target="_blank">www.redhat.com</a><br>
RHT Global #: 82-62605<br>
</div></div><br>_______________________________________________<br>
Infra mailing list<br>
<a href="mailto:Infra@ovirt.org">Infra@ovirt.org</a><br>
<a href="http://lists.ovirt.org/mailman/listinfo/infra" rel="noreferrer" target="_blank">http://lists.ovirt.org/mailman/listinfo/infra</a><br>
<br></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div>Eyal Edri<br>Associate Manager</div><div>RHEV DevOps<br>EMEA ENG Virtualization R&D<br>Red Hat Israel<br><br>phone: +972-9-7692018<br>irc: eedri (on #tlv #rhev-dev #rhev-integ)</div></div></div></div></div>
</div>