Experimental Jenkins monitoring

Cannot add Shahar as reviewer in...

Re: [Gluster-infra]...

Nadav Goldin

Thursday, 14 April 2016 Thu, 14 Apr '16

5:24 p.m.

Hi, I've created an experimental dashboard for Jenkins at our Grafana instance: http://graphite.phx.ovirt.org/dashboard/db/jenkins-monitoring (if you don't have an account, you can enrol with github/google) currently it collects the following metrics: 1) How many jobs in the Build Queue are waiting per slaves' label: for instance: if there are 4 builds of a job that is restricted to 'el7' and 2 builds of another job which is restricted to 'el7' in the build queue we will see 6 for 'el7' in the first graph. 'No label' sums jobs which are waiting but are unrestricted. 2) How many slaves are idle per label. note that the slave's labels are contained in the job's labels, but not vice versa, as we allow regex expressions such as (fc21 || fc22 ). right now it treats them as simple strings. 3) Total number of online/offline/idle slaves besides the normal monitoring, it can help us: 1) minimize the difference between 'idle' slaves per label and jobs waiting in the build queue per label. this might be caused by unnecessary restrictions on the label, or maybe by the 'Throttle Concurrent Builds' plugin. 2) decide how many VMs and which OS to install on the new hosts. 3) in the future, once we have the 'slave pools' implemented, we could implement auto-scaling based on thresholds or some other function. 'experimental' - as it still needs to be tested for stability(it is based on python-jenkins and graphite-send) and also more metrics can be added(maybe avg running time per job? builds per hour? ) - will be happy to hear. I plan later to pack it all into independent fabric tasks(i.e. fab do.jenkins.slaves.show) Nadav

Attachments:

attachment.html (text/html — 2.2 KB)

Show replies by date

Barak Korren

Friday, 15 April Fri, 15 Apr

12:33 a.m.

nice! בתאריך 15 באפר׳ 2016 01:24,‏ "Nadav Goldin" <ngoldin(a)redhat.com> כתב:

...

David Caro

1:39 a.m.

--maH1Gajj2nflutpK Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 04/15 01:24, Nadav Goldin wrote:

...

Hi, I've created an experimental dashboard for Jenkins at our Grafana instanc=

...

http://graphite.phx.ovirt.org/dashboard/db/jenkins-monitoring (if you don't have an account, you can enrol with github/google)

Nice! \o/

...

=20 currently it collects the following metrics: 1) How many jobs in the Build Queue are waiting per slaves' label: =20 for instance: if there are 4 builds of a job that is restricted to 'el7' and 2 builds of another job which is restricted to 'el7' in the build queue we will see 6 for 'el7' in the first graph. 'No label' sums jobs which are waiting but are unrestricted. =20 2) How many slaves are idle per label. note that the slave's labels are contained in the job's labels, but not vice versa, as we allow regex expressions such as (fc21 || fc22 ). right now it treats them as simple strings. =20 3) Total number of online/offline/idle slaves =20 besides the normal monitoring, it can help us: 1) minimize the difference between 'idle' slaves per label and jobs waiti=

...

in the build queue per label. this might be caused by unnecessary restrictions on the label, or maybe by the 'Throttle Concurrent Builds' plugin. 2) decide how many VMs and which OS to install on the new hosts. 3) in the future, once we have the 'slave pools' implemented, we could implement auto-scaling based on thresholds or some other function. =20 =20 'experimental' - as it still needs to be tested for stability(it is based on python-jenkins and graphite-send) and also more metrics can be added(maybe avg running t=

ime

...

per job? builds per hour? ) - will be happy to hear.

I think that will change a lot per-project basis, if we can get that info p= er job, with grafana then we can aggregate and create secondary stats (like bi= lds per hour as you say). So I'd say just to collect the 'bare' data, like job built event, job ended, duration and such.

...

=20 I plan later to pack it all into independent fabric tasks(i.e. fab do.jenkins.slaves.show)

Have you checked the current ds fabric checks? There are already a bunch of fabric tasks that monitor jenkins, if we insta= ll the nagiosgraph (see ds for details) to send the nagios performance data in= to graphite, we can use them as is to also start alarms and such. dcaro@akhos$ fab -l | grep nagi do.jenkins.nagios.check_build_load Checks if the b= ui... do.jenkins.nagios.check_executors Checks if the e= xe... do.jenkins.nagios.check_queue Check if the bu= il... do.provision.nagios_check Show a summary = of... Though those will not give you the bare data (were designed with nagios in mind, not graphite so they are just checks, the stats were added later) There's also a bunch of helpers functions to create nagios checks too.

...

=20 =20 Nadav

...

_______________________________________________ Infra mailing list Infra(a)ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

--=20 David Caro Red Hat S.L. Continuous Integration Engineer - EMEA ENG Virtualization R&D Tel.: +420 532 294 605 Email: dcaro(a)redhat.com IRC: dcaro|dcaroest@{freenode|oftc|redhat} Web: www.redhat.com RHT Global #: 82-62605 --maH1Gajj2nflutpK Content-Type: application/pgp-signature; name="signature.asc"

...PGP SIGNATURE...

--maH1Gajj2nflutpK--

Nadav Goldin

Sunday, 17 April Sun, 17 Apr

3:55 p.m.

...

I think that will change a lot per-project basis, if we can get that info per job, with grafana then we can aggregate and create secondary stats (like bilds per hour as you say). So I'd say just to collect the 'bare' data, like job built event, job ended, duration and such.

agree. will need to improve that, right now it 'pulls' each X seconds via the CLI, instead of Jenkins sending the events, so it is limited to what the CLI can provide and not that efficient. I plan to install [1] and do the opposite (Jenkins will send a POST request with the data on each build event and then it would be sent to graphite) Have you checked the current ds fabric checks?

...

There are already a bunch of fabric tasks that monitor jenkins, if we install the nagiosgraph (see ds for details) to send the nagios performance data into graphite, we can use them as is to also start alarms and such

Icinga2 has integrated graphite support, so after the upgrade we will get all of our alarms data sent to graphite 'out-of-the-box'.

...

dcaro@akhos$ fab -l | grep nagi do.jenkins.nagios.check_build_load Checks if the bui... do.jenkins.nagios.check_executors Checks if the exe... do.jenkins.nagios.check_queue Check if the buil... do.provision.nagios_check Show a summary of... Though those will not give you the bare data (were designed with nagios in mind, not graphite so they are just checks, the stats were added later) There's also a bunch of helpers functions to create nagios checks too.

cool, wasn't aware of those fabric checks. I think for simple metrics(loads and such) we could use that(i.e. query Jenkins from fabric) but for more complicated queries we'd need to query graphite itself, with this[2] I could create scripts that query graphite and trigger Icinga alerts. such as: calculate the 'expected' slaves load for the next hour(in graphite) and then: Icinga queries graphite -> triggers another Icinga alert -> triggers custom script(such as fab task to create slaves) for now, added two more metrics: top 10 jobs in past X time, and avg number of builds running / builds waiting in queue in the past X time. some metrics might 'glitch' from time to time as there is not a lot of data yet and it mainly counts integer values while graphite is oriented towards floats, so the data has to be smoothed(usually with movingAverage()) [1] https://wiki.jenkins-ci.org/display/JENKINS/Statistics+Notification+Plugin [2] https://github.com/klen/graphite-beacon On Fri, Apr 15, 2016 at 9:39 AM, David Caro <dcaro(a)redhat.com> wrote: > On 04/15 01:24, Nadav Goldin wrote: > > Hi, > > I've created an experimental dashboard for Jenkins at our Grafana > instance: > > http://graphite.phx.ovirt.org/dashboard/db/jenkins-monitoring > > (if you don't have an account, you can enrol with github/google) > > Nice! \o/ > > > > > currently it collects the following metrics: > > 1) How many jobs in the Build Queue are waiting per slaves' label: > > > > for instance: if there are 4 builds of a job that is restricted to 'el7' > > and 2 builds of another job > > which is restricted to 'el7' in the build queue we will see 6 for 'el7' > in > > the first graph. > > 'No label' sums jobs which are waiting but are unrestricted. > > > > 2) How many slaves are idle per label. > > note that the slave's labels are contained in the job's labels, but not > > vice versa, as > > we allow regex expressions such as (fc21 || fc22 ). right now it treats > > them as simple > > strings. > > > > 3) Total number of online/offline/idle slaves > > > > besides the normal monitoring, it can help us: > > 1) minimize the difference between 'idle' slaves per label and jobs > waiting > > in the build queue per label. > > this might be caused by unnecessary restrictions on the label, or maybe > by > > the > > 'Throttle Concurrent Builds' plugin. > > 2) decide how many VMs and which OS to install on the new hosts. > > 3) in the future, once we have the 'slave pools' implemented, we could > > implement > > auto-scaling based on thresholds or some other function. > > > > > > 'experimental' - as it still needs to be tested for stability(it is based > > on python-jenkins > > and graphite-send) and also more metrics can be added(maybe avg running > time > > per job? builds per hour? ) - will be happy to hear.

...

> > > > > I plan later to pack it all into independent fabric tasks(i.e. fab > > do.jenkins.slaves.show) > > Have you checked the current ds fabric checks? > There are already a bunch of fabric tasks that monitor jenkins, if we > install > the nagiosgraph (see ds for details) to send the nagios performance data > into > graphite, we can use them as is to also start alarms and such.

...

> > > > > > > Nadav > > > _______________________________________________ > > Infra mailing list > > Infra(a)ovirt.org > > http://lists.ovirt.org/mailman/listinfo/infra > > > -- > David Caro > > Red Hat S.L. > Continuous Integration Engineer - EMEA ENG Virtualization R&D > > Tel.: +420 532 294 605 > Email: dcaro(a)redhat.com > IRC: dcaro|dcaroest@{freenode|oftc|redhat} > Web: www.redhat.com > RHT Global #: 82-62605 >

David Caro

Monday, 18 April Mon, 18 Apr

2:36 a.m.

--XsQoSWH+UP9D9v3l Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 04/17 23:55, Nadav Goldin wrote:

...

> > I think that will change a lot per-project basis, if we can get that in=

...

> per > job, with grafana then we can aggregate and create secondary stats (like > bilds > per hour as you say). > So I'd say just to collect the 'bare' data, like job built event, job > ended, > duration and such. =20 agree. will need to improve that, right now it 'pulls' each X seconds via the CLI, instead of Jenkins sending the events, so it is limited to what the CLI c=

...

provide and not that efficient. I plan to install [1] and do the opposite (Jenkins will send a POST request with the data on each build event and then it would be sent to graphite)

Amarchuk had already some ideas on integrating collectd with jenkins, imo t= hat will work well for 'master' related stats and more difficult for others like job started, etc. but worth looking at it

...

=20 Have you checked the current ds fabric checks? > There are already a bunch of fabric tasks that monitor jenkins, if we > install > the nagiosgraph (see ds for details) to send the nagios performance data > into > graphite, we can use them as is to also start alarms and such > Icinga2 has integrated graphite support, so after the upgrade we will get all of our alarms data sent to graphite 'out-of-the-box'.

+1!

...

=20 > > dcaro@akhos$ fab -l | grep nagi > do.jenkins.nagios.check_build_load Checks if t=

...

> bui... > do.jenkins.nagios.check_executors Checks if t=

...

> exe... > do.jenkins.nagios.check_queue Check if the > buil... > do.provision.nagios_check Show a summ=

ary

...

> of... > > Though those will not give you the bare data (were designed with nagios=

...

> mind, not graphite so they are just checks, the stats were added later) > > There's also a bunch of helpers functions to create nagios checks too. > =20 cool, wasn't aware of those fabric checks. I think for simple metrics(loads and such) we could use that(i.e. query Jenkins from fabric) but for more complicated queries we'd need to query graphite itself, with this[2] I could create scripts that query graphite and trigger Icinga alerts. such as: calculate the 'expected' slaves load for the next hour(in graphi=

te)

...

and then: Icinga queries graphite -> triggers another Icinga alert -> triggers cust=

...

script(such as fab task to create slaves)

I'd be careful with the reactions for now, but yes, that's great.

...

=20 for now, added two more metrics: top 10 jobs in past X time, and avg number of builds running / builds waiting in queue in the past X time. some metrics might 'glitch' from time to time as there is not a lot of da=

...

yet and it mainly counts integer values while graphite is oriented towards floats, so the data has to be smoothed(usually with movingAverage()) =20 =20 =20 [1] https://wiki.jenkins-ci.org/display/JENKINS/Statistics+Notification+Plugin [2] https://github.com/klen/graphite-beacon =20 On Fri, Apr 15, 2016 at 9:39 AM, David Caro <dcaro(a)redhat.com> wrote: =20 > On 04/15 01:24, Nadav Goldin wrote: > > Hi, > > I've created an experimental dashboard for Jenkins at our Grafana > instance: > > http://graphite.phx.ovirt.org/dashboard/db/jenkins-monitoring > > (if you don't have an account, you can enrol with github/google) > > Nice! \o/ > > > > > currently it collects the following metrics: > > 1) How many jobs in the Build Queue are waiting per slaves' label: > > > > for instance: if there are 4 builds of a job that is restricted to 'e=

l7'

...

> > and 2 builds of another job > > which is restricted to 'el7' in the build queue we will see 6 for 'el=

...

> in > > the first graph. > > 'No label' sums jobs which are waiting but are unrestricted. > > > > 2) How many slaves are idle per label. > > note that the slave's labels are contained in the job's labels, but n=

...

> > vice versa, as > > we allow regex expressions such as (fc21 || fc22 ). right now it trea=

...

> > them as simple > > strings. > > > > 3) Total number of online/offline/idle slaves > > > > besides the normal monitoring, it can help us: > > 1) minimize the difference between 'idle' slaves per label and jobs > waiting > > in the build queue per label. > > this might be caused by unnecessary restrictions on the label, or may=

...

> by > > the > > 'Throttle Concurrent Builds' plugin. > > 2) decide how many VMs and which OS to install on the new hosts. > > 3) in the future, once we have the 'slave pools' implemented, we could > > implement > > auto-scaling based on thresholds or some other function. > > > > > > 'experimental' - as it still needs to be tested for stability(it is b=

ased

...

> > on python-jenkins > > and graphite-send) and also more metrics can be added(maybe avg runni=

ng > > time > > > per job? builds per hour? ) - will be happy to hear.

...

> > I think that will change a lot per-project basis, if we can get that in=

...

> per > job, with grafana then we can aggregate and create secondary stats (like > bilds > per hour as you say). > So I'd say just to collect the 'bare' data, like job built event, job > ended, > duration and such. > > > > > I plan later to pack it all into independent fabric tasks(i.e. fab > > do.jenkins.slaves.show) > > Have you checked the current ds fabric checks? > There are already a bunch of fabric tasks that monitor jenkins, if we > install > the nagiosgraph (see ds for details) to send the nagios performance data > into > graphite, we can use them as is to also start alarms and such. > > dcaro@akhos$ fab -l | grep nagi > do.jenkins.nagios.check_build_load Checks if t=

...

> bui... > do.jenkins.nagios.check_executors Checks if t=

...

> exe... > do.jenkins.nagios.check_queue Check if the > buil... > do.provision.nagios_check Show a summ=

ary

...

> of... > > Though those will not give you the bare data (were designed with nagios=

...

> mind, not graphite so they are just checks, the stats were added later) > > There's also a bunch of helpers functions to create nagios checks too. > > > > > > > > Nadav > > > _______________________________________________ > > Infra mailing list > > Infra(a)ovirt.org > > http://lists.ovirt.org/mailman/listinfo/infra > > > -- > David Caro > > Red Hat S.L. > Continuous Integration Engineer - EMEA ENG Virtualization R&D > > Tel.: +420 532 294 605 > Email: dcaro(a)redhat.com > IRC: dcaro|dcaroest@{freenode|oftc|redhat} > Web: www.redhat.com > RHT Global #: 82-62605 >

...

_______________________________________________ Infra mailing list Infra(a)ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

--=20 David Caro Red Hat S.L. Continuous Integration Engineer - EMEA ENG Virtualization R&D Tel.: +420 532 294 605 Email: dcaro(a)redhat.com IRC: dcaro|dcaroest@{freenode|oftc|redhat} Web: www.redhat.com RHT Global #: 82-62605 --XsQoSWH+UP9D9v3l Content-Type: application/pgp-signature; name="signature.asc"

...PGP SIGNATURE...

--XsQoSWH+UP9D9v3l--

Eyal Edri

Wednesday, 20 April Wed, 20 Apr

6:48 a.m.

...

On 04/17 23:55, Nadav Goldin wrote: > > > > I think that will change a lot per-project basis, if we can get that info > > per > > job, with grafana then we can aggregate and create secondary stats (like > > bilds > > per hour as you say). > > So I'd say just to collect the 'bare' data, like job built event, job > > ended, > > duration and such. > > agree. will need to improve that, right now it 'pulls' each X seconds via > the CLI, > instead of Jenkins sending the events, so it is limited to what the CLI can > provide and not that efficient. I plan to install [1] and do the opposite > (Jenkins will send a POST request with the data on each build > event and then it would be sent to graphite) Amarchuk had already some ideas on integrating collectd with jenkins, imo that will work well for 'master' related stats and more difficult for others like job started, etc. but worth looking at it > > Have you checked the current ds fabric checks? > > There are already a bunch of fabric tasks that monitor jenkins, if we > > install > > the nagiosgraph (see ds for details) to send the nagios performance data > > into > > graphite, we can use them as is to also start alarms and such > > > Icinga2 has integrated graphite support, so after the upgrade we will > get all of our alarms data sent to graphite 'out-of-the-box'. +1! > > > > > dcaro@akhos$ fab -l | grep nagi > > do.jenkins.nagios.check_build_load Checks if the > > bui... > > do.jenkins.nagios.check_executors Checks if the > > exe... > > do.jenkins.nagios.check_queue Check if the > > buil... > > do.provision.nagios_check Show a summary > > of... > > > > Though those will not give you the bare data (were designed with nagios in > > mind, not graphite so they are just checks, the stats were added later) > > > > There's also a bunch of helpers functions to create nagios checks too. > > > > cool, wasn't aware of those fabric checks. > I think for simple metrics(loads and such) we could use that(i.e. query > Jenkins from fabric) > but for more complicated queries we'd need to query graphite itself, > with this[2] I could create scripts that query graphite and trigger Icinga > alerts. > such as: calculate the 'expected' slaves load for the next hour(in graphite) > and then: > Icinga queries graphite -> triggers another Icinga alert -> triggers custom > script(such as > fab task to create slaves) I'd be careful with the reactions for now, but yes, that's great. > > for now, added two more metrics: top 10 jobs in past X time, and > avg number of builds running / builds waiting in queue in the past X time. > some metrics might 'glitch' from time to time as there is not a lot of data > yet > and it mainly counts integer values while graphite is oriented towards > floats, so the data has to be smoothed(usually with movingAverage()) > > > > [1] > https://wiki.jenkins-ci.org/display/JENKINS/Statistics+Notification+Plugin > [2] https://github.com/klen/graphite-beacon > > On Fri, Apr 15, 2016 at 9:39 AM, David Caro <dcaro(a)redhat.com> wrote: > > > On 04/15 01:24, Nadav Goldin wrote: > > > Hi, > > > I've created an experimental dashboard for Jenkins at our Grafana > > instance: > > > http://graphite.phx.ovirt.org/dashboard/db/jenkins-monitoring > > > (if you don't have an account, you can enrol with github/google) > > > > Nice! \o/ > > > > > > > > currently it collects the following metrics: > > > 1) How many jobs in the Build Queue are waiting per slaves' label: > > > > > > for instance: if there are 4 builds of a job that is restricted to 'el7' > > > and 2 builds of another job > > > which is restricted to 'el7' in the build queue we will see 6 for 'el7' > > in > > > the first graph. > > > 'No label' sums jobs which are waiting but are unrestricted. > > > > > > 2) How many slaves are idle per label. > > > note that the slave's labels are contained in the job's labels, but not > > > vice versa, as > > > we allow regex expressions such as (fc21 || fc22 ). right now it treats > > > them as simple > > > strings. > > > > > > 3) Total number of online/offline/idle slaves > > > > > > besides the normal monitoring, it can help us: > > > 1) minimize the difference between 'idle' slaves per label and jobs > > waiting > > > in the build queue per label. > > > this might be caused by unnecessary restrictions on the label, or maybe > > by > > > the > > > 'Throttle Concurrent Builds' plugin. > > > 2) decide how many VMs and which OS to install on the new hosts. > > > 3) in the future, once we have the 'slave pools' implemented, we could > > > implement > > > auto-scaling based on thresholds or some other function. > > > > > > > > > 'experimental' - as it still needs to be tested for stability(it is based > > > on python-jenkins > > > and graphite-send) and also more metrics can be added(maybe avg running > > time > > > per job? builds per hour? ) - will be happy to hear. > > > > I think that will change a lot per-project basis, if we can get that info > > per > > job, with grafana then we can aggregate and create secondary stats (like > > bilds > > per hour as you say). > > So I'd say just to collect the 'bare' data, like job built event, job > > ended, > > duration and such. > > > > > > > > I plan later to pack it all into independent fabric tasks(i.e. fab > > > do.jenkins.slaves.show) > > > > Have you checked the current ds fabric checks? > > There are already a bunch of fabric tasks that monitor jenkins, if we > > install > > the nagiosgraph (see ds for details) to send the nagios performance data > > into > > graphite, we can use them as is to also start alarms and such. > > > > dcaro@akhos$ fab -l | grep nagi > > do.jenkins.nagios.check_build_load Checks if the > > bui... > > do.jenkins.nagios.check_executors Checks if the > > exe... > > do.jenkins.nagios.check_queue Check if the > > buil... > > do.provision.nagios_check Show a summary > > of... > > > > Though those will not give you the bare data (were designed with nagios in > > mind, not graphite so they are just checks, the stats were added later) > > > > There's also a bunch of helpers functions to create nagios checks too. > > > > > > > > > > > > > Nadav > > > > > _______________________________________________ > > > Infra mailing list > > > Infra(a)ovirt.org > > > http://lists.ovirt.org/mailman/listinfo/infra > > > > > > -- > > David Caro > > > > Red Hat S.L. > > Continuous Integration Engineer - EMEA ENG Virtualization R&D > > > > Tel.: +420 532 294 605 > > Email: dcaro(a)redhat.com > > IRC: dcaro|dcaroest@{freenode|oftc|redhat} > > Web: www.redhat.com > > RHT Global #: 82-62605 > > > _______________________________________________ > Infra mailing list > Infra(a)ovirt.org > http://lists.ovirt.org/mailman/listinfo/infra -- David Caro Red Hat S.L. Continuous Integration Engineer - EMEA ENG Virtualization R&D Tel.: +420 532 294 605 Email: dcaro(a)redhat.com IRC: dcaro|dcaroest@{freenode|oftc|redhat} Web: www.redhat.com RHT Global #: 82-62605 _______________________________________________ Infra mailing list Infra(a)ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

-- Eyal Edri Associate Manager RHEV DevOps EMEA ENG Virtualization R&D Red Hat Israel phone: +972-9-7692018 irc: eedri (on #tlv #rhev-dev #rhev-integ)

Fabian Deutsch

7:14 a.m.

That's looking pretty nice! - fabian On Wed, Apr 20, 2016 at 1:48 PM, Eyal Edri <eedri(a)redhat.com> wrote:

...

Looks amazing! Just adding a screenshot so people will see how nice it is :) [1] [1] http://graphite.phx.ovirt.org/dashboard/snapshot/qvdhPL34HoUpygGldklmT5uL... On Mon, Apr 18, 2016 at 10:36 AM, David Caro <dcaro(a)redhat.com> wrote: > On 04/17 23:55, Nadav Goldin wrote: > > > > > > I think that will change a lot per-project basis, if we can get that > info > > > per > > > job, with grafana then we can aggregate and create secondary stats > (like > > > bilds > > > per hour as you say). > > > So I'd say just to collect the 'bare' data, like job built event, job > > > ended, > > > duration and such. > > > > agree. will need to improve that, right now it 'pulls' each X seconds > via > > the CLI, > > instead of Jenkins sending the events, so it is limited to what the CLI > can > > provide and not that efficient. I plan to install [1] and do the > opposite > > (Jenkins will send a POST request with the data on each build > > event and then it would be sent to graphite) > > Amarchuk had already some ideas on integrating collectd with jenkins, imo > that > will work well for 'master' related stats and more difficult for others > like > job started, etc. but worth looking at it > > > > > Have you checked the current ds fabric checks? > > > There are already a bunch of fabric tasks that monitor jenkins, if we > > > install > > > the nagiosgraph (see ds for details) to send the nagios performance > data > > > into > > > graphite, we can use them as is to also start alarms and such > > > > > Icinga2 has integrated graphite support, so after the upgrade we will > > get all of our alarms data sent to graphite 'out-of-the-box'. > > +1! > > > > > > > > > dcaro@akhos$ fab -l | grep nagi > > > do.jenkins.nagios.check_build_load Checks if > the > > > bui... > > > do.jenkins.nagios.check_executors Checks if > the > > > exe... > > > do.jenkins.nagios.check_queue Check if > the > > > buil... > > > do.provision.nagios_check Show a > summary > > > of... > > > > > > Though those will not give you the bare data (were designed with > nagios in > > > mind, not graphite so they are just checks, the stats were added > later) > > > > > > There's also a bunch of helpers functions to create nagios checks too. > > > > > > > cool, wasn't aware of those fabric checks. > > I think for simple metrics(loads and such) we could use that(i.e. query > > Jenkins from fabric) > > but for more complicated queries we'd need to query graphite itself, > > with this[2] I could create scripts that query graphite and trigger > Icinga > > alerts. > > such as: calculate the 'expected' slaves load for the next hour(in > graphite) > > and then: > > Icinga queries graphite -> triggers another Icinga alert -> triggers > custom > > script(such as > > fab task to create slaves) > > I'd be careful with the reactions for now, but yes, that's great. > > > > > for now, added two more metrics: top 10 jobs in past X time, and > > avg number of builds running / builds waiting in queue in the past X > time. > > some metrics might 'glitch' from time to time as there is not a lot of > data > > yet > > and it mainly counts integer values while graphite is oriented towards > > floats, so the data has to be smoothed(usually with movingAverage()) > > > > > > > > [1] > > > https://wiki.jenkins-ci.org/display/JENKINS/Statistics+Notification+Plugin > > [2] https://github.com/klen/graphite-beacon > > > > On Fri, Apr 15, 2016 at 9:39 AM, David Caro <dcaro(a)redhat.com> wrote: > > > > > On 04/15 01:24, Nadav Goldin wrote: > > > > Hi, > > > > I've created an experimental dashboard for Jenkins at our Grafana > > > instance: > > > > http://graphite.phx.ovirt.org/dashboard/db/jenkins-monitoring > > > > (if you don't have an account, you can enrol with github/google) > > > > > > Nice! \o/ > > > > > > > > > > > currently it collects the following metrics: > > > > 1) How many jobs in the Build Queue are waiting per slaves' label: > > > > > > > > for instance: if there are 4 builds of a job that is restricted to > 'el7' > > > > and 2 builds of another job > > > > which is restricted to 'el7' in the build queue we will see 6 for > 'el7' > > > in > > > > the first graph. > > > > 'No label' sums jobs which are waiting but are unrestricted. > > > > > > > > 2) How many slaves are idle per label. > > > > note that the slave's labels are contained in the job's labels, but > not > > > > vice versa, as > > > > we allow regex expressions such as (fc21 || fc22 ). right now it > treats > > > > them as simple > > > > strings. > > > > > > > > 3) Total number of online/offline/idle slaves > > > > > > > > besides the normal monitoring, it can help us: > > > > 1) minimize the difference between 'idle' slaves per label and jobs > > > waiting > > > > in the build queue per label. > > > > this might be caused by unnecessary restrictions on the label, or > maybe > > > by > > > > the > > > > 'Throttle Concurrent Builds' plugin. > > > > 2) decide how many VMs and which OS to install on the new hosts. > > > > 3) in the future, once we have the 'slave pools' implemented, we > could > > > > implement > > > > auto-scaling based on thresholds or some other function. > > > > > > > > > > > > 'experimental' - as it still needs to be tested for stability(it is > based > > > > on python-jenkins > > > > and graphite-send) and also more metrics can be added(maybe avg > running > > > time > > > > per job? builds per hour? ) - will be happy to hear. > > > > > > I think that will change a lot per-project basis, if we can get that > info > > > per > > > job, with grafana then we can aggregate and create secondary stats > (like > > > bilds > > > per hour as you say). > > > So I'd say just to collect the 'bare' data, like job built event, job > > > ended, > > > duration and such. > > > > > > > > > > > I plan later to pack it all into independent fabric tasks(i.e. fab > > > > do.jenkins.slaves.show) > > > > > > Have you checked the current ds fabric checks? > > > There are already a bunch of fabric tasks that monitor jenkins, if we > > > install > > > the nagiosgraph (see ds for details) to send the nagios performance > data > > > into > > > graphite, we can use them as is to also start alarms and such. > > > > > > dcaro@akhos$ fab -l | grep nagi > > > do.jenkins.nagios.check_build_load Checks if > the > > > bui... > > > do.jenkins.nagios.check_executors Checks if > the > > > exe... > > > do.jenkins.nagios.check_queue Check if > the > > > buil... > > > do.provision.nagios_check Show a > summary > > > of... > > > > > > Though those will not give you the bare data (were designed with > nagios in > > > mind, not graphite so they are just checks, the stats were added > later) > > > > > > There's also a bunch of helpers functions to create nagios checks too. > > > > > > > > > > > > > > > > > > Nadav > > > > > > > _______________________________________________ > > > > Infra mailing list > > > > Infra(a)ovirt.org > > > > http://lists.ovirt.org/mailman/listinfo/infra > > > > > > > > > -- > > > David Caro > > > > > > Red Hat S.L. > > > Continuous Integration Engineer - EMEA ENG Virtualization R&D > > > > > > Tel.: +420 532 294 605 > > > Email: dcaro(a)redhat.com > > > IRC: dcaro|dcaroest@{freenode|oftc|redhat} > > > Web: www.redhat.com > > > RHT Global #: 82-62605 > > > > > > _______________________________________________ > > Infra mailing list > > Infra(a)ovirt.org > > http://lists.ovirt.org/mailman/listinfo/infra > > > -- > David Caro > > Red Hat S.L. > Continuous Integration Engineer - EMEA ENG Virtualization R&D > > Tel.: +420 532 294 605 > Email: dcaro(a)redhat.com > IRC: dcaro|dcaroest@{freenode|oftc|redhat} > Web: www.redhat.com > RHT Global #: 82-62605 > > _______________________________________________ > Infra mailing list > Infra(a)ovirt.org > http://lists.ovirt.org/mailman/listinfo/infra > > -- Eyal Edri Associate Manager RHEV DevOps EMEA ENG Virtualization R&D Red Hat Israel phone: +972-9-7692018 irc: eedri (on #tlv #rhev-dev #rhev-integ) _______________________________________________ Infra mailing list Infra(a)ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

-- Fabian Deutsch <fdeutsch(a)redhat.com> RHEV Hypervisor Red Hat

Sandro Bonazzola

Friday, 29 April Fri, 29 Apr

3:07 a.m.

On Fri, Apr 15, 2016 at 12:24 AM, Nadav Goldin <ngoldin(a)redhat.com> wrote:

...

Pretty nice!

...

currently it collects the following metrics: 1) How many jobs in the Build Queue are waiting per slaves' label: for instance: if there are 4 builds of a job that is restricted to 'el7' and 2 builds of another job which is restricted to 'el7' in the build queue we will see 6 for 'el7' in the first graph. 'No label' sums jobs which are waiting but are unrestricted. 2) How many slaves are idle per label. note that the slave's labels are contained in the job's labels, but not vice versa, as we allow regex expressions such as (fc21 || fc22 ). right now it treats them as simple strings. 3) Total number of online/offline/idle slaves besides the normal monitoring, it can help us: 1) minimize the difference between 'idle' slaves per label and jobs waiting in the build queue per label. this might be caused by unnecessary restrictions on the label, or maybe by the 'Throttle Concurrent Builds' plugin. 2) decide how many VMs and which OS to install on the new hosts. 3) in the future, once we have the 'slave pools' implemented, we could implement auto-scaling based on thresholds or some other function. 'experimental' - as it still needs to be tested for stability(it is based on python-jenkins and graphite-send) and also more metrics can be added(maybe avg running time per job? builds per hour? ) - will be happy to hear. I plan later to pack it all into independent fabric tasks(i.e. fab do.jenkins.slaves.show) Nadav _______________________________________________ Infra mailing list Infra(a)ovirt.org http://lists.ovirt.org/mailman/listinfo/infra

-- Sandro Bonazzola Better technology. Faster innovation. Powered by community collaboration. See how it works at redhat.com

3319

days inactive

3334

days old

infra@ovirt.org

Manage subscription

7 comments

6 participants

tags (0)

participants (6)

Barak Korren
David Caro
Eyal Edri
Fabian Deutsch
Nadav Goldin
Sandro Bonazzola

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Experimental Jenkins monitoring