Proposal: Hystrix for realtime command monitoring

Hi All, I am contributing to the engine for three months now. While I dug into the code I started to wonder how to visualize what the engine is actually doing. To get better insights I added hystrix[1] to the engine. Hystrix is a circuit breaker library which was developed by Netflix and has one pretty interesting feature: Real time metrics for commands. In combination with hystrix-dashboard[2] it allows very interesting insights. You can easily get an overview of commands involved in operations, their performance and complexity. Look at [2] and the attachments in [5] and [6] for screenshots to get an Impression. I want to propose to integrate hystrix permanently because from my perspective the results were really useful and I also had some good experiences with hystrix in past projects. A first implementation can be found on gerrit[3]. # Where is it immediately useful? During development and QA. An example: I tested the hystrix integration on /api/vms and /api/hosts rest endpoints and immediately saw that the number of command exectuions grew lineary whit the number of vms and hosts. The bug reports [5] and [6] are the result. # How to monitor the engine? It is as easy as starting a hystrix-dashboard [2] with $ git clone https://github.com/Netflix/Hystrix.git $ cd Hystrix/hystrix-dashboard $ ../gradlew jettyRun and point the dashboard to https://<customer.engine.ip>/ovirt-engine/hystrix.stream. # Other possible benefits? * Live metrics at customer site for admins, consultants and support. * Historical metrics for analysis in addition to the log files. The metrics information is directly usable in graphite [7]. Therefore it would be possible to collect the json stream for a certain time period and analyze them later like in [4]. To do that someone just has to run curl --user admin@internal:engine http://localhost:8080/ovirt-engine/api/hystrix.stream > hystrix.stream for as long as necessary. The results can be analyzed later. # Possible architectural benefits? In addition to the live metrics we might also have use for the real hystrix features: * Circuit Breaker * Bulk execution of commands * De-dublication of commands (Caching) * Synchronous and asynchronous execution support * ... Our commands do already have a lot of features, so I don't think that there are some quick wins, but maybe there are interesting opportunities for infra. # Overhead? In [5] the netflix employees describe their results regarding the overhead of wrapping every command into a new instance of a hystrix command. They ran their tests on a standard 4-core Amazon EC2 server with a load of 60 request per second. When using threadpools they measured a mean overhead of less than one millisecond (so negligible). At the 90th percentile they measured an overhead of 3 ms. At the 99th percentile of about 9 ms. When configuring the hystrix commands to use semaphores instead of threadpools they are even faster. # How to integrate? A working implementation can be found on gerrit[3]. These patch sets wrap a hystrix command around every VdcAction, every VdcQuery and every VDSCommand. This just required four small modifications in the code base. # Security? In the provided patches the hystrix-metrics-servlet is accessible at /ovirt-engine/api/hystrix.stream. It is protected by basic auth but accessible for everyone who can authenticate. We should probably restrict it to admins. # Todo? 1) We do report failed actions with return values. Hystrix expects failing commands to throw an exception. So on the dashboard almost every command looks like a success. To overcome this, it would be pretty easy to throw an exception inside the command and catch it immediately after it leaves the hystrix wrapper. 2) Finetuning Do we want semaphores or a thread pool. When the thread pool, what size do we want? 3) Three unpackaged dependencies: archaius, hystrix-core, hystrix-contrib # References [1] https://github.com/Netflix/Hystrix [2] https://github.com/Netflix/Hystrix/tree/master/hystrix-dashboard [3] https://gerrit.ovirt.org/#/q/topic:hystrix [4] http://www.nurkiewicz.com/2015/02/storing-months-of-historical-metrics.html [5] https://github.com/Netflix/Hystrix/wiki/FAQ#what-is-the-processing-overhead-... [5] https://bugzilla.redhat.com/show_bug.cgi?id=1268216 [6] https://bugzilla.redhat.com/show_bug.cgi?id=1268224 [7] http://graphite.wikidot.com

Hi All, =20 I am contributing to the engine for three months now. While I dug into =
--Apple-Mail=_F5FA624B-89A6-42FD-9D48-A4B9545863C9 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii On 2 Oct 2015, at 12:47, Roman Mohr wrote: the code I
started to wonder how to visualize what the engine is actually doing.
=20 To get better insights I added hystrix[1] to the engine. Hystrix is a = circuit breaker library which was developed by Netflix and has one pretty = interesting feature: Real time metrics for commands. =20 In combination with hystrix-dashboard[2] it allows very interesting = insights. You can easily get an overview of commands involved in operations, =
performance and complexity. Look at [2] and the attachments in [5] and = [6] for screenshots to get an Impression. =20 I want to propose to integrate hystrix permanently because from my =
This is one of the main problems with large application, anything to = help to understand what's going on is very welcome their perspective
the results were really useful and I also had some good experiences = with hystrix in past projects. =20 A first implementation can be found on gerrit[3]. =20 # Where is it immediately useful? =20 During development and QA. =20 An example: I tested the hystrix integration on /api/vms and = /api/hosts rest endpoints and immediately saw that the number of command exectuions = grew lineary whit the number of vms and hosts. The bug reports [5] and [6] = are the result. =20 # How to monitor the engine? =20 It is as easy as starting a hystrix-dashboard [2] with =20 $ git clone https://github.com/Netflix/Hystrix.git $ cd Hystrix/hystrix-dashboard $ ../gradlew jettyRun =20 and point the dashboard to=20 =20 https://<customer.engine.ip>/ovirt-engine/hystrix.stream. =20 # Other possible benefits? =20 * Live metrics at customer site for admins, consultants and support. =20 * Historical metrics for analysis in addition to the log files. The metrics information is directly usable in graphite [7]. = Therefore it would be possible to collect the json stream for a certain time period and = analyze them later like in [4]. To do that someone just has to run =20 curl --user admin@internal:engine = http://localhost:8080/ovirt-engine/api/hystrix.stream > hystrix.stream =20 for as long as necessary. The results can be analyzed later.
=20 # Possible architectural benefits? =20 In addition to the live metrics we might also have use for the real = hystrix features: =20 * Circuit Breaker * Bulk execution of commands * De-dublication of commands (Caching) * Synchronous and asynchronous execution support * .. =20 Our commands do already have a lot of features, so I don't think that =
+1 it's a great idea and when properly documented so even a BFU can do that = it would allow us to get much better idea when something is not working = or working too slow on a system we don't have access to, but it\'s = reproducible elsewhere. Just ask for "hey, run this thingie while you = are reproducing the issue and send us the result" there are
some quick wins, but maybe there are interesting opportunities for = infra.
eh..I would worry about that much later. First we should understand what = are we actually doing and why (as we all know the engine is likely doing = a lot of useless stuff;-)
=20 # Overhead? =20 In [5] the netflix employees describe their results regarding the = overhead of wrapping every command into a new instance of a hystrix command. =20 They ran their tests on a standard 4-core Amazon EC2 server with a = load of 60 request per second. =20 When using threadpools they measured a mean overhead of less than one millisecond (so negligible). At the 90th percentile they measured an = overhead of 3 ms. At the 99th percentile of about 9 ms.
=20 When configuring the hystrix commands to use semaphores instead of =
This is likely good enough for backend commands and REST entry points = (as you currently did), but may need more careful examination if we = would want to add this to e.g. thread pool allocations Don't get slowed down by that though, even for higher level stuff it is = a great source of information threadpools
they are even faster. =20 # How to integrate? =20 A working implementation can be found on gerrit[3]. These patch sets = wrap a hystrix command around every VdcAction, every VdcQuery and every = VDSCommand. This just required four small modifications in the code base. =20 # Security? =20 In the provided patches the hystrix-metrics-servlet is accessible at /ovirt-engine/api/hystrix.stream. It is protected by basic auth but = accessible for everyone who can authenticate. We should probably restrict it to = admins.
=20 # Todo? =20 1) We do report failed actions with return values. Hystrix expects = failing commands to throw an exception. So on the dashboard almost every = command looks like a success. To overcome this, it would be pretty easy to throw an exception inside the command and catch it immediately after it leaves =
that would be great if it doesn't require too much work. If it does then = we can start with enabling/disabling via JMX using Roy's recent patch = [8] the
hystrix wrapper.
at the beginning it's probably enough to see what stuff is getting = called, without differentiating between success or failure (we mostly do = log failures, so hopefully we know when stuff is broken this way)
=20 2) Finetuning Do we want semaphores or a thread pool. When the thread pool, what = size do we want?=20 =20 3) Three unpackaged dependencies: archaius, hystrix-core, = hystrix-contrib
Since you yesterday volunteered to package them I think this should not = stop us!:-) thanks a lot for the effort, I miss a proper analysis for soooo long. = Thanks for stepping up! michal
=20 # References =20 [1] https://github.com/Netflix/Hystrix [2] https://github.com/Netflix/Hystrix/tree/master/hystrix-dashboard=20=
[3] https://gerrit.ovirt.org/#/q/topic:hystrix=20 [4] = http://www.nurkiewicz.com/2015/02/storing-months-of-historical-metrics.htm= l [5] = https://github.com/Netflix/Hystrix/wiki/FAQ#what-is-the-processing-overhea= d-of-using-hystrix [5] https://bugzilla.redhat.com/show_bug.cgi?id=3D1268216=20 [6] https://bugzilla.redhat.com/show_bug.cgi?id=3D1268224 [7] http://graphite.wikidot.com [8] https://gerrit.ovirt.org/#/c/29693/
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
> hystrix.stream<br><br> for as long as necessary. The = results can be analyzed = later.<br></div></blockquote><div><br></div>+1</div><div>it's a great = idea and when properly documented so even a BFU can do that it would = allow us to get much better idea when something is not working or = working too slow on a system we don't have access to, but it\'s = reproducible elsewhere. Just ask for "hey, run this thingie while you = are reproducing the issue and send us the = result"</div><div><br><blockquote type=3D"cite"><div dir=3D"ltr"><br># = Possible architectural benefits?<br><br>In addition to the live metrics = we might also have use for the real hystrix features:<br><br> * = Circuit Breaker<br> * Bulk execution of commands<br> * = De-dublication of commands (Caching)<br> * Synchronous and = asynchronous execution support<br> * ..<br><br>Our commands do = already have a lot of features, so I don't think that there are<br>some = quick wins, but maybe there are interesting opportunities for = infra.<br></div></blockquote><div><br></div>eh..I would worry about that = much later. First we should understand what are we actually doing and = why (as we all know the engine is likely doing a lot of useless = stuff;-)</div><div><br><blockquote type=3D"cite"><div dir=3D"ltr"><br># = Overhead?<br><br>In [5] the netflix employees describe their results = regarding the overhead of<br>wrapping every command into a new instance = of a hystrix command.<br><br>They ran their tests on a standard 4-core = Amazon EC2 server with a load of 60<br>request per second.<br><br>When = using threadpools they measured a mean overhead of less than = one<br>millisecond (so negligible). At the 90th percentile they = measured an overhead<br>of 3 ms. At the 99th percentile of about 9 = ms.<br></div></blockquote><div><br></div>This is likely good enough for = backend commands and REST entry points (as you currently did), but may = need more careful examination if we would want to add this to e.g. =
<br>[6] <a href=3D"https://bugzilla.redhat.com/show_bug.cgi?id=3D1268224= " = target=3D"_blank">https://bugzilla.redhat.com/show_bug.cgi?id=3D1268224</a= <br>[7] <a href=3D"http://graphite.wikidot.com" = target=3D"_blank">http://graphite.wikidot.com</a><br></div></blockquote>[8= ] <a =
--Apple-Mail=_F5FA624B-89A6-42FD-9D48-A4B9545863C9 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii <html><head></head><body style=3D"word-wrap: break-word; = -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; = "><br><div><div>On 2 Oct 2015, at 12:47, Roman Mohr wrote:</div><br = class=3D"Apple-interchange-newline"><blockquote type=3D"cite"><div = dir=3D"ltr">Hi All,<br><br>I am contributing to the engine for three = months now. While I dug into the code I<br>started to wonder how to = visualize what the engine is actually = doing.<br></div></blockquote><div><br></div><div>This is one of the main = problems with large application, anything to help to understand what's = going on is very welcome</div><br><blockquote type=3D"cite"><div = dir=3D"ltr"><br>To get better insights I added hystrix[1] to the engine. = Hystrix is a circuit<br>breaker library which was developed by Netflix = and has one pretty interesting<br>feature: Real time metrics for = commands.<br><br>In combination with hystrix-dashboard[2] it allows very = interesting insights.<br>You can easily get an overview of commands = involved in operations, their<br>performance and complexity. Look at [2] = and the attachments in [5] and [6] for<br>screenshots to get an = Impression.<br><br>I want to propose to integrate hystrix permanently = because from my perspective<br>the results were really useful and I also = had some good experiences with hystrix<br>in past projects.<br><br>A = first implementation can be found on gerrit[3].<br><br># Where is it = immediately useful?<br><br>During development and QA.<br><br>An example: = I tested the hystrix integration on /api/vms and /api/hosts = rest<br>endpoints and immediately saw that the number of command = exectuions grew<br>lineary whit the number of vms and hosts. The bug = reports [5] and [6] are the<br>result.<br><br># How to monitor the = engine?<br><br>It is as easy as starting a hystrix-dashboard [2] = with<br><br> $ git clone <a = href=3D"https://github.com/Netflix/Hystrix.git" = target=3D"_blank">https://github.com/Netflix/Hystrix.git</a><br> $ = cd Hystrix/hystrix-dashboard<br> $ ../gradlew jettyRun<br><br>and = point the dashboard to <br><br> = https://<customer.engine.ip>/ovirt-engine/hystrix.stream.<br><br># = Other possible benefits?<br><br> * Live metrics at customer site = for admins, consultants and support.<br><br> * Historical metrics = for analysis in addition to the log files.<br> The metrics = information is directly usable in graphite [7]. Therefore it would = be<br> possible to collect the json stream for a certain = time period and analyze them<br> later like in [4]. To do = that someone just has to run<br><br> curl = --user admin@internal:engine <a = href=3D"http://localhost:8080/ovirt-engine/api/hystrix.stream" = target=3D"_blank">http://localhost:8080/ovirt-engine/api/hystrix.stream</a= thread pool allocations</div><div>Don't get slowed down by that though, = even for higher level stuff it is a great source of = information</div><div><br><blockquote type=3D"cite"><div = dir=3D"ltr"><br>When configuring the hystrix commands to use semaphores = instead of threadpools<br>they are even faster.<br><br># How to = integrate?<br><br>A working implementation can be found on = gerrit[3]. These patch sets wrap a<br>hystrix command around every = VdcAction, every VdcQuery and every VDSCommand.<br>This just required = four small modifications in the code base.<br><br># Security?<br><br>In = the provided patches the hystrix-metrics-servlet is accessible = at<br>/ovirt-engine/api/hystrix.stream. It is protected by basic auth = but accessible<br>for everyone who can authenticate. We should probably = restrict it to admins.<br></div></blockquote><div><br></div>that would = be great if it doesn't require too much work. If it does then we can = start with enabling/disabling via JMX using Roy's recent patch = [8]</div><div><br><blockquote type=3D"cite"><div dir=3D"ltr"><br># = Todo?<br><br>1) We do report failed actions with return values. Hystrix = expects failing<br>commands to throw an exception. So on the = dashboard almost every command looks<br>like a success. To = overcome this, it would be pretty easy to throw an<br>exception inside = the command and catch it immediately after it leaves the<br>hystrix = wrapper.<br></div></blockquote><div><br></div>at the beginning it's = probably enough to see what stuff is getting called, without = differentiating between success or failure (we mostly do log failures, = so hopefully we know when stuff is broken this = way)</div><div><br><blockquote type=3D"cite"><div dir=3D"ltr"><br>2) = Finetuning<br>Do we want semaphores or a thread pool. When the thread = pool, what size do we want? <br><br>3) Three unpackaged dependencies: = archaius, hystrix-core, = hystrix-contrib<br></div></blockquote><div><br></div><div>Since you = yesterday volunteered to package them I think this should not stop = us!:-)</div><div><br></div><div>thanks a lot for the effort, I miss a = proper analysis for soooo long. Thanks for stepping = up!</div><div><br></div>michal</div><div><br><blockquote = type=3D"cite"><div dir=3D"ltr"><br># References<br><br>[1] <a = href=3D"https://github.com/Netflix/Hystrix" = target=3D"_blank">https://github.com/Netflix/Hystrix</a><br>[2] <a = href=3D"https://github.com/Netflix/Hystrix/tree/master/hystrix-dashboard" = target=3D"_blank">https://github.com/Netflix/Hystrix/tree/master/hystrix-d= ashboard</a> <br>[3] <a = href=3D"https://gerrit.ovirt.org/#/q/topic:hystrix" = target=3D"_blank">https://gerrit.ovirt.org/#/q/topic:hystrix</a> <br>[4] = <a = href=3D"http://www.nurkiewicz.com/2015/02/storing-months-of-historical-met= rics.html" = target=3D"_blank">http://www.nurkiewicz.com/2015/02/storing-months-of-hist= orical-metrics.html</a><br>[5] <a = href=3D"https://github.com/Netflix/Hystrix/wiki/FAQ#what-is-the-processing= -overhead-of-using-hystrix" = target=3D"_blank">https://github.com/Netflix/Hystrix/wiki/FAQ#what-is-the-= processing-overhead-of-using-hystrix</a><br>[5] <a = href=3D"https://bugzilla.redhat.com/show_bug.cgi?id=3D1268216" = target=3D"_blank">https://bugzilla.redhat.com/show_bug.cgi?id=3D1268216</a= href=3D"https://gerrit.ovirt.org/#/c/29693/">https://gerrit.ovirt.org/#/c/= 29693/</a></div><div><br><blockquote type=3D"cite"> _______________________________________________<br>Devel mailing = list<br><a = href=3D"mailto:Devel@ovirt.org">Devel@ovirt.org</a><br>http://lists.ovirt.= org/mailman/listinfo/devel</blockquote></div><br></body></html>= --Apple-Mail=_F5FA624B-89A6-42FD-9D48-A4B9545863C9--

On Fri, Oct 2, 2015 at 4:24 PM, Michal Skrivanek < michal.skrivanek@redhat.com> wrote:
On 2 Oct 2015, at 12:47, Roman Mohr wrote:
Hi All,
I am contributing to the engine for three months now. While I dug into the code I started to wonder how to visualize what the engine is actually doing.
This is one of the main problems with large application, anything to help to understand what's going on is very welcome
To get better insights I added hystrix[1] to the engine. Hystrix is a circuit breaker library which was developed by Netflix and has one pretty interesting feature: Real time metrics for commands.
In combination with hystrix-dashboard[2] it allows very interesting insights. You can easily get an overview of commands involved in operations, their performance and complexity. Look at [2] and the attachments in [5] and [6] for screenshots to get an Impression.
I want to propose to integrate hystrix permanently because from my perspective the results were really useful and I also had some good experiences with hystrix in past projects.
A first implementation can be found on gerrit[3].
# Where is it immediately useful?
During development and QA.
An example: I tested the hystrix integration on /api/vms and /api/hosts rest endpoints and immediately saw that the number of command exectuions grew lineary whit the number of vms and hosts. The bug reports [5] and [6] are the result.
# How to monitor the engine?
It is as easy as starting a hystrix-dashboard [2] with
$ git clone https://github.com/Netflix/Hystrix.git $ cd Hystrix/hystrix-dashboard $ ../gradlew jettyRun
and point the dashboard to
https://<customer.engine.ip>/ovirt-engine/hystrix.stream.
# Other possible benefits?
* Live metrics at customer site for admins, consultants and support.
* Historical metrics for analysis in addition to the log files. The metrics information is directly usable in graphite [7]. Therefore it would be possible to collect the json stream for a certain time period and analyze them later like in [4]. To do that someone just has to run
curl --user admin@internal:engine http://localhost:8080/ovirt-engine/api/hystrix.stream > hystrix.stream
for as long as necessary. The results can be analyzed later.
+1 it's a great idea and when properly documented so even a BFU can do that it would allow us to get much better idea when something is not working or working too slow on a system we don't have access to, but it\'s reproducible elsewhere. Just ask for "hey, run this thingie while you are reproducing the issue and send us the result"
# Possible architectural benefits?
In addition to the live metrics we might also have use for the real hystrix features:
* Circuit Breaker * Bulk execution of commands * De-dublication of commands (Caching) * Synchronous and asynchronous execution support * ..
Our commands do already have a lot of features, so I don't think that there are some quick wins, but maybe there are interesting opportunities for infra.
eh..I would worry about that much later. First we should understand what are we actually doing and why (as we all know the engine is likely doing a lot of useless stuff;-)
# Overhead?
In [5] the netflix employees describe their results regarding the overhead of wrapping every command into a new instance of a hystrix command.
They ran their tests on a standard 4-core Amazon EC2 server with a load of 60 request per second.
When using threadpools they measured a mean overhead of less than one millisecond (so negligible). At the 90th percentile they measured an overhead of 3 ms. At the 99th percentile of about 9 ms.
This is likely good enough for backend commands and REST entry points (as you currently did), but may need more careful examination if we would want to add this to e.g. thread pool allocations Don't get slowed down by that though, even for higher level stuff it is a great source of information
When configuring the hystrix commands to use semaphores instead of threadpools they are even faster.
# How to integrate?
A working implementation can be found on gerrit[3]. These patch sets wrap a hystrix command around every VdcAction, every VdcQuery and every VDSCommand. This just required four small modifications in the code base.
# Security?
In the provided patches the hystrix-metrics-servlet is accessible at /ovirt-engine/api/hystrix.stream. It is protected by basic auth but accessible for everyone who can authenticate. We should probably restrict it to admins.
that would be great if it doesn't require too much work. If it does then we can start with enabling/disabling via JMX using Roy's recent patch [8]
The hystrix stream is now accessible in http://<host>/ovirt-engine/services/hystrix.stream and admin privileges are needed. Further it can be enabled an disabled via JMX (disabled by default). @Juan, @Roy thank you for your feedback on the code.
# Todo?
1) We do report failed actions with return values. Hystrix expects failing commands to throw an exception. So on the dashboard almost every command looks like a success. To overcome this, it would be pretty easy to throw an exception inside the command and catch it immediately after it leaves the hystrix wrapper.
at the beginning it's probably enough to see what stuff is getting called, without differentiating between success or failure (we mostly do log failures, so hopefully we know when stuff is broken this way)
Ok, I leave it disabled for now. But should really be just as easy as to throw an exception if the command fails and immediately catching it afterwards (not the nicest looking code then, but would work). And this can be encapsulated in the command executor, so it would not pollute the existing code.
2) Finetuning Do we want semaphores or a thread pool. When the thread pool, what size do we want?
To answer this myself, I use semaphores, to be sure to support
transactions over multiple commands properly.
3) Three unpackaged dependencies: archaius, hystrix-core, hystrix-contrib
Since you yesterday volunteered to package them I think this should not stop us!:-)
thanks a lot for the effort, I miss a proper analysis for soooo long. Thanks for stepping up!
michal
# References
[1] https://github.com/Netflix/Hystrix [2] https://github.com/Netflix/Hystrix/tree/master/hystrix-dashboard [3] https://gerrit.ovirt.org/#/q/topic:hystrix [4] http://www.nurkiewicz.com/2015/02/storing-months-of-historical-metrics.html [5] https://github.com/Netflix/Hystrix/wiki/FAQ#what-is-the-processing-overhead-... [5] https://bugzilla.redhat.com/show_bug.cgi?id=1268216 [6] https://bugzilla.redhat.com/show_bug.cgi?id=1268224 [7] http://graphite.wikidot.com
[8] https://gerrit.ovirt.org/#/c/29693/
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Thanks Roman. Adding Oved for Infra visibility. We have a lot to gain here. On Thu, Oct 29, 2015 at 3:28 PM, Roman Mohr <rmohr@redhat.com> wrote:
On Fri, Oct 2, 2015 at 4:24 PM, Michal Skrivanek < michal.skrivanek@redhat.com> wrote:
On 2 Oct 2015, at 12:47, Roman Mohr wrote:
Hi All,
I am contributing to the engine for three months now. While I dug into the code I started to wonder how to visualize what the engine is actually doing.
This is one of the main problems with large application, anything to help to understand what's going on is very welcome
To get better insights I added hystrix[1] to the engine. Hystrix is a circuit breaker library which was developed by Netflix and has one pretty interesting feature: Real time metrics for commands.
In combination with hystrix-dashboard[2] it allows very interesting insights. You can easily get an overview of commands involved in operations, their performance and complexity. Look at [2] and the attachments in [5] and [6] for screenshots to get an Impression.
I want to propose to integrate hystrix permanently because from my perspective the results were really useful and I also had some good experiences with hystrix in past projects.
A first implementation can be found on gerrit[3].
# Where is it immediately useful?
During development and QA.
An example: I tested the hystrix integration on /api/vms and /api/hosts rest endpoints and immediately saw that the number of command exectuions grew lineary whit the number of vms and hosts. The bug reports [5] and [6] are the result.
# How to monitor the engine?
It is as easy as starting a hystrix-dashboard [2] with
$ git clone https://github.com/Netflix/Hystrix.git $ cd Hystrix/hystrix-dashboard $ ../gradlew jettyRun
and point the dashboard to
https://<customer.engine.ip>/ovirt-engine/hystrix.stream.
# Other possible benefits?
* Live metrics at customer site for admins, consultants and support.
* Historical metrics for analysis in addition to the log files. The metrics information is directly usable in graphite [7]. Therefore it would be possible to collect the json stream for a certain time period and analyze them later like in [4]. To do that someone just has to run
curl --user admin@internal:engine http://localhost:8080/ovirt-engine/api/hystrix.stream > hystrix.stream
for as long as necessary. The results can be analyzed later.
+1 it's a great idea and when properly documented so even a BFU can do that it would allow us to get much better idea when something is not working or working too slow on a system we don't have access to, but it\'s reproducible elsewhere. Just ask for "hey, run this thingie while you are reproducing the issue and send us the result"
# Possible architectural benefits?
In addition to the live metrics we might also have use for the real hystrix features:
* Circuit Breaker * Bulk execution of commands * De-dublication of commands (Caching) * Synchronous and asynchronous execution support * ..
Our commands do already have a lot of features, so I don't think that there are some quick wins, but maybe there are interesting opportunities for infra.
eh..I would worry about that much later. First we should understand what are we actually doing and why (as we all know the engine is likely doing a lot of useless stuff;-)
# Overhead?
In [5] the netflix employees describe their results regarding the overhead of wrapping every command into a new instance of a hystrix command.
They ran their tests on a standard 4-core Amazon EC2 server with a load of 60 request per second.
When using threadpools they measured a mean overhead of less than one millisecond (so negligible). At the 90th percentile they measured an overhead of 3 ms. At the 99th percentile of about 9 ms.
This is likely good enough for backend commands and REST entry points (as you currently did), but may need more careful examination if we would want to add this to e.g. thread pool allocations Don't get slowed down by that though, even for higher level stuff it is a great source of information
When configuring the hystrix commands to use semaphores instead of threadpools they are even faster.
# How to integrate?
A working implementation can be found on gerrit[3]. These patch sets wrap a hystrix command around every VdcAction, every VdcQuery and every VDSCommand. This just required four small modifications in the code base.
# Security?
In the provided patches the hystrix-metrics-servlet is accessible at /ovirt-engine/api/hystrix.stream. It is protected by basic auth but accessible for everyone who can authenticate. We should probably restrict it to admins.
that would be great if it doesn't require too much work. If it does then we can start with enabling/disabling via JMX using Roy's recent patch [8]
The hystrix stream is now accessible in http://<host>/ovirt-engine/services/hystrix.stream and admin privileges are needed. Further it can be enabled an disabled via JMX (disabled by default). @Juan, @Roy thank you for your feedback on the code.
# Todo?
1) We do report failed actions with return values. Hystrix expects failing commands to throw an exception. So on the dashboard almost every command looks like a success. To overcome this, it would be pretty easy to throw an exception inside the command and catch it immediately after it leaves the hystrix wrapper.
at the beginning it's probably enough to see what stuff is getting called, without differentiating between success or failure (we mostly do log failures, so hopefully we know when stuff is broken this way)
Ok, I leave it disabled for now. But should really be just as easy as to throw an exception if the command fails and immediately catching it afterwards (not the nicest looking code then, but would work). And this can be encapsulated in the command executor, so it would not pollute the existing code.
2) Finetuning Do we want semaphores or a thread pool. When the thread pool, what size do we want?
To answer this myself, I use semaphores, to be sure to support
transactions over multiple commands properly.
3) Three unpackaged dependencies: archaius, hystrix-core, hystrix-contrib
Since you yesterday volunteered to package them I think this should not stop us!:-)
thanks a lot for the effort, I miss a proper analysis for soooo long. Thanks for stepping up!
michal
# References
[1] https://github.com/Netflix/Hystrix [2] https://github.com/Netflix/Hystrix/tree/master/hystrix-dashboard [3] https://gerrit.ovirt.org/#/q/topic:hystrix [4] http://www.nurkiewicz.com/2015/02/storing-months-of-historical-metrics.html [5] https://github.com/Netflix/Hystrix/wiki/FAQ#what-is-the-processing-overhead-... [5] https://bugzilla.redhat.com/show_bug.cgi?id=1268216 [6] https://bugzilla.redhat.com/show_bug.cgi?id=1268224 [7] http://graphite.wikidot.com
[8] https://gerrit.ovirt.org/#/c/29693/
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel

Hi, a status update, a request and a question: [...]
A first implementation can be found on gerrit[3].
The implementation should be almost ready. The new rpms for hystrix are
also moving forward. I would need some people which can give me karma on bodhi. See be [...]
# How to monitor the engine?
It is as easy as starting a hystrix-dashboard [2] with
$ git clone https://github.com/Netflix/Hystrix.git $ cd Hystrix/hystrix-dashboard $ ../gradlew jettyRun
As part of the hystrix rpms there is now also a 'hystrix-dashboard' rpm. Using it is pretty simple. Just install it with 'dnf install hystrix-dashboard' and start jetty with 'systemctrl start jetty'. Jetty will then listen on 8080 by default (and if you told selinux that jetty is allowed to access the network).
# Security?
In the provided patches the hystrix-metrics-servlet is accessible at /ovirt-engine/api/hystrix.stream. It is protected by basic auth but accessible for everyone who can authenticate. We should probably restrict it to admins.
that would be great if it doesn't require too much work. If it does then we can start with enabling/disabling via JMX using Roy's recent patch [8]
Since I had to implement JMX support anyway to enable and disable hystrix (disabled by default) I am wondering if I can remove the authentication
[...] part. There is no sensible data in the hystrix stream and all other services like the db health check are not protected either. It would make it again a little bit easier to use. [...]
3) Three unpackaged dependencies: archaius, hystrix-core, hystrix-contrib
All required packages will be available in rawhide the next few hours. All builds on koji succeeded. Also all packages for f23 were successfully build.
I would appreciate if some of you find the time to give these f23 pacakges some karma: archaius-0.7.3-3.fc23 <https://bodhi.fedoraproject.org/updates/FEDORA-2015-3ae4cc39c5> [9] (includes archaius-core and archaius-zookeeper) hystrix-1.4.21-4.fc23 <https://bodhi.fedoraproject.org/updates/FEDORA-2015-35994552ed> [10] (includes hystrix-core, hystrix-metrics-event-stream and hystrix-dashboard) On el7 I had to package a little bit more and the final hystrix package itself is still missing, but some karma on the first round of packages would be very helpful: archaius-0.4.1-1.el7 <https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-dd72806724> [11] (includes archaius-core) mockito-1.9.0-19.el7 <https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-7bf9b82936> [12] assertj-core-2.2.0-2.el7 <https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-f02466a5da> [13] jctools-1.1-0.3.alpha.el7 <https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-27b59f8bf2> [14] rxjava-1.0.13-2.el7 <https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-37400bf69d> [15]
Since you yesterday volunteered to package them I think this should not
stop us!:-)
thanks a lot for the effort, I miss a proper analysis for soooo long. Thanks for stepping up!
michal
# References
[1] https://github.com/Netflix/Hystrix [2] https://github.com/Netflix/Hystrix/tree/master/hystrix-dashboard [3] https://gerrit.ovirt.org/#/q/topic:hystrix [4] http://www.nurkiewicz.com/2015/02/storing-months-of-historical-metrics.html [5] https://github.com/Netflix/Hystrix/wiki/FAQ#what-is-the-processing-overhead-... [5] https://bugzilla.redhat.com/show_bug.cgi?id=1268216 [6] https://bugzilla.redhat.com/show_bug.cgi?id=1268224 [7] http://graphite.wikidot.com
[9] https://bodhi.fedoraproject.org/updates/FEDORA-2015-3ae4cc39c5 [10] https://bodhi.fedoraproject.org/updates/FEDORA-2015-35994552ed [11] https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-dd72806724 [12] https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-7bf9b82936 [13] https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-f02466a5da [14] https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-27b59f8bf2 [15] https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-37400bf69d
_______________________________________________
Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
Thanks, Roman

On Fri, Dec 11, 2015 at 10:28 AM, Roman Mohr <rmohr@redhat.com> wrote:
Hi,
a status update, a request and a question:
[...]
A first implementation can be found on gerrit[3].
The implementation should be almost ready. The new rpms for hystrix are
also moving forward. I would need some people which can give me karma on bodhi. See be
[...]
# How to monitor the engine?
It is as easy as starting a hystrix-dashboard [2] with
$ git clone https://github.com/Netflix/Hystrix.git $ cd Hystrix/hystrix-dashboard $ ../gradlew jettyRun
As part of the hystrix rpms there is now also a 'hystrix-dashboard' rpm. Using it is pretty simple. Just install it with 'dnf install hystrix-dashboard' and start jetty with 'systemctrl start jetty'. Jetty will then listen on 8080 by default (and if you told selinux that jetty is allowed to access the network).
[...]
# Security?
In the provided patches the hystrix-metrics-servlet is accessible at /ovirt-engine/api/hystrix.stream. It is protected by basic auth but accessible for everyone who can authenticate. We should probably restrict it to admins.
that would be great if it doesn't require too much work. If it does then we can start with enabling/disabling via JMX using Roy's recent patch [8]
Since I had to implement JMX support anyway to enable and disable hystrix (disabled by default) I am wondering if I can remove the authentication part. There is no sensible data in the hystrix stream and all other services like the db health check are not protected either. It would make it again a little bit easier to use.
[...]
3) Three unpackaged dependencies: archaius, hystrix-core, hystrix-contrib
All required packages will be available in rawhide the next few hours. All builds on koji succeeded. Also all packages for f23 were successfully build.
I would appreciate if some of you find the time to give these f23 pacakges some karma:
archaius-0.7.3-3.fc23 <https://bodhi.fedoraproject.org/updates/FEDORA-2015-3ae4cc39c5> [9] (includes archaius-core and archaius-zookeeper) hystrix-1.4.21-4.fc23 <https://bodhi.fedoraproject.org/updates/FEDORA-2015-35994552ed> [10] (includes hystrix-core, hystrix-metrics-event-stream and hystrix-dashboard)
On el7 I had to package a little bit more and the final hystrix package itself is still missing, but some karma on the first round of packages would be very helpful:
archaius-0.4.1-1.el7 <https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-dd72806724> [11] (includes archaius-core) mockito-1.9.0-19.el7 <https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-7bf9b82936> [12] assertj-core-2.2.0-2.el7 <https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-f02466a5da> [13] jctools-1.1-0.3.alpha.el7 <https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-27b59f8bf2> [14] rxjava-1.0.13-2.el7 <https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-37400bf69d> [15]
All additional packages for el7 are now also available on testing: jackson-core-2.6.3-1.el7 [16] <https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-b711a01041> hystrix-1.4.21-5.el7 <https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-d770404d8b> [17] As always, I am grateful for every karma!
Since you yesterday volunteered to package them I think this should not
stop us!:-)
thanks a lot for the effort, I miss a proper analysis for soooo long. Thanks for stepping up!
michal
# References
[1] https://github.com/Netflix/Hystrix [2] https://github.com/Netflix/Hystrix/tree/master/hystrix-dashboard [3] https://gerrit.ovirt.org/#/q/topic:hystrix [4] http://www.nurkiewicz.com/2015/02/storing-months-of-historical-metrics.html [5] https://github.com/Netflix/Hystrix/wiki/FAQ#what-is-the-processing-overhead-... [5] https://bugzilla.redhat.com/show_bug.cgi?id=1268216 [6] https://bugzilla.redhat.com/show_bug.cgi?id=1268224 [7] http://graphite.wikidot.com
[9] https://bodhi.fedoraproject.org/updates/FEDORA-2015-3ae4cc39c5 [10] https://bodhi.fedoraproject.org/updates/FEDORA-2015-35994552ed [11] https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-dd72806724 [12] https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-7bf9b82936 [13] https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-f02466a5da [14] https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-27b59f8bf2 [15] https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-37400bf69d
[16] https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-b711a01041 [17] https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2015-d770404d8b
_______________________________________________
Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
Thanks, Roman
participants (3)
-
Doron Fediuck
-
Michal Skrivanek
-
Roman Mohr