
Hi All, =20 I am contributing to the engine for three months now. While I dug into =
--Apple-Mail=_F5FA624B-89A6-42FD-9D48-A4B9545863C9 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii On 2 Oct 2015, at 12:47, Roman Mohr wrote: the code I
started to wonder how to visualize what the engine is actually doing.
=20 To get better insights I added hystrix[1] to the engine. Hystrix is a = circuit breaker library which was developed by Netflix and has one pretty = interesting feature: Real time metrics for commands. =20 In combination with hystrix-dashboard[2] it allows very interesting = insights. You can easily get an overview of commands involved in operations, =
performance and complexity. Look at [2] and the attachments in [5] and = [6] for screenshots to get an Impression. =20 I want to propose to integrate hystrix permanently because from my =
This is one of the main problems with large application, anything to = help to understand what's going on is very welcome their perspective
the results were really useful and I also had some good experiences = with hystrix in past projects. =20 A first implementation can be found on gerrit[3]. =20 # Where is it immediately useful? =20 During development and QA. =20 An example: I tested the hystrix integration on /api/vms and = /api/hosts rest endpoints and immediately saw that the number of command exectuions = grew lineary whit the number of vms and hosts. The bug reports [5] and [6] = are the result. =20 # How to monitor the engine? =20 It is as easy as starting a hystrix-dashboard [2] with =20 $ git clone https://github.com/Netflix/Hystrix.git $ cd Hystrix/hystrix-dashboard $ ../gradlew jettyRun =20 and point the dashboard to=20 =20 https://<customer.engine.ip>/ovirt-engine/hystrix.stream. =20 # Other possible benefits? =20 * Live metrics at customer site for admins, consultants and support. =20 * Historical metrics for analysis in addition to the log files. The metrics information is directly usable in graphite [7]. = Therefore it would be possible to collect the json stream for a certain time period and = analyze them later like in [4]. To do that someone just has to run =20 curl --user admin@internal:engine = http://localhost:8080/ovirt-engine/api/hystrix.stream > hystrix.stream =20 for as long as necessary. The results can be analyzed later.
=20 # Possible architectural benefits? =20 In addition to the live metrics we might also have use for the real = hystrix features: =20 * Circuit Breaker * Bulk execution of commands * De-dublication of commands (Caching) * Synchronous and asynchronous execution support * .. =20 Our commands do already have a lot of features, so I don't think that =
+1 it's a great idea and when properly documented so even a BFU can do that = it would allow us to get much better idea when something is not working = or working too slow on a system we don't have access to, but it\'s = reproducible elsewhere. Just ask for "hey, run this thingie while you = are reproducing the issue and send us the result" there are
some quick wins, but maybe there are interesting opportunities for = infra.
eh..I would worry about that much later. First we should understand what = are we actually doing and why (as we all know the engine is likely doing = a lot of useless stuff;-)
=20 # Overhead? =20 In [5] the netflix employees describe their results regarding the = overhead of wrapping every command into a new instance of a hystrix command. =20 They ran their tests on a standard 4-core Amazon EC2 server with a = load of 60 request per second. =20 When using threadpools they measured a mean overhead of less than one millisecond (so negligible). At the 90th percentile they measured an = overhead of 3 ms. At the 99th percentile of about 9 ms.
=20 When configuring the hystrix commands to use semaphores instead of =
This is likely good enough for backend commands and REST entry points = (as you currently did), but may need more careful examination if we = would want to add this to e.g. thread pool allocations Don't get slowed down by that though, even for higher level stuff it is = a great source of information threadpools
they are even faster. =20 # How to integrate? =20 A working implementation can be found on gerrit[3]. These patch sets = wrap a hystrix command around every VdcAction, every VdcQuery and every = VDSCommand. This just required four small modifications in the code base. =20 # Security? =20 In the provided patches the hystrix-metrics-servlet is accessible at /ovirt-engine/api/hystrix.stream. It is protected by basic auth but = accessible for everyone who can authenticate. We should probably restrict it to = admins.
=20 # Todo? =20 1) We do report failed actions with return values. Hystrix expects = failing commands to throw an exception. So on the dashboard almost every = command looks like a success. To overcome this, it would be pretty easy to throw an exception inside the command and catch it immediately after it leaves =
that would be great if it doesn't require too much work. If it does then = we can start with enabling/disabling via JMX using Roy's recent patch = [8] the
hystrix wrapper.
at the beginning it's probably enough to see what stuff is getting = called, without differentiating between success or failure (we mostly do = log failures, so hopefully we know when stuff is broken this way)
=20 2) Finetuning Do we want semaphores or a thread pool. When the thread pool, what = size do we want?=20 =20 3) Three unpackaged dependencies: archaius, hystrix-core, = hystrix-contrib
Since you yesterday volunteered to package them I think this should not = stop us!:-) thanks a lot for the effort, I miss a proper analysis for soooo long. = Thanks for stepping up! michal
=20 # References =20 [1] https://github.com/Netflix/Hystrix [2] https://github.com/Netflix/Hystrix/tree/master/hystrix-dashboard=20=
[3] https://gerrit.ovirt.org/#/q/topic:hystrix=20 [4] = http://www.nurkiewicz.com/2015/02/storing-months-of-historical-metrics.htm= l [5] = https://github.com/Netflix/Hystrix/wiki/FAQ#what-is-the-processing-overhea= d-of-using-hystrix [5] https://bugzilla.redhat.com/show_bug.cgi?id=3D1268216=20 [6] https://bugzilla.redhat.com/show_bug.cgi?id=3D1268224 [7] http://graphite.wikidot.com [8] https://gerrit.ovirt.org/#/c/29693/
_______________________________________________ Devel mailing list Devel@ovirt.org http://lists.ovirt.org/mailman/listinfo/devel
> hystrix.stream<br><br> for as long as necessary. The = results can be analyzed = later.<br></div></blockquote><div><br></div>+1</div><div>it's a great = idea and when properly documented so even a BFU can do that it would = allow us to get much better idea when something is not working or = working too slow on a system we don't have access to, but it\'s = reproducible elsewhere. Just ask for "hey, run this thingie while you = are reproducing the issue and send us the = result"</div><div><br><blockquote type=3D"cite"><div dir=3D"ltr"><br># = Possible architectural benefits?<br><br>In addition to the live metrics = we might also have use for the real hystrix features:<br><br> * = Circuit Breaker<br> * Bulk execution of commands<br> * = De-dublication of commands (Caching)<br> * Synchronous and = asynchronous execution support<br> * ..<br><br>Our commands do = already have a lot of features, so I don't think that there are<br>some = quick wins, but maybe there are interesting opportunities for = infra.<br></div></blockquote><div><br></div>eh..I would worry about that = much later. First we should understand what are we actually doing and = why (as we all know the engine is likely doing a lot of useless = stuff;-)</div><div><br><blockquote type=3D"cite"><div dir=3D"ltr"><br># = Overhead?<br><br>In [5] the netflix employees describe their results = regarding the overhead of<br>wrapping every command into a new instance = of a hystrix command.<br><br>They ran their tests on a standard 4-core = Amazon EC2 server with a load of 60<br>request per second.<br><br>When = using threadpools they measured a mean overhead of less than = one<br>millisecond (so negligible). At the 90th percentile they = measured an overhead<br>of 3 ms. At the 99th percentile of about 9 = ms.<br></div></blockquote><div><br></div>This is likely good enough for = backend commands and REST entry points (as you currently did), but may = need more careful examination if we would want to add this to e.g. =
<br>[6] <a href=3D"https://bugzilla.redhat.com/show_bug.cgi?id=3D1268224= " = target=3D"_blank">https://bugzilla.redhat.com/show_bug.cgi?id=3D1268224</a= <br>[7] <a href=3D"http://graphite.wikidot.com" = target=3D"_blank">http://graphite.wikidot.com</a><br></div></blockquote>[8= ] <a =
--Apple-Mail=_F5FA624B-89A6-42FD-9D48-A4B9545863C9 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii <html><head></head><body style=3D"word-wrap: break-word; = -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; = "><br><div><div>On 2 Oct 2015, at 12:47, Roman Mohr wrote:</div><br = class=3D"Apple-interchange-newline"><blockquote type=3D"cite"><div = dir=3D"ltr">Hi All,<br><br>I am contributing to the engine for three = months now. While I dug into the code I<br>started to wonder how to = visualize what the engine is actually = doing.<br></div></blockquote><div><br></div><div>This is one of the main = problems with large application, anything to help to understand what's = going on is very welcome</div><br><blockquote type=3D"cite"><div = dir=3D"ltr"><br>To get better insights I added hystrix[1] to the engine. = Hystrix is a circuit<br>breaker library which was developed by Netflix = and has one pretty interesting<br>feature: Real time metrics for = commands.<br><br>In combination with hystrix-dashboard[2] it allows very = interesting insights.<br>You can easily get an overview of commands = involved in operations, their<br>performance and complexity. Look at [2] = and the attachments in [5] and [6] for<br>screenshots to get an = Impression.<br><br>I want to propose to integrate hystrix permanently = because from my perspective<br>the results were really useful and I also = had some good experiences with hystrix<br>in past projects.<br><br>A = first implementation can be found on gerrit[3].<br><br># Where is it = immediately useful?<br><br>During development and QA.<br><br>An example: = I tested the hystrix integration on /api/vms and /api/hosts = rest<br>endpoints and immediately saw that the number of command = exectuions grew<br>lineary whit the number of vms and hosts. The bug = reports [5] and [6] are the<br>result.<br><br># How to monitor the = engine?<br><br>It is as easy as starting a hystrix-dashboard [2] = with<br><br> $ git clone <a = href=3D"https://github.com/Netflix/Hystrix.git" = target=3D"_blank">https://github.com/Netflix/Hystrix.git</a><br> $ = cd Hystrix/hystrix-dashboard<br> $ ../gradlew jettyRun<br><br>and = point the dashboard to <br><br> = https://<customer.engine.ip>/ovirt-engine/hystrix.stream.<br><br># = Other possible benefits?<br><br> * Live metrics at customer site = for admins, consultants and support.<br><br> * Historical metrics = for analysis in addition to the log files.<br> The metrics = information is directly usable in graphite [7]. Therefore it would = be<br> possible to collect the json stream for a certain = time period and analyze them<br> later like in [4]. To do = that someone just has to run<br><br> curl = --user admin@internal:engine <a = href=3D"http://localhost:8080/ovirt-engine/api/hystrix.stream" = target=3D"_blank">http://localhost:8080/ovirt-engine/api/hystrix.stream</a= thread pool allocations</div><div>Don't get slowed down by that though, = even for higher level stuff it is a great source of = information</div><div><br><blockquote type=3D"cite"><div = dir=3D"ltr"><br>When configuring the hystrix commands to use semaphores = instead of threadpools<br>they are even faster.<br><br># How to = integrate?<br><br>A working implementation can be found on = gerrit[3]. These patch sets wrap a<br>hystrix command around every = VdcAction, every VdcQuery and every VDSCommand.<br>This just required = four small modifications in the code base.<br><br># Security?<br><br>In = the provided patches the hystrix-metrics-servlet is accessible = at<br>/ovirt-engine/api/hystrix.stream. It is protected by basic auth = but accessible<br>for everyone who can authenticate. We should probably = restrict it to admins.<br></div></blockquote><div><br></div>that would = be great if it doesn't require too much work. If it does then we can = start with enabling/disabling via JMX using Roy's recent patch = [8]</div><div><br><blockquote type=3D"cite"><div dir=3D"ltr"><br># = Todo?<br><br>1) We do report failed actions with return values. Hystrix = expects failing<br>commands to throw an exception. So on the = dashboard almost every command looks<br>like a success. To = overcome this, it would be pretty easy to throw an<br>exception inside = the command and catch it immediately after it leaves the<br>hystrix = wrapper.<br></div></blockquote><div><br></div>at the beginning it's = probably enough to see what stuff is getting called, without = differentiating between success or failure (we mostly do log failures, = so hopefully we know when stuff is broken this = way)</div><div><br><blockquote type=3D"cite"><div dir=3D"ltr"><br>2) = Finetuning<br>Do we want semaphores or a thread pool. When the thread = pool, what size do we want? <br><br>3) Three unpackaged dependencies: = archaius, hystrix-core, = hystrix-contrib<br></div></blockquote><div><br></div><div>Since you = yesterday volunteered to package them I think this should not stop = us!:-)</div><div><br></div><div>thanks a lot for the effort, I miss a = proper analysis for soooo long. Thanks for stepping = up!</div><div><br></div>michal</div><div><br><blockquote = type=3D"cite"><div dir=3D"ltr"><br># References<br><br>[1] <a = href=3D"https://github.com/Netflix/Hystrix" = target=3D"_blank">https://github.com/Netflix/Hystrix</a><br>[2] <a = href=3D"https://github.com/Netflix/Hystrix/tree/master/hystrix-dashboard" = target=3D"_blank">https://github.com/Netflix/Hystrix/tree/master/hystrix-d= ashboard</a> <br>[3] <a = href=3D"https://gerrit.ovirt.org/#/q/topic:hystrix" = target=3D"_blank">https://gerrit.ovirt.org/#/q/topic:hystrix</a> <br>[4] = <a = href=3D"http://www.nurkiewicz.com/2015/02/storing-months-of-historical-met= rics.html" = target=3D"_blank">http://www.nurkiewicz.com/2015/02/storing-months-of-hist= orical-metrics.html</a><br>[5] <a = href=3D"https://github.com/Netflix/Hystrix/wiki/FAQ#what-is-the-processing= -overhead-of-using-hystrix" = target=3D"_blank">https://github.com/Netflix/Hystrix/wiki/FAQ#what-is-the-= processing-overhead-of-using-hystrix</a><br>[5] <a = href=3D"https://bugzilla.redhat.com/show_bug.cgi?id=3D1268216" = target=3D"_blank">https://bugzilla.redhat.com/show_bug.cgi?id=3D1268216</a= href=3D"https://gerrit.ovirt.org/#/c/29693/">https://gerrit.ovirt.org/#/c/= 29693/</a></div><div><br><blockquote type=3D"cite"> _______________________________________________<br>Devel mailing = list<br><a = href=3D"mailto:Devel@ovirt.org">Devel@ovirt.org</a><br>http://lists.ovirt.= org/mailman/listinfo/devel</blockquote></div><br></body></html>= --Apple-Mail=_F5FA624B-89A6-42FD-9D48-A4B9545863C9--