Re: [ovirt-devel] Proposal: Hystrix for realtime command monitoring

2 Oct 2015

      ...
Hi All,
=20
I am contributing to the engine for three months now. While I dug into =
--Apple-Mail=_F5FA624B-89A6-42FD-9D48-A4B9545863C9
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii

On 2 Oct 2015, at 12:47, Roman Mohr wrote:

the code I
...
started to wonder how to visualize what the engine is actually doing.
...
=20
To get better insights I added hystrix[1] to the engine. Hystrix is a =
circuit
breaker library which was developed by Netflix and has one pretty =
interesting
feature: Real time metrics for commands.
=20
In combination with hystrix-dashboard[2] it allows very interesting =
insights.
You can easily get an overview of commands involved in operations, =
...
performance and complexity. Look at [2] and the attachments in [5] and =
[6] for
screenshots to get an Impression.
=20
I want to propose to integrate hystrix permanently because from my =
This is one of the main problems with large application, anything to =
help to understand what's going on is very welcome

their
perspective
...
the results were really useful and I also had some good experiences =
with hystrix
in past projects.
=20
A first implementation can be found on gerrit[3].
=20
# Where is it immediately useful?
=20
During development and QA.
=20
An example: I tested the hystrix integration on /api/vms and =
/api/hosts rest
endpoints and immediately saw that the number of command exectuions =
grew
lineary whit the number of vms and hosts. The bug reports [5] and [6] =
are the
result.
=20
# How to monitor the engine?
=20
It is as easy as starting a hystrix-dashboard [2] with
=20
  $ git clone https://github.com/Netflix/Hystrix.git
  $ cd Hystrix/hystrix-dashboard
  $ ../gradlew jettyRun
=20
and point the dashboard to=20
=20
   https://<customer.engine.ip>/ovirt-engine/hystrix.stream.
=20
# Other possible benefits?
=20
 * Live metrics at customer site for admins, consultants and support.
=20
 * Historical metrics for analysis in addition to the log files.
   The metrics information is directly usable in graphite [7]. =
Therefore it would be
   possible to collect the json stream for a certain time period and =
analyze them
   later like in [4]. To do that someone just has to run
=20
      curl --user admin@internal:engine =
http://localhost:8080/ovirt-engine/api/hystrix.stream > hystrix.stream
=20
   for as long as necessary. The results can be analyzed later.
...
=20
# Possible architectural benefits?
=20
In addition to the live metrics we might also have use for the real =
hystrix features:
=20
 * Circuit Breaker
 * Bulk execution of commands
 * De-dublication of commands (Caching)
 * Synchronous and asynchronous execution support
 * ..
=20
Our commands do already have a lot of features, so I don't think that =
+1
it's a great idea and when properly documented so even a BFU can do that =
it would allow us to get much better idea when something is not working =
or working too slow on a system we don't have access to, but it\'s =
reproducible elsewhere. Just ask for "hey, run this thingie while you =
are reproducing the issue and send us the result"

there are
...
some quick wins, but maybe there are interesting opportunities for =
infra.
eh..I would worry about that much later. First we should understand what =
are we actually doing and why (as we all know the engine is likely doing =
a lot of useless stuff;-)
...
=20
# Overhead?
=20
In [5] the netflix employees describe their results regarding the =
overhead of
wrapping every command into a new instance of a hystrix command.
=20
They ran their tests on a standard 4-core Amazon EC2 server with a =
load of 60
request per second.
=20
When using threadpools they measured a mean overhead of less than one
millisecond (so negligible).  At the 90th percentile they measured an =
overhead
of 3 ms. At the 99th percentile of about 9 ms.
...
=20
When configuring the hystrix commands to use semaphores instead of =
This is likely good enough for backend commands and REST entry points =
(as you currently did), but may need more careful examination if we =
would want to add this to e.g. thread pool allocations
Don't get slowed down by that though, even for higher level stuff it is =
a great source of information

threadpools
...
they are even faster.
=20
# How to integrate?
=20
A working implementation can be found on gerrit[3].  These patch sets =
wrap a
hystrix command around every VdcAction, every VdcQuery and every =
VDSCommand.
This just required four small modifications in the code base.
=20
# Security?
=20
In the provided patches the hystrix-metrics-servlet is accessible at
/ovirt-engine/api/hystrix.stream. It is protected by basic auth but =
accessible
for everyone who can authenticate. We should probably restrict it to =
admins.
...
=20
# Todo?
=20
1) We do report failed actions with return values. Hystrix expects =
failing
commands to throw an exception.  So on the dashboard almost every =
command looks
like a success.  To overcome this, it would be pretty easy to throw an
exception inside the command and catch it immediately after it leaves =
that would be great if it doesn't require too much work. If it does then =
we can start with enabling/disabling via JMX using Roy's recent patch =
[8]

the
...
hystrix wrapper.
at the beginning it's probably enough to see what stuff is getting =
called, without differentiating between success or failure (we mostly do =
log failures, so hopefully we know when stuff is broken this way)
...
=20
2) Finetuning
Do we want semaphores or a thread pool. When the thread pool, what =
size do we want?=20
=20
3) Three unpackaged dependencies: archaius, hystrix-core, =
hystrix-contrib
Since you yesterday volunteered to package them I think this should not =
stop us!:-)

thanks a lot for the effort, I miss a proper analysis for soooo long. =
Thanks for stepping up!

michal
...
=20
# References
=20
[1] https://github.com/Netflix/Hystrix
[2] https://github.com/Netflix/Hystrix/tree/master/hystrix-dashboard=20=
...
[3] https://gerrit.ovirt.org/#/q/topic:hystrix=20
[4] =
http://www.nurkiewicz.com/2015/02/storing-months-of-historical-metrics.htm=
l
[5] =
https://github.com/Netflix/Hystrix/wiki/FAQ#what-is-the-processing-overhea=
d-of-using-hystrix
[5] https://bugzilla.redhat.com/show_bug.cgi?id=3D1268216=20
[6] https://bugzilla.redhat.com/show_bug.cgi?id=3D1268224
[7] http://graphite.wikidot.com
[8] https://gerrit.ovirt.org/#/c/29693/
...
_______________________________________________
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel
...
> hystrix.stream<br><br>   for as long as necessary. The =
results can be analyzed =
later.<br></div></blockquote><div><br></div>+1</div><div>it's a great =
idea and when properly documented so even a BFU can do that it would =
allow us to get much better idea when something is not working or =
working too slow on a system we don't have access to, but it\'s =
reproducible elsewhere. Just ask for "hey, run this thingie while you =
are reproducing the issue and send us the =
result"</div><div><br><blockquote type=3D"cite"><div dir=3D"ltr"><br># =
Possible architectural benefits?<br><br>In addition to the live metrics =
we might also have use for the real hystrix features:<br><br> * =
Circuit Breaker<br> * Bulk execution of commands<br> * =
De-dublication of commands (Caching)<br> * Synchronous and =
asynchronous execution support<br> * ..<br><br>Our commands do =
already have a lot of features, so I don't think that there are<br>some =
quick wins, but maybe there are interesting opportunities for =
infra.<br></div></blockquote><div><br></div>eh..I would worry about that =
much later. First we should understand what are we actually doing and =
why (as we all know the engine is likely doing a lot of useless =
stuff;-)</div><div><br><blockquote type=3D"cite"><div dir=3D"ltr"><br># =
Overhead?<br><br>In [5] the netflix employees describe their results =
regarding the overhead of<br>wrapping every command into a new instance =
of a hystrix command.<br><br>They ran their tests on a standard 4-core =
Amazon EC2 server with a load of 60<br>request per second.<br><br>When =
using threadpools they measured a mean overhead of less than =
one<br>millisecond (so negligible).  At the 90th percentile they =
measured an overhead<br>of 3 ms. At the 99th percentile of about 9 =
ms.<br></div></blockquote><div><br></div>This is likely good enough for =
backend commands and REST entry points (as you currently did), but may =
need more careful examination if we would want to add this to e.g. =
...
<br>[6] <a href=3D"https://bugzilla.redhat.com/show_bug.cgi?id=3D1268224=
" =
target=3D"_blank">https://bugzilla.redhat.com/show_bug.cgi?id=3D1268224</a=
<br>[7] <a href=3D"http://graphite.wikidot.com" =
target=3D"_blank">http://graphite.wikidot.com</a><br></div></blockquote>[8=
] <a =
--Apple-Mail=_F5FA624B-89A6-42FD-9D48-A4B9545863C9
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=us-ascii

<html><head></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; =
"><br><div><div>On 2 Oct 2015, at 12:47, Roman Mohr wrote:</div><br =
class=3D"Apple-interchange-newline"><blockquote type=3D"cite"><div =
dir=3D"ltr">Hi All,<br><br>I am contributing to the engine for three =
months now. While I dug into the code I<br>started to wonder how to =
visualize what the engine is actually =
doing.<br></div></blockquote><div><br></div><div>This is one of the main =
problems with large application, anything to help to understand what's =
going on is very welcome</div><br><blockquote type=3D"cite"><div =
dir=3D"ltr"><br>To get better insights I added hystrix[1] to the engine. =
Hystrix is a circuit<br>breaker library which was developed by Netflix =
and has one pretty interesting<br>feature: Real time metrics for =
commands.<br><br>In combination with hystrix-dashboard[2] it allows very =
interesting insights.<br>You can easily get an overview of commands =
involved in operations, their<br>performance and complexity. Look at [2] =
and the attachments in [5] and [6] for<br>screenshots to get an =
Impression.<br><br>I want to propose to integrate hystrix permanently =
because from my perspective<br>the results were really useful and I also =
had some good experiences with hystrix<br>in past projects.<br><br>A =
first implementation can be found on gerrit[3].<br><br># Where is it =
immediately useful?<br><br>During development and QA.<br><br>An example: =
I tested the hystrix integration on /api/vms and /api/hosts =
rest<br>endpoints and immediately saw that the number of command =
exectuions grew<br>lineary whit the number of vms and hosts. The bug =
reports [5] and [6] are the<br>result.<br><br># How to monitor the =
engine?<br><br>It is as easy as starting a hystrix-dashboard [2] =
with<br><br>  $ git clone <a =
href=3D"https://github.com/Netflix/Hystrix.git" =
target=3D"_blank">https://github.com/Netflix/Hystrix.git</a><br> ; $ =
cd Hystrix/hystrix-dashboard<br>  $ ../gradlew jettyRun<br><br>and =
point the dashboard to <br><br>   =
https://<customer.engine.ip>/ovirt-engine/hystrix.stream.<br><br># =
Other possible benefits?<br><br> * Live metrics at customer site =
for admins, consultants and support.<br><br> * Historical metrics =
for analysis in addition to the log files.<br>   The metrics =
information is directly usable in graphite [7]. Therefore it would =
be<br>   possible to collect the json stream for a certain =
time period and analyze them<br>   later like in [4]. To do =
that someone just has to run<br><br>      curl =
--user admin@internal:engine <a =
href=3D"http://localhost:8080/ovirt-engine/api/hystrix.stream" =
target=3D"_blank">http://localhost:8080/ovirt-engine/api/hystrix.stream</a=
thread pool allocations</div><div>Don't get slowed down by that though, =
even for higher level stuff it is a great source of =
information</div><div><br><blockquote type=3D"cite"><div =
dir=3D"ltr"><br>When configuring the hystrix commands to use semaphores =
instead of threadpools<br>they are even faster.<br><br># How to =
integrate?<br><br>A working implementation can be found on =
gerrit[3].  These patch sets wrap a<br>hystrix command around every =
VdcAction, every VdcQuery and every VDSCommand.<br>This just required =
four small modifications in the code base.<br><br># Security?<br><br>In =
the provided patches the hystrix-metrics-servlet is accessible =
at<br>/ovirt-engine/api/hystrix.stream. It is protected by basic auth =
but accessible<br>for everyone who can authenticate. We should probably =
restrict it to admins.<br></div></blockquote><div><br></div>that would =
be great if it doesn't require too much work. If it does then we can =
start with enabling/disabling via JMX using Roy's recent patch =
[8]</div><div><br><blockquote type=3D"cite"><div dir=3D"ltr"><br># =
Todo?<br><br>1) We do report failed actions with return values. Hystrix =
expects failing<br>commands to throw an exception.  So on the =
dashboard almost every command looks<br>like a success.  To =
overcome this, it would be pretty easy to throw an<br>exception inside =
the command and catch it immediately after it leaves the<br>hystrix =
wrapper.<br></div></blockquote><div><br></div>at the beginning it's =
probably enough to see what stuff is getting called, without =
differentiating between success or failure (we mostly do log failures, =
so hopefully we know when stuff is broken this =
way)</div><div><br><blockquote type=3D"cite"><div dir=3D"ltr"><br>2) =
Finetuning<br>Do we want semaphores or a thread pool. When the thread =
pool, what size do we want? <br><br>3) Three unpackaged dependencies: =
archaius, hystrix-core, =
hystrix-contrib<br></div></blockquote><div><br></div><div>Since you =
yesterday volunteered to package them I think this should not stop =
us!:-)</div><div><br></div><div>thanks a lot for the effort, I miss a =
proper analysis for soooo long. Thanks for stepping =
up!</div><div><br></div>michal</div><div><br><blockquote =
type=3D"cite"><div dir=3D"ltr"><br># References<br><br>[1] <a =
href=3D"https://github.com/Netflix/Hystrix" =
target=3D"_blank">https://github.com/Netflix/Hystrix</a><br>[2] <a =
href=3D"https://github.com/Netflix/Hystrix/tree/master/hystrix-dashboard" =
target=3D"_blank">https://github.com/Netflix/Hystrix/tree/master/hystrix-d=
ashboard</a> <br>[3] <a =
href=3D"https://gerrit.ovirt.org/#/q/topic:hystrix" =
target=3D"_blank">https://gerrit.ovirt.org/#/q/topic:hystrix</a> <br>[4] =
<a =
href=3D"http://www.nurkiewicz.com/2015/02/storing-months-of-historical-met=
rics.html" =
target=3D"_blank">http://www.nurkiewicz.com/2015/02/storing-months-of-hist=
orical-metrics.html</a><br>[5] <a =
href=3D"https://github.com/Netflix/Hystrix/wiki/FAQ#what-is-the-processing=
-overhead-of-using-hystrix" =
target=3D"_blank">https://github.com/Netflix/Hystrix/wiki/FAQ#what-is-the-=
processing-overhead-of-using-hystrix</a><br>[5] <a =
href=3D"https://bugzilla.redhat.com/show_bug.cgi?id=3D1268216" =
target=3D"_blank">https://bugzilla.redhat.com/show_bug.cgi?id=3D1268216</a=
href=3D"https://gerrit.ovirt.org/#/c/29693/">https://gerrit.ovirt.org/#/c/=
29693/</a></div><div><br><blockquote type=3D"cite">
_______________________________________________<br>Devel mailing =
list<br><a =
href=3D"mailto:Devel@ovirt.org">Devel@ovirt.org</a><br>http://lists.ovirt.=
org/mailman/listinfo/devel</blockquote></div><br></body></html>=

--Apple-Mail=_F5FA624B-89A6-42FD-9D48-A4B9545863C9--

Re: [ovirt-devel] Proposal: Hystrix for realtime command monitoring

Michal Skrivanek