[ovirt-devel] Proposal: Hystrix for realtime command monitoring

Doron Fediuck dfediuck at redhat.com
Thu Oct 29 14:28:25 UTC 2015


Thanks Roman.
Adding Oved for Infra visibility. We have a lot to gain here.

On Thu, Oct 29, 2015 at 3:28 PM, Roman Mohr <rmohr at redhat.com> wrote:

>
>
> On Fri, Oct 2, 2015 at 4:24 PM, Michal Skrivanek <
> michal.skrivanek at redhat.com> wrote:
>
>>
>> On 2 Oct 2015, at 12:47, Roman Mohr wrote:
>>
>> Hi All,
>>
>> I am contributing to the engine for three months now. While I dug into
>> the code I
>> started to wonder how to visualize what the engine is actually doing.
>>
>>
>> This is one of the main problems with large application, anything to help
>> to understand what's going on is very welcome
>>
>>
>> To get better insights I added hystrix[1] to the engine. Hystrix is a
>> circuit
>> breaker library which was developed by Netflix and has one pretty
>> interesting
>> feature: Real time metrics for commands.
>>
>> In combination with hystrix-dashboard[2] it allows very interesting
>> insights.
>> You can easily get an overview of commands involved in operations, their
>> performance and complexity. Look at [2] and the attachments in [5] and
>> [6] for
>> screenshots to get an Impression.
>>
>> I want to propose to integrate hystrix permanently because from my
>> perspective
>> the results were really useful and I also had some good experiences with
>> hystrix
>> in past projects.
>>
>> A first implementation can be found on gerrit[3].
>>
>> # Where is it immediately useful?
>>
>> During development and QA.
>>
>> An example: I tested the hystrix integration on /api/vms and /api/hosts
>> rest
>> endpoints and immediately saw that the number of command exectuions grew
>> lineary whit the number of vms and hosts. The bug reports [5] and [6] are
>> the
>> result.
>>
>> # How to monitor the engine?
>>
>> It is as easy as starting a hystrix-dashboard [2] with
>>
>>   $ git clone https://github.com/Netflix/Hystrix.git
>>   $ cd Hystrix/hystrix-dashboard
>>   $ ../gradlew jettyRun
>>
>> and point the dashboard to
>>
>>    https://<customer.engine.ip>/ovirt-engine/hystrix.stream.
>>
>> # Other possible benefits?
>>
>>  * Live metrics at customer site for admins, consultants and support.
>>
>>  * Historical metrics for analysis in addition to the log files.
>>    The metrics information is directly usable in graphite [7]. Therefore
>> it would be
>>    possible to collect the json stream for a certain time period and
>> analyze them
>>    later like in [4]. To do that someone just has to run
>>
>>       curl --user admin at internal:engine
>> http://localhost:8080/ovirt-engine/api/hystrix.stream > hystrix.stream
>>
>>    for as long as necessary. The results can be analyzed later.
>>
>>
>> +1
>> it's a great idea and when properly documented so even a BFU can do that
>> it would allow us to get much better idea when something is not working or
>> working too slow on a system we don't have access to, but it\'s
>> reproducible elsewhere. Just ask for "hey, run this thingie while you are
>> reproducing the issue and send us the result"
>>
>>
>> # Possible architectural benefits?
>>
>> In addition to the live metrics we might also have use for the real
>> hystrix features:
>>
>>  * Circuit Breaker
>>  * Bulk execution of commands
>>  * De-dublication of commands (Caching)
>>  * Synchronous and asynchronous execution support
>>  * ..
>>
>> Our commands do already have a lot of features, so I don't think that
>> there are
>> some quick wins, but maybe there are interesting opportunities for infra.
>>
>>
>> eh..I would worry about that much later. First we should understand what
>> are we actually doing and why (as we all know the engine is likely doing a
>> lot of useless stuff;-)
>>
>>
>> # Overhead?
>>
>> In [5] the netflix employees describe their results regarding the
>> overhead of
>> wrapping every command into a new instance of a hystrix command.
>>
>> They ran their tests on a standard 4-core Amazon EC2 server with a load
>> of 60
>> request per second.
>>
>> When using threadpools they measured a mean overhead of less than one
>> millisecond (so negligible).  At the 90th percentile they measured an
>> overhead
>> of 3 ms. At the 99th percentile of about 9 ms.
>>
>>
>> This is likely good enough for backend commands and REST entry points (as
>> you currently did), but may need more careful examination if we would want
>> to add this to e.g. thread pool allocations
>> Don't get slowed down by that though, even for higher level stuff it is a
>> great source of information
>>
>>
>> When configuring the hystrix commands to use semaphores instead of
>> threadpools
>> they are even faster.
>>
>> # How to integrate?
>>
>> A working implementation can be found on gerrit[3].  These patch sets
>> wrap a
>> hystrix command around every VdcAction, every VdcQuery and every
>> VDSCommand.
>> This just required four small modifications in the code base.
>>
>> # Security?
>>
>> In the provided patches the hystrix-metrics-servlet is accessible at
>> /ovirt-engine/api/hystrix.stream. It is protected by basic auth but
>> accessible
>> for everyone who can authenticate. We should probably restrict it to
>> admins.
>>
>>
>> that would be great if it doesn't require too much work. If it does then
>> we can start with enabling/disabling via JMX using Roy's recent patch [8]
>>
>>
> The hystrix stream is now accessible in http://<host>/ovirt-engine/services/hystrix.stream
> and admin privileges are needed.
> Further it can be enabled an disabled via JMX (disabled by default).
> @Juan, @Roy thank you for your feedback on the code.
>
>>
>> # Todo?
>>
>> 1) We do report failed actions with return values. Hystrix expects failing
>> commands to throw an exception.  So on the dashboard almost every command
>> looks
>> like a success.  To overcome this, it would be pretty easy to throw an
>> exception inside the command and catch it immediately after it leaves the
>> hystrix wrapper.
>>
>>
>> at the beginning it's probably enough to see what stuff is getting
>> called, without differentiating between success or failure (we mostly do
>> log failures, so hopefully we know when stuff is broken this way)
>>
>>
> Ok, I leave it disabled for now. But should really be just as easy as to
> throw an exception if the command fails and immediately catching it
> afterwards (not the nicest looking code then, but would work). And this can
> be encapsulated in the command executor, so it would not pollute the
> existing code.
>
>
>>
>> 2) Finetuning
>> Do we want semaphores or a thread pool. When the thread pool, what size
>> do we want?
>>
>> To answer this myself, I use semaphores, to be sure to support
> transactions over multiple commands properly.
>
>> 3) Three unpackaged dependencies: archaius, hystrix-core, hystrix-contrib
>>
>>
>> Since you yesterday volunteered to package them I think this should not
>> stop us!:-)
>>
>> thanks a lot for the effort, I miss a proper analysis for soooo long.
>> Thanks for stepping up!
>>
>> michal
>>
>>
>> # References
>>
>> [1] https://github.com/Netflix/Hystrix
>> [2] https://github.com/Netflix/Hystrix/tree/master/hystrix-dashboard
>> [3] https://gerrit.ovirt.org/#/q/topic:hystrix
>> [4]
>> http://www.nurkiewicz.com/2015/02/storing-months-of-historical-metrics.html
>> [5]
>> https://github.com/Netflix/Hystrix/wiki/FAQ#what-is-the-processing-overhead-of-using-hystrix
>> [5] https://bugzilla.redhat.com/show_bug.cgi?id=1268216
>> [6] https://bugzilla.redhat.com/show_bug.cgi?id=1268224
>> [7] http://graphite.wikidot.com
>>
>> [8] https://gerrit.ovirt.org/#/c/29693/
>>
>> _______________________________________________
>> Devel mailing list
>> Devel at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/devel
>>
>>
>>
>
> _______________________________________________
> Devel mailing list
> Devel at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/devel/attachments/20151029/e7f79306/attachment-0001.html>


More information about the Devel mailing list