Hi All,
I am contributing to the engine for three months now. While I dug into the code I
started to wonder how to visualize what the engine is actually doing.
To get better insights I added hystrix[1] to the engine. Hystrix is a circuit
breaker library which was developed by Netflix and has one pretty interesting
feature: Real time metrics for commands.
In combination with hystrix-dashboard[2] it allows very interesting insights.
You can easily get an overview of commands involved in operations, their
performance and complexity. Look at [2] and the attachments in [5] and [6] for
screenshots to get an Impression.
I want to propose to integrate hystrix permanently because from my perspective
the results were really useful and I also had some good experiences with hystrix
in past projects.
A first implementation can be found on gerrit[3].
# Where is it immediately useful?
During development and QA.
An example: I tested the hystrix integration on /api/vms and /api/hosts rest
endpoints and immediately saw that the number of command exectuions grew
lineary whit the number of vms and hosts. The bug reports [5] and [6] are the
result.
# How to monitor the engine?
It is as easy as starting a hystrix-dashboard [2] with
$ git clone https://github.com/Netflix/Hystrix.git
$ cd Hystrix/hystrix-dashboard
$ ../gradlew jettyRun
and point the dashboard to
https://<customer.engine.ip>/ovirt-engine/hystrix.stream.
# Other possible benefits?
* Live metrics at customer site for admins, consultants and support.
* Historical metrics for analysis in addition to the log files.
The metrics information is directly usable in graphite [7]. Therefore it would be
possible to collect the json stream for a certain time period and analyze them
later like in [4]. To do that someone just has to run
curl --user admin@internal:engine http://localhost:8080/ovirt-engine/api/hystrix.stream > hystrix.stream
for as long as necessary. The results can be analyzed later.
# Possible architectural benefits?
In addition to the live metrics we might also have use for the real hystrix features:
* Circuit Breaker
* Bulk execution of commands
* De-dublication of commands (Caching)
* Synchronous and asynchronous execution support
* ...
Our commands do already have a lot of features, so I don't think that there are
some quick wins, but maybe there are interesting opportunities for infra.
# Overhead?
In [5] the netflix employees describe their results regarding the overhead of
wrapping every command into a new instance of a hystrix command.
They ran their tests on a standard 4-core Amazon EC2 server with a load of 60
request per second.
When using threadpools they measured a mean overhead of less than one
millisecond (so negligible). At the 90th percentile they measured an overhead
of 3 ms. At the 99th percentile of about 9 ms.
When configuring the hystrix commands to use semaphores instead of threadpools
they are even faster.
# How to integrate?
A working implementation can be found on gerrit[3]. These patch sets wrap a
hystrix command around every VdcAction, every VdcQuery and every VDSCommand.
This just required four small modifications in the code base.
# Security?
In the provided patches the hystrix-metrics-servlet is accessible at
/ovirt-engine/api/hystrix.stream. It is protected by basic auth but accessible
for everyone who can authenticate. We should probably restrict it to admins.
# Todo?
1) We do report failed actions with return values. Hystrix expects failing
commands to throw an exception. So on the dashboard almost every command looks
like a success. To overcome this, it would be pretty easy to throw an
exception inside the command and catch it immediately after it leaves the
hystrix wrapper.
2) Finetuning
Do we want semaphores or a thread pool. When the thread pool, what size do we want?
3) Three unpackaged dependencies: archaius, hystrix-core, hystrix-contrib
# References
[1] https://github.com/Netflix/Hystrix
[2] https://github.com/Netflix/Hystrix/tree/master/hystrix-dashboard
[3] https://gerrit.ovirt.org/#/q/topic:hystrix
[4] http://www.nurkiewicz.com/2015/02/storing-months-of-historical-metrics.html
[5] https://github.com/Netflix/Hystrix/wiki/FAQ#what-is-the-processing-overhead-of-using-hystrix
[5] https://bugzilla.redhat.com/show_bug.cgi?id=1268216
[6] https://bugzilla.redhat.com/show_bug.cgi?id=1268224
[7] http://graphite.wikidot.com