[ovirt-devel] [vdsm] exploring a possible integration between collectd and Vdsm

Yaniv Kaul ykaul at redhat.com
Tue Oct 11 20:31:14 UTC 2016


On Tue, Oct 11, 2016 at 2:05 PM, Francesco Romani <fromani at redhat.com>
wrote:

> Hi all,
>
> In the last 2.5 days I was exploring if and how we can integrate collectd
> and Vdsm.
>
> The final picture could look like:
> 1. collectd does all the monitoring and reporting currently Vdsm does
> 2. Engine consumes data from collectd
> 3. Vdsm consumes *notifications* from collectd - for few but important
> tasks like Drive high water mark monitoring
>
> Benefits (aka: why to bother?):
> 1. less code in Vdsm / long-awaited modularization of Vdsm
> 2. better integration with the system, reuse of well-known components
> 3. more flexibility in monitoring/reporting: collectd is special purpose
> existing solution
> 4. faster, more scalable operation because all the monitoring can be done
> in C
>
> At first glance, Collectd seems to have all the tools we need.
> 1. A plugin interface (https://collectd.org/wiki/
> index.php/Plugin_architecture and https://collectd.org/wiki/
> index.php/Table_of_Plugins)
> 2. Support for notifications and thresholds (https://collectd.org/wiki/
> index.php/Notifications_and_thresholds)
> 3. a libvirt plugin https://collectd.org/wiki/index.php/Plugin:virt
>
> So, the picture is like
>
> 1. we start requiring collectd as dependency of Vdsm
> 2. we either configure it appropriately (collectd support config drop-ins:
> /etc/collectd.d) or we document our requirements (or both)
> 3. collectd monitors the hosts and libvirt
> 4. Engine polls collectd
> 5. Vdsm listens from notifications
>
> Should libvirt deliver us the event we need (see
> https://bugzilla.redhat.com/show_bug.cgi?id=1181659),
> we can just stop using collectd notifications, everything else works as
> previously.
>
> Challenges:
> 1. Collectd does NOT consider the plugin API stable (
> https://collectd.org/wiki/index.php/Plugin_architecture#
> The_interface.27s_stability)
>    so the plugins should be inclueded in the main tree, much like the
> modules of the linux kernel
>    Worth mentioning that the plugin API itself has a good deal of rough
> edges.
>    we will need to maintain this plugin ourselves, *and* we need to
> maintain our thin API
>    layer, to make sure the plugin loads and works with recent versions of
> collectd.
> 2. the virt plugin is out of date, doesn't report some data we need: see
> https://github.com/collectd/collectd/issues/1945
> 3. the notification message(s) are tailored for human consumption, those
> messages are not easy
>    to parse for machines.
> 4. the threshold support in collectd seems to match values against
> constants; it doesn't seem possible
>    to match a value against another one, as we need to do for high water
> monitoring (capacity VS allocation).
>
> How I'm addressing, or how I plan to address those challenges (aka action
> items):
> 1. I've been experimenting with out-of-tree plugins, and I managed
> develop, build, install and run
>    one out-of-tree plugin: https://github.com/mojaves/
> vmon/tree/master/collectd
>    The development pace of collectd looks sustainable, so this doesn't
> look such a big deal.
>    Furthermore, we can engage with upstream to merge our plugins, either
> as-is or to extend existing ones.
> 2. Write another collectd plugin based on the Vdsm python code and/or my
> past accelerator executable project
>    (https://github.com/mojaves/vmon)
> 3. patch the collectd notification code. It is yet another plugin
>    OR
> 4. send notification from the new virt module as per #2, bypassing the
> threshold system. This move could preclude
>    the new virt module to be merged in the collectd tree.
>
> Current status of the action items:
> 1. done BUT PoC quality
> 2. To be done (more work than #1/possible dupe with github issue)
> 3. need more investigation, conflicts with #4
> 4. need more investigation, conflicts with #3
>
> All the code I'm working on will be found on https://github.com/mojaves/
> vmon
>
> Comments are appreciated
>

This generally sounds like a good idea - and I hope it is coordinated with
our efforts for monitoring (see [1], [2]).
Note that ages ago, ovirt-node actually had it already[3].

Few notes:
- I think the most compelling reason to move to collectd is actually to
benefit from the already existing plugins that it already has, which will
cover
a lot of the missing monitoring requirements and wishes we have (example:
local disk usage on the host), as well as integrate it into
Engine monitoring (example: postgresql performance monitoring).
- You can't remove monitoring from VDSM - as it new VDSM may work against
older Engine setups. You can gradually remove them.
I'd actually begin with cleanup - there are some 'metrics' that are simply
not needed and should not be reported in the first place and
are there for historic reasons only. Remove them - from Engine first, from
the DB and all, then later we can either send fake values or remove
from VDSM.
- If you are moving to collectd, as you can see from the metrics effort,
we'd really want to send it elsewhere - and Engine should consume it from
there.
Metrics storages usually have a very nice REST UI with the ability to bring
series with average, with different criteria (such as per hour, per minute
or what not stats), etc.
- I agree with Nir about separating between our core business and the
monitoring we do for extra. Keep in mind that some of the stats are for SLA
and critical scheduling decisions as well.
- The libvirt collectd plugin is REALLY outdated. I think it may require
significant work to bring it up to speed with our existing capabilities.
Y.


[1] https://sradcoblog.wordpress.com/2016/07/19/ovirt-metrics-elk/
[2] https://bronhaim.wordpress.com/2016/06/26/ovirt-metrics/
[3] https://github.com/oVirt/ovirt-node/blob/master/scripts/collectd.conf.in


> --
> Francesco Romani
> RedHat Engineering Virtualization R & D
> Phone: 8261328
> IRC: fromani
> _______________________________________________
> Devel mailing list
> Devel at ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ovirt.org/pipermail/devel/attachments/20161011/2e75e96b/attachment-0001.html>


More information about the Devel mailing list