----- Original Message -----
From: "Yaniv Kaul" <ykaul(a)redhat.com>
To: "Francesco Romani" <fromani(a)redhat.com>
Cc: "devel" <devel(a)ovirt.org>
Sent: Tuesday, October 11, 2016 10:31:14 PM
Subject: Re: [ovirt-devel] [vdsm] exploring a possible integration between collectd and
Vdsm
On Tue, Oct 11, 2016 at 2:05 PM, Francesco Romani <fromani(a)redhat.com>
wrote:
> Hi all,
>
> In the last 2.5 days I was exploring if and how we can integrate collectd
> and Vdsm.
[...]
This generally sounds like a good idea - and I hope it is coordinated
with
our efforts for monitoring (see [1], [2]).
Sure it will. I played a couple of days with collectd to "just" see
a. how hard is to write a collectd plugin, and/or if it is feasible to
ship ot out-of-tree for the initial few releases, until it stabilizes
so it can be submitted upstream
b. if we can get events/notifications from collectd
c. if we can integrate those notifications with Vdsm
And turns out we *can* do all of the above, with various degrees of difficulty.
Few notes:
- I think the most compelling reason to move to collectd is actually to
benefit from the already existing plugins that it already has, which will
cover
a lot of the missing monitoring requirements and wishes we have (example:
local disk usage on the host), as well as integrate it into
Engine monitoring (example: postgresql performance monitoring).
Agreed
- You can't remove monitoring from VDSM - as it new VDSM may work
against
older Engine setups. You can gradually remove them.
Yes, for example we can make Vdsm poll collectd and act as facade to old Engines,
while new one should skip this step and ask collectd or the metrics aggregator
service you mention below.
I'd actually begin with cleanup - there are some
'metrics' that are simply
not needed and should not be reported in the first place and
are there for historic reasons only. Remove them - from Engine first, from
the DB and all, then later we can either send fake values or remove
from VDSM.
Yes, this is the first place where we need to coordinate with the metrics effort.
- If you are moving to collectd, as you can see from the metrics
effort,
we'd really want to send it elsewhere - and Engine should consume it from
there.
Metrics storages usually have a very nice REST UI with the ability to bring
series with average, with different criteria (such as per hour, per minute
or what not stats), etc.
Fully agreed
- I agree with Nir about separating between our core business and
the
monitoring we do for extra. Keep in mind that some of the stats are for SLA
and critical scheduling decisions as well.
Yes, of course adding a dependency for core monitoring is risky.
So far the bottom line is that relying on collectd for this is just one more
option on the table now.
[mostly brainstorming from now on]
However, I'd like highlight that is not just risky: is a different tradeoff.
Doing the core monitoring in Vdsm (so in python, essentially in a single threaded server)
is not a free lunch, because this has a quite high price on performance level.
If the main Vdsm process is overloaded, then the polling cycle can get longer, and the
overall response time of processing system events (e.g. disk detected full) can get
longer as well.
We've observed in not-so-distant past high response time from heavily loaded Vdsm.
I think the idea of having different instances for different monitoring purposes
(credit to Nir) is the best shot at the moment.
We could maybe have one standard system collectd for regular monitoring,
and perhaps one special purpose, very limited collectd instance for critical information.
On top of that, Vdsm could double-checl and keep doing the core monitoring itself,
albeit at lower rate (e.g. every 10s instead of every 2s; every 60s instead of every
15s).
Leveraging libvirt events is *the* right thing, no doubt about that, but it would be very
nice
to have a dependable external service which can generate the events we need based on
libvirt data, and move the notification logic on top of it.
Something like (final picture, excluding intermediate compatibility layers)
[data source]
-----+-------
|
`-> [monitoring/metrics collection]
-------------+---------------
|
+--> [metrics store] -{data}-> [Engine]
|
`--> [notification service] -{events}-> [Vdsm]
Not all the "boxes" need to be separate processes, for example collectd has
some
support for thresholds and notifications which is ~80% of what Vdsm needs (again not
considering reliability, just feature-wise).
[end brainstorm]
- The libvirt collectd plugin is REALLY outdated. I think it may
require
significant work to bring it up to speed with our existing capabilities.
Yep I looked briefly at that code. It is REALLY outdated :)
Besides some information just not reported (easy to fix),
we will most likely need to have here some logic to deal with stuck VMs,
much like we do in Vdsm.
In the past I ported some Vdsm monitoring code to C[1], this could perhaps help now.
Bests,
+++
[1]
https://github.com/mojaves/vmon/tree/master/src
--
Francesco Romani
Red Hat Engineering Virtualization R & D
Phone: 8261328
IRC: fromani