[ovirt-devel] Caching of data from the database done properly

Tue Jul 12 06:59:55 UTC 2016

Hi Yevgeny,

On Mon, Jul 11, 2016 at 7:59 PM, Yevgeny Zaspitsky <yzaspits at redhat.com> wrote:
>
>
> On Tue, Jul 5, 2016 at 7:14 AM, Roman Mohr <rmohr at redhat.com> wrote:
>>
>> On Mon, Jul 4, 2016 at 11:58 PM, Roman Mohr <rmohr at redhat.com> wrote:
>> > Hi Everyone,
>> >
>> > I wanted to discuss a practice which seems to be pretty common in the
>> > engine which I find very limiting, dangerous and for some things it
>> > can even be a blocker.
>> >
>> > There are several places in the engine where we are using maps as
>> > cache in singletons to avoid reloading data from the database. Two
>> > prominent ones are the QuotaManager[1] and the MacPoolPerCluster[2].
>> >
>> > While it looks tempting to just use a map as cache, add some locks
>> > around it and create an injectable singleton, this has some drawbacks:
>> >
>> > 1) We have an autoritative source for our data and it offers
>> > transactions to take care of inconsistencies or parallel updates.
>> > Doing all that in a service again duplicates this.
>> > 2) Caching on the service layer is definitely not a good idea. It can
>> > introduce unwanted side effects when someone invokes the DAOs
>> > directly.
>> > 3) The point is more about the question if a cache is really needed:
>> > Do I just want that cache because I find it convenient to do a
>> > #getMacPoolForCluster(Guid clusterId) in a loop instead of just
>> > loading it once before the loop, or do my usage requirements really
>> > force me to use a cache?
>> >
>> > If you really need a cache, consider the following:
>> >
>> > 1) Do the caching on the DAO layer. This guarantees the best
>> > consistency across the data.
>> > 2) Yes this means either locking in the DAOs or a transactional cache.
>> > But before you complain, think about what in [1] and [2] is done. We
>> > do exactly that there, so the complexity is already introduced anyway.
>> > 3) Since we are working with transactions, a custom cache should NEVER
>> > cache writes (really just talking about our use case here). This makes
>> > checks for existing IDs before adding an entity or similar checks
>> > unnecessary, don't duplicate constraint checks like in [2].
>> > 4) There should always be a way to disable the cache (even if it is
>> > just for testing).
>> > 5) If I can't convince you to move the cache to the DAO layer, still
>> > add a way to disable the cache.
>> >
>>
>> I forgot to mention one thing: There are of course cases where
>> something is loaded on startup. Mostly things which can have multiple
>> sources.
>> For instance for the application configuration itself it is pretty
>> common, or like in the scheduler the scheduling policies where some
>> are Java only,
>> some are coming from other sources. It is still good
>>
>> But for normal business entities accessing parts of it through
>> services and parts of it through services is not the best thing to do
>> (if constructiong the whole business entity out of multiple daos is
>> complex, Repositories can help, but the cache should still be in the
>> dao layer).
>
>
> I do not agree that the caching should be on the DAO layer - that might lead
> to getting an entity that is built of parts that are not coherent each with
> another if the different DAO caches are not in sync.

I can't agree here.
That is what transactions are for. A second layer cache normally
follows transactions. You have interceptors to detect rollbacks and
commits.
If you don't have JTA in place there is then normally a small window
where you can read stale data in different transactions (which is fine
in most cases). It does have nothing to do with where the cache is.

It is much easier to stay in sync since there is no way to by-bass the cache.

> I'd put the cache on the Repositories (non-existent currently) or a higher
> layer, just above the transaction boundaries, so the cache would contain
> service call results rather than raw data.

What does that mean above the Transaction boundaries?
Yes a second level cache is to have a cache across transaction
boundaries and you also have that when you place them in the DAO
layer.

You would further make it very hard to track weather you are allowed
to manipulate data through DAOs, Repositories or Services when you
don't place the basic cache inside the DAOs since you might always by
accident by-pass the cache.

For higher layer caches in singletons it is also almost a prerequisite
to have the basic cache in the DAO layer because you can then also
listen on cache changes for dependent entities inside the singleton
(all cache implementations I know have listeners) and invalidate or
update derived caches. This, in combination with the possibility to
disable the cache completely on all layers, makes the cache completely
transparent on every layer. Which makes it very easy to write sane
code when using all the different services, DAOs, Repositories, ... .

> Then the cache would prevent from
> the application accessing the DB connection pool for a connection.

The cache inside the DAO is before you acquire a DB connection.
Does not make much sense otherwise.

> Yes,
> different service caches might have same entities duplicated in the memory,
> but I do not care of that until that's proven as a problem and if it would
> I'd go to improving cache - making that more capable.
>

Sometimes you need caches for derived values and that is fine too. The
best thing you can have then is that you have caches in the DAO layer
and use cache listeners (which all cache solutions have), or just fire
events that in the current transaction some of the basic entities have
changed. Then you listen on those changes in the service too and
update your cache according to these events. Like all the other caches
you then commit or rollback your changes by listening to the
transaction (and keep it thread local before that). The benefit is
that the caching complexity is invisible from outside and everything
behaves as expected.

Another interesting scenario is, when you want to get a derived
resource from a service, which should be unique (let's say a mac
address). In this case you would use a non transactional cache with
locks (like the the MacPoolPerCluster does) to make the change visible
application wide. Now if the transaction fails, you would release this
resource as part of the rollback process, by listening to the
transaction. The resource would then be visible as used just a little
bit longer. If you want to return such a resource to a pool because
you don't need it anymore, you would keep the resource release thread
local until the transaction succeeds. Just in case something goes
wrong with the transaction. All this can be done from within the
service which makes the service still easy to use.

>> I hope  you get what I mean.
>>
>> > For as long as there is no general caching solution with something
>> > like ehcache or infinispan, in my eyes such small things matter a lot
>> > to keep a project maintainable.
>> >
>> > That are some of the best practices I have seen around caching
>> > database data. It would be great if we could agree on something like
>> > that. Maybe there is already an agreement on something and I am just
>> > not aware of it.
>> >
>> > Looking forward to hear your feedback.
>> >
>> > Roman
>> >
>> > [1]
>> > https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/quota/QuotaManager.java
>> > [2]
>> > https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/network/macpool/MacPoolPerCluster.java
>> _______________________________________________
>> Devel mailing list
>> Devel at ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/devel
>
>

Best Regards,
Roman