[ovirt-devel] status update: consuming the BLOCK_THRESHOLD event from libvirt (rhbz#1181665)

Wed Jul 19 09:44:49 UTC 2017

Hi all,

With libvirt 3.2.0 and onwards, it seems we have now the tools to solve
https://bugzilla.redhat.com/show_bug.cgi?id=1181665

and eventually get rid of the disk polling we do. This change is
expected to have huge impact on performance, so I'm working on it.

I had plans for a comprehensive refactoring in this area, but looks like
a solution backportable for 4.1.z is appealing, so I

started with this first, saving the refactoring (which I still very much
want) for later.

So, quick summary: libvirt >= 3.2.0 allows to set a threshold to any
node in the backing chain of each drive of a VM

(https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainSetBlockThreshold),
and fire one event exactly once

when that threshold is crossed. The event needs to be explicitely
rearmed after.

This is exactly what we need to get rid of polling in the steady state,
so far so good.

The problem is: we can't use this for some important flows we have, and
which involve usage of disks not (yet) attached to a given VM.

Possibly affected flows:

- live storage migration:

  we use         flags = (libvirt.VIR_DOMAIN_BLOCK_COPY_SHALLOW |
                 libvirt.VIR_DOMAIN_BLOCK_COPY_REUSE_EXT |
                 VIR_DOMAIN_BLOCK_COPY_TRANSIENT_JOB)

  meaning that Vdsm is in charge of handling the volume

- snapshots:

  we use         snapFlags = (libvirt.VIR_DOMAIN_SNAPSHOT_CREATE_REUSE_EXT |
                     libvirt.VIR_DOMAIN_SNAPSHOT_CREATE_NO_METADATA)

  (same meaning as above)

- live merge: should be OK (according to a glance at the source and a
chat with Adam).

So looks like we will need to bridge this gap.

So we can still use the BLOCK_THRESHOLD event for steady state, and
avoid polling in the vast majority of the cases.

With "steady state" I mean that the VM is running, with no
administration (snapshot, live merge, live storage migration...)
operation in progress.

I think it is fair to assume that VMs are in this state the vast
majority of the time.
For the very important cases on which we cannot depend on events, we can
fall back to polling, but in a smarter way:

instead of polling everything every 2s, let's just poll just the drives
involved in the ongoing operations.

Those should be far less of the total amount of drives, and for a far
shorter time than today, so polling should be practical.

Since the event fires once, we will need to rearm it only if the
operation is ongoing, and only just before to start it (both conditions
easy to check)
We can disable the polling on completion, or on error. This per se is
easy, but we will need a careful review of the flows, and perhaps some
safety nets in place.

Anyway, should we miss to disable the polling, we will "just" have some
overhead.

On recovery, we will need to make sure to rearm all the relevant events,
but  we can just plug in the recovery we must do already, so this should
be easy as well.

So it seems to me this could fly and we can actually have the
performance benefits of events.

However, due to the fact that we need to review some existing and
delicate flows, I think we should still keep the current polling code
around for the next release.

I believe the best route is:

1. offer the new event-based code for 4.2, keep the polling around.
Default to events for performance

2. remove the polling completely in 4.3

I'm currently working on the patches here:
https://gerrit.ovirt.org/#/q/status:open+project:vdsm+branch:master+topic:watermark-event-minimal

Even though the basics are in place, I don't think they are ready for
review yet.

Comments welcome, as usual.

-- 
Francesco Romani
Senior SW Eng., Virtualization R&D
Red Hat
IRC: fromani github: @fromanirh