[ovirt-devel] Re: Backup: how to download only used extents from imageio backend

30 Jun 2020

      On Tue, Jun 30, 2020 at 10:32 AM Michael Ablassmeier <abi@grinser.de> wrote:
...
hi,
im currently looking at the new incremental backup api that has been
part of the 4.4 and RHV 4.4-beta release. So far i was able to create
full/incremental backups and restore without any problem.
Now, using the backup_vm.py example from the ovirt-engine-sdk i get
the following is happening during a full backup:
1) imageio client api requests transfer
 2) starts qemu-img to create a local qemu image with same size
 3) starts qemu-nbd to serve this image
 4) reads used extents from provided imageio source, passes data to
 qemu-nbd process
 5) resulting file is a thin provisioned qcow image with the actual
 data of the VM's used space.
while this works great, it has one downside: if i backup a virtual
machine with lots of used extents, or multiple virtual machines at the
same time, i may run out of space, if my primary backup target is
not a regular disk.
Imagine i want to stream the FULL backup to tape directly like
backup_vm.py full [..] <vm_uuid> /dev/nst0
thats currently not possible, because qemu-img is not able to open
a tape device directly, given its nature of the qcow2 format.
So what iam basically looking for, is a way to download only the extents
from the imageio server that are really in use, not depending on qemu-*
tools, to be able to pipe the data somehwere else.
Standard tools, like for example curl, will allways download the full
provisioned image from the imageio backend (of course).
I noticed is that it is possible to query the extents via:
https://tranfer_node:54322/images/d471c659-889f-4e7f-b55a-a475649c48a6/exten...
As i failed to find them, are there any existing functions/api calls
that could be used to download only the used extents to a file/fifo
pipe?
So far, i played around with the _internal.io.copy function, beeing able
to at least read the data into a in memory BytesIO stream, but thats not
the solution to my "problem" :)
To use _internal.io.copy to copy the image to tape, we need to solve
several issues:

1. how do you write the extents to tape so that you can extract them later?
2. provide a backend that knows how to stream data to tape in the right format
3. fix client.download() to consider the number of writers allowed by
the backend,
   since streaming to tape using multiple writers will not be possible.

I think we can start with a simple implementation using imageio API, and once
we have a working solution, we can consider making a backend.

A possible solution for 1 is to use tar format, creating one tar per backup.

The tar structure can be:

- backup info - json file with information about this backup like vm
id, disk id,
  date, checkpoint, etc.
- extents - the json returned from imageio as is. Using this json you
can restore
  later every extent to the right location in the restored image
- extent 1 - first data extent (zero=False)
...
- extent N - last data extent

To restore this backup, you need to:

1. find the tar in the tape (I have no idea how you would do this)
2. extract backup info from the tar
3. extract extents from the tar
4. start an upload transfer
5. for each data extent:
    read data from the tar member, and send to imageio using the right
offset and size

Other formats are possible, but reusing tar seems like the easiest way
and will make it
easier to write and read backups from tapes.

Creating a tar file and adding items using streaming can be done like this:

    with tarfile.open("/dev/xxx", "w|") as tar:

        # Create tarinfo for extent-N
        # setting other attributes may be needed
        tarinfo = tarfile.Tarinfo("extent-{}".format(extent_number))
        tarinfo.size = extent_size

        # reader must implement read(n), providing tarinfo.size bytes.
        tar.addfile(tarinfo, fileObj=reader)

I never tried to write directly to tape with python tarfile, but it should work.

So the missing part is to create a connection to imageio and reading the data.

The easiest way is to use imageio._internal.backends.http, but note that this
is internal now, so you should not use it outside of imageio. It is fine for
writing proof of concept, and if you can show a good use case we can work
on public API.

With that backend, you can do this:

    from imageio._internal.backends impot http

    with http.Backend(transfer_url, cafile) as backend:
        extents = list(backend.extents("zero"))

        # Write extents to tarfile. Assuming you wrote a helper write_to_tar()
        # doing the Tarinfo dance.
        extents_data = json.dumps([extent.to_dict() for extent in extents])
        write_to_tar("extents", len(extent_data), io.BytesIO(extents_data))

        for n, extent in enumerate(e for e in extents if not e.zero):

            # Seek to start of extent. Reading extent.length bytes will
            # return extent data.
            backend.seek(extent.start)

            # Backends do not implement read() and it would be inefficient to
            # implement read. This is a quick hack to make it possible
to integrate
            # other code expecting file-like objects.
            # reader is http.HTTPResponse() instance, implementing read().
            reader = backend._get(extent.length)

            write_to_tar("extent-{}".format(n), extent.length, reader)

For incremental backup, you will need to change:

    extents = list(backend.extents("dirty"))
    ...
    for n, extent in enumerate(e for e in extents if e.dirty):

You can write this using http.client.HTTPSConnection without using
the http backend, but it would be a lot of code.

We probably need to expose the backends or a simplified interface
in the client public API to make it easier to write such applications.

Maybe something like:

     client.coy(src, dst)

Where src and dst are objects implementing imageio backend interface.

But before we do this we need to see some examples of real programs
using imageio, to understand the requirements better.

Nir

[ovirt-devel] Re: Backup: how to download only used extents from imageio backend

Nir Soffer