How to Backup a VM

Stefan Wolf

29 Aug 2020 29 Aug '20

4:12 p.m.

Hello to all I try to backup a normal VM. But It seems that I don't really understand the concept. At first I found the possibility to backup with the api https://www.ovirt.org/documentation/administration_guide/#Setting_a_storage_.... Create a snapshot of the VM, finding the ID of the snapshot and the configuration of the VM makes sense to me. But at this point, I would download the config an the snapshot and put it to my backup storage. And not create a new VM attach the disk and run a backup with backup programm. And for restoring do the sam way backwards. If i look at other project, there seems do be a way to download the snapshot and configfile, or am I wrong? Maybe someone can explain to me why I should use additional software to install in an additional machine. Or even better someone can explain to me how I don't have to use additional backup software. And to the same topic backup. There is in the documentation the possibility to set up a backup storage It is nearly the same, create a snapshot, or clone the machine and export it to backup storage

...

Export the new virtual machine to a backup domain. See Exporting a Virtual Machine to a Data Domain in the Virtual Machine Management Guide. Sadly there is just writen what to do, not how, the link points to 404 page. maybe someone can explain to me how to use backup storage

thank you very much shb

Show replies by date

Jayme

29 Aug 29 Aug

4:57 p.m.

Probably the easiest way is to export the VM as OVA. The OVA format is a single file which includes the entire VM image along with the config. You can import it back into oVirt easily as well. You can do this from the GUI on a running VM and export to OVA without bringing the VM down. The export process will handle the creation and deletion of the snapshot. You can export to OVA to a directory located on one of the hosts, this directory could be a NFS mount on an external storage server if you want. The problem with export to OVA is that you can't put it on a schedule and it is mostly a manual process. You can however initiate it with Ansible. A little while ago I actually wrote an ansible playbook to backup multiple VMs on a schedule. It was wrote for oVirt 4.3, I have not had to time to test it with 4.4 yet https://github.com/silverorange/ovirt_ansible_backup On Sat, Aug 29, 2020 at 10:14 AM Stefan Wolf <shb256@gmail.com> wrote:

...

Hello to all

I try to backup a normal VM. But It seems that I don't really understand the concept. At first I found the possibility to backup with the api https://www.ovirt.org/documentation/administration_guide/#Setting_a_storage_... . Create a snapshot of the VM, finding the ID of the snapshot and the configuration of the VM makes sense to me. But at this point, I would download the config an the snapshot and put it to my backup storage. And not create a new VM attach the disk and run a backup with backup programm. And for restoring do the sam way backwards.

If i look at other project, there seems do be a way to download the snapshot and configfile, or am I wrong? Maybe someone can explain to me why I should use additional software to install in an additional machine. Or even better someone can explain to me how I don't have to use additional backup software.

...
Export the new virtual machine to a backup domain. See Exporting a Virtual Machine to a Data Domain in the Virtual Machine Management Guide. Sadly there is just writen what to do, not how, the link points to 404

And to the same topic backup. There is in the documentation the possibility to set up a backup storage It is nearly the same, create a snapshot, or clone the machine and export it to backup storage page. maybe someone can explain to me how to use backup storage

thank you very much

shb _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/COR6VIV477XUFD...

Stefan Wolf

30 Aug 30 Aug

9:37 a.m.

Hello,

...

https://github.com/silverorange/ovirt_ansible_backup I am also still using 4.3. In my opinion this is by far the best and easiest solution for disaster recovery. No need to install an appliance, and if there is a need to recover, you can import the ova in every hypervisor - no databases, no dependency.

Sometimes I ve issues with "TASK [Wait for export] " sometime it takes to long to export the ova. an I also had the problem, that the export already finished, but it was not realized by the script. In ovirt the export was finished and the filename was renamed from *.tmp to *.ova maybe you have an idea for me. thanks bye

info＠dsdm.ch

12:23 p.m.

Checking the timestamp (diff between now and timestamp) of the exportfile could also an option to verify if the export is still ongoing instead of using the ovirt_event_info.

Jayme

2:52 p.m.

You should be able to fix by increasing the timeout variable in main.yml. I think the default is pretty low around @ 600 seconds (10 minutes). I have mine set for a few hours since I’m dealing with large vms. I’d also increase poll interval as well so it’s not checking for completion every 10 seconds. I set my poll interval to 5 minutes. I backup many large vms (over 1tb) with this playbook for the past several months and never had a problem with it not completing. On Sun, Aug 30, 2020 at 3:39 AM Stefan Wolf <shb256@gmail.com> wrote:

...

Hello,

...
https://github.com/silverorange/ovirt_ansible_backup

I am also still using 4.3.

In my opinion this is by far the best and easiest solution for disaster recovery. No need to install an appliance, and if there is a need to recover, you can import the ova in every hypervisor - no databases, no dependency.

Sometimes I ve issues with "TASK [Wait for export] " sometime it takes to long to export the ova. an I also had the problem, that the export already finished, but it was not realized by the script. In ovirt the export was finished and the filename was renamed from *.tmp to *.ova

maybe you have an idea for me.

thanks bye

_______________________________________________

Users mailing list -- users@ovirt.org

To unsubscribe send an email to users-leave@ovirt.org

Privacy Statement: https://www.ovirt.org/privacy-policy.html

oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/

List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/Q7TKVK5TL6HT7D...

Jayme

2:55 p.m.

Also if you look at the blog post linked on github page it has info about increasing the ansible timeout on ovirt engine machine. This will be necessary when dealing with large vms that take over 2 hours to export On Sun, Aug 30, 2020 at 8:52 AM Jayme <jaymef@gmail.com> wrote:

...

You should be able to fix by increasing the timeout variable in main.yml. I think the default is pretty low around @ 600 seconds (10 minutes). I have mine set for a few hours since I’m dealing with large vms. I’d also increase poll interval as well so it’s not checking for completion every 10 seconds. I set my poll interval to 5 minutes.

I backup many large vms (over 1tb) with this playbook for the past several months and never had a problem with it not completing.

On Sun, Aug 30, 2020 at 3:39 AM Stefan Wolf <shb256@gmail.com> wrote:

...
Hello,

...
https://github.com/silverorange/ovirt_ansible_backup

I am also still using 4.3.

In my opinion this is by far the best and easiest solution for disaster recovery. No need to install an appliance, and if there is a need to recover, you can import the ova in every hypervisor - no databases, no dependency.

Sometimes I ve issues with "TASK [Wait for export] " sometime it takes to long to export the ova. an I also had the problem, that the export already finished, but it was not realized by the script. In ovirt the export was finished and the filename was renamed from *.tmp to *.ova

maybe you have an idea for me.

thanks bye

_______________________________________________

Users mailing list -- users@ovirt.org

To unsubscribe send an email to users-leave@ovirt.org

Privacy Statement: https://www.ovirt.org/privacy-policy.html

oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/

List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/Q7TKVK5TL6HT7D...

Stefan Wolf

3:06 p.m.

yes you are right, I ve already found. But this was not realy my problem. It causes from the HostedEngine. Long time ago I ve decreased the memory. It seems that this was the problem. now it is seems to be working pretty well.

Stefan Wolf

5:55 p.m.

OK, I ve run the backup three times . I still have two machines, where it still fails on TASK [Wait for export] I think the Problem is not the timeout, in oVirt engine the export has already finished : " Exporting VM VMName as an OVA to /home/backup/in_progress/VMName.ova on Host kvm360" But [Wait for export] still counts to 1 exit with error and move on to the next task bye shb

Jayme

6:19 p.m.

Interesting I’ve not hit that issue myself. I’d think it must somehow be related to getting the event status. Is it happening to the same vms every time? Is there anything different about the vm names or anything that would set them apart from the others that work? On Sun, Aug 30, 2020 at 11:56 AM Stefan Wolf <shb256@gmail.com> wrote:

...

OK,

I ve run the backup three times .

I still have two machines, where it still fails on TASK [Wait for export]

I think the Problem is not the timeout, in oVirt engine the export has already finished : "

Exporting VM VMName as an OVA to /home/backup/in_progress/VMName.ova on Host kvm360"

But [Wait for export] still counts to 1 exit with error and move on to the next task

bye shb

_______________________________________________

Users mailing list -- users@ovirt.org

To unsubscribe send an email to users-leave@ovirt.org

Privacy Statement: https://www.ovirt.org/privacy-policy.html

oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/

List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/W65G6ZUL6C6UJA...

Stefan Wolf

31 Aug 31 Aug

9:52 a.m.

I think, I found the problem. It is case sensitive. For the export it is NOT case sensitive but for the step "wait for export" it is. I ve changed it and now it seems to be working

Jayme

1:20 p.m.

Thanks for letting me know, I suspected that might be the case. I’ll make a note to fix that in the playbook On Mon, Aug 31, 2020 at 3:57 AM Stefan Wolf <shb256@gmail.com> wrote:

...

I think, I found the problem.

It is case sensitive. For the export it is NOT case sensitive but for the step "wait for export" it is. I ve changed it and now it seems to be working

_______________________________________________

Users mailing list -- users@ovirt.org

To unsubscribe send an email to users-leave@ovirt.org

Privacy Statement: https://www.ovirt.org/privacy-policy.html

oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/

List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/RYFMBHZTJF76RT...

thomas＠hoberg.net

30 Aug 30 Aug

6:47 p.m.

I found OVA export/import to be rather tricky. On the final 4.3 release there still remains a bug, which can lead to the OVA files containing nothing but zeros for the disks, a single line fix that only made it into 4.4 https://bugzilla.redhat.com/show_bug.cgi?id=1813028 Once I fixed that on 4.3 I had the issue that imports from OVAs, but also from an export domain were failing with the qemu-img doing the transfer failing with a write error to the local gluster storage, often after a considerable part of the work had already been done (the particular image was 115GB in allocated size on a 500GB thin disk). Since the gluster is all fine, has space etc. I was thinking that perhaps a timeout might be the real cause (I am not using server class hardware in the home-lab, so built-in 'patience' may be too short). This gives me a hint where to make ansible wait longer, just in case it's a timing issue.

thomas＠hoberg.net

7:10 p.m.

Struggling with bugs and issues on OVA export/import (my clear favorite otherwise, especially when moving VMs between different types of hypervisors), I've tried pretty much everything else, too. Export domains are deprecated and require quite a bit of manual handling. Unfortunately the buttons for the various operations are all over the place e.g. the activation and maintenance toggles are in different pages. In the end the mechanisms underneath (qemu-img) seem very much the same and suffer from the same issues (I have larger VMs that keep failing on imports). So far the only fool-proof method has been to use the imageio daemon to upload and download disk images, either via the Python API or the Web-GUI. Transfer times are terrible though, 50MB/s is quite low when the network below is 2.5-10Gbit and SSDs all around. Obviously with Python as everybody's favorite GUI these days, you can also copy and transfer the VMs complete definition, but I am one of those old guys, who might even prefer a real GUI to mouse clicks on a browser. The documentation on backup domains is terrible. What's missing behind the 404 link in oVirt becomes a very terse section in the RHV manuals, where you're basically just told that after cloning the VM, you should then move its disks to the backup domain... What you are then supposed to do with the cloned VM, if it's ok to simplay throw it away, because the definition is silently copied to the OVF_STORE on the backup... none of that is explained or mentioned. There is also no procedure for restoring a machine from a backup domain, when really a cloning process that allows a target domain would be pretty much what I'd vote for. Redhat really wants you to buy the professional product there, or use the Python GUI. I've sadly found the OVA files generated by oVirt (QEMU, really) to be incompatible with both VMware Workstation 15.5 and VirtualBox 6.12. No idea who's fault this is, but both sides are obviously not doing plug-fests every other week and I'm pretty sure this could be fixed manually when needed.

Nir Soffer

1 Sep 1 Sep

11:26 p.m.

On Sun, Aug 30, 2020 at 7:13 PM <thomas@hoberg.net> wrote:

...

Struggling with bugs and issues on OVA export/import (my clear favorite otherwise, especially when moving VMs between different types of hypervisors), I've tried pretty much everything else, too.

Export domains are deprecated and require quite a bit of manual handling. Unfortunately the buttons for the various operations are all over the place e.g. the activation and maintenance toggles are in different pages.

Using export domain is not a single click, but it is not that complicated. But this is good feedback anyway.

...

In the end the mechanisms underneath (qemu-img) seem very much the same and suffer from the same issues (I have larger VMs that keep failing on imports).

I think the issue is gluster, not qemu-img.

...

So far the only fool-proof method has been to use the imageio daemon to upload and download disk images, either via the Python API or the Web-GUI.

How did you try? transfer via the UI is completely different than transfer using the python API. From the UI, you get the image content on storage, without sparseness support. If you download 500g raw sparse disk (e.g. gluster with allocation policy thin) with 50g of data and 450g of unallocated space, you will get 50g of data, and 450g of zeroes. This is very slow. If you upload the image to another system you will upload 500g of data, which will again be very slow. From the python API, download and upload support sparseness, so you will download and upload only 50g. Both upload and download use 4 connections, so you can maximize the throughput that you can get from the storage. From python API, you can convert the image format during download/upload automatically, for example download raw disk to qcow2 image. Gluster is a challenge (as usual), since when using sharding (enabled by default for ovirt), it does not report sparness. So even from the python API you will download the entire 500g. We can improve this using zero detection but this is not implemented yet.

...

Transfer times are terrible though, 50MB/s is quite low when the network below is 2.5-10Gbit and SSDs all around.

In our lab we tested upload of 100 GiB image and 10 concurrent uploads of 100 GiB images, and we measured throughput of 1 GiB/s: https://bugzilla.redhat.com/show_bug.cgi?id=1591439#c24 I would like to understand the setup better: - upload or download? - disk format? - disk storage? - how is storage connected to host? - how do you access the host (1g network? 10g?) - image format? - image storage?

...

Obviously with Python as everybody's favorite GUI these days, you can also copy and transfer the VMs complete definition, but I am one of those old guys, who might even prefer a real GUI to mouse clicks on a browser.

The documentation on backup domains is terrible. What's missing behind the 404 link in oVirt becomes a very terse section in the RHV manuals, where you're basically just told that after cloning the VM, you should then move its disks to the backup domain...

backup domain is a partly cooked feature and it is not very useful. There is no reason to use it for moving VMs from one environment to another. I already explained how to move vms using a data domain. Check here: https://lists.ovirt.org/archives/list/users@ovirt.org/message/ULLFLFKBAW7T7B... https://lists.ovirt.org/archives/list/users@ovirt.org/message/GFOK55O5N4SRU5... I'm not sure it is documented properly, please file a documentation bug if we need to add something to the documentation.

...

What you are then supposed to do with the cloned VM, if it's ok to simplay throw it away, because the definition is silently copied to the OVF_STORE on the backup... none of that is explained or mentioned.

If you cloned a vm to data domain and then detach the data domain there is nothing to cleanup in the source system.

...

There is also no procedure for restoring a machine from a backup domain, when really a cloning process that allows a target domain would be pretty much what I'd vote for.

We have this in 4.4, try to select a VM and click "Export". Nir

Nir Soffer

2 Sep 2 Sep

8:32 p.m.

On Tue, Sep 1, 2020 at 11:26 PM Nir Soffer <nsoffer@redhat.com> wrote:

...

On Sun, Aug 30, 2020 at 7:13 PM <thomas@hoberg.net> wrote:

...
Struggling with bugs and issues on OVA export/import (my clear favorite otherwise, especially when moving VMs between different types of hypervisors), I've tried pretty much everything else, too.

Export domains are deprecated and require quite a bit of manual handling. Unfortunately the buttons for the various operations are all over the place e.g. the activation and maintenance toggles are in different pages.

Using export domain is not a single click, but it is not that complicated. But this is good feedback anyway.

...
In the end the mechanisms underneath (qemu-img) seem very much the same and suffer from the same issues (I have larger VMs that keep failing on imports).

I think the issue is gluster, not qemu-img.

...
So far the only fool-proof method has been to use the imageio daemon to upload and download disk images, either via the Python API or the Web-GUI.

How did you try? transfer via the UI is completely different than transfer using the python API.

From the UI, you get the image content on storage, without sparseness support. If you download 500g raw sparse disk (e.g. gluster with allocation policy thin) with 50g of data and 450g of unallocated space, you will get 50g of data, and 450g of zeroes. This is very slow. If you upload the image to another system you will upload 500g of data, which will again be very slow.

From the python API, download and upload support sparseness, so you will download and upload only 50g. Both upload and download use 4 connections, so you can maximize the throughput that you can get from the storage. From python API, you can convert the image format during download/upload automatically, for example download raw disk to qcow2 image.

Gluster is a challenge (as usual), since when using sharding (enabled by default for ovirt), it does not report sparness. So even from the python API you will download the entire 500g. We can improve this using zero detection but this is not implemented yet.

I forgot to add that NFS < 4.2 is also a challenge, and will cause very slow downloads creating fully allocated files for the same reason. If you use export domain to move VMs, you should use NFS 4.2. Unfortunately oVirt tries hard to prevent you from using NFS 4.2. Not only this is not the default, the settings to select version 4.2 are hidden under: Storage > Domains > domain-name > Manage Domain > Custom Connection Parameters Select "V4.2" for NFS Version. All this can be done only when the storage domain is in maintenance. With this creating preallocated disks is infinity times faster (using fallocate()), copying disks from this domain will can much faster, and downloading raw sparse disk will be much faster and more correct, preserving sparseness.

...

...
Transfer times are terrible though, 50MB/s is quite low when the network below is 2.5-10Gbit and SSDs all around.

In our lab we tested upload of 100 GiB image and 10 concurrent uploads of 100 GiB images, and we measured throughput of 1 GiB/s: https://bugzilla.redhat.com/show_bug.cgi?id=1591439#c24

I would like to understand the setup better:

- upload or download? - disk format? - disk storage? - how is storage connected to host? - how do you access the host (1g network? 10g?) - image format? - image storage?

If NFS, which version?

...

...
Obviously with Python as everybody's favorite GUI these days, you can also copy and transfer the VMs complete definition, but I am one of those old guys, who might even prefer a real GUI to mouse clicks on a browser.

The documentation on backup domains is terrible. What's missing behind the 404 link in oVirt becomes a very terse section in the RHV manuals, where you're basically just told that after cloning the VM, you should then move its disks to the backup domain...

backup domain is a partly cooked feature and it is not very useful. There is no reason to use it for moving VMs from one environment to another.

I already explained how to move vms using a data domain. Check here: https://lists.ovirt.org/archives/list/users@ovirt.org/message/ULLFLFKBAW7T7B... https://lists.ovirt.org/archives/list/users@ovirt.org/message/GFOK55O5N4SRU5...

I'm not sure it is documented properly, please file a documentation bug if we need to add something to the documentation.

...
What you are then supposed to do with the cloned VM, if it's ok to simplay throw it away, because the definition is silently copied to the OVF_STORE on the backup... none of that is explained or mentioned.

If you cloned a vm to data domain and then detach the data domain there is nothing to cleanup in the source system.

...
There is also no procedure for restoring a machine from a backup domain, when really a cloning process that allows a target domain would be pretty much what I'd vote for.

We have this in 4.4, try to select a VM and click "Export".

Nir

thomas＠hoberg.net

3 Sep 3 Sep

9:47 a.m.

...

On Sun, Aug 30, 2020 at 7:13 PM <thomas(a)hoberg.net> wrote:

Using export domain is not a single click, but it is not that complicated. But this is good feedback anyway.

I think the issue is gluster, not qemu-img.

How did you try? transfer via the UI is completely different than transfer using the python API.

From the UI, you get the image content on storage, without sparseness support. If you download 500g raw sparse disk (e.g. gluster with allocation policy thin) with 50g of data and 450g of unallocated space, you will get 50g of data, and 450g of zeroes. This is very slow. If you upload the image to another system you will upload 500g of data, which will again be very slow.

From the python API, download and upload support sparseness, so you will download and upload only 50g. Both upload and download use 4 connections, so you can maximize the throughput that you can get from the storage. From python API, you can convert the image format during download/upload automatically, for example download raw disk to qcow2 image.

Gluster is a challenge (as usual), since when using sharding (enabled by default for ovirt), it does not report sparness. So even from the python API you will download the entire 500g. We can improve this using zero detection but this is not implemented yet.

In our lab we tested upload of 100 GiB image and 10 concurrent uploads of 100 GiB images, and we measured throughput of 1 GiB/s: https://bugzilla.redhat.com/show_bug.cgi?id=1591439#c24

I would like to understand the setup better:

- upload or download? - disk format? - disk storage? - how is storage connected to host? - how do you access the host (1g network? 10g?) - image format? - image storage?

backup domain is a partly cooked feature and it is not very useful. There is no reason to use it for moving VMs from one environment to another.

I already explained how to move vms using a data domain. Check here: https://lists.ovirt.org/archives/list/users@ovirt.org/message/ULLFLFKBAW7... https://lists.ovirt.org/archives/list/users@ovirt.org/message/GFOK55O5N4S...

I'm not sure it is documented properly, please file a documentation bug if we need to add something to the documentation.

If you cloned a vm to data domain and then detach the data domain there is nothing to cleanup in the source system.

We have this in 4.4, try to select a VM and click "Export".

Nir On Sun, Aug 30, 2020 at 7:13 PM <thomas(a)hoberg.net> wrote:

Using export domain is not a single click, but it is not that complicated. But this is good feedback anyway.

I think the issue is gluster, not qemu-img.

From what I am gathering from your feedback, that may be very much so, and I think it's a major concern. I know RHV started out much like vSphere or Oracle Virtualization without HCI, but with separated storage and dedicated servers for the management. If you have scale, HCI is quite simply inefficient. But if you have scale, you either are already cloud yourself or going there. So IMHO HCI in small lab, edge, industrial or embedded applications is *the* future for HCI products and with it for oVirt. In that sense I perfectly subscribe to your perspective that the 'Python-GUI' is the major selling point of oVirt towards developers, but where Ceph, NAS and SAN will most likely be managed professionally, the HCI stuff needs to work out of your box--perfectly. In my case I am lego-ing surplus servers into an HCI to use both as resilient storage and for POC VMs which are fire and forget (a host goes down, the VMs get restarted elsewhere, no need to rush in and rewire things if an old host had it's final gasp). The target model at the edge I see is more what I have at home in my home-lab, which is basically a bunch of NUCs, Atom J5005 with 32GB and 1TB SATA at the low end, and now with 14nm Core CPUs being pushed out of inventories for cheap, even a NUC10 i7-10710U with 64GB of RAM and 1TB of NVMe, a fault tolerant cluster well below 50Watts in normal operations and with no moving parts. In the corporate lab these are complemented by big ML servers for the main research, where the oVirt HCI simply adds storage and VMs for automation jobs, but I'd love to be able to use those also as oVirt compute nodes, at least partially: The main workloads there run under Docker because of the easy GPU integration. It's not that dissimilar in the home-lab, where my workstations (not 24/7 and often running Windows) may sometimes be added as compute nodes, but not part of the HCI parts. I'd love to string these all together via a USB3 Gluster and use the on-board 1Gbit for the business end of the VMS, but since nobody offers a simple USB3 peering network, I am using 2.5 or 5GBit USB Ethernet adapters instead for 3-node HCI (main) and 1-node HCI (disaster/backup/migration).

...

How did you try? transfer via the UI is completely different than transfer using the python API.

Both ways, using the Python sample code from the SDK you wrote. I didn't measure the GUI side... it finished over night, but the Python code echos a throughput figure at the end, which was 50MB/s in my case, while NFS typically reaches the 2.5Gbit Ethernet limits of 270MB/s. And funny, that they should be so different, I keep thinking that the Web-GUI and the 'Python-GUI' are in lock-step, but I guess the 'different' mainly refers to the fact that the GUI needs to go through an image proxy.

...

From the UI, you get the image content on storage, without sparseness support. If you download 500g raw sparse disk (e.g. gluster with allocation policy thin) with 50g of data and 450g of unallocated space, you will get 50g of data, and 450g of zeroes. This is very slow. If you upload the image to another system you will upload 500g of data, which will again be very slow.

From the python API, download and upload support sparseness, so you will download and upload only 50g. Both upload and download use 4 connections, so you can maximize the throughput that you can get from the storage. From python API, you can convert the image format during download/upload automatically, for example download raw disk to qcow2 image.

This comment helped me realize how different the GUI image transfers are from OVA, Export Domain and Python: While the first allows these transfers from 'everywhere a GUI might run on', the latter will run on any node with hosted-engine capabilities, which implies VDSM runnung there and it having access to both ends of the storage locally. But the critical insight was, that disk images Gluster failed to write/store with all the faster methods, were written and worked fine using the GUI or via the imageio proxy. So one of the perhaps best ways to find the underlying Gluster bug is to see what's happening when the same image is transferred in both ways. I can't see how a bug report to the Gluster team might have a chance of succeeding when I attach a 500GB disk image and ask them to find out 'why this image fails with qemu-img writes'...

...

Gluster is a challenge (as usual), since when using sharding (enabled by default for ovirt),

Somehow that message doesn't get to the headlines on oVirt: HCI is not advertised as a 'niche that might sometimes work'. HCI is built on the premise and promise that the network protocols and software (as well as the physical network) are more reliable than the node hardware, otherwise it just becomes a very expensive source of entropy. And of course, sharding is a must in HCI with VMs, even if it breaks one of the major benefits of Gluster: Access to the original files in the back bricks in case it fouls up. In an HPC environment with hundreds of nodes and bricks I guess I wouldn't use it, in 3-9 node HCI with VMs mostly, sharding and erasure codes is what I need to work perfectly. I've gathered it's another team and they have now have major staffing and funding issues, but without the ability to manage cloud, DC on-premise and edge HCI deployments under a single management pane and with good interoperability, oVirt/RHV ceases to be a product: IMHO you can't afford that, even if it costs investments.

...

it does not report sparness. So even from the python API you will download the entire 500g. We can improve this using zero detection but this is not implemented yet. Since I have VDO underneath it might not even make such a big difference with regards to storage and with compression on the communications link, implementing yet another zero detection layer may not yield tons of benefit. I guess what I'd mostly expect is a option to the disk up/download that acts locally to the VDSM nodes, like the OVAs and domain exports/imports.

The other critical success element for oVirt (apart from offering something more reliable than a single physical host), is the ability to use it in a self-service manner. The 'Python-GUI' is quickly becoming the default, especially with the kids in the company, who no longer even know how to point and click a mouse and will code everything, but there are still the older guys like me, who expect to do things manually with a mouse on a GUI. So if these options are there, the GUI should support them.

...

In our lab we tested upload of 100 GiB image and 10 concurrent uploads of 100 GiB images, and we measured throughput of 1 GiB/s: https://bugzilla.redhat.com/show_bug.cgi?id=1591439#c24

That doesn't sound so great, if the network is 100Gbit ;-) So I am assuming you can saturate the network, something I am afraid of doing in an edge HCI with a single network port running Gluster and everything else. With native 10Gbit USB3 links supporting isochronous protocols I'd feel save, but with TCP/IP on Gbit... In any case I'll do more testing, but currently that doesn't solve my problem, because I still need to have those VMs move from the NFS domain to Gluster and that fails.

...

I would like to understand the setup better:

Currently the focus is on migrating clusters from 4.3 HCI to 4.4 with a full rebuild of the nodes and VMs in safe storage. The official migration procedure doesn't seem mistake reslient enough on a 3 node HCI gluster. Moving VMs between Gluster and NFS domains seems to work well enough on export, imports work too, but once you move those VMs to Gluster on the target, qemu-img convert fails more often than not, evidently because of a Gluster bug, that does not trigger on GUI uploads.

...

- upload or download? both - disk format? "thin provisioned" wherever I had a choice: The VMs in question are pretty much always about functionality not performance and not having to worry about disk sizes. VMs are given a large single disk, VDO, LVM_thin, QCOW2 expected to only what's written to. - disk storage? 3n or 1n HCI Gluster, detachable domains are local storage exported via NFS and meant to be temporary, because gluster storage doesn't move that easily. There is no SAN or enterprise NFS available. - how is storage connected to host? PCIe or SATA SSD - how do you access the host (1g network? 10g?) 2.5/5/10 Gibt Ethernet - image format? I tag "thin" whereever I get a choice. qemu-img info will still often report "raw" e.g. on export domain images. - image storage? Gluster or NFS

backup domain is a partly cooked feature and it is not very useful. There is no reason to use it for moving VMs from one environment to another. The manual is terse. I guess the only functionality at the moment is that VMs in backup domains don't get launched.

The attribute also just seems to be a local flag, when a domain is re-attached the backup flag gets lost. I only noticed after I successfully launched VMs from the 'backup' domain re-attached to the 4.4 target.

...

I already explained how to move vms using a data domain. Check here: https://lists.ovirt.org/archives/list/users@ovirt.org/message/ULLFLFKBAW7... https://lists.ovirt.org/archives/list/users@ovirt.org/message/GFOK55O5N4S...

Since HCI and Gluster is my default, I didn't pay that much attention initially. I have tested NFS domains more and I find them much easier to use, but without an enterprise NAS or with HCI as a target on source and target, that's not a solution until disks can be moved from NFS to Gluster without failing on qemu-img convert.

...

I'm not sure it is documented properly, please file a documentation bug if we need to add something to the documentation.

If you cloned a vm to data domain and then detach the data domain there is nothing to cleanup in the source system.

At least on the 4.3 GUI clone doesn't have a target and only asks for a name: There is no cloning from gluster to NFS or vice-versa in the GUI. Instead I have to first clone (gluster2gluster) and then move (gluster2NFS) to make a VM movable. Perhaps that is different in Python/REST? With 4.4 the clone operation is much more elaborate and allows fine tuning the 'cloned' machine. But again, I don't see that I can change the storage domain there: There is a selection box, but it only allows the same domain as the clone source. Actually that makes a lot of sense, because for VDI scenarios or similar, clone should be a copy-on-write operation, essentialy a snapshot into a distinct identity. So detaching tons of straddling VMs could be a challenge. As far as I can tell on 4.3 clone is simply a full copy (with sparsity preserved) and with 4.4 you get a 'copy with reconfiguration'. The VDI type storage efficency needs to come from VDO, it doesn't seem to be managed by oVirt.

...

We have this in 4.4, try to select a VM and click "Export".

Good, so the next migration will be easier...

...

Nir

Hey, sorry for piling on a bit: I really do appreciate both what you have been creating and your support. It's just that for a product that is almost decades old now, it seems very beta right where and how I need to use it. I am very much looking forward to next week and hear about the bright future you plan for oVirt/RHV, but in the mean-time I'd like to abuse this opportunity to push my agenda a bit: 1. Make HCI a true focus of the product, not a Nutanix also-ran sideline. Perhaps even make it your daily driver in QA 2. Find ways of fencing, that do not require enterprise hardware: NUCs or similar could be a giant opportunity in edge deployments, with various levels of concentration (and higher grade hardware) along the path towards DCs or clouds: Not having to switch the orchestrator API is a USP 3. With Thunderbolt being the new USB and Thunderbold being PCIe or NVMe over fabric etc.: Is there a way to make USB work for a HCI fabric? I use Mellanox host-chaining on our big boxes and while vendors would rather sell IF switches, labs would rather use software. And USB is even cheaper than Ethernet, because four ports come free with every box, allowing for quite a HCI mesh just adding cables. Gluster giving up on RDMA support (if I read correctly) is the wrong way to go.

1800

Age (days ago)

1805

Last active (days ago)

List overview

Download

15 comments

5 participants

participants (5)

info＠dsdm.ch
Jayme
Nir Soffer
Stefan Wolf
thomas＠hoberg.net

How to Backup a VM

tags

participants (5)