On Sun, Aug 30, 2020 at 7:13 PM <thomas(a)hoberg.net>
wrote:
Using export domain is not a single click, but it is not that complicated.
But this is good feedback anyway.
I think the issue is gluster, not qemu-img.
How did you try? transfer via the UI is completely different than
transfer using the python API.
From the UI, you get the image content on storage, without sparseness
support. If you
download 500g raw sparse disk (e.g. gluster with allocation policy
thin) with 50g of data
and 450g of unallocated space, you will get 50g of data, and 450g of
zeroes. This is very
slow. If you upload the image to another system you will upload 500g
of data, which will
again be very slow.
From the python API, download and upload support sparseness, so you
will download and
upload only 50g. Both upload and download use 4 connections, so you
can maximize the
throughput that you can get from the storage. From python API, you can
convert the image
format during download/upload automatically, for example download raw
disk to qcow2
image.
Gluster is a challenge (as usual), since when using sharding (enabled
by default for ovirt),
it does not report sparness. So even from the python API you will
download the entire 500g.
We can improve this using zero detection but this is not implemented yet.
In our lab we tested upload of 100 GiB image and 10 concurrent uploads
of 100 GiB
images, and we measured throughput of 1 GiB/s:
https://bugzilla.redhat.com/show_bug.cgi?id=1591439#c24
I would like to understand the setup better:
- upload or download?
- disk format?
- disk storage?
- how is storage connected to host?
- how do you access the host (1g network? 10g?)
- image format?
- image storage?
backup domain is a partly cooked feature and it is not very useful.
There is no reason
to use it for moving VMs from one environment to another.
I already explained how to move vms using a data domain. Check here:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/ULLFLFKBAW7...
https://lists.ovirt.org/archives/list/users@ovirt.org/message/GFOK55O5N4S...
I'm not sure it is documented properly, please file a documentation
bug if we need to
add something to the documentation.
If you cloned a vm to data domain and then detach the data domain
there is nothing to cleanup in the source system.
We have this in 4.4, try to select a VM and click "Export".
Nir
On Sun, Aug 30, 2020 at 7:13 PM <thomas(a)hoberg.net> wrote:
Using export domain is not a single click, but it is not that complicated.
But this is good feedback anyway.
I think the issue is gluster, not qemu-img.
From what I am gathering from your feedback, that may be very much so, and I think
it's a major concern.
I know RHV started out much like vSphere or Oracle Virtualization without HCI, but with
separated storage and dedicated servers for the management. If you have scale, HCI is
quite simply inefficient.
But if you have scale, you either are already cloud yourself or going there. So IMHO HCI
in small lab, edge, industrial or embedded applications is *the* future for HCI products
and with it for oVirt. In that sense I perfectly subscribe to your perspective that the
'Python-GUI' is the major selling point of oVirt towards developers, but where
Ceph, NAS and SAN will most likely be managed professionally, the HCI stuff needs to work
out of your box--perfectly.
In my case I am lego-ing surplus servers into an HCI to use both as resilient storage and
for POC VMs which are fire and forget (a host goes down, the VMs get restarted elsewhere,
no need to rush in and rewire things if an old host had it's final gasp).
The target model at the edge I see is more what I have at home in my home-lab, which is
basically a bunch of NUCs, Atom J5005 with 32GB and 1TB SATA at the low end, and now with
14nm Core CPUs being pushed out of inventories for cheap, even a NUC10 i7-10710U with 64GB
of RAM and 1TB of NVMe, a fault tolerant cluster well below 50Watts in normal operations
and with no moving parts.
In the corporate lab these are complemented by big ML servers for the main research, where
the oVirt HCI simply adds storage and VMs for automation jobs, but I'd love to be able
to use those also as oVirt compute nodes, at least partially: The main workloads there run
under Docker because of the easy GPU integration. It's not that dissimilar in the
home-lab, where my workstations (not 24/7 and often running Windows) may sometimes be
added as compute nodes, but not part of the HCI parts.
I'd love to string these all together via a USB3 Gluster and use the on-board 1Gbit
for the business end of the VMS, but since nobody offers a simple USB3 peering network, I
am using 2.5 or 5GBit USB Ethernet adapters instead for 3-node HCI (main) and 1-node HCI
(disaster/backup/migration).
How did you try? transfer via the UI is completely different than
transfer using the python API.
Both ways, using the Python sample code from the SDK
you wrote. I didn't measure the GUI side... it finished over night, but the Python
code echos a throughput figure at the end, which was 50MB/s in my case, while NFS
typically reaches the 2.5Gbit Ethernet limits of 270MB/s.
And funny, that they should be so different, I keep thinking that the Web-GUI and the
'Python-GUI' are in lock-step, but I guess the 'different' mainly refers
to the fact that the GUI needs to go through an image proxy.
From the UI, you get the image content on storage, without sparseness
support. If you
download 500g raw sparse disk (e.g. gluster with allocation policy
thin) with 50g of data
and 450g of unallocated space, you will get 50g of data, and 450g of
zeroes. This is very
slow. If you upload the image to another system you will upload 500g
of data, which will
again be very slow.
From the python API, download and upload support sparseness, so you
will download and
upload only 50g. Both upload and download use 4 connections, so you
can maximize the
throughput that you can get from the storage. From python API, you can
convert the image
format during download/upload automatically, for example download raw
disk to qcow2
image.
This comment helped me realize how different the GUI image transfers are
from OVA, Export Domain and Python: While the first allows these transfers from
'everywhere a GUI might run on', the latter will run on any node with
hosted-engine capabilities, which implies VDSM runnung there and it having access to both
ends of the storage locally.
But the critical insight was, that disk images Gluster failed to write/store with all the
faster methods, were written and worked fine using the GUI or via the imageio proxy.
So one of the perhaps best ways to find the underlying Gluster bug is to see what's
happening when the same image is transferred in both ways.
I can't see how a bug report to the Gluster team might have a chance of succeeding
when I attach a 500GB disk image and ask them to find out 'why this image fails with
qemu-img writes'...
Gluster is a challenge (as usual), since when using sharding (enabled
by default for ovirt),
Somehow that message doesn't get to the headlines on
oVirt: HCI is not advertised as a 'niche that might sometimes work'.
HCI is built on the premise and promise that the network protocols and software (as well
as the physical network) are more reliable than the node hardware, otherwise it just
becomes a very expensive source of entropy.
And of course, sharding is a must in HCI with VMs, even if it breaks one of the major
benefits of Gluster: Access to the original files in the back bricks in case it fouls up.
In an HPC environment with hundreds of nodes and bricks I guess I wouldn't use it, in
3-9 node HCI with VMs mostly, sharding and erasure codes is what I need to work
perfectly.
I've gathered it's another team and they have now have major staffing and funding
issues, but without the ability to manage cloud, DC on-premise and edge HCI deployments
under a single management pane and with good interoperability, oVirt/RHV ceases to be a
product: IMHO you can't afford that, even if it costs investments.
it does not report sparness. So even from the python API you will
download the entire 500g.
We can improve this using zero detection but this is not implemented yet.
Since I
have VDO underneath it might not even make such a big difference with regards to storage
and with compression on the communications link, implementing yet another zero detection
layer may not yield tons of benefit. I guess what I'd mostly expect is a option to the
disk up/download that acts locally to the VDSM nodes, like the OVAs and domain
exports/imports.
The other critical success element for oVirt (apart from offering something more reliable
than a single physical host), is the ability to use it in a self-service manner. The
'Python-GUI' is quickly becoming the default, especially with the kids in the
company, who no longer even know how to point and click a mouse and will code everything,
but there are still the older guys like me, who expect to do things manually with a mouse
on a GUI. So if these options are there, the GUI should support them.
In our lab we tested upload of 100 GiB image and 10 concurrent uploads
of 100 GiB
images, and we measured throughput of 1 GiB/s:
https://bugzilla.redhat.com/show_bug.cgi?id=1591439#c24 That doesn't sound so
great, if the network is 100Gbit ;-)
So I am assuming you can saturate the network, something I am afraid of doing in an edge
HCI with a single network port running Gluster and everything else. With native 10Gbit
USB3 links supporting isochronous protocols I'd feel save, but with TCP/IP on Gbit...
In any case I'll do more testing, but currently that doesn't solve my problem,
because I still need to have those VMs move from the NFS domain to Gluster and that
fails.
I would like to understand the setup better:
Currently the focus is on migrating clusters from 4.3 HCI to 4.4 with a full
rebuild of the nodes and VMs in safe storage. The official migration procedure doesn't
seem mistake reslient enough on a 3 node HCI gluster.
Moving VMs between Gluster and NFS domains seems to work well enough on export, imports
work too, but once you move those VMs to Gluster on the target, qemu-img convert fails
more often than not, evidently because of a Gluster bug, that does not trigger on GUI
uploads.
- upload or download?
both
- disk format?
"thin provisioned" wherever I had a
choice: The VMs in question are pretty much always about functionality not performance and
not having to worry about disk sizes. VMs are given a large single disk, VDO, LVM_thin,
QCOW2 expected to only what's written to.
- disk storage?
3n or 1n HCI Gluster, detachable domains are
local storage exported via NFS and meant to be temporary, because gluster storage
doesn't move that easily. There is no SAN or enterprise NFS available.
- how is storage connected to host?
PCIe or SATA SSD
- how do you access the host (1g network? 10g?)
2.5/5/10 Gibt
Ethernet
- image format?
I tag "thin" whereever I get a
choice. qemu-img info will still often report "raw" e.g. on export domain
images.
- image storage?
Gluster or NFS
backup domain is a partly cooked feature and it is not very useful.
There is no reason
to use it for moving VMs from one environment to another.
The manual is terse. I
guess the only functionality at the moment is that VMs in backup domains don't get
launched.
The attribute also just seems to be a local flag, when a domain is re-attached the backup
flag gets lost. I only noticed after I successfully launched VMs from the 'backup'
domain re-attached to the 4.4 target.
Since
HCI and Gluster is my default, I didn't pay that much attention initially.
I have tested NFS domains more and I find them much easier to use, but without an
enterprise NAS or with HCI as a target on source and target, that's not a solution
until disks can be moved from NFS to Gluster without failing on qemu-img convert.
I'm not sure it is documented properly, please file a documentation
bug if we need to
add something to the documentation.
If you cloned a vm to data domain and then detach the data domain
there is nothing to cleanup in the source system.
At least on the 4.3 GUI clone
doesn't have a target and only asks for a name: There is no cloning from gluster to
NFS or vice-versa in the GUI. Instead I have to first clone (gluster2gluster) and then
move (gluster2NFS) to make a VM movable. Perhaps that is different in Python/REST?
With 4.4 the clone operation is much more elaborate and allows fine tuning the
'cloned' machine. But again, I don't see that I can change the storage domain
there: There is a selection box, but it only allows the same domain as the clone source.
Actually that makes a lot of sense, because for VDI scenarios or similar, clone should be
a copy-on-write operation, essentialy a snapshot into a distinct identity. So detaching
tons of straddling VMs could be a challenge.
As far as I can tell on 4.3 clone is simply a full copy (with sparsity preserved) and with
4.4 you get a 'copy with reconfiguration'. The VDI type storage efficency needs to
come from VDO, it doesn't seem to be managed by oVirt.
We have this in 4.4, try to select a VM and click "Export".
Good, so the
next migration will be easier...
Nir
Hey, sorry for piling on a bit: I really do appreciate both what you have been
creating and your support.
It's just that for a product that is almost decades old now, it seems very beta right
where and how I need to use it.
I am very much looking forward to next week and hear about the bright future you plan for
oVirt/RHV, but in the mean-time I'd like to abuse this opportunity to push my agenda a
bit:
1. Make HCI a true focus of the product, not a Nutanix also-ran sideline. Perhaps even
make it your daily driver in QA
2. Find ways of fencing, that do not require enterprise hardware: NUCs or similar could be
a giant opportunity in edge deployments, with various levels of concentration (and higher
grade hardware) along the path towards DCs or clouds: Not having to switch the
orchestrator API is a USP
3. With Thunderbolt being the new USB and Thunderbold being PCIe or NVMe over fabric etc.:
Is there a way to make USB work for a HCI fabric? I use Mellanox host-chaining on our big
boxes and while vendors would rather sell IF switches, labs would rather use software. And
USB is even cheaper than Ethernet, because four ports come free with every box, allowing
for quite a HCI mesh just adding cables. Gluster giving up on RDMA support (if I read
correctly) is the wrong way to go.