On Fri, Aug 28, 2020 at 2:31 AM <thomas(a)hoberg.net>
wrote:
You should really try the attach/detach storage domain, this is the
recommended way to move
vms from one ovirt system to another.
You could detach the entire domain with all vms from the old system,
and connect it to the new
system, without copying even one bit.
I guess you cannot do this because you don't use shared storage?
These are all HCI setups with GlusterFS, so storage is shared in a way...
I am also experimenting with a backup (not export) domain on NFS and/or removable media
(just temp local storage, exported via NFS), but the handling is very odd, to say the
least (see my other post for the full story).
Basically the documentation says you move all VM disks to the backup domain after cloning
the VM. And then it says nothing more... (how does the VM definition get carried over? Can
I then destroy the remaing clone VM? Do I need to re-create a similar VM at the target?
etc.)
The in-place upgrade producedure in the docs for the HCI case has far too many tersly
described steps that can go wrong with someone like me doing it: I even manage to fail a
green-field setup many times somehow
And even if I were to do the upgrade as described, I do need to know that the
export/clean-migration/import is still a viable option, should something go wrong.
...
Using ovirt 4.3 when 4.4 was released is going to be painful, don't do this.
That's why I am migrating, but for that I need to prove a working plan B
Unfortunately the description doesn't tell if the failure of the silent
qemu-img was on the export side, resulting a corrupted image: I am assuming that qemu-img
is used in both export and import.
The failure on the import is not silent, just doesn't seem to make a lot of sense,
because qemu-img is reporting a write error at the local single node HCI gluster target,
which has plenty of space and is essentially a loopback in 1nHCI.
...
No, export domain is using qemu-img, which is the best tool for copying images.
This is how all disks are copied in oVirt in all flows. There are no
issues like ignored
errors or silent failures in storage code.
My problem is getting errors and now
understanding what's causing them.
Qemu-img on the target HCI single node Gluster is reporting a write error at varying block
numbers, often after dozens of gigabytes have already been transferred. There is plenty of
space on the Gluster, an SSD with VDO underneath so the higher risk is actually the
source, which is the NFS mount from the export domain.
I've tried uploading the image using imageio and your Python sample from the SDK, but
just as I had done that (with 50MB/s at 1/6 of the performance of the qemu-img transfer),
I managed to kill the 4.4 cluster by downgrading the machine type of the hosted-engine,
when I was really trying to make a successfully restored VM work with renamed Ethernet
devices...
The upload via imageio completed fully, I hadn't tested the disk image with a machine
yet to see if it would boot.
...
There are no timeouts in storage code, e.g. attach/detach domain, or
export to export
domain.
Nir
Well, that was almost my last hope, because I don't know what could make
the qemu-img import transfer fail on a write, when the very same image works with
imageio... Actually, the big difference there is that the resulting disk, which is
logically configured at 500GB is actually logically consuming 500GB in the domain, while
sparse images that make it successfully through qemu-img, retain their much smaller actual
size. VDO is still underneath so it may not matter, and I didn't have a chance to try
sparsify before I killed the target cluster.
I have also prepared a USB3 disk to act as export domain, which I'll physically move,
just to ensure the NFS pipe in the qemu-img job isn't the real culprit.
And I guess I'll try the export again, to see if I overlooked some error there.
>
>
> Nir