On Mon, Jul 5, 2021 at 2:13 PM Nir Soffer <nsoffer@redhat.com> wrote:

>
> vdsm     14342  3270  0 11:17 ?        00:00:03 /usr/bin/qemu-img convert -p -t none -T none -f raw /rhev/data-center/mnt/blockSD/679c0725-75fb-4af7-bff1-7c447c5d789c/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/d2a89b5e-7d62-4695-96d8-b762ce52b379 -O raw -o preallocation=falloc /rhev/data-center/mnt/172.16.1.137:_nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/d2a89b5e-7d62-4695-96d8-b762ce52b379

-o preallocation + NFS 4.0 + very slow NFS is your problem.

qemu-img is using posix-fallocate() to preallocate the entire image at
the start of the copy. With NFS 4.2
this uses fallocate() linux specific syscall that allocates the space
very efficiently in no time. With older
NFS versions, this becomes a very slow loop, writing one byte for
every 4k block.

If you see -o preallocation, it means you are using an old vdsm
version, we stopped using -o preallocation
in 4.4.2, see https://bugzilla.redhat.com/1850267.

OK. As I said at the beginning the environment is latest 4.3
We are going to upgrade to 4.4 and we are making some complimentary backups, for safeness.


> On the hypervisor the ls commands quite hang, so from another hypervisor I see that the disk size seems to remain at 4Gb even if timestamp updates...
>
> # ll /rhev/data-center/mnt/172.16.1.137\:_nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/
> total 4260941
> -rw-rw----. 1 nobody nobody 4363202560 Jul  5 11:23 d2a89b5e-7d62-4695-96d8-b762ce52b379
> -rw-r--r--. 1 nobody nobody        261 Jul  5 11:17 d2a89b5e-7d62-4695-96d8-b762ce52b379.meta
>
> On host console I see a throughput of 4mbit/s...
>
> # strace -p 14342

This shows only the main thread use -f use -f to show all threads.

 # strace -f -p 14342
strace: Process 14342 attached with 2 threads
[pid 14342] ppoll([{fd=9, events=POLLIN|POLLERR|POLLHUP}], 1, NULL, NULL, 8 <unfinished ...>
[pid 14343] pwrite64(12, "\0", 1, 16474968063) = 1
[pid 14343] pwrite64(12, "\0", 1, 16474972159) = 1
[pid 14343] pwrite64(12, "\0", 1, 16474976255) = 1
[pid 14343] pwrite64(12, "\0", 1, 16474980351) = 1
[pid 14343] pwrite64(12, "\0", 1, 16474984447) = 1
[pid 14343] pwrite64(12, "\0", 1, 16474988543) = 1
[pid 14343] pwrite64(12, "\0", 1, 16474992639) = 1
[pid 14343] pwrite64(12, "\0", 1, 16474996735) = 1
[pid 14343] pwrite64(12, "\0", 1, 16475000831) = 1
[pid 14343] pwrite64(12, "\0", 1, 16475004927) = 1
. . . and so on . . .


>
> This is a test oVirt env so I can wait and eventually test something...
> Let me know your suggestions

I would start by changing the NFS storage domain to version 4.2.

I'm going to try. RIght now I have set it to the default of autonegotiated...


1. kill the hang qemu-img (it will probably cannot be killed, but worth trying)
2. deactivate the storage domain
3. fix the ownership on the storage domain (should be vdsm:kvm, not
nobody:nobody)3.

Unfortunately it is an appliance. I have asked the guys that have it in charge if we can set them.
Thanks for the other concepts explained.

Gianluca