Any way to terminate stuck export task

Hello, in oVirt 4.3.10 an export job to export domain takes too long, probably due to the NFS server slow. How can I stop in a clean way the task? I see the exported file remains always at 4,5Gb of size. Command vmstat on host with qemu-img process gives no throughput but blocked processes procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 2 0 170208752 474412 16985752 0 0 719 72 2948 5677 0 0 96 4 0 0 2 0 170207184 474412 16985780 0 0 3580 99 5043 6790 0 0 96 4 0 0 2 0 170208800 474412 16985804 0 0 1379 41 2332 5527 0 0 96 4 0 and the generated file refreshes its timestamp but not the size # ll -a /rhev/data-center/mnt/172.16.1.137: _nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/125ad0f8-2672-468f-86a0-115a7be287f0/ total 4675651 drwxr-xr-x. 2 vdsm kvm 1024 Jul 3 14:10 . drwxr-xr-x. 12 vdsm kvm 1024 Jul 3 14:10 .. -rw-rw----. 1 vdsm kvm 4787863552 Jul 3 14:33 bb94ae66-e574-432b-bf68-7497bb3ca9e6 -rw-r--r--. 1 vdsm kvm 268 Jul 3 14:10 bb94ae66-e574-432b-bf68-7497bb3ca9e6.meta # du -sh /rhev/data-center/mnt/172.16.1.137: _nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/125ad0f8-2672-468f-86a0-115a7be287f0/ 4.5G /rhev/data-center/mnt/172.16.1.137: _nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/125ad0f8-2672-468f-86a0-115a7be287f0/ The VM has two disks, 35Gb and 300GB, not full but quite occupied. Can I simply kill the qemu-img processes on the chosen hypervisor (I suppose the SPM one)? Any way to track down why it is so slow? Thanks, Gianluca

On Sat, Jul 3, 2021 at 3:46 PM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
Hello, in oVirt 4.3.10 an export job to export domain takes too long, probably due to the NFS server slow. How can I stop in a clean way the task? I see the exported file remains always at 4,5Gb of size. Command vmstat on host with qemu-img process gives no throughput but blocked processes
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 2 0 170208752 474412 16985752 0 0 719 72 2948 5677 0 0 96 4 0 0 2 0 170207184 474412 16985780 0 0 3580 99 5043 6790 0 0 96 4 0 0 2 0 170208800 474412 16985804 0 0 1379 41 2332 5527 0 0 96 4 0
and the generated file refreshes its timestamp but not the size
# ll -a /rhev/data-center/mnt/172.16.1.137:_nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/125ad0f8-2672-468f-86a0-115a7be287f0/ total 4675651 drwxr-xr-x. 2 vdsm kvm 1024 Jul 3 14:10 . drwxr-xr-x. 12 vdsm kvm 1024 Jul 3 14:10 .. -rw-rw----. 1 vdsm kvm 4787863552 Jul 3 14:33 bb94ae66-e574-432b-bf68-7497bb3ca9e6 -rw-r--r--. 1 vdsm kvm 268 Jul 3 14:10 bb94ae66-e574-432b-bf68-7497bb3ca9e6.meta
# du -sh /rhev/data-center/mnt/172.16.1.137:_nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/125ad0f8-2672-468f-86a0-115a7be287f0/ 4.5G /rhev/data-center/mnt/172.16.1.137:_nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/125ad0f8-2672-468f-86a0-115a7be287f0/
The VM has two disks, 35Gb and 300GB, not full but quite occupied.
Can I simply kill the qemu-img processes on the chosen hypervisor (I suppose the SPM one)?
Killing the qemu-img process is the only way to stop qemu-img. The system is designed to clean up properly after qemu-img terminates. If this capability is important to you, you can file RFE to allow aborting jobs from engine UI/API. This is already implemented internally, but we did not expose the capability. It would be useful to understand why qemu-img convert does not make progress. If you can reproduce this by running qemu-img from the shell, it can be useful to run it via strace and ask about this in qemu-block mailing list. Example strace usage: strace -o convert.log -f -tt -T qemu-img convert ... Also output of nfsstat during the copy can help. Nir

Isn't it better to strace it before killing qemu-img . Best Regards,Strahil Nikolov On Sun, Jul 4, 2021 at 0:15, Nir Soffer<nsoffer@redhat.com> wrote: On Sat, Jul 3, 2021 at 3:46 PM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
Hello, in oVirt 4.3.10 an export job to export domain takes too long, probably due to the NFS server slow. How can I stop in a clean way the task? I see the exported file remains always at 4,5Gb of size. Command vmstat on host with qemu-img process gives no throughput but blocked processes
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 2 0 170208752 474412 16985752 0 0 719 72 2948 5677 0 0 96 4 0 0 2 0 170207184 474412 16985780 0 0 3580 99 5043 6790 0 0 96 4 0 0 2 0 170208800 474412 16985804 0 0 1379 41 2332 5527 0 0 96 4 0
and the generated file refreshes its timestamp but not the size
# ll -a /rhev/data-center/mnt/172.16.1.137:_nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/125ad0f8-2672-468f-86a0-115a7be287f0/ total 4675651 drwxr-xr-x. 2 vdsm kvm 1024 Jul 3 14:10 . drwxr-xr-x. 12 vdsm kvm 1024 Jul 3 14:10 .. -rw-rw----. 1 vdsm kvm 4787863552 Jul 3 14:33 bb94ae66-e574-432b-bf68-7497bb3ca9e6 -rw-r--r--. 1 vdsm kvm 268 Jul 3 14:10 bb94ae66-e574-432b-bf68-7497bb3ca9e6.meta
# du -sh /rhev/data-center/mnt/172.16.1.137:_nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/125ad0f8-2672-468f-86a0-115a7be287f0/ 4.5G /rhev/data-center/mnt/172.16.1.137:_nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/125ad0f8-2672-468f-86a0-115a7be287f0/
The VM has two disks, 35Gb and 300GB, not full but quite occupied.
Can I simply kill the qemu-img processes on the chosen hypervisor (I suppose the SPM one)?
Killing the qemu-img process is the only way to stop qemu-img. The system is designed to clean up properly after qemu-img terminates. If this capability is important to you, you can file RFE to allow aborting jobs from engine UI/API. This is already implemented internally, but we did not expose the capability. It would be useful to understand why qemu-img convert does not make progress. If you can reproduce this by running qemu-img from the shell, it can be useful to run it via strace and ask about this in qemu-block mailing list. Example strace usage: strace -o convert.log -f -tt -T qemu-img convert ... Also output of nfsstat during the copy can help. Nir _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/RAMVA5P5IBOXL3...

On Sun, Jul 4, 2021 at 11:30 AM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
Isn't it better to strace it before killing qemu-img .
It may be too late, but it may help to understand why this qemu-img run got stuck.
Best Regards, Strahil Nikolov
On Sun, Jul 4, 2021 at 0:15, Nir Soffer <nsoffer@redhat.com> wrote: On Sat, Jul 3, 2021 at 3:46 PM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
Hello, in oVirt 4.3.10 an export job to export domain takes too long, probably due to the NFS server slow. How can I stop in a clean way the task? I see the exported file remains always at 4,5Gb of size. Command vmstat on host with qemu-img process gives no throughput but blocked processes
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 2 0 170208752 474412 16985752 0 0 719 72 2948 5677 0 0 96 4 0 0 2 0 170207184 474412 16985780 0 0 3580 99 5043 6790 0 0 96 4 0 0 2 0 170208800 474412 16985804 0 0 1379 41 2332 5527 0 0 96 4 0
and the generated file refreshes its timestamp but not the size
# ll -a /rhev/data-center/mnt/172.16.1.137:_nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/125ad0f8-2672-468f-86a0-115a7be287f0/ total 4675651 drwxr-xr-x. 2 vdsm kvm 1024 Jul 3 14:10 . drwxr-xr-x. 12 vdsm kvm 1024 Jul 3 14:10 .. -rw-rw----. 1 vdsm kvm 4787863552 Jul 3 14:33 bb94ae66-e574-432b-bf68-7497bb3ca9e6 -rw-r--r--. 1 vdsm kvm 268 Jul 3 14:10 bb94ae66-e574-432b-bf68-7497bb3ca9e6.meta
# du -sh /rhev/data-center/mnt/172.16.1.137:_nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/125ad0f8-2672-468f-86a0-115a7be287f0/ 4.5G /rhev/data-center/mnt/172.16.1.137:_nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/125ad0f8-2672-468f-86a0-115a7be287f0/
The VM has two disks, 35Gb and 300GB, not full but quite occupied.
Can I simply kill the qemu-img processes on the chosen hypervisor (I suppose the SPM one)?
Killing the qemu-img process is the only way to stop qemu-img. The system is designed to clean up properly after qemu-img terminates.
If this capability is important to you, you can file RFE to allow aborting jobs from engine UI/API. This is already implemented internally, but we did not expose the capability.
It would be useful to understand why qemu-img convert does not make progress. If you can reproduce this by running qemu-img from the shell, it can be useful to run it via strace and ask about this in qemu-block mailing list.
Example strace usage:
strace -o convert.log -f -tt -T qemu-img convert ...
Also output of nfsstat during the copy can help.
Nir
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/RAMVA5P5IBOXL3...

On Sun, Jul 4, 2021 at 1:01 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Sun, Jul 4, 2021 at 11:30 AM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
Isn't it better to strace it before killing qemu-img .
It may be too late, but it may help to understand why this qemu-img run got stuck.
Hi, thanks for your answers and suggestions. That env was a production one and so I was forced to power off the hypervisor and power on it again (it was a maintenance window with all the VMs powered down anyway). I was also unable to put the host into maintenance because it replied that there were some tasks running, even after the kill, because the 2 processes (the VM had 2 disks to export and so two qemu-img processes) remained in defunct and after several minutes no change in web admin feedback about the process.... My first suspicion was something related to fw congestion because the hypervisor network and the nas appliance were in different networks and I wasn't sure if a fw was in place for it.... But on a test oVirt environment with same oVirt version and with the same network for hypervisors I was able to put a Linux server with the same network as the nas and configure it as nfs server. And the export went with a throughput of about 50MB/s, so no fw problem. A VM with 55Gb disk exported in 19 minutes. So I got the rights to mount the nas on the test env and mounted it as export domain and now I have the same problems I can debug. The same VM with only one disk (55Gb). The process: vdsm 14342 3270 0 11:17 ? 00:00:03 /usr/bin/qemu-img convert -p -t none -T none -f raw /rhev/data-center/mnt/blockSD/679c0725-75fb-4af7-bff1-7c447c5d789c/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/d2a89b5e-7d62-4695-96d8-b762ce52b379 -O raw -o preallocation=falloc /rhev/data-center/mnt/172.16.1.137: _nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/d2a89b5e-7d62-4695-96d8-b762ce52b379 On the hypervisor the ls commands quite hang, so from another hypervisor I see that the disk size seems to remain at 4Gb even if timestamp updates... # ll /rhev/data-center/mnt/172.16.1.137 \:_nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/ total 4260941 -rw-rw----. 1 nobody nobody 4363202560 Jul 5 11:23 d2a89b5e-7d62-4695-96d8-b762ce52b379 -rw-r--r--. 1 nobody nobody 261 Jul 5 11:17 d2a89b5e-7d62-4695-96d8-b762ce52b379.meta On host console I see a throughput of 4mbit/s... # strace -p 14342 strace: Process 14342 attached ppoll([{fd=9, events=POLLIN|POLLERR|POLLHUP}], 1, NULL, NULL, 8 # ll /proc/14342/fd hangs... # nfsstat -v Client packet stats: packets udp tcp tcpconn 0 0 0 0 Client rpc stats: calls retrans authrefrsh 31171856 0 31186615 Client nfs v4: null read write commit open open_conf 0 0% 2339179 7% 14872911 47% 7233 0% 74956 0% 2 0% open_noat open_dgrd close setattr fsinfo renew 2312347 7% 0 0% 2387293 7% 24 0% 23 0% 5 0% setclntid confirm lock lockt locku access 3 0% 3 0% 8 0% 8 0% 5 0% 1342746 4% getattr lookup lookup_root remove rename link 3031001 9% 71551 0% 7 0% 74590 0% 6 0% 0 0% symlink create pathconf statfs readlink readdir 0 0% 9 0% 16 0% 4548231 14% 0 0% 98506 0% server_caps delegreturn getacl setacl fs_locations rel_lkowner 39 0% 14 0% 0 0% 0 0% 0 0% 0 0% secinfo exchange_id create_ses destroy_ses sequence get_lease_t 0 0% 0 0% 4 0% 2 0% 1 0% 0 0% reclaim_comp layoutget getdevinfo layoutcommit layoutreturn getdevlist 0 0% 2 0% 0 0% 0 0% 0 0% 0 0% (null) 5 0% # vmstat 3 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 3 1 0 82867112 437548 7066580 0 0 54 1 0 0 0 0 100 0 0 0 1 0 82867024 437548 7066620 0 0 1708 0 3720 8638 0 0 95 4 0 4 1 0 82868728 437552 7066616 0 0 875 9 3004 8457 0 0 95 4 0 0 1 0 82869600 437552 7066636 0 0 1785 6 2982 8359 0 0 95 4 0 I see the blocked process that is my qemu-img one... In messages of hypervisor Jul 5 11:33:06 node4 kernel: INFO: task qemu-img:14343 blocked for more than 120 seconds. Jul 5 11:33:06 node4 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 5 11:33:06 node4 kernel: qemu-img D ffff9d960e7e1080 0 14343 3328 0x00000080 Jul 5 11:33:06 node4 kernel: Call Trace: Jul 5 11:33:06 node4 kernel: [<ffffffffa72de185>] ? sched_clock_cpu+0x85/0xc0 Jul 5 11:33:06 node4 kernel: [<ffffffffa72da830>] ? try_to_wake_up+0x190/0x390 Jul 5 11:33:06 node4 kernel: [<ffffffffa7988089>] schedule_preempt_disabled+0x29/0x70 Jul 5 11:33:06 node4 kernel: [<ffffffffa7985ff7>] __mutex_lock_slowpath+0xc7/0x1d0 Jul 5 11:33:06 node4 kernel: [<ffffffffa79853cf>] mutex_lock+0x1f/0x2f Jul 5 11:33:06 node4 kernel: [<ffffffffc0db5489>] nfs_start_io_write+0x19/0x40 [nfs] Jul 5 11:33:06 node4 kernel: [<ffffffffc0dad0d1>] nfs_file_write+0x81/0x1e0 [nfs] Jul 5 11:33:06 node4 kernel: [<ffffffffa744d063>] do_sync_write+0x93/0xe0 Jul 5 11:33:06 node4 kernel: [<ffffffffa744db50>] vfs_write+0xc0/0x1f0 Jul 5 11:33:06 node4 kernel: [<ffffffffa744eaf2>] SyS_pwrite64+0x92/0xc0 Jul 5 11:33:06 node4 kernel: [<ffffffffa7993ec9>] ? system_call_after_swapgs+0x96/0x13a Jul 5 11:33:06 node4 kernel: [<ffffffffa7993f92>] system_call_fastpath+0x25/0x2a Jul 5 11:33:06 node4 kernel: [<ffffffffa7993ed5>] ? system_call_after_swapgs+0xa2/0x13a Possibly problems with NFSv4? I see that it mounts as nfsv4: # mount . . . 172.16.1.137:/nas/EXPORT-DOMAIN on /rhev/data-center/mnt/172.16.1.137:_nas_EXPORT-DOMAIN type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,timeo=600,retrans=6,sec=sys,clientaddr=192.168.50.52,local_lock=none,addr=172.16.1.137) This is a test oVirt env so I can wait and eventually test something... Let me know your suggestions Gianluca

That NFS looks like it is not properly configured -> nobody:nobody is not suposed to be seen. Change the ownership from nfs side to 36:36. Also, you can define (all_squash,anonuid=36,anongid=36) as export options. Best Regards,Strahil Nikolov On Mon, Jul 5, 2021 at 12:52, Gianluca Cecchi<gianluca.cecchi@gmail.com> wrote: _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/6HE5TY3GTB32JF...

On Mon, Jul 5, 2021 at 11:56 AM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
That NFS looks like it is not properly configured -> nobody:nobody is not suposed to be seen.
Change the ownership from nfs side to 36:36. Also, you can define (all_squash,anonuid=36,anongid=36) as export options.
Best Regards, Strahil Nikolov
I have those options in my test with a Linux box exporting via NFS. But from the appliance point of view I have to check if it is possible... It is not under my control and I don't know that appliance architecture. Anyone? Gianluca

On Mon, Jul 5, 2021 at 12:50 PM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
On Sun, Jul 4, 2021 at 1:01 PM Nir Soffer <nsoffer@redhat.com> wrote:
On Sun, Jul 4, 2021 at 11:30 AM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
Isn't it better to strace it before killing qemu-img .
It may be too late, but it may help to understand why this qemu-img run got stuck.
Hi, thanks for your answers and suggestions. That env was a production one and so I was forced to power off the hypervisor and power on it again (it was a maintenance window with all the VMs powered down anyway). I was also unable to put the host into maintenance because it replied that there were some tasks running, even after the kill, because the 2 processes (the VM had 2 disks to export and so two qemu-img processes) remained in defunct and after several minutes no change in web admin feedback about the process....
My first suspicion was something related to fw congestion because the hypervisor network and the nas appliance were in different networks and I wasn't sure if a fw was in place for it.... But on a test oVirt environment with same oVirt version and with the same network for hypervisors I was able to put a Linux server with the same network as the nas and configure it as nfs server. And the export went with a throughput of about 50MB/s, so no fw problem. A VM with 55Gb disk exported in 19 minutes.
So I got the rights to mount the nas on the test env and mounted it as export domain and now I have the same problems I can debug. The same VM with only one disk (55Gb). The process:
vdsm 14342 3270 0 11:17 ? 00:00:03 /usr/bin/qemu-img convert -p -t none -T none -f raw /rhev/data-center/mnt/blockSD/679c0725-75fb-4af7-bff1-7c447c5d789c/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/d2a89b5e-7d62-4695-96d8-b762ce52b379 -O raw -o preallocation=falloc /rhev/data-center/mnt/172.16.1.137:_nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/d2a89b5e-7d62-4695-96d8-b762ce52b379
-o preallocation + NFS 4.0 + very slow NFS is your problem. qemu-img is using posix-fallocate() to preallocate the entire image at the start of the copy. With NFS 4.2 this uses fallocate() linux specific syscall that allocates the space very efficiently in no time. With older NFS versions, this becomes a very slow loop, writing one byte for every 4k block. If you see -o preallocation, it means you are using an old vdsm version, we stopped using -o preallocation in 4.4.2, see https://bugzilla.redhat.com/1850267.
On the hypervisor the ls commands quite hang, so from another hypervisor I see that the disk size seems to remain at 4Gb even if timestamp updates...
# ll /rhev/data-center/mnt/172.16.1.137\:_nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/ total 4260941 -rw-rw----. 1 nobody nobody 4363202560 Jul 5 11:23 d2a89b5e-7d62-4695-96d8-b762ce52b379 -rw-r--r--. 1 nobody nobody 261 Jul 5 11:17 d2a89b5e-7d62-4695-96d8-b762ce52b379.meta
On host console I see a throughput of 4mbit/s...
# strace -p 14342
This shows only the main thread use -f use -f to show all threads.
strace: Process 14342 attached ppoll([{fd=9, events=POLLIN|POLLERR|POLLHUP}], 1, NULL, NULL, 8
# ll /proc/14342/fd hangs...
# nfsstat -v Client packet stats: packets udp tcp tcpconn 0 0 0 0
Client rpc stats: calls retrans authrefrsh 31171856 0 31186615
Client nfs v4: null read write commit open open_conf 0 0% 2339179 7% 14872911 47% 7233 0% 74956 0% 2 0% open_noat open_dgrd close setattr fsinfo renew 2312347 7% 0 0% 2387293 7% 24 0% 23 0% 5 0% setclntid confirm lock lockt locku access 3 0% 3 0% 8 0% 8 0% 5 0% 1342746 4% getattr lookup lookup_root remove rename link 3031001 9% 71551 0% 7 0% 74590 0% 6 0% 0 0% symlink create pathconf statfs readlink readdir 0 0% 9 0% 16 0% 4548231 14% 0 0% 98506 0% server_caps delegreturn getacl setacl fs_locations rel_lkowner 39 0% 14 0% 0 0% 0 0% 0 0% 0 0% secinfo exchange_id create_ses destroy_ses sequence get_lease_t 0 0% 0 0% 4 0% 2 0% 1 0% 0 0% reclaim_comp layoutget getdevinfo layoutcommit layoutreturn getdevlist 0 0% 2 0% 0 0% 0 0% 0 0% 0 0% (null) 5 0%
# vmstat 3 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 3 1 0 82867112 437548 7066580 0 0 54 1 0 0 0 0 100 0 0 0 1 0 82867024 437548 7066620 0 0 1708 0 3720 8638 0 0 95 4 0 4 1 0 82868728 437552 7066616 0 0 875 9 3004 8457 0 0 95 4 0 0 1 0 82869600 437552 7066636 0 0 1785 6 2982 8359 0 0 95 4 0
I see the blocked process that is my qemu-img one...
In messages of hypervisor
Jul 5 11:33:06 node4 kernel: INFO: task qemu-img:14343 blocked for more than 120 seconds. Jul 5 11:33:06 node4 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 5 11:33:06 node4 kernel: qemu-img D ffff9d960e7e1080 0 14343 3328 0x00000080 Jul 5 11:33:06 node4 kernel: Call Trace: Jul 5 11:33:06 node4 kernel: [<ffffffffa72de185>] ? sched_clock_cpu+0x85/0xc0 Jul 5 11:33:06 node4 kernel: [<ffffffffa72da830>] ? try_to_wake_up+0x190/0x390 Jul 5 11:33:06 node4 kernel: [<ffffffffa7988089>] schedule_preempt_disabled+0x29/0x70 Jul 5 11:33:06 node4 kernel: [<ffffffffa7985ff7>] __mutex_lock_slowpath+0xc7/0x1d0 Jul 5 11:33:06 node4 kernel: [<ffffffffa79853cf>] mutex_lock+0x1f/0x2f Jul 5 11:33:06 node4 kernel: [<ffffffffc0db5489>] nfs_start_io_write+0x19/0x40 [nfs] Jul 5 11:33:06 node4 kernel: [<ffffffffc0dad0d1>] nfs_file_write+0x81/0x1e0 [nfs] Jul 5 11:33:06 node4 kernel: [<ffffffffa744d063>] do_sync_write+0x93/0xe0 Jul 5 11:33:06 node4 kernel: [<ffffffffa744db50>] vfs_write+0xc0/0x1f0 Jul 5 11:33:06 node4 kernel: [<ffffffffa744eaf2>] SyS_pwrite64+0x92/0xc0 Jul 5 11:33:06 node4 kernel: [<ffffffffa7993ec9>] ? system_call_after_swapgs+0x96/0x13a Jul 5 11:33:06 node4 kernel: [<ffffffffa7993f92>] system_call_fastpath+0x25/0x2a Jul 5 11:33:06 node4 kernel: [<ffffffffa7993ed5>] ? system_call_after_swapgs+0xa2/0x13a
Looks like qemu-img is stuck writing to the NFS server.
Possibly problems with NFSv4? I see that it mounts as nfsv4:
# mount . . . 172.16.1.137:/nas/EXPORT-DOMAIN on /rhev/data-center/mnt/172.16.1.137:_nas_EXPORT-DOMAIN type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,timeo=600,retrans=6,sec=sys,clientaddr=192.168.50.52,local_lock=none,addr=172.16.1.137)
This is a test oVirt env so I can wait and eventually test something... Let me know your suggestions
I would start by changing the NFS storage domain to version 4.2. 1. kill the hang qemu-img (it will probably cannot be killed, but worth trying) 2. deactivate the storage domain 3. fix the ownership on the storage domain (should be vdsm:kvm, not nobody:nobody)3. 4. in ovirt engine: manage storage domain -> advanced options -> nfs version: 4.2 5. activate the storage domain 6. try again to export the disk Finally I think we have a management issue here. It does not make sense to use a preallocated disk on export domain. Using a preallocated disk makes sense on a data domain when you want to prevent the case of failing with ENSPC when VM writes to the disk. Disks on the export domain are never used by a running VM so there is no reason to preallocate them. The system should always use sparse disks when copying to export domain. When importing disks from export domain, the system should reconstruct the original disk configuration (e.g. raw-preallocated). Nir

On Mon, Jul 5, 2021 at 2:13 PM Nir Soffer <nsoffer@redhat.com> wrote:
vdsm 14342 3270 0 11:17 ? 00:00:03 /usr/bin/qemu-img
convert -p -t none -T none -f raw /rhev/data-center/mnt/blockSD/679c0725-75fb-4af7-bff1-7c447c5d789c/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/d2a89b5e-7d62-4695-96d8-b762ce52b379 -O raw -o preallocation=falloc /rhev/data-center/mnt/172.16.1.137: _nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/d2a89b5e-7d62-4695-96d8-b762ce52b379
-o preallocation + NFS 4.0 + very slow NFS is your problem.
qemu-img is using posix-fallocate() to preallocate the entire image at the start of the copy. With NFS 4.2 this uses fallocate() linux specific syscall that allocates the space very efficiently in no time. With older NFS versions, this becomes a very slow loop, writing one byte for every 4k block.
If you see -o preallocation, it means you are using an old vdsm version, we stopped using -o preallocation in 4.4.2, see https://bugzilla.redhat.com/1850267.
OK. As I said at the beginning the environment is latest 4.3 We are going to upgrade to 4.4 and we are making some complimentary backups, for safeness.
On the hypervisor the ls commands quite hang, so from another hypervisor I see that the disk size seems to remain at 4Gb even if timestamp updates...
# ll /rhev/data-center/mnt/172.16.1.137 \:_nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/ total 4260941 -rw-rw----. 1 nobody nobody 4363202560 Jul 5 11:23 d2a89b5e-7d62-4695-96d8-b762ce52b379 -rw-r--r--. 1 nobody nobody 261 Jul 5 11:17 d2a89b5e-7d62-4695-96d8-b762ce52b379.meta
On host console I see a throughput of 4mbit/s...
# strace -p 14342
This shows only the main thread use -f use -f to show all threads.
# strace -f -p 14342 strace: Process 14342 attached with 2 threads [pid 14342] ppoll([{fd=9, events=POLLIN|POLLERR|POLLHUP}], 1, NULL, NULL, 8 <unfinished ...> [pid 14343] pwrite64(12, "\0", 1, 16474968063) = 1 [pid 14343] pwrite64(12, "\0", 1, 16474972159) = 1 [pid 14343] pwrite64(12, "\0", 1, 16474976255) = 1 [pid 14343] pwrite64(12, "\0", 1, 16474980351) = 1 [pid 14343] pwrite64(12, "\0", 1, 16474984447) = 1 [pid 14343] pwrite64(12, "\0", 1, 16474988543) = 1 [pid 14343] pwrite64(12, "\0", 1, 16474992639) = 1 [pid 14343] pwrite64(12, "\0", 1, 16474996735) = 1 [pid 14343] pwrite64(12, "\0", 1, 16475000831) = 1 [pid 14343] pwrite64(12, "\0", 1, 16475004927) = 1 . . . and so on . . .
This is a test oVirt env so I can wait and eventually test something... Let me know your suggestions
I would start by changing the NFS storage domain to version 4.2.
I'm going to try. RIght now I have set it to the default of autonegotiated...
1. kill the hang qemu-img (it will probably cannot be killed, but worth trying) 2. deactivate the storage domain 3. fix the ownership on the storage domain (should be vdsm:kvm, not nobody:nobody)3.
Unfortunately it is an appliance. I have asked the guys that have it in charge if we can set them. Thanks for the other concepts explained. Gianluca

On Mon, Jul 5, 2021 at 3:36 PM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
On Mon, Jul 5, 2021 at 2:13 PM Nir Soffer <nsoffer@redhat.com> wrote:
vdsm 14342 3270 0 11:17 ? 00:00:03 /usr/bin/qemu-img convert -p -t none -T none -f raw /rhev/data-center/mnt/blockSD/679c0725-75fb-4af7-bff1-7c447c5d789c/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/d2a89b5e-7d62-4695-96d8-b762ce52b379 -O raw -o preallocation=falloc /rhev/data-center/mnt/172.16.1.137:_nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/d2a89b5e-7d62-4695-96d8-b762ce52b379
-o preallocation + NFS 4.0 + very slow NFS is your problem.
qemu-img is using posix-fallocate() to preallocate the entire image at the start of the copy. With NFS 4.2 this uses fallocate() linux specific syscall that allocates the space very efficiently in no time. With older NFS versions, this becomes a very slow loop, writing one byte for every 4k block.
If you see -o preallocation, it means you are using an old vdsm version, we stopped using -o preallocation in 4.4.2, see https://bugzilla.redhat.com/1850267.
OK. As I said at the beginning the environment is latest 4.3 We are going to upgrade to 4.4 and we are making some complimentary backups, for safeness.
On the hypervisor the ls commands quite hang, so from another hypervisor I see that the disk size seems to remain at 4Gb even if timestamp updates...
# ll /rhev/data-center/mnt/172.16.1.137\:_nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/ total 4260941 -rw-rw----. 1 nobody nobody 4363202560 Jul 5 11:23 d2a89b5e-7d62-4695-96d8-b762ce52b379 -rw-r--r--. 1 nobody nobody 261 Jul 5 11:17 d2a89b5e-7d62-4695-96d8-b762ce52b379.meta
On host console I see a throughput of 4mbit/s...
# strace -p 14342
This shows only the main thread use -f use -f to show all threads.
# strace -f -p 14342 strace: Process 14342 attached with 2 threads [pid 14342] ppoll([{fd=9, events=POLLIN|POLLERR|POLLHUP}], 1, NULL, NULL, 8 <unfinished ...> [pid 14343] pwrite64(12, "\0", 1, 16474968063) = 1 [pid 14343] pwrite64(12, "\0", 1, 16474972159) = 1 [pid 14343] pwrite64(12, "\0", 1, 16474976255) = 1 [pid 14343] pwrite64(12, "\0", 1, 16474980351) = 1 [pid 14343] pwrite64(12, "\0", 1, 16474984447) = 1 [pid 14343] pwrite64(12, "\0", 1, 16474988543) = 1 [pid 14343] pwrite64(12, "\0", 1, 16474992639) = 1 [pid 14343] pwrite64(12, "\0", 1, 16474996735) = 1 [pid 14343] pwrite64(12, "\0", 1, 16475000831) = 1 [pid 14343] pwrite64(12, "\0", 1, 16475004927) = 1
qemu-img is busy in posix_fallocate(), wiring one byte to every 4k block. If you add -tt -T (as I suggested), we can see how much time each write takes, which may explain why this takes so much time. strace -f -p 14342 --tt -T
. . . and so on . . .
This is a test oVirt env so I can wait and eventually test something... Let me know your suggestions
I would start by changing the NFS storage domain to version 4.2.
I'm going to try. RIght now I have set it to the default of autonegotiated...
1. kill the hang qemu-img (it will probably cannot be killed, but worth trying) 2. deactivate the storage domain 3. fix the ownership on the storage domain (should be vdsm:kvm, not nobody:nobody)3.
Unfortunately it is an appliance. I have asked the guys that have it in charge if we can set them. Thanks for the other concepts explained.
Gianluca

On Mon, Jul 5, 2021 at 5:06 PM Nir Soffer <nsoffer@redhat.com> wrote:
qemu-img is busy in posix_fallocate(), wiring one byte to every 4k block.
If you add -tt -T (as I suggested), we can see how much time each write takes, which may explain why this takes so much time.
strace -f -p 14342 --tt -T
It seems I missed part of your suggestion... i didn't get the "-tt -T" (or I didn't see it...) With it I get this during the export (in networking of host console 4 mbit/s....): # strace -f -p 25243 -tt -T strace: Process 25243 attached with 2 threads [pid 25243] 09:17:32.503907 ppoll([{fd=9, events=POLLIN|POLLERR|POLLHUP}], 1, NULL, NULL, 8 <unfinished ...> [pid 25244] 09:17:32.694207 pwrite64(12, "\0", 1, 3773509631) = 1 <0.000059> [pid 25244] 09:17:32.694412 pwrite64(12, "\0", 1, 3773513727) = 1 <0.000078> [pid 25244] 09:17:32.694608 pwrite64(12, "\0", 1, 3773517823) = 1 <0.000056> [pid 25244] 09:17:32.694729 pwrite64(12, "\0", 1, 3773521919) = 1 <0.000024> [pid 25244] 09:17:32.694796 pwrite64(12, "\0", 1, 3773526015) = 1 <0.000020> [pid 25244] 09:17:32.694855 pwrite64(12, "\0", 1, 3773530111) = 1 <0.000015> [pid 25244] 09:17:32.694908 pwrite64(12, "\0", 1, 3773534207) = 1 <0.000014> [pid 25244] 09:17:32.694950 pwrite64(12, "\0", 1, 3773538303) = 1 <0.000016> [pid 25244] 09:17:32.694993 pwrite64(12, "\0", 1, 3773542399) = 1 <0.200032> [pid 25244] 09:17:32.895140 pwrite64(12, "\0", 1, 3773546495) = 1 <0.000034> [pid 25244] 09:17:32.895227 pwrite64(12, "\0", 1, 3773550591) = 1 <0.000029> [pid 25244] 09:17:32.895296 pwrite64(12, "\0", 1, 3773554687) = 1 <0.000024> [pid 25244] 09:17:32.895353 pwrite64(12, "\0", 1, 3773558783) = 1 <0.000016> [pid 25244] 09:17:32.895400 pwrite64(12, "\0", 1, 3773562879) = 1 <0.000015> [pid 25244] 09:17:32.895443 pwrite64(12, "\0", 1, 3773566975) = 1 <0.000015> [pid 25244] 09:17:32.895485 pwrite64(12, "\0", 1, 3773571071) = 1 <0.000015> [pid 25244] 09:17:32.895527 pwrite64(12, "\0", 1, 3773575167) = 1 <0.000017> [pid 25244] 09:17:32.895570 pwrite64(12, "\0", 1, 3773579263) = 1 <0.199493> [pid 25244] 09:17:33.095147 pwrite64(12, "\0", 1, 3773583359) = 1 <0.000031> [pid 25244] 09:17:33.095262 pwrite64(12, "\0", 1, 3773587455) = 1 <0.000061> [pid 25244] 09:17:33.095378 pwrite64(12, "\0", 1, 3773591551) = 1 <0.000027> [pid 25244] 09:17:33.095445 pwrite64(12, "\0", 1, 3773595647) = 1 <0.000021> [pid 25244] 09:17:33.095498 pwrite64(12, "\0", 1, 3773599743) = 1 <0.000016> [pid 25244] 09:17:33.095542 pwrite64(12, "\0", 1, 3773603839) = 1 <0.000014> . . . BTW: it seems my NAS appliance doesn't support 4.2 version of NFS, because if I force it, I then get an error in mount and in engine.log this error for both nodes as they try to mount: 2021-07-05 17:01:56,082+02 ERROR [org.ovirt.engine.core.bll.storage.connection.FileStorageHelper] (EE-ManagedThreadFactory-engine-Thread-2554190) [642eb6be] The connection with details '172.16.1.137:/nas/EXPORT-DOMAIN' failed because of error code '477' and error message is: problem while trying to mount target and in vdsm.log: MountError: (32, ';mount.nfs: Protocol not supported\n') With NFSv3 I get apparently the same command: vdsm 19702 3036 7 17:15 ? 00:00:02 /usr/bin/qemu-img convert -p -t none -T none -f raw /rhev/data-center/mnt/blockSD/679c0725-75fb-4af7-bff1-7c447c5d789c/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/d2a89b5e-7d62-4695-96d8-b762ce52b379 -O raw -o preallocation=falloc /rhev/data-center/mnt/172.16.1.137: _nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/d2a89b5e-7d62-4695-96d8-b762ce52b379 The file size seems bigger but anyway very low throughput as with NFS v4. Gianluca

On Tue, Jul 6, 2021 at 10:21 AM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
On Mon, Jul 5, 2021 at 5:06 PM Nir Soffer <nsoffer@redhat.com> wrote:
qemu-img is busy in posix_fallocate(), wiring one byte to every 4k block.
If you add -tt -T (as I suggested), we can see how much time each write takes, which may explain why this takes so much time.
strace -f -p 14342 --tt -T
It seems I missed part of your suggestion... i didn't get the "-tt -T" (or I didn't see it...)
With it I get this during the export (in networking of host console 4 mbit/s....):
# strace -f -p 25243 -tt -T strace: Process 25243 attached with 2 threads [pid 25243] 09:17:32.503907 ppoll([{fd=9, events=POLLIN|POLLERR|POLLHUP}], 1, NULL, NULL, 8 <unfinished ...> [pid 25244] 09:17:32.694207 pwrite64(12, "\0", 1, 3773509631) = 1 <0.000059> [pid 25244] 09:17:32.694412 pwrite64(12, "\0", 1, 3773513727) = 1 <0.000078> [pid 25244] 09:17:32.694608 pwrite64(12, "\0", 1, 3773517823) = 1 <0.000056> [pid 25244] 09:17:32.694729 pwrite64(12, "\0", 1, 3773521919) = 1 <0.000024> [pid 25244] 09:17:32.694796 pwrite64(12, "\0", 1, 3773526015) = 1 <0.000020> [pid 25244] 09:17:32.694855 pwrite64(12, "\0", 1, 3773530111) = 1 <0.000015> [pid 25244] 09:17:32.694908 pwrite64(12, "\0", 1, 3773534207) = 1 <0.000014> [pid 25244] 09:17:32.694950 pwrite64(12, "\0", 1, 3773538303) = 1 <0.000016> [pid 25244] 09:17:32.694993 pwrite64(12, "\0", 1, 3773542399) = 1 <0.200032> [pid 25244] 09:17:32.895140 pwrite64(12, "\0", 1, 3773546495) = 1 <0.000034> [pid 25244] 09:17:32.895227 pwrite64(12, "\0", 1, 3773550591) = 1 <0.000029> [pid 25244] 09:17:32.895296 pwrite64(12, "\0", 1, 3773554687) = 1 <0.000024> [pid 25244] 09:17:32.895353 pwrite64(12, "\0", 1, 3773558783) = 1 <0.000016> [pid 25244] 09:17:32.895400 pwrite64(12, "\0", 1, 3773562879) = 1 <0.000015> [pid 25244] 09:17:32.895443 pwrite64(12, "\0", 1, 3773566975) = 1 <0.000015> [pid 25244] 09:17:32.895485 pwrite64(12, "\0", 1, 3773571071) = 1 <0.000015> [pid 25244] 09:17:32.895527 pwrite64(12, "\0", 1, 3773575167) = 1 <0.000017> [pid 25244] 09:17:32.895570 pwrite64(12, "\0", 1, 3773579263) = 1 <0.199493> [pid 25244] 09:17:33.095147 pwrite64(12, "\0", 1, 3773583359) = 1 <0.000031> [pid 25244] 09:17:33.095262 pwrite64(12, "\0", 1, 3773587455) = 1 <0.000061> [pid 25244] 09:17:33.095378 pwrite64(12, "\0", 1, 3773591551) = 1 <0.000027> [pid 25244] 09:17:33.095445 pwrite64(12, "\0", 1, 3773595647) = 1 <0.000021> [pid 25244] 09:17:33.095498 pwrite64(12, "\0", 1, 3773599743) = 1 <0.000016> [pid 25244] 09:17:33.095542 pwrite64(12, "\0", 1, 3773603839) = 1 <0.000014>
Most writes are pretty fast, but from time to time there is a very slow write. From the small sample you posted, we have: awk '{print $11}' strace.out | sed -e "s/<//" -e "s/>//" | awk '{sum+=$1; if ($1 < 0.1) {fast+=$1; fast_nr++} else {slow+=$1; slow_nr++}} END{printf "average: %.6f slow: %.6f fast: %.6f\n", sum/NR, slow/slow_nr, fast/fast_nr}' average: 0.016673 slow: 0.199763 fast: 0.000028 Preallocating a 300 GiB disk will take about 15 days :-)
300*1024**3 / 4096 * 0.016673 / 3600 / 24 15.17613511111111
If all writes would be fast, it will take less than an hour:
300*1024**3 / 4096 * 0.000028 / 3600 0.6116693333333333
. . .
BTW: it seems my NAS appliance doesn't support 4.2 version of NFS, because if I force it, I then get an error in mount and in engine.log this error for both nodes as they try to mount:
2021-07-05 17:01:56,082+02 ERROR [org.ovirt.engine.core.bll.storage.connection.FileStorageHelper] (EE-ManagedThreadFactory-engine-Thread-2554190) [642eb6be] The connection with details '172.16.1.137:/nas/EXPORT-DOMAIN' failed because of error code '477' and error message is: problem while trying to mount target
and in vdsm.log: MountError: (32, ';mount.nfs: Protocol not supported\n')
Too bad. You can evaluate how ovirt 4.4. will work with this appliance using this dd command: dd if=/dev/zero bs=8M count=38400 of=/path/to/new/disk oflag=direct conv=fsync We don't use dd for this, but the operation is the same on NFS < 4.2. Based on the 50 MiB/s rate you reported earlier, I guess you have a 1Gbit network to this appliance, so zeroing can do up to 128 MiB/s, which will take about 40 minutes for 300G. Using NFS 4.2, fallocate will complete in less than a second. Here is example from my test system, creating 90g raw preallocated volume: 2021-07-06 15:46:40,382+0300 INFO (tasks/1) [storage.Volume] Request to create RAW volume /rhev/data-center/mnt/storage2:_exp ort_00/a600ba04-34f9-4793-a5dc-6d4150716d14/images/bcf7c623-8fd8-47b3-aaee-a65c0872536d/82def38d-b41b-4126-826e-0513d669f1b5 with capacity = 96636764160 (fileVolume:493) ... 2021-07-06 15:46:40,447+0300 INFO (tasks/1) [storage.Volume] Preallocating volume /rhev/data-center/mnt/storage2:_export_00/a600ba04-34f9-4793-a5dc-6d4150716d14/images/bcf7c623-8fd8-47b3-aaee-a65c0872536d/82def38d-b41b-4126-826e-0513d669f1b5: 0.05 seconds (utils:390) Nir

On Tue, Jul 6, 2021 at 2:52 PM Nir Soffer <nsoffer@redhat.com> wrote:
Too bad.
You can evaluate how ovirt 4.4. will work with this appliance using this dd command:
dd if=/dev/zero bs=8M count=38400 of=/path/to/new/disk oflag=direct conv=fsync
We don't use dd for this, but the operation is the same on NFS < 4.2.
I confirm I'm able to saturate the 1Gb/s link. tried creating a 10Gb file on the StoreOnce appliance # time dd if=/dev/zero bs=8M count=1280 of=/rhev/data-center/mnt/ 172.16.1.137\:_nas_EXPORT-DOMAIN/ansible_ova/test.img oflag=direct conv=fsync 1280+0 records in 1280+0 records out 10737418240 bytes (11 GB) copied, 98.0172 s, 110 MB/s real 1m38.035s user 0m0.003s sys 0m2.366s So are you saying that after upgrading to 4.4.6 (or just released 4.4.7) I should be able to export with this speed? Or anyway I do need NFS v4.2? BTW: is there any capping put in place by oVirt to the export phase (the qemu-img command in practice)? Designed for example not to perturbate the activity of hypervisor?Or do you think that if I have a 10Gb/s network backend and powerful disks on oVirt and powerful NFS server processing power I should have much more speed?
Based on the 50 MiB/s rate you reported earlier, I guess you have a 1Gbit network to this appliance, so zeroing can do up to 128 MiB/s, which will take about 40 minutes for 300G.
Using NFS 4.2, fallocate will complete in less than a second.
I can sort of confirm this also for 4.3.10. I have a test CentOS 7.4 VM configured as NFS server and, if I configure it as an export domain using the default autonegotiate option, it is (strangely enough) mounted as NFS v4.1 and the initial fallocate takes some minutes (55Gb disk). If I reconfigure it forcing NFS v4.2, it does it and the initial fallocate is immediate, in the sense that "ls -l" on the export domain becomes quite immediately the size of the virtual disk. Thanks, Gianluca

On Tue, Jul 6, 2021 at 5:55 PM Gianluca Cecchi <gianluca.cecchi@gmail.com> wrote:
On Tue, Jul 6, 2021 at 2:52 PM Nir Soffer <nsoffer@redhat.com> wrote:
Too bad.
You can evaluate how ovirt 4.4. will work with this appliance using this dd command:
dd if=/dev/zero bs=8M count=38400 of=/path/to/new/disk oflag=direct conv=fsync
We don't use dd for this, but the operation is the same on NFS < 4.2.
I confirm I'm able to saturate the 1Gb/s link. tried creating a 10Gb file on the StoreOnce appliance # time dd if=/dev/zero bs=8M count=1280 of=/rhev/data-center/mnt/172.16.1.137\:_nas_EXPORT-DOMAIN/ansible_ova/test.img oflag=direct conv=fsync 1280+0 records in 1280+0 records out 10737418240 bytes (11 GB) copied, 98.0172 s, 110 MB/s
real 1m38.035s user 0m0.003s sys 0m2.366s
So are you saying that after upgrading to 4.4.6 (or just released 4.4.7) I should be able to export with this speed?
The preallocation part will run at the same speed, and then you need to copy the used parts of the disk, time depending on how much data is used.
Or anyway I do need NFS v4.2?
Without NFS 4.2. With NFS 4.2 the entire allocation will take less than a second without consuming any network bandwidth.
BTW: is there any capping put in place by oVirt to the export phase (the qemu-img command in practice)? Designed for example not to perturbate the activity of hypervisor?Or do you think that if I have a 10Gb/s network backend and powerful disks on oVirt and powerful NFS server processing power I should have much more speed?
We don't have any capping in place, usually people complain that copying images is too slow. In general when copying to file base storage we don't use -W option (unordered writes) so copy will be slower compared with block based storage, when qemu-img use 8 concurrent writes. So in a way we always cap the copies to file based storage. To get maximum throughput you need to run multiple copies at the same time. Nir

Disks on the export domain are never used by a running VM so there is no reason to preallocate them. The system should always use sparse disks when copying to export domain.
When importing disks from export domain, the system should reconstruct the original disk configuration (e.g. raw-preallocated).
Hey Nir, I think you are wrong. In order to minimize the downtime , many users would use storage migration while the VM is running, then they power off, detach and attach on the new location , power on and live migrate while the VM works. I think preallocation should be based on VM status (running or offline). Best Regards, Strahil Nikolov

On Mon, Jul 5, 2021 at 4:06 PM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
Disks on the export domain are never used by a running VM so there is no reason to preallocate them. The system should always use sparse disks when copying to export domain.
When importing disks from export domain, the system should reconstruct the original disk configuration (e.g. raw-preallocated).
Hey Nir,
I think you are wrong. In order to minimize the downtime , many users would use storage migration while the VM is running, then they power off, detach and attach on the new location , power on and live migrate while the VM works.
Live storage migration (move disk while vm is running) is possible only between data domains, and requires no downtime and no detach/attach are needed. I'm not sure if it is possible to export a vm to export domain when the vm is running, (maybe exporting snapshot works in 4.4). Anway, assuming you can export while the vm is running, the target disk will never be used by any vm. When the export is done, you need to import the vm back to the same or other system, copying the disk to a data domain. So we have: original disk: raw-preallocated on data domain 1 exported disk: raw-sparse or qcow2-sparse on export domain target disk: raw-preallocated on data domain 2 There is no reason to use a preallocated disk for the temporary disk created in the export domain. Nir
Teilnehmer (3)
-
Gianluca Cecchi
-
Nir Soffer
-
Strahil Nikolov