[ovirt-users] Re: Any way to terminate stuck export task

5 Jul 2021

      On Mon, Jul 5, 2021 at 12:50 PM Gianluca Cecchi
<gianluca.cecchi@gmail.com> wrote:
...
On Sun, Jul 4, 2021 at 1:01 PM Nir Soffer <nsoffer@redhat.com> wrote:
...
On Sun, Jul 4, 2021 at 11:30 AM Strahil Nikolov <hunter86_bg@yahoo.com> wrote:
...
Isn't it better to strace it before killing qemu-img .
It may be too late, but it may help to understand why this qemu-img
run got stuck.
Hi, thanks for your answers and suggestions.
That env was a production one and so I was forced to power off the hypervisor and power on it again (it was a maintenance window with all the VMs powered down anyway). I was also unable to put the host into maintenance because it replied that there were some tasks running, even after the kill, because the 2 processes (the VM had 2 disks to export and so two qemu-img processes) remained in defunct and after several minutes no change in web admin feedback about the process....
My first suspicion was something related to fw congestion because the hypervisor network and the nas appliance were in different networks and I wasn't sure if a fw was in place for it....
But on a test oVirt environment with same oVirt version and with the same network for hypervisors I was able to put a Linux server with the same network as the nas and configure it as nfs server.
And the export went with a throughput of about 50MB/s, so no fw problem.
A VM with 55Gb disk exported in 19 minutes.
So I got the rights to mount the nas on the test env and mounted it as export domain and now I have the same problems I can debug.
The same VM with only one disk (55Gb). The process:
vdsm     14342  3270  0 11:17 ?        00:00:03 /usr/bin/qemu-img convert -p -t none -T none -f raw /rhev/data-center/mnt/blockSD/679c0725-75fb-4af7-bff1-7c447c5d789c/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/d2a89b5e-7d62-4695-96d8-b762ce52b379 -O raw -o preallocation=falloc /rhev/data-center/mnt/172.16.1.137:_nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/d2a89b5e-7d62-4695-96d8-b762ce52b379
-o preallocation + NFS 4.0 + very slow NFS is your problem.

qemu-img is using posix-fallocate() to preallocate the entire image at
the start of the copy. With NFS 4.2
this uses fallocate() linux specific syscall that allocates the space
very efficiently in no time. With older
NFS versions, this becomes a very slow loop, writing one byte for
every 4k block.

If you see -o preallocation, it means you are using an old vdsm
version, we stopped using -o preallocation
in 4.4.2, see https://bugzilla.redhat.com/1850267.
...
On the hypervisor the ls commands quite hang, so from another hypervisor I see that the disk size seems to remain at 4Gb even if timestamp updates...
# ll /rhev/data-center/mnt/172.16.1.137\:_nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/
total 4260941
-rw-rw----. 1 nobody nobody 4363202560 Jul  5 11:23 d2a89b5e-7d62-4695-96d8-b762ce52b379
-rw-r--r--. 1 nobody nobody        261 Jul  5 11:17 d2a89b5e-7d62-4695-96d8-b762ce52b379.meta
On host console I see a throughput of 4mbit/s...
# strace -p 14342
This shows only the main thread use -f use -f to show all threads.
...
strace: Process 14342 attached
ppoll([{fd=9, events=POLLIN|POLLERR|POLLHUP}], 1, NULL, NULL, 8
# ll /proc/14342/fd
hangs...
# nfsstat -v
Client packet stats:
packets    udp        tcp        tcpconn
0          0          0          0
Client rpc stats:
calls      retrans    authrefrsh
31171856   0          31186615
Client nfs v4:
null         read         write        commit       open         open_conf
0         0% 2339179   7% 14872911 47% 7233      0% 74956     0% 2         0%
open_noat    open_dgrd    close        setattr      fsinfo       renew
2312347   7% 0         0% 2387293   7% 24        0% 23        0% 5         0%
setclntid    confirm      lock         lockt        locku        access
3         0% 3         0% 8         0% 8         0% 5         0% 1342746   4%
getattr      lookup       lookup_root  remove       rename       link
3031001   9% 71551     0% 7         0% 74590     0% 6         0% 0         0%
symlink      create       pathconf     statfs       readlink     readdir
0         0% 9         0% 16        0% 4548231  14% 0         0% 98506     0%
server_caps  delegreturn  getacl       setacl       fs_locations rel_lkowner
39        0% 14        0% 0         0% 0         0% 0         0% 0         0%
secinfo      exchange_id  create_ses   destroy_ses  sequence     get_lease_t
0         0% 0         0% 4         0% 2         0% 1         0% 0         0%
reclaim_comp layoutget    getdevinfo   layoutcommit layoutreturn getdevlist
0         0% 2         0% 0         0% 0         0% 0         0% 0         0%
(null)
5         0%
# vmstat 3
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  1      0 82867112 437548 7066580    0    0    54     1    0    0  0  0 100  0  0
 0  1      0 82867024 437548 7066620    0    0  1708     0 3720 8638  0  0 95  4  0
 4  1      0 82868728 437552 7066616    0    0   875     9 3004 8457  0  0 95  4  0
 0  1      0 82869600 437552 7066636    0    0  1785     6 2982 8359  0  0 95  4  0
I see the blocked process that is my qemu-img one...
In messages of hypervisor
Jul  5 11:33:06 node4 kernel: INFO: task qemu-img:14343 blocked for more than 120 seconds.
Jul  5 11:33:06 node4 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  5 11:33:06 node4 kernel: qemu-img        D ffff9d960e7e1080     0 14343   3328 0x00000080
Jul  5 11:33:06 node4 kernel: Call Trace:
Jul  5 11:33:06 node4 kernel: [<ffffffffa72de185>] ? sched_clock_cpu+0x85/0xc0
Jul  5 11:33:06 node4 kernel: [<ffffffffa72da830>] ? try_to_wake_up+0x190/0x390
Jul  5 11:33:06 node4 kernel: [<ffffffffa7988089>] schedule_preempt_disabled+0x29/0x70
Jul  5 11:33:06 node4 kernel: [<ffffffffa7985ff7>] __mutex_lock_slowpath+0xc7/0x1d0
Jul  5 11:33:06 node4 kernel: [<ffffffffa79853cf>] mutex_lock+0x1f/0x2f
Jul  5 11:33:06 node4 kernel: [<ffffffffc0db5489>] nfs_start_io_write+0x19/0x40 [nfs]
Jul  5 11:33:06 node4 kernel: [<ffffffffc0dad0d1>] nfs_file_write+0x81/0x1e0 [nfs]
Jul  5 11:33:06 node4 kernel: [<ffffffffa744d063>] do_sync_write+0x93/0xe0
Jul  5 11:33:06 node4 kernel: [<ffffffffa744db50>] vfs_write+0xc0/0x1f0
Jul  5 11:33:06 node4 kernel: [<ffffffffa744eaf2>] SyS_pwrite64+0x92/0xc0
Jul  5 11:33:06 node4 kernel: [<ffffffffa7993ec9>] ? system_call_after_swapgs+0x96/0x13a
Jul  5 11:33:06 node4 kernel: [<ffffffffa7993f92>] system_call_fastpath+0x25/0x2a
Jul  5 11:33:06 node4 kernel: [<ffffffffa7993ed5>] ? system_call_after_swapgs+0xa2/0x13a
Looks like qemu-img is stuck writing to the NFS server.
...
Possibly problems with NFSv4? I see that it mounts as nfsv4:
# mount
. . .
172.16.1.137:/nas/EXPORT-DOMAIN on /rhev/data-center/mnt/172.16.1.137:_nas_EXPORT-DOMAIN type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,timeo=600,retrans=6,sec=sys,clientaddr=192.168.50.52,local_lock=none,addr=172.16.1.137)
This is a test oVirt env so I can wait and eventually test something...
Let me know your suggestions
I would start by changing the NFS storage domain to version 4.2.

1. kill the hang qemu-img (it will probably cannot be killed, but worth trying)
2. deactivate the storage domain
3. fix the ownership on the storage domain (should be vdsm:kvm, not
nobody:nobody)3.
4. in ovirt engine: manage storage domain -> advanced options -> nfs
version: 4.2
5. activate the storage domain
6. try again to export the disk

Finally I think we have a management issue here. It does not make
sense to use a preallocated
disk on export domain. Using a preallocated disk makes sense on a data
domain when you
want to prevent the case of failing with ENSPC when VM writes to the disk.

Disks on the export domain are never used by a running VM so there is
no reason to
preallocate them. The system should always use sparse disks when
copying to export
domain.

When importing disks from export domain, the system should reconstruct
the original disk
configuration (e.g. raw-preallocated).

Nir