On Mon, Jul 5, 2021 at 12:50 PM Gianluca Cecchi
<gianluca.cecchi(a)gmail.com> wrote:
On Sun, Jul 4, 2021 at 1:01 PM Nir Soffer <nsoffer(a)redhat.com> wrote:
>
> On Sun, Jul 4, 2021 at 11:30 AM Strahil Nikolov <hunter86_bg(a)yahoo.com> wrote:
> >
> > Isn't it better to strace it before killing qemu-img .
>
> It may be too late, but it may help to understand why this qemu-img
> run got stuck.
>
Hi, thanks for your answers and suggestions.
That env was a production one and so I was forced to power off the hypervisor and power
on it again (it was a maintenance window with all the VMs powered down anyway). I was also
unable to put the host into maintenance because it replied that there were some tasks
running, even after the kill, because the 2 processes (the VM had 2 disks to export and so
two qemu-img processes) remained in defunct and after several minutes no change in web
admin feedback about the process....
My first suspicion was something related to fw congestion because the hypervisor network
and the nas appliance were in different networks and I wasn't sure if a fw was in
place for it....
But on a test oVirt environment with same oVirt version and with the same network for
hypervisors I was able to put a Linux server with the same network as the nas and
configure it as nfs server.
And the export went with a throughput of about 50MB/s, so no fw problem.
A VM with 55Gb disk exported in 19 minutes.
So I got the rights to mount the nas on the test env and mounted it as export domain and
now I have the same problems I can debug.
The same VM with only one disk (55Gb). The process:
vdsm 14342 3270 0 11:17 ? 00:00:03 /usr/bin/qemu-img convert -p -t none -T
none -f raw
/rhev/data-center/mnt/blockSD/679c0725-75fb-4af7-bff1-7c447c5d789c/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/d2a89b5e-7d62-4695-96d8-b762ce52b379
-O raw -o preallocation=falloc
/rhev/data-center/mnt/172.16.1.137:_nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/d2a89b5e-7d62-4695-96d8-b762ce52b379
-o preallocation + NFS 4.0 + very slow NFS is your problem.
qemu-img is using posix-fallocate() to preallocate the entire image at
the start of the copy. With NFS 4.2
this uses fallocate() linux specific syscall that allocates the space
very efficiently in no time. With older
NFS versions, this becomes a very slow loop, writing one byte for
every 4k block.
If you see -o preallocation, it means you are using an old vdsm
version, we stopped using -o preallocation
in 4.4.2, see
https://bugzilla.redhat.com/1850267.
On the hypervisor the ls commands quite hang, so from another
hypervisor I see that the disk size seems to remain at 4Gb even if timestamp updates...
# ll
/rhev/data-center/mnt/172.16.1.137\:_nas_EXPORT-DOMAIN/20433d5d-9d82-4079-9252-0e746ce54106/images/530b3e7f-4ce4-4051-9cac-1112f5f9e8b5/
total 4260941
-rw-rw----. 1 nobody nobody 4363202560 Jul 5 11:23 d2a89b5e-7d62-4695-96d8-b762ce52b379
-rw-r--r--. 1 nobody nobody 261 Jul 5 11:17
d2a89b5e-7d62-4695-96d8-b762ce52b379.meta
On host console I see a throughput of 4mbit/s...
# strace -p 14342
This shows only the main thread use -f use -f to show all threads.
strace: Process 14342 attached
ppoll([{fd=9, events=POLLIN|POLLERR|POLLHUP}], 1, NULL, NULL, 8
# ll /proc/14342/fd
hangs...
# nfsstat -v
Client packet stats:
packets udp tcp tcpconn
0 0 0 0
Client rpc stats:
calls retrans authrefrsh
31171856 0 31186615
Client nfs v4:
null read write commit open open_conf
0 0% 2339179 7% 14872911 47% 7233 0% 74956 0% 2 0%
open_noat open_dgrd close setattr fsinfo renew
2312347 7% 0 0% 2387293 7% 24 0% 23 0% 5 0%
setclntid confirm lock lockt locku access
3 0% 3 0% 8 0% 8 0% 5 0% 1342746 4%
getattr lookup lookup_root remove rename link
3031001 9% 71551 0% 7 0% 74590 0% 6 0% 0 0%
symlink create pathconf statfs readlink readdir
0 0% 9 0% 16 0% 4548231 14% 0 0% 98506 0%
server_caps delegreturn getacl setacl fs_locations rel_lkowner
39 0% 14 0% 0 0% 0 0% 0 0% 0 0%
secinfo exchange_id create_ses destroy_ses sequence get_lease_t
0 0% 0 0% 4 0% 2 0% 1 0% 0 0%
reclaim_comp layoutget getdevinfo layoutcommit layoutreturn getdevlist
0 0% 2 0% 0 0% 0 0% 0 0% 0 0%
(null)
5 0%
# vmstat 3
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 1 0 82867112 437548 7066580 0 0 54 1 0 0 0 0 100 0 0
0 1 0 82867024 437548 7066620 0 0 1708 0 3720 8638 0 0 95 4 0
4 1 0 82868728 437552 7066616 0 0 875 9 3004 8457 0 0 95 4 0
0 1 0 82869600 437552 7066636 0 0 1785 6 2982 8359 0 0 95 4 0
I see the blocked process that is my qemu-img one...
In messages of hypervisor
Jul 5 11:33:06 node4 kernel: INFO: task qemu-img:14343 blocked for more than 120
seconds.
Jul 5 11:33:06 node4 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 5 11:33:06 node4 kernel: qemu-img D ffff9d960e7e1080 0 14343 3328
0x00000080
Jul 5 11:33:06 node4 kernel: Call Trace:
Jul 5 11:33:06 node4 kernel: [<ffffffffa72de185>] ? sched_clock_cpu+0x85/0xc0
Jul 5 11:33:06 node4 kernel: [<ffffffffa72da830>] ? try_to_wake_up+0x190/0x390
Jul 5 11:33:06 node4 kernel: [<ffffffffa7988089>]
schedule_preempt_disabled+0x29/0x70
Jul 5 11:33:06 node4 kernel: [<ffffffffa7985ff7>]
__mutex_lock_slowpath+0xc7/0x1d0
Jul 5 11:33:06 node4 kernel: [<ffffffffa79853cf>] mutex_lock+0x1f/0x2f
Jul 5 11:33:06 node4 kernel: [<ffffffffc0db5489>] nfs_start_io_write+0x19/0x40
[nfs]
Jul 5 11:33:06 node4 kernel: [<ffffffffc0dad0d1>] nfs_file_write+0x81/0x1e0 [nfs]
Jul 5 11:33:06 node4 kernel: [<ffffffffa744d063>] do_sync_write+0x93/0xe0
Jul 5 11:33:06 node4 kernel: [<ffffffffa744db50>] vfs_write+0xc0/0x1f0
Jul 5 11:33:06 node4 kernel: [<ffffffffa744eaf2>] SyS_pwrite64+0x92/0xc0
Jul 5 11:33:06 node4 kernel: [<ffffffffa7993ec9>] ?
system_call_after_swapgs+0x96/0x13a
Jul 5 11:33:06 node4 kernel: [<ffffffffa7993f92>] system_call_fastpath+0x25/0x2a
Jul 5 11:33:06 node4 kernel: [<ffffffffa7993ed5>] ?
system_call_after_swapgs+0xa2/0x13a
Looks like qemu-img is stuck writing to the NFS server.
Possibly problems with NFSv4? I see that it mounts as nfsv4:
# mount
. . .
172.16.1.137:/nas/EXPORT-DOMAIN on /rhev/data-center/mnt/172.16.1.137:_nas_EXPORT-DOMAIN
type nfs4
(rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,soft,nosharecache,proto=tcp,timeo=600,retrans=6,sec=sys,clientaddr=192.168.50.52,local_lock=none,addr=172.16.1.137)
This is a test oVirt env so I can wait and eventually test something...
Let me know your suggestions
I would start by changing the NFS storage domain to version 4.2.
1. kill the hang qemu-img (it will probably cannot be killed, but worth trying)
2. deactivate the storage domain
3. fix the ownership on the storage domain (should be vdsm:kvm, not
nobody:nobody)3.
4. in ovirt engine: manage storage domain -> advanced options -> nfs
version: 4.2
5. activate the storage domain
6. try again to export the disk
Finally I think we have a management issue here. It does not make
sense to use a preallocated
disk on export domain. Using a preallocated disk makes sense on a data
domain when you
want to prevent the case of failing with ENSPC when VM writes to the disk.
Disks on the export domain are never used by a running VM so there is
no reason to
preallocate them. The system should always use sparse disks when
copying to export
domain.
When importing disks from export domain, the system should reconstruct
the original disk
configuration (e.g. raw-preallocated).
Nir