Sounds exactly like some trouble I was having. I downgraded the kernel
to 4.18.0-448 and everything is fine. There have been a couple of
kernel releases since I had problems but I haven't had a chance to try
them yet. I believe it was in 4.18.0-485 that I noticed it but that's
just from memory.
On 5/11/2023 2:26 PM, dominik.drazyk(a)blackrack.pl wrote:
> Hello,
> I have recently migrated our customer's cluster to newer hardware (CentOS 8
Stream, 4 hypervisor nodes, 3 hosts with GlusterFS 5x 6TB SSD as JBOD with replica 3).
After 1 month from the switch we encounter frequent vm locks that need host reboot in
order to unlock the VM. Affected vms cannot be powered down from ovirt UI. Even if ovirt
is successful in powering down affected vms, they cannot be booted again with information
that OS disk is used. Once I reboot the host, vms can be turned on and everything works
fine.
>
> In vdsm logs I can note the following error:
> 2023-05-11 19:33:12,339+0200 ERROR (qgapoller/1) [virt.periodic.Operation]
<bound method QemuGuestAgentPoller._poller
> of <vdsm.virt.qemuguestagent.QemuGuestAgentPoller object at 0x7f553aa3e470>>
operation failed (periodic:187)
> Traceback (most recent call last):
> File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py", line 185,
in __call__
> self._func()
> File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py",
line 476, in _poller
> vm_id, self._qga_call_get_vcpus(vm_obj))
> File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py",
line 797, in _qga_call_get_vcpus
> if 'online' in vcpus:
> TypeError: argument of type 'NoneType' is not iterable
>
> /var/log/messages reports:
> May 11 19:35:15 kernel: task:CPU 7/KVM state:D stack: 0 pid: 7065 ppid:
1 flags: 0x80000182
> May 11 19:35:15 kernel: Call Trace:
> May 11 19:35:15 kernel: __schedule+0x2d1/0x870
> May 11 19:35:15 kernel: schedule+0x55/0xf0
> May 11 19:35:15 kernel: schedule_preempt_disabled+0xa/0x10
> May 11 19:35:15 kernel: rwsem_down_read_slowpath+0x26e/0x3f0
> May 11 19:35:15 kernel: down_read+0x95/0xa0
> May 11 19:35:15 kernel: get_user_pages_unlocked+0x66/0x2a0
> May 11 19:35:15 kernel: hva_to_pfn+0xf5/0x430 [kvm]
> May 11 19:35:15 kernel: kvm_faultin_pfn+0x95/0x2e0 [kvm]
> May 11 19:35:15 kernel: ? select_task_rq_fair+0x355/0x990
> May 11 19:35:15 kernel: ? sched_clock+0x5/0x10
> May 11 19:35:15 kernel: ? sched_clock_cpu+0xc/0xb0
> May 11 19:35:15 kernel: direct_page_fault+0x3b4/0x860 [kvm]
> May 11 19:35:15 kernel: kvm_mmu_page_fault+0x114/0x680 [kvm]
> May 11 19:35:15 kernel: ? vmx_vmexit+0x9f/0x70d [kvm_intel]
> May 11 19:35:15 kernel: ? vmx_vmexit+0xae/0x70d [kvm_intel]
> May 11 19:35:15 kernel: ? gfn_to_pfn_cache_invalidate_start+0x190/0x190 [kvm]
> May 11 19:35:15 kernel: vmx_handle_exit+0x177/0x770 [kvm_intel]
> May 11 19:35:15 kernel: ? gfn_to_pfn_cache_invalidate_start+0x190/0x190 [kvm]
> May 11 19:35:15 kernel: vcpu_enter_guest+0xafd/0x18e0 [kvm]
> May 11 19:35:15 kernel: ? hrtimer_try_to_cancel+0x7b/0x100
> May 11 19:35:15 kernel: kvm_arch_vcpu_ioctl_run+0x112/0x600 [kvm]
> May 11 19:35:15 kernel: kvm_vcpu_ioctl+0x2c9/0x640 [kvm]
> May 11 19:35:15 kernel: ? pollwake+0x74/0xa0
> May 11 19:35:15 kernel: ? wake_up_q+0x70/0x70
> May 11 19:35:15 kernel: ? __wake_up_common+0x7a/0x190
> May 11 19:35:15 kernel: do_vfs_ioctl+0xa4/0x690
> May 11 19:35:15 kernel: ksys_ioctl+0x64/0xa0
> May 11 19:35:15 kernel: __x64_sys_ioctl+0x16/0x20
> May 11 19:35:15 kernel: do_syscall_64+0x5b/0x1b0
> May 11 19:35:15 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xc6
> May 11 19:35:15 kernel: RIP: 0033:0x7faf1a1387cb
> May 11 19:35:15 kernel: Code: Unable to access opcode bytes at RIP 0x7faf1a1387a1.
> May 11 19:35:15 kernel: RSP: 002b:00007fa6f5ffa6e8 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
> May 11 19:35:15 kernel: RAX: ffffffffffffffda RBX: 000055be52e7bcf0 RCX:
00007faf1a1387cb
> May 11 19:35:15 kernel: RDX: 0000000000000000 RSI: 000000000000ae80 RDI:
0000000000000027
> May 11 19:35:15 kernel: RBP: 0000000000000000 R08: 000055be5158c6a8 R09:
00000007d9e95a00
> May 11 19:35:15 kernel: R10: 0000000000000002 R11: 0000000000000246 R12:
0000000000000000
> May 11 19:35:15 kernel: R13: 000055be515bcfc0 R14: 00007fffec958800 R15:
00007faf1d6c6000
> May 11 19:35:15 kernel: INFO: task worker:714626 blocked for more than 120 seconds.
> May 11 19:35:15 kernel: Not tainted 4.18.0-489.el8.x86_64 #1
> May 11 19:35:15 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
>
> May 11 19:35:15 kernel: task:worker state:D stack: 0 pid:714626 ppid:
1 flags:0x00000180
> May 11 19:35:15 kernel: Call Trace:
> May 11 19:35:15 kernel: __schedule+0x2d1/0x870
> May 11 19:35:15 kernel: schedule+0x55/0xf0
> May 11 19:35:15 kernel: schedule_preempt_disabled+0xa/0x10
> May 11 19:35:15 kernel: rwsem_down_read_slowpath+0x26e/0x3f0
> May 11 19:35:15 kernel: down_read+0x95/0xa0
> May 11 19:35:15 kernel: do_madvise.part.30+0x2c3/0xa40
> May 11 19:35:15 kernel: ? syscall_trace_enter+0x1ff/0x2d0
> May 11 19:35:15 kernel: ? __x64_sys_madvise+0x26/0x30
> May 11 19:35:15 kernel: __x64_sys_madvise+0x26/0x30
> May 11 19:35:15 kernel: do_syscall_64+0x5b/0x1b0
> May 11 19:35:15 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xc6
> May 11 19:35:15 kernel: RIP: 0033:0x7faf1a138a4b
> May 11 19:35:15 kernel: Code: Unable to access opcode bytes at RIP 0x7faf1a138a21.
> May 11 19:35:15 kernel: RSP: 002b:00007faf151ea7f8 EFLAGS: 00000206 ORIG_RAX:
000000000000001c
> May 11 19:35:15 kernel: RAX: ffffffffffffffda RBX: 00007faf149eb000 RCX:
00007faf1a138a4b
> May 11 19:35:15 kernel: RDX: 0000000000000004 RSI: 00000000007fb000 RDI:
00007faf149eb000
> May 11 19:35:15 kernel: RBP: 0000000000000000 R08: 00000007faf080ba R09:
00000000ffffffff
> May 11 19:35:15 kernel: R10: 00007faf151ea760 R11: 0000000000000206 R12:
00007faf15aec48e
> May 11 19:35:15 kernel: R13: 00007faf15aec48f R14: 00007faf151eb700 R15:
00007faf151ea8c0
> May 11 19:35:15 kernel: INFO: task worker:714628 blocked for more than 120 seconds.
> May 11 19:35:15 kernel: Not tainted 4.18.0-489.el8.x86_64 #1
> May 11 19:35:15 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Installed VDSM packages:
> vdsm-api-4.50.3.4-1.el8.noarch
> vdsm-network-4.50.3.4-1.el8.x86_64
> vdsm-yajsonrpc-4.50.3.4-1.el8.noarch
> vdsm-http-4.50.3.4-1.el8.noarch
> vdsm-client-4.50.3.4-1.el8.noarch
> vdsm-4.50.3.4-1.el8.x86_64
> vdsm-gluster-4.50.3.4-1.el8.x86_64
> vdsm-python-4.50.3.4-1.el8.noarch
> vdsm-jsonrpc-4.50.3.4-1.el8.noarch
> vdsm-common-4.50.3.4-1.el8.noarch
>
> Libvirt:
> libvirt-client-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-driver-nodedev-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-driver-storage-logical-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-driver-network-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-driver-qemu-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-driver-storage-scsi-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-driver-storage-core-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-config-network-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-driver-storage-iscsi-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-driver-storage-rbd-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-driver-storage-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-libs-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-config-nwfilter-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-driver-secret-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-driver-storage-disk-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-driver-storage-mpath-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-driver-storage-gluster-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> python3-libvirt-8.0.0-2.module_el8.7.0+1218+f626c2ff.x86_64
> libvirt-daemon-driver-nwfilter-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-lock-sanlock-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-driver-interface-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>
libvirt-daemon-driver-storage-iscsi-direct-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-kvm-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>
> During the issue with locked vms, they do not respond from network, I cannot use VNC
console (or any other) to check what is happening from VM perspective. The host cannot
list running processes. There are plenty of resources left and each host runs about 30-35
vms.
> First I thought that it might be related to glusterfs (I use gluster on other
clusters and it usually works fine), so we migrated all vms back to old storage (NFS). The
problem came back today on two hosts. I do not have such issues on other cluster which
runs on Rocky 8.6 with hyperconverged glusterfs. Hence as a last resort I'll be
migrating to Rocky 8 from CentOS 8-stream.
>
> Has anyone observed such issues with ovirt hosts on CentOS 8-stream? Any form of help
is welcome, as I'm running out of ideas.
> _______________________________________________
> Users mailing list -- users(a)ovirt.org
> To unsubscribe send an email to users-leave(a)ovirt.org
> Privacy Statement:
https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/2C7MAUFIUW6...