Hi,
We are experiencing the same issue. We've migrated one host from CentOS
Stream 8 to Rocky 8.8 and now see the same issue with the EL 8.8 kernel.
We don't see this issue on our 8.7 hosts.
Regards,
Rik
On 5/15/23 22:48, Jeff Bailey wrote:
Sounds exactly like some trouble I was having. I downgraded the
kernel to 4.18.0-448 and everything is fine. There have been a couple
of kernel releases since I had problems but I haven't had a chance to
try them yet. I believe it was in 4.18.0-485 that I noticed it but
that's just from memory.
On 5/11/2023 2:26 PM, dominik.drazyk(a)blackrack.pl wrote:
> Hello,
> I have recently migrated our customer's cluster to newer hardware
> (CentOS 8 Stream, 4 hypervisor nodes, 3 hosts with GlusterFS 5x 6TB
> SSD as JBOD with replica 3). After 1 month from the switch we
> encounter frequent vm locks that need host reboot in order to unlock
> the VM. Affected vms cannot be powered down from ovirt UI. Even if
> ovirt is successful in powering down affected vms, they cannot be
> booted again with information that OS disk is used. Once I reboot the
> host, vms can be turned on and everything works fine.
>
> In vdsm logs I can note the following error:
> 2023-05-11 19:33:12,339+0200 ERROR (qgapoller/1)
> [virt.periodic.Operation] <bound method QemuGuestAgentPoller._poller
> of <vdsm.virt.qemuguestagent.QemuGuestAgentPoller object at
> 0x7f553aa3e470>> operation failed (periodic:187)
> Traceback (most recent call last):
> File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py",
> line 185, in __call__
> self._func()
> File
> "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line
> 476, in _poller
> vm_id, self._qga_call_get_vcpus(vm_obj))
> File
> "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line
> 797, in _qga_call_get_vcpus
> if 'online' in vcpus:
> TypeError: argument of type 'NoneType' is not iterable
>
> /var/log/messages reports:
> May 11 19:35:15 kernel: task:CPU 7/KVM state:D stack: 0 pid:
> 7065 ppid: 1 flags: 0x80000182
> May 11 19:35:15 kernel: Call Trace:
> May 11 19:35:15 kernel: __schedule+0x2d1/0x870
> May 11 19:35:15 kernel: schedule+0x55/0xf0
> May 11 19:35:15 kernel: schedule_preempt_disabled+0xa/0x10
> May 11 19:35:15 kernel: rwsem_down_read_slowpath+0x26e/0x3f0
> May 11 19:35:15 kernel: down_read+0x95/0xa0
> May 11 19:35:15 kernel: get_user_pages_unlocked+0x66/0x2a0
> May 11 19:35:15 kernel: hva_to_pfn+0xf5/0x430 [kvm]
> May 11 19:35:15 kernel: kvm_faultin_pfn+0x95/0x2e0 [kvm]
> May 11 19:35:15 kernel: ? select_task_rq_fair+0x355/0x990
> May 11 19:35:15 kernel: ? sched_clock+0x5/0x10
> May 11 19:35:15 kernel: ? sched_clock_cpu+0xc/0xb0
> May 11 19:35:15 kernel: direct_page_fault+0x3b4/0x860 [kvm]
> May 11 19:35:15 kernel: kvm_mmu_page_fault+0x114/0x680 [kvm]
> May 11 19:35:15 kernel: ? vmx_vmexit+0x9f/0x70d [kvm_intel]
> May 11 19:35:15 kernel: ? vmx_vmexit+0xae/0x70d [kvm_intel]
> May 11 19:35:15 kernel: ?
> gfn_to_pfn_cache_invalidate_start+0x190/0x190 [kvm]
> May 11 19:35:15 kernel: vmx_handle_exit+0x177/0x770 [kvm_intel]
> May 11 19:35:15 kernel: ?
> gfn_to_pfn_cache_invalidate_start+0x190/0x190 [kvm]
> May 11 19:35:15 kernel: vcpu_enter_guest+0xafd/0x18e0 [kvm]
> May 11 19:35:15 kernel: ? hrtimer_try_to_cancel+0x7b/0x100
> May 11 19:35:15 kernel: kvm_arch_vcpu_ioctl_run+0x112/0x600 [kvm]
> May 11 19:35:15 kernel: kvm_vcpu_ioctl+0x2c9/0x640 [kvm]
> May 11 19:35:15 kernel: ? pollwake+0x74/0xa0
> May 11 19:35:15 kernel: ? wake_up_q+0x70/0x70
> May 11 19:35:15 kernel: ? __wake_up_common+0x7a/0x190
> May 11 19:35:15 kernel: do_vfs_ioctl+0xa4/0x690
> May 11 19:35:15 kernel: ksys_ioctl+0x64/0xa0
> May 11 19:35:15 kernel: __x64_sys_ioctl+0x16/0x20
> May 11 19:35:15 kernel: do_syscall_64+0x5b/0x1b0
> May 11 19:35:15 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xc6
> May 11 19:35:15 kernel: RIP: 0033:0x7faf1a1387cb
> May 11 19:35:15 kernel: Code: Unable to access opcode bytes at RIP
> 0x7faf1a1387a1.
> May 11 19:35:15 kernel: RSP: 002b:00007fa6f5ffa6e8 EFLAGS: 00000246
> ORIG_RAX: 0000000000000010
> May 11 19:35:15 kernel: RAX: ffffffffffffffda RBX: 000055be52e7bcf0
> RCX: 00007faf1a1387cb
> May 11 19:35:15 kernel: RDX: 0000000000000000 RSI: 000000000000ae80
> RDI: 0000000000000027
> May 11 19:35:15 kernel: RBP: 0000000000000000 R08: 000055be5158c6a8
> R09: 00000007d9e95a00
> May 11 19:35:15 kernel: R10: 0000000000000002 R11: 0000000000000246
> R12: 0000000000000000
> May 11 19:35:15 kernel: R13: 000055be515bcfc0 R14: 00007fffec958800
> R15: 00007faf1d6c6000
> May 11 19:35:15 kernel: INFO: task worker:714626 blocked for more
> than 120 seconds.
> May 11 19:35:15 kernel: Not tainted 4.18.0-489.el8.x86_64 #1
> May 11 19:35:15 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>
> May 11 19:35:15 kernel: task:worker state:D stack: 0
> pid:714626 ppid: 1 flags:0x00000180
> May 11 19:35:15 kernel: Call Trace:
> May 11 19:35:15 kernel: __schedule+0x2d1/0x870
> May 11 19:35:15 kernel: schedule+0x55/0xf0
> May 11 19:35:15 kernel: schedule_preempt_disabled+0xa/0x10
> May 11 19:35:15 kernel: rwsem_down_read_slowpath+0x26e/0x3f0
> May 11 19:35:15 kernel: down_read+0x95/0xa0
> May 11 19:35:15 kernel: do_madvise.part.30+0x2c3/0xa40
> May 11 19:35:15 kernel: ? syscall_trace_enter+0x1ff/0x2d0
> May 11 19:35:15 kernel: ? __x64_sys_madvise+0x26/0x30
> May 11 19:35:15 kernel: __x64_sys_madvise+0x26/0x30
> May 11 19:35:15 kernel: do_syscall_64+0x5b/0x1b0
> May 11 19:35:15 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xc6
> May 11 19:35:15 kernel: RIP: 0033:0x7faf1a138a4b
> May 11 19:35:15 kernel: Code: Unable to access opcode bytes at RIP
> 0x7faf1a138a21.
> May 11 19:35:15 kernel: RSP: 002b:00007faf151ea7f8 EFLAGS: 00000206
> ORIG_RAX: 000000000000001c
> May 11 19:35:15 kernel: RAX: ffffffffffffffda RBX: 00007faf149eb000
> RCX: 00007faf1a138a4b
> May 11 19:35:15 kernel: RDX: 0000000000000004 RSI: 00000000007fb000
> RDI: 00007faf149eb000
> May 11 19:35:15 kernel: RBP: 0000000000000000 R08: 00000007faf080ba
> R09: 00000000ffffffff
> May 11 19:35:15 kernel: R10: 00007faf151ea760 R11: 0000000000000206
> R12: 00007faf15aec48e
> May 11 19:35:15 kernel: R13: 00007faf15aec48f R14: 00007faf151eb700
> R15: 00007faf151ea8c0
> May 11 19:35:15 kernel: INFO: task worker:714628 blocked for more
> than 120 seconds.
> May 11 19:35:15 kernel: Not tainted 4.18.0-489.el8.x86_64 #1
> May 11 19:35:15 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Installed VDSM packages:
> vdsm-api-4.50.3.4-1.el8.noarch
> vdsm-network-4.50.3.4-1.el8.x86_64
> vdsm-yajsonrpc-4.50.3.4-1.el8.noarch
> vdsm-http-4.50.3.4-1.el8.noarch
> vdsm-client-4.50.3.4-1.el8.noarch
> vdsm-4.50.3.4-1.el8.x86_64
> vdsm-gluster-4.50.3.4-1.el8.x86_64
> vdsm-python-4.50.3.4-1.el8.noarch
> vdsm-jsonrpc-4.50.3.4-1.el8.noarch
> vdsm-common-4.50.3.4-1.el8.noarch
>
> Libvirt:
> libvirt-client-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-driver-nodedev-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>
> libvirt-daemon-driver-storage-logical-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>
> libvirt-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-driver-network-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>
> libvirt-daemon-driver-qemu-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-driver-storage-scsi-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>
> libvirt-daemon-driver-storage-core-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>
> libvirt-daemon-config-network-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>
> libvirt-daemon-driver-storage-iscsi-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>
> libvirt-daemon-driver-storage-rbd-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>
> libvirt-daemon-driver-storage-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>
> libvirt-libs-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-config-nwfilter-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>
> libvirt-daemon-driver-secret-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>
> libvirt-daemon-driver-storage-disk-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>
> libvirt-daemon-driver-storage-mpath-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>
> libvirt-daemon-driver-storage-gluster-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>
> python3-libvirt-8.0.0-2.module_el8.7.0+1218+f626c2ff.x86_64
> libvirt-daemon-driver-nwfilter-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>
> libvirt-lock-sanlock-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
> libvirt-daemon-driver-interface-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>
>
libvirt-daemon-driver-storage-iscsi-direct-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>
> libvirt-daemon-kvm-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>
> During the issue with locked vms, they do not respond from network, I
> cannot use VNC console (or any other) to check what is happening from
> VM perspective. The host cannot list running processes. There are
> plenty of resources left and each host runs about 30-35 vms.
> First I thought that it might be related to glusterfs (I use gluster
> on other clusters and it usually works fine), so we migrated all vms
> back to old storage (NFS). The problem came back today on two hosts.
> I do not have such issues on other cluster which runs on Rocky 8.6
> with hyperconverged glusterfs. Hence as a last resort I'll be
> migrating to Rocky 8 from CentOS 8-stream.
>
> Has anyone observed such issues with ovirt hosts on CentOS 8-stream?
> Any form of help is welcome, as I'm running out of ideas.
> _______________________________________________
> Users mailing list -- users(a)ovirt.org
> To unsubscribe send an email to users-leave(a)ovirt.org
> Privacy Statement:
https://www.ovirt.org/privacy-policy.html
> oVirt Code of Conduct:
>
https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
>
https://lists.ovirt.org/archives/list/users@ovirt.org/message/2C7MAUFIUW6...
_______________________________________________
Users mailing list -- users(a)ovirt.org
To unsubscribe send an email to users-leave(a)ovirt.org
Privacy Statement:
https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/QFFRNZKKXSV...
--
Rik Theys
System Engineer
KU Leuven - Dept. Elektrotechniek (ESAT)
Kasteelpark Arenberg 10 bus 2440 - B-3001 Leuven-Heverlee
+32(0)16/32.11.07
----------------------------------------------------------------
<<Any errors in spelling, tact or fact are transmission errors>>