Same here, very frustrating.
On 21/05/2023 21:11, Rik Theys wrote:
> Hi,
>
> We are experiencing the same issue. We've migrated one host from
> CentOS Stream 8 to Rocky 8.8 and now see the same issue with the EL
> 8.8 kernel.
>
> We don't see this issue on our 8.7 hosts.
>
> Regards,
>
> Rik
>
> On 5/15/23 22:48, Jeff Bailey wrote:
>> Sounds exactly like some trouble I was having. I downgraded the
>> kernel to 4.18.0-448 and everything is fine. There have been a
>> couple of kernel releases since I had problems but I haven't had a
>> chance to try them yet. I believe it was in 4.18.0-485 that I
>> noticed it but that's just from memory.
>>
>>
>> On 5/11/2023 2:26 PM, dominik.drazyk(a)blackrack.pl wrote:
>>> Hello,
>>> I have recently migrated our customer's cluster to newer hardware
>>> (CentOS 8 Stream, 4 hypervisor nodes, 3 hosts with GlusterFS 5x 6TB
>>> SSD as JBOD with replica 3). After 1 month from the switch we
>>> encounter frequent vm locks that need host reboot in order to unlock
>>> the VM. Affected vms cannot be powered down from ovirt UI. Even if
>>> ovirt is successful in powering down affected vms, they cannot be
>>> booted again with information that OS disk is used. Once I reboot
>>> the host, vms can be turned on and everything works fine.
>>>
>>> In vdsm logs I can note the following error:
>>> 2023-05-11 19:33:12,339+0200 ERROR (qgapoller/1)
>>> [virt.periodic.Operation] <bound method QemuGuestAgentPoller._poller
>>> of <vdsm.virt.qemuguestagent.QemuGuestAgentPoller object at
>>> 0x7f553aa3e470>> operation failed (periodic:187)
>>> Traceback (most recent call last):
>>> File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py",
>>> line 185, in __call__
>>> self._func()
>>> File
>>> "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py",
line
>>> 476, in _poller
>>> vm_id, self._qga_call_get_vcpus(vm_obj))
>>> File
>>> "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py",
line
>>> 797, in _qga_call_get_vcpus
>>> if 'online' in vcpus:
>>> TypeError: argument of type 'NoneType' is not iterable
>>>
>>> /var/log/messages reports:
>>> May 11 19:35:15 kernel: task:CPU 7/KVM state:D stack: 0 pid:
>>> 7065 ppid: 1 flags: 0x80000182
>>> May 11 19:35:15 kernel: Call Trace:
>>> May 11 19:35:15 kernel: __schedule+0x2d1/0x870
>>> May 11 19:35:15 kernel: schedule+0x55/0xf0
>>> May 11 19:35:15 kernel: schedule_preempt_disabled+0xa/0x10
>>> May 11 19:35:15 kernel: rwsem_down_read_slowpath+0x26e/0x3f0
>>> May 11 19:35:15 kernel: down_read+0x95/0xa0
>>> May 11 19:35:15 kernel: get_user_pages_unlocked+0x66/0x2a0
>>> May 11 19:35:15 kernel: hva_to_pfn+0xf5/0x430 [kvm]
>>> May 11 19:35:15 kernel: kvm_faultin_pfn+0x95/0x2e0 [kvm]
>>> May 11 19:35:15 kernel: ? select_task_rq_fair+0x355/0x990
>>> May 11 19:35:15 kernel: ? sched_clock+0x5/0x10
>>> May 11 19:35:15 kernel: ? sched_clock_cpu+0xc/0xb0
>>> May 11 19:35:15 kernel: direct_page_fault+0x3b4/0x860 [kvm]
>>> May 11 19:35:15 kernel: kvm_mmu_page_fault+0x114/0x680 [kvm]
>>> May 11 19:35:15 kernel: ? vmx_vmexit+0x9f/0x70d [kvm_intel]
>>> May 11 19:35:15 kernel: ? vmx_vmexit+0xae/0x70d [kvm_intel]
>>> May 11 19:35:15 kernel: ?
>>> gfn_to_pfn_cache_invalidate_start+0x190/0x190 [kvm]
>>> May 11 19:35:15 kernel: vmx_handle_exit+0x177/0x770 [kvm_intel]
>>> May 11 19:35:15 kernel: ?
>>> gfn_to_pfn_cache_invalidate_start+0x190/0x190 [kvm]
>>> May 11 19:35:15 kernel: vcpu_enter_guest+0xafd/0x18e0 [kvm]
>>> May 11 19:35:15 kernel: ? hrtimer_try_to_cancel+0x7b/0x100
>>> May 11 19:35:15 kernel: kvm_arch_vcpu_ioctl_run+0x112/0x600 [kvm]
>>> May 11 19:35:15 kernel: kvm_vcpu_ioctl+0x2c9/0x640 [kvm]
>>> May 11 19:35:15 kernel: ? pollwake+0x74/0xa0
>>> May 11 19:35:15 kernel: ? wake_up_q+0x70/0x70
>>> May 11 19:35:15 kernel: ? __wake_up_common+0x7a/0x190
>>> May 11 19:35:15 kernel: do_vfs_ioctl+0xa4/0x690
>>> May 11 19:35:15 kernel: ksys_ioctl+0x64/0xa0
>>> May 11 19:35:15 kernel: __x64_sys_ioctl+0x16/0x20
>>> May 11 19:35:15 kernel: do_syscall_64+0x5b/0x1b0
>>> May 11 19:35:15 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xc6
>>> May 11 19:35:15 kernel: RIP: 0033:0x7faf1a1387cb
>>> May 11 19:35:15 kernel: Code: Unable to access opcode bytes at RIP
>>> 0x7faf1a1387a1.
>>> May 11 19:35:15 kernel: RSP: 002b:00007fa6f5ffa6e8 EFLAGS: 00000246
>>> ORIG_RAX: 0000000000000010
>>> May 11 19:35:15 kernel: RAX: ffffffffffffffda RBX: 000055be52e7bcf0
>>> RCX: 00007faf1a1387cb
>>> May 11 19:35:15 kernel: RDX: 0000000000000000 RSI: 000000000000ae80
>>> RDI: 0000000000000027
>>> May 11 19:35:15 kernel: RBP: 0000000000000000 R08: 000055be5158c6a8
>>> R09: 00000007d9e95a00
>>> May 11 19:35:15 kernel: R10: 0000000000000002 R11: 0000000000000246
>>> R12: 0000000000000000
>>> May 11 19:35:15 kernel: R13: 000055be515bcfc0 R14: 00007fffec958800
>>> R15: 00007faf1d6c6000
>>> May 11 19:35:15 kernel: INFO: task worker:714626 blocked for more
>>> than 120 seconds.
>>> May 11 19:35:15 kernel: Not tainted 4.18.0-489.el8.x86_64 #1
>>> May 11 19:35:15 kernel: "echo 0 >
>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>
>>> May 11 19:35:15 kernel: task:worker state:D stack: 0
>>> pid:714626 ppid: 1 flags:0x00000180
>>> May 11 19:35:15 kernel: Call Trace:
>>> May 11 19:35:15 kernel: __schedule+0x2d1/0x870
>>> May 11 19:35:15 kernel: schedule+0x55/0xf0
>>> May 11 19:35:15 kernel: schedule_preempt_disabled+0xa/0x10
>>> May 11 19:35:15 kernel: rwsem_down_read_slowpath+0x26e/0x3f0
>>> May 11 19:35:15 kernel: down_read+0x95/0xa0
>>> May 11 19:35:15 kernel: do_madvise.part.30+0x2c3/0xa40
>>> May 11 19:35:15 kernel: ? syscall_trace_enter+0x1ff/0x2d0
>>> May 11 19:35:15 kernel: ? __x64_sys_madvise+0x26/0x30
>>> May 11 19:35:15 kernel: __x64_sys_madvise+0x26/0x30
>>> May 11 19:35:15 kernel: do_syscall_64+0x5b/0x1b0
>>> May 11 19:35:15 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xc6
>>> May 11 19:35:15 kernel: RIP: 0033:0x7faf1a138a4b
>>> May 11 19:35:15 kernel: Code: Unable to access opcode bytes at RIP
>>> 0x7faf1a138a21.
>>> May 11 19:35:15 kernel: RSP: 002b:00007faf151ea7f8 EFLAGS: 00000206
>>> ORIG_RAX: 000000000000001c
>>> May 11 19:35:15 kernel: RAX: ffffffffffffffda RBX: 00007faf149eb000
>>> RCX: 00007faf1a138a4b
>>> May 11 19:35:15 kernel: RDX: 0000000000000004 RSI: 00000000007fb000
>>> RDI: 00007faf149eb000
>>> May 11 19:35:15 kernel: RBP: 0000000000000000 R08: 00000007faf080ba
>>> R09: 00000000ffffffff
>>> May 11 19:35:15 kernel: R10: 00007faf151ea760 R11: 0000000000000206
>>> R12: 00007faf15aec48e
>>> May 11 19:35:15 kernel: R13: 00007faf15aec48f R14: 00007faf151eb700
>>> R15: 00007faf151ea8c0
>>> May 11 19:35:15 kernel: INFO: task worker:714628 blocked for more
>>> than 120 seconds.
>>> May 11 19:35:15 kernel: Not tainted 4.18.0-489.el8.x86_64 #1
>>> May 11 19:35:15 kernel: "echo 0 >
>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> Installed VDSM packages:
>>> vdsm-api-4.50.3.4-1.el8.noarch
>>> vdsm-network-4.50.3.4-1.el8.x86_64
>>> vdsm-yajsonrpc-4.50.3.4-1.el8.noarch
>>> vdsm-http-4.50.3.4-1.el8.noarch
>>> vdsm-client-4.50.3.4-1.el8.noarch
>>> vdsm-4.50.3.4-1.el8.x86_64
>>> vdsm-gluster-4.50.3.4-1.el8.x86_64
>>> vdsm-python-4.50.3.4-1.el8.noarch
>>> vdsm-jsonrpc-4.50.3.4-1.el8.noarch
>>> vdsm-common-4.50.3.4-1.el8.noarch
>>>
>>> Libvirt:
>>> libvirt-client-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>> libvirt-daemon-driver-nodedev-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>>
>>>
libvirt-daemon-driver-storage-logical-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>>
>>> libvirt-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>> libvirt-daemon-driver-network-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>>
>>> libvirt-daemon-driver-qemu-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>>
libvirt-daemon-driver-storage-scsi-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>>
>>>
libvirt-daemon-driver-storage-core-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>>
>>> libvirt-daemon-config-network-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>>
>>>
libvirt-daemon-driver-storage-iscsi-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>>
>>>
libvirt-daemon-driver-storage-rbd-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>>
>>> libvirt-daemon-driver-storage-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>>
>>> libvirt-libs-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>> libvirt-daemon-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>> libvirt-daemon-config-nwfilter-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>>
>>> libvirt-daemon-driver-secret-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>>
>>>
libvirt-daemon-driver-storage-disk-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>>
>>>
libvirt-daemon-driver-storage-mpath-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>>
>>>
libvirt-daemon-driver-storage-gluster-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>>
>>> python3-libvirt-8.0.0-2.module_el8.7.0+1218+f626c2ff.x86_64
>>> libvirt-daemon-driver-nwfilter-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>>
>>> libvirt-lock-sanlock-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>> libvirt-daemon-driver-interface-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>>
>>>
libvirt-daemon-driver-storage-iscsi-direct-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>>
>>> libvirt-daemon-kvm-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
>>>
>>> During the issue with locked vms, they do not respond from network,
>>> I cannot use VNC console (or any other) to check what is happening
>>> from VM perspective. The host cannot list running processes. There
>>> are plenty of resources left and each host runs about 30-35 vms.
>>> First I thought that it might be related to glusterfs (I use gluster
>>> on other clusters and it usually works fine), so we migrated all vms
>>> back to old storage (NFS). The problem came back today on two hosts.
>>> I do not have such issues on other cluster which runs on Rocky 8.6
>>> with hyperconverged glusterfs. Hence as a last resort I'll be
>>> migrating to Rocky 8 from CentOS 8-stream.
>>>
>>> Has anyone observed such issues with ovirt hosts on CentOS 8-stream?
>>> Any form of help is welcome, as I'm running out of ideas.
>>> _______________________________________________
>>> Users mailing list -- users(a)ovirt.org
>>> To unsubscribe send an email to users-leave(a)ovirt.org
>>> Privacy Statement:
https://www.ovirt.org/privacy-policy.html
>>> oVirt Code of Conduct:
>>>
https://www.ovirt.org/community/about/community-guidelines/
>>> List Archives:
>>>
https://lists.ovirt.org/archives/list/users@ovirt.org/message/2C7MAUFIUW6...
>> _______________________________________________
>> Users mailing list -- users(a)ovirt.org
>> To unsubscribe send an email to users-leave(a)ovirt.org
>> Privacy Statement:
https://www.ovirt.org/privacy-policy.html
>> oVirt Code of Conduct:
>>
https://www.ovirt.org/community/about/community-guidelines/
>> List Archives:
>>
https://lists.ovirt.org/archives/list/users@ovirt.org/message/QFFRNZKKXSV...
>