[ovirt-users] Re: VM causes CPU blocks and forces reboot of host

21 May 2023

      Hi,

We are experiencing the same issue. We've migrated one host from CentOS 
Stream 8 to Rocky 8.8 and now see the same issue with the EL 8.8 kernel.

We don't see this issue on our 8.7 hosts.

Regards,

Rik

On 5/15/23 22:48, Jeff Bailey wrote:
...
Sounds exactly like some trouble I was having.  I downgraded the 
kernel to 4.18.0-448 and everything is fine.  There have been a couple 
of kernel releases since I had problems but I haven't had a chance to 
try them yet.  I believe it was in 4.18.0-485 that I noticed it but 
that's just from memory.
On 5/11/2023 2:26 PM, dominik.drazyk@blackrack.pl wrote:
...
Hello,
I have recently migrated our customer's cluster to newer hardware 
(CentOS 8 Stream, 4 hypervisor nodes, 3 hosts with GlusterFS 5x 6TB 
SSD as JBOD with replica 3). After 1 month from the switch we 
encounter frequent vm locks that need host reboot in order to unlock 
the VM. Affected vms cannot be powered down from ovirt UI. Even if 
ovirt is successful in powering down affected vms, they cannot be 
booted again with information that OS disk is used. Once I reboot the 
host, vms can be turned on and everything works fine.
In vdsm logs I can note the following error:
  2023-05-11 19:33:12,339+0200 ERROR (qgapoller/1) 
[virt.periodic.Operation] <bound method QemuGuestAgentPoller._poller
of <vdsm.virt.qemuguestagent.QemuGuestAgentPoller object at 
0x7f553aa3e470>> operation failed (periodic:187)
Traceback (most recent call last):
   File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py", 
line 185, in __call__
     self._func()
   File 
"/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 
476, in _poller
     vm_id, self._qga_call_get_vcpus(vm_obj))
   File 
"/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 
797, in _qga_call_get_vcpus
     if 'online' in vcpus:
TypeError: argument of type 'NoneType' is not iterable
/var/log/messages reports:
May 11 19:35:15 kernel: task:CPU 7/KVM       state:D stack:    0 pid: 
7065 ppid:     1 flags: 0x80000182
May 11 19:35:15 kernel: Call Trace:
May 11 19:35:15 kernel: __schedule+0x2d1/0x870
May 11 19:35:15 kernel: schedule+0x55/0xf0
May 11 19:35:15 kernel: schedule_preempt_disabled+0xa/0x10
May 11 19:35:15 kernel: rwsem_down_read_slowpath+0x26e/0x3f0
May 11 19:35:15 kernel: down_read+0x95/0xa0
May 11 19:35:15 kernel: get_user_pages_unlocked+0x66/0x2a0
May 11 19:35:15 kernel: hva_to_pfn+0xf5/0x430 [kvm]
May 11 19:35:15 kernel: kvm_faultin_pfn+0x95/0x2e0 [kvm]
May 11 19:35:15 kernel: ? select_task_rq_fair+0x355/0x990
May 11 19:35:15 kernel: ? sched_clock+0x5/0x10
May 11 19:35:15 kernel: ? sched_clock_cpu+0xc/0xb0
May 11 19:35:15 kernel: direct_page_fault+0x3b4/0x860 [kvm]
May 11 19:35:15 kernel: kvm_mmu_page_fault+0x114/0x680 [kvm]
May 11 19:35:15 kernel: ? vmx_vmexit+0x9f/0x70d [kvm_intel]
May 11 19:35:15 kernel: ? vmx_vmexit+0xae/0x70d [kvm_intel]
May 11 19:35:15 kernel: ? 
gfn_to_pfn_cache_invalidate_start+0x190/0x190 [kvm]
May 11 19:35:15 kernel: vmx_handle_exit+0x177/0x770 [kvm_intel]
May 11 19:35:15 kernel: ? 
gfn_to_pfn_cache_invalidate_start+0x190/0x190 [kvm]
May 11 19:35:15 kernel: vcpu_enter_guest+0xafd/0x18e0 [kvm]
May 11 19:35:15 kernel: ? hrtimer_try_to_cancel+0x7b/0x100
May 11 19:35:15 kernel: kvm_arch_vcpu_ioctl_run+0x112/0x600 [kvm]
May 11 19:35:15 kernel: kvm_vcpu_ioctl+0x2c9/0x640 [kvm]
May 11 19:35:15 kernel: ? pollwake+0x74/0xa0
May 11 19:35:15 kernel: ? wake_up_q+0x70/0x70
May 11 19:35:15 kernel: ? __wake_up_common+0x7a/0x190
May 11 19:35:15 kernel: do_vfs_ioctl+0xa4/0x690
May 11 19:35:15 kernel: ksys_ioctl+0x64/0xa0
May 11 19:35:15 kernel: __x64_sys_ioctl+0x16/0x20
May 11 19:35:15 kernel: do_syscall_64+0x5b/0x1b0
May 11 19:35:15 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xc6
May 11 19:35:15 kernel: RIP: 0033:0x7faf1a1387cb
May 11 19:35:15 kernel: Code: Unable to access opcode bytes at RIP 
0x7faf1a1387a1.
May 11 19:35:15 kernel: RSP: 002b:00007fa6f5ffa6e8 EFLAGS: 00000246 
ORIG_RAX: 0000000000000010
May 11 19:35:15 kernel: RAX: ffffffffffffffda RBX: 000055be52e7bcf0 
RCX: 00007faf1a1387cb
May 11 19:35:15 kernel: RDX: 0000000000000000 RSI: 000000000000ae80 
RDI: 0000000000000027
May 11 19:35:15 kernel: RBP: 0000000000000000 R08: 000055be5158c6a8 
R09: 00000007d9e95a00
May 11 19:35:15 kernel: R10: 0000000000000002 R11: 0000000000000246 
R12: 0000000000000000
May 11 19:35:15 kernel: R13: 000055be515bcfc0 R14: 00007fffec958800 
R15: 00007faf1d6c6000
May 11 19:35:15 kernel: INFO: task worker:714626 blocked for more 
than 120 seconds.
May 11 19:35:15 kernel:      Not tainted 4.18.0-489.el8.x86_64 #1
May 11 19:35:15 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 11 19:35:15 kernel: task:worker          state:D stack:    0 
pid:714626 ppid:     1 flags:0x00000180
May 11 19:35:15 kernel: Call Trace:
May 11 19:35:15 kernel: __schedule+0x2d1/0x870
May 11 19:35:15 kernel: schedule+0x55/0xf0
May 11 19:35:15 kernel: schedule_preempt_disabled+0xa/0x10
May 11 19:35:15 kernel: rwsem_down_read_slowpath+0x26e/0x3f0
May 11 19:35:15 kernel: down_read+0x95/0xa0
May 11 19:35:15 kernel: do_madvise.part.30+0x2c3/0xa40
May 11 19:35:15 kernel: ? syscall_trace_enter+0x1ff/0x2d0
May 11 19:35:15 kernel: ? __x64_sys_madvise+0x26/0x30
May 11 19:35:15 kernel: __x64_sys_madvise+0x26/0x30
May 11 19:35:15 kernel: do_syscall_64+0x5b/0x1b0
May 11 19:35:15 kernel: entry_SYSCALL_64_after_hwframe+0x61/0xc6
May 11 19:35:15 kernel: RIP: 0033:0x7faf1a138a4b
May 11 19:35:15 kernel: Code: Unable to access opcode bytes at RIP 
0x7faf1a138a21.
May 11 19:35:15 kernel: RSP: 002b:00007faf151ea7f8 EFLAGS: 00000206 
ORIG_RAX: 000000000000001c
May 11 19:35:15 kernel: RAX: ffffffffffffffda RBX: 00007faf149eb000 
RCX: 00007faf1a138a4b
May 11 19:35:15 kernel: RDX: 0000000000000004 RSI: 00000000007fb000 
RDI: 00007faf149eb000
May 11 19:35:15 kernel: RBP: 0000000000000000 R08: 00000007faf080ba 
R09: 00000000ffffffff
May 11 19:35:15 kernel: R10: 00007faf151ea760 R11: 0000000000000206 
R12: 00007faf15aec48e
May 11 19:35:15 kernel: R13: 00007faf15aec48f R14: 00007faf151eb700 
R15: 00007faf151ea8c0
May 11 19:35:15 kernel: INFO: task worker:714628 blocked for more 
than 120 seconds.
May 11 19:35:15 kernel:      Not tainted 4.18.0-489.el8.x86_64 #1
May 11 19:35:15 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Installed VDSM packages:
vdsm-api-4.50.3.4-1.el8.noarch
vdsm-network-4.50.3.4-1.el8.x86_64
vdsm-yajsonrpc-4.50.3.4-1.el8.noarch
vdsm-http-4.50.3.4-1.el8.noarch
vdsm-client-4.50.3.4-1.el8.noarch
vdsm-4.50.3.4-1.el8.x86_64
vdsm-gluster-4.50.3.4-1.el8.x86_64
vdsm-python-4.50.3.4-1.el8.noarch
vdsm-jsonrpc-4.50.3.4-1.el8.noarch
vdsm-common-4.50.3.4-1.el8.noarch
Libvirt:
libvirt-client-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-nodedev-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-storage-logical-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-network-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-qemu-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-storage-scsi-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-storage-core-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-config-network-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-storage-iscsi-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-storage-rbd-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-storage-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-libs-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-config-nwfilter-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-secret-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-storage-disk-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-storage-mpath-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-storage-gluster-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
python3-libvirt-8.0.0-2.module_el8.7.0+1218+f626c2ff.x86_64
libvirt-daemon-driver-nwfilter-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-lock-sanlock-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-interface-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-driver-storage-iscsi-direct-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
libvirt-daemon-kvm-8.0.0-14.module_el8.8.0+1257+0c3374ae.x86_64
During the issue with locked vms, they do not respond from network, I 
cannot use VNC console (or any other) to check what is happening from 
VM perspective. The host cannot list running processes. There are 
plenty of resources left and each host runs about 30-35 vms.
First I thought that it might be related to glusterfs (I use gluster 
on other clusters and it usually works fine), so we migrated all vms 
back to old storage (NFS). The problem came back today on two hosts. 
I do not have such issues on other cluster which runs on Rocky 8.6 
with hyperconverged glusterfs. Hence as a last resort I'll be 
migrating to Rocky 8 from CentOS 8-stream.
Has anyone observed such issues with ovirt hosts on CentOS 8-stream? 
Any form of help is welcome, as I'm running out of ideas.
_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/2C7MAUFIUW6K5I...

Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/QFFRNZKKXSVLZQ...
-- 
Rik Theys
System Engineer
KU Leuven - Dept. Elektrotechniek (ESAT)
Kasteelpark Arenberg 10 bus 2440  - B-3001 Leuven-Heverlee
+32(0)16/32.11.07
----------------------------------------------------------------
<<Any errors in spelling, tact or fact are transmission errors>>