Hey,
I'm wondering if anyone is experiencing freezing VMs? Especially Windows servers and
especially with a lot of RAM.
I found the following here:
https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/page-11
Post #218 and further down are really interesting, as the conclusion is that there might a
problem in kernels between 4.18.0-372.26.1 all the way up to Mainstream 6.3 or LTS 6.1
kernels. The problem appears to be, that mmu_notifier_seq is referenced as an integer in
the is_page_fault_stale() function, causing KVM to freeze when the counter reaches max
integer - 2,147,483,647.
Post #220 seems to state, that it was fixed in commit ba6e3fe25543 in the kernel.
The problem is, that the latest kernel of my nodes running on Centos 8 are running
4.18.0-408.el8.x86_64, which is affected. I could try and downgrade on those nodes, but
would lock me to the unsupported CentOS 8 oVirt nodes.
I tried a new oVirt node based on CentOS 9, en it comes with 5.14.0-514.el9.x86_64, which
is also affected by the looks of it. I tried upgrading the kernel on CentOS 9 to the
latest 6.1 LTS kernel og the latest 6.11 Mainstream kernel, and while the node works fine,
it does not work for oVirt. The node cannot be activated, once the new kernel is in use.
Is I'm a fixer and not a develloper, I think the task migh be too big for me to fix
ovirt and make it work with 6.1/6.3 kernels. My last attempt is going to be an attempt to
backport the fix to the 5.14 kernel supplied with oVirt based on CentOS 9 nodes.
I know... I should probably look for a new solution, but oVirt has been running our many
VMs quite well, at an affordable price. Yes we have more work fixing various issues that
pop up from time to time, but if left alone, it does work quite well and stable.
Has anyone else encountered these issues?
//J