Unrecoverable NMI error on HP Gen8 hosts.

I have oVirt Node v4.4.8.3 running on several HP ProLiant Gen8 servers. I receive the following error under certain circumstances: "An Unrecoverable System Error (NMI) has occurred (iLO application watchdog timeout NMI, Service Information: 0x0000002B, 0x00000000)" When a host starts taking a load (but nowhere near a threshold), I encounter the above iLO-logged error and the host locks-up. I have had to grossly under-utilize my hosts to avoid this problem. I'm hoping for a better fix or work-around. I've had the same problem beginning with my oVirt 4.3.x hosts, so it isn't oVirt version specific. The little information I could find on the error wasn't helpful. Red Hat acknowledges the issue, but limited to shutdown/reboot operations; not during "normal" operations. Anyone else experienced this problem? How did you fix it or work around it? I'd like to better utilize my servers if possible. In advance, thank you to anyone and everyone who offers help.

On Thu, Dec 30, 2021 at 8:02 PM Diggy Mc <d03@bornfree.org> wrote:
I have oVirt Node v4.4.8.3 running on several HP ProLiant Gen8 servers. I receive the following error under certain circumstances: "An Unrecoverable System Error (NMI) has occurred (iLO application watchdog timeout NMI, Service Information: 0x0000002B, 0x00000000)"
When a host starts taking a load (but nowhere near a threshold), I encounter the above iLO-logged error and the host locks-up. I have had to grossly under-utilize my hosts to avoid this problem. I'm hoping for a better fix or work-around.
I've had the same problem beginning with my oVirt 4.3.x hosts, so it isn't oVirt version specific.
The little information I could find on the error wasn't helpful. Red Hat acknowledges the issue, but limited to shutdown/reboot operations; not during "normal" operations.
Anyone else experienced this problem? How did you fix it or work around it? I'd like to better utilize my servers if possible.
In advance, thank you to anyone and everyone who offers help.
Are you sure it's related to oVirt at all? To Linux? Did you check the hardware? Contact your hardware support? Perhaps check some on-board diagnostics/logs/whatever? Good luck and best regards, -- Didi

On Thu, Dec 30, 2021 at 8:02 PM Diggy Mc <d03(a)bornfree.org> wrote:
Are you sure it's related to oVirt at all? To Linux? Did you check the hardware? Contact your hardware support? Perhaps check some on-board diagnostics/logs/whatever?
Good luck and best regards,
Before putting the Gen8 servers into production (as with all servers), I ran the comprehensive HP diagnostic tests on them for 24 hours. I also ran the intensive MemTest86 tests on them for 24 hours as well. All tests passed. I feel safe assuming the hardware is okay. The little information I have found on the matter suggests it is a Kernel watchdog issue, but those articles offered no help in resolving the problem. I am not suggesting it is an oVirt issue. I am simply hoping that someone in the oVirt community has encountered the same problem and can offer a solution or work-around.

On Thu, Dec 30, 2021 at 8:02 PM Diggy Mc <d03@bornfree.org> wrote:
I have oVirt Node v4.4.8.3 running on several HP ProLiant Gen8 servers. I receive the following error under certain circumstances: "An Unrecoverable System Error (NMI) has occurred (iLO application watchdog timeout NMI, Service Information: 0x0000002B, 0x00000000)"
When a host starts taking a load (but nowhere near a threshold), I encounter the above iLO-logged error and the host locks-up. I have had to grossly under-utilize my hosts to avoid this problem. I'm hoping for a better fix or work-around.
I've had the same problem beginning with my oVirt 4.3.x hosts, so it isn't oVirt version specific.
The little information I could find on the error wasn't helpful. Red Hat acknowledges the issue, but limited to shutdown/reboot operations; not during "normal" operations.
Anyone else experienced this problem? How did you fix it or work around it? I'd like to better utilize my servers if possible.
In advance, thank you to anyone and everyone who offers help.
NMI errors are usually hardware related or kernel / system related. (E.g.
memory failure, hardware health check watchdog, etc) They are not oVirt related per-say. That said, I'm seeing an HPE report with the same NMI service code. https://community.hpe.com/t5/ProLiant-Servers-ML-DL-SL/Proliant-dl360p-gen8A... - Gilboa

I'm not sure if fencing is not generating those NMI options. Have you tested the fencing ? If not, follow the documentation to test fencing and if that's the reason for the NMI.Also check any pending firmware updates like the newest iLO4. Best Regards,Strahil Nikolov On Sun, Jan 2, 2022 at 17:44, Gilboa Davara<gilboad@gmail.com> wrote: _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/MXADE3ZVXA3VNQ...

These Gen8 servers have iLO-3 on them. They are at the latest version of iLO. I'm not familiar with 'fencing'. Any guidance you can offer on the subject would be greatly appreciated.

Hi Diggy, I'm not sure if it's an oVirt issue, but it can be a network or firewall issue. Did you test the connection between oVirt hosts and the iLO interfaces? Simple tests like ping to ensure one host can reach others iLO interfaces and ipmitool to ensure you can connect to the management interfaces? Marcos -----Original Message----- From: Diggy Mc <d03@bornfree.org> Sent: quinta-feira, 30 de dezembro de 2021 15:02 To: users@ovirt.org Subject: [External] : [ovirt-users] Unrecoverable NMI error on HP Gen8 hosts. I have oVirt Node v4.4.8.3 running on several HP ProLiant Gen8 servers. I receive the following error under certain circumstances: "An Unrecoverable System Error (NMI) has occurred (iLO application watchdog timeout NMI, Service Information: 0x0000002B, 0x00000000)" When a host starts taking a load (but nowhere near a threshold), I encounter the above iLO-logged error and the host locks-up. I have had to grossly under-utilize my hosts to avoid this problem. I'm hoping for a better fix or work-around. I've had the same problem beginning with my oVirt 4.3.x hosts, so it isn't oVirt version specific. The little information I could find on the error wasn't helpful. Red Hat acknowledges the issue, but limited to shutdown/reboot operations; not during "normal" operations. Anyone else experienced this problem? How did you fix it or work around it? I'd like to better utilize my servers if possible. In advance, thank you to anyone and everyone who offers help. _______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-leave@ovirt.org Privacy Statement: https://urldefense.com/v3/__https://www.ovirt.org/privacy-policy.html__;!!AC... oVirt Code of Conduct: https://urldefense.com/v3/__https://www.ovirt.org/community/about/community-... List Archives: https://urldefense.com/v3/__https://lists.ovirt.org/archives/list/users@ovir...
participants (5)
-
Diggy Mc
-
Gilboa Davara
-
Marcos Sungaila
-
Strahil Nikolov
-
Yedidyah Bar David