I just noticed that one of my oVirt physical hosts has been rebooting
due to an apparent hardware voltage fault. It's a Dell, and I've got
their tools installed and am monitoring status, but the issue clears
itself. It has apparently been doing this for a bit now, and we didn't
catch it because (a) there weren't any VMs on it (probably were the
first time but they were restarted elsewhere fast enough that it wasn't
noticed) and (b) it reboots fast enough that at most it pops up in our
monitoring system for one pass and then clears so our NOC either didn't
see it or assumed it was okay since it cleared.
oVirt has been logging alerts when it happens, but seeing that requires
someone to log in and check the logs (and we've got a bunch of different
systems to manage, including multiple oVirt clusters, so nobody is doing
that on a regular basis). We monitor most things with SNMP and/or CLI
checks (we have PRTG, Nagios, and LibreNMS for various different
things).
What are people doing to monitor the health of their oVirt systems? Is
it possible to get alerts emailed to admins? Is there any SNMP support
in oVirt to allow external systems to monitor its health? This setup is
on 4.3.10 if that matters.
--
Chris Adams <cma(a)cmadams.net>