[Users] ovirt nagios plugin email flood on engine upgrade

René Koch (ovido) r.koch at ovido.at
Mon Oct 28 09:06:20 UTC 2013


On Sun, 2013-10-27 at 10:08 +0200, Itamar Heim wrote:
> Hi René,
> 
> a question - when we upgrade the engine, we move the service to down.
> at that point, nagios is sending lots of emails on each failing 
> monitored resource due to inability to connect to the API.

Hi Itamar,

That's the default behavior of Nagios - it send's out an email for all
services which aren't in state "ok".
As check_rhev3 plugin uses the API for all datacenter, cluster, storage,
host, pool (and maybe vm) checks and during upgrade the API isn't
available all service checks fail and so Nagios starts spamming.


> 
> Is thee a way to report this "once" per the root cause for all monitored 
> resources?


Yes, there are multiple ways how you can prevent mail floods.

First of all I suggest to schedule downtimes in Nagios for all affected
hosts/services. If a service has a scheduled downtime no notifications
will be sent out during this time (no matter how often state changes).
If using Icinga CGI gui or Icinga-Web instead of Nagios you can select
multiple hosts and choose "Schedule Downtime For Checked Host(s) and All
Services". I'm not sure if this is implemented in Nagios CGI gui as well
or if you have to do this step for all hosts manually.

An alternative to downtimes is disabling notification globally. But keep
in mind that you want receive any notifications during this period - so
e.g. a firewall outage isn't reported via email then either.


Beside the handling of planed downtimes there's also a feature I
recommend you to prevent email floods if the engine goes down unplanned:
service dependencies
(http://nagios.sourceforge.net/docs/3_0/dependencies.html)

All my check_rhev3 service checks are members of service group "rhev"
and I created a service dependency where no notifications are sent out
(execution_failure_criteria n) if a service of service group rhev is in
a state other then "ok" and the datacenter check "oVirt DC Status Check"
on host "ovirt-engine.lab.ovido.at" is in state "unknown, critical or
pending":

define servicedependency {
        host_name               ovirt-engine.lab.ovido.at
        service_description     oVirt DC Status Check
        dependent_servicegroup_name     rhev
        execution_failure_criteria      n
        notification_failure_criteria   u,c,p
}

With service dependencies for all check_rhev3 checks you will only
receive 2 notifications when engine is down: 1 for the first service
check failing and 1 for the DC Status Check (this checks is triggered
automatically when using service dependencies and one of the services
fails).


Please let me know if you need further information (service dependencies
can be quite confusing)...


Regards,
René


> 
> Thanks,
>     Itamar




More information about the Users mailing list