On Sun, 2013-10-27 at 10:08 +0200, Itamar Heim wrote:
Hi René,
a question - when we upgrade the engine, we move the service to down.
at that point, nagios is sending lots of emails on each failing
monitored resource due to inability to connect to the API.
Hi Itamar,
That's the default behavior of Nagios - it send's out an email for all
services which aren't in state "ok".
As check_rhev3 plugin uses the API for all datacenter, cluster, storage,
host, pool (and maybe vm) checks and during upgrade the API isn't
available all service checks fail and so Nagios starts spamming.
Is thee a way to report this "once" per the root cause for all monitored
resources?
Yes, there are multiple ways how you can prevent mail floods.
First of all I suggest to schedule downtimes in Nagios for all affected
hosts/services. If a service has a scheduled downtime no notifications
will be sent out during this time (no matter how often state changes).
If using Icinga CGI gui or Icinga-Web instead of Nagios you can select
multiple hosts and choose "Schedule Downtime For Checked Host(s) and All
Services". I'm not sure if this is implemented in Nagios CGI gui as well
or if you have to do this step for all hosts manually.
An alternative to downtimes is disabling notification globally. But keep
in mind that you want receive any notifications during this period - so
e.g. a firewall outage isn't reported via email then either.
Beside the handling of planed downtimes there's also a feature I
recommend you to prevent email floods if the engine goes down unplanned:
service dependencies
(
http://nagios.sourceforge.net/docs/3_0/dependencies.html)
All my check_rhev3 service checks are members of service group "rhev"
and I created a service dependency where no notifications are sent out
(execution_failure_criteria n) if a service of service group rhev is in
a state other then "ok" and the datacenter check "oVirt DC Status
Check"
on host "ovirt-engine.lab.ovido.at" is in state "unknown, critical or
pending":
define servicedependency {
host_name ovirt-engine.lab.ovido.at
service_description oVirt DC Status Check
dependent_servicegroup_name rhev
execution_failure_criteria n
notification_failure_criteria u,c,p
}
With service dependencies for all check_rhev3 checks you will only
receive 2 notifications when engine is down: 1 for the first service
check failing and 1 for the DC Status Check (this checks is triggered
automatically when using service dependencies and one of the services
fails).
Please let me know if you need further information (service dependencies
can be quite confusing)...
Regards,
René
Thanks,
Itamar