SimpleEventCorrelator |
The simple event correlator, SEC
from http://kodu.neti.ee/%7Eristo/sec/
SEC is an open source and platform independent event correlation tool that was designed to fill the gap between commercial event correlation systems and homegrown solutions that usually comprise a few simple shell scripts. SEC accepts input from regular files, named pipes, and standard input, and can thus be employed as an event correlator for any application that is able to write its output events to a file stream. The SEC configuration is stored in text files as rules, each rule specifying an event matching condition, an action list, and optionally a Boolean expression whose truth value decides whether the rule can be applied at a given moment. Regular expressions, Perl subroutines, etc. are used for defining event matching conditions. SEC can produce output events by executing user-specified shell scripts or programs (e.g., snmptrap or mail), by writing messages to pipes or files, and by various other means.
I wanted to use SEC with nagios to prevent page floods. Page floods happen to me when many servers running the exact same service are unavailable and nagios sends me a page for every service on every server. Simple event correlation was exactly the solution I was looking for. I've seen references to hooking up SEC with nagios, but I've never been able to find any details of anyone's implementation.
The first task was to hook up nagios to SEC. The main option here appeared to be the status.dat which gets rewritten by nagios at regular intervals. I created a little perl script to track changes which i called nagios_tail.pl (there were some similar programs on the net but I ran into problems with them). This script gets all status updates (including soft state changes) and feeds them into SEC. The script generates a log which contains entries that look like this:
SERVICE, hostname, servicename, state, check output
After some more reading I realized a nagios plugin might be a better solution.
The next step was to get SEC reading nagios_tail.pl. In order to accomplish this, I set up a task in sec.conf to run on the internal SEC startup event:
# startup task to invoke nagios_tail.pl to monitor nagios service changes type=Single ptype=SubStr pattern=SEC_STARTUP context=SEC_INTERNAL_EVENT desc=monitor new nagios events action=spawn /apps/nagios23/sec/nagios_tail.pl /apps/nagios23/var/status.dat
Note that the "-intevents" option must be passed to sec on startup in order to enable the internal sec events. This is convenient since it's easy disable nagios_tail.pl by omitting this flag when testing with input from stdin.
My nagios config contained lots of servers owned by multiple application groups in multiple datacenters. Ultimately, I wanted to correlate events of each service across all servers in the same datacenter. When a problem was seen with a service on more than one server within a datacenter, SEC would recognize the situation and submit a passive check to nagios. This would send a page to me notifying me of multiple problems in the datacenter, and meanwhile prevent notification by each of the individual servers.
I didn't want to have to write rules for every service--i wanted a single generic set of rules. I ended up decided to use a naming standard for the services that began with the application name and the datacenter.
First the nagios config. I created a host and a hostgroup for the simple event correlator.
define host{ host_name EventCorrelator address 127.0.0.1 alias Simple Event Correlator check_command check_dummy use default_host } define hostgroup{ hostgroup_name SEC alias Simple Event Correlator members EventCorrelator }
To set up the SEC service in the nagios config, first associate the check with all hosts running the service in the datacenter (DC) for an application group (APP). Then create a passive service for the SEC check.
define service{ use default_app_check hostgroup_name APP-DC service_description APP DC Service Description check_command check_xyz!123 } define service{ use default_app_check host_name EventCorrelator service_description SEC APP DC Service Description max_check_attempts 1 active_checks_enabled 0 check_command check_dummy }
Finally, notifications of the service on each server should be set up to depend on the SEC check associated with that service within the datacenter via a servicedependency.
# sec checks for multiple service problems define servicedependency{ hostgroup_name SEC service_description SEC APP DC Service Description dependent_hostgroup_name APP-DC dependent_service_description APP DC Service Description notification_failure_criteria w,c,u }
I ended up with this sec.conf (since I wrote this I've been reading through the docs and have come up with a couple of pretty good ideas for improvements).
To prevent false positives, when a new problem is seen on a service within a datacenter, a notification delay context is set up. If all problems of the service within the dc have not been resolved prior to when the context expires, a passive failure will be submitted to nagios for the SEC check. Thanks to the service dependency, the individual servers won't all send pages. So I get one page per datacenter instead of 10+.
This monitors all services matching the naming convention and will submits passive checks to nagios when it sees problems. It's only necessary to define checks in nagios that you actually want to use with SEC--the rest will simply be ignored by nagios.
The timing of the notification delay was a bit tricky to get just right. Our services are set up to require 5 failures at on failure per minute before sending out pages. Setting a 3 minute notification delay in SEC ensured that the nagios_tail.pl can have time to notice the update in status.dat, feed it to SEC, SEC can submit the passive alert and it can be recognized by nagios.
One of the trickier bits I've run into is testing the config. I came up with a simple automated tester that simply tests for the creation and deletion of contexts based on test cases. I use it to test the above sec.conf. The script is currently just called run_tests.pl.