SimpleEventCorrelator

WebHome | UnixGeekTools | Geekfarm | About This Site

The simple event correlator, SEC

UPDATE

I am no longer using the SimpleEventCorrelator. I now have a much better solution, take a look at: GridPanoptes

Pointers

Info

SEC is an open source and platform independent event correlation tool that was designed to fill the gap between commercial event correlation systems and homegrown solutions that usually comprise a few simple shell scripts. SEC accepts input from regular files, named pipes, and standard input, and can thus be employed as an event correlator for any application that is able to write its output events to a file stream. The SEC configuration is stored in text files as rules, each rule specifying an event matching condition, an action list, and optionally a Boolean expression whose truth value decides whether the rule can be applied at a given moment. Regular expressions, Perl subroutines, etc. are used for defining event matching conditions. SEC can produce output events by executing user-specified shell scripts or programs (e.g., snmptrap or mail), by writing messages to pipes or files, and by various other means.

Nagios Integration

I wanted to use SEC with nagios to prevent page floods. Page floods happen to me when many servers running the exact same service are unavailable and nagios sends me a page for every service on every server. Simple event correlation was exactly the solution I was looking for. I've seen references to hooking up SEC with nagios, but I've never been able to find any details of anyone's implementation.

The first task was to hook up nagios to SEC. The main option here appeared to be the status.dat which gets rewritten by nagios at regular intervals. I created a little perl script to track changes which i called nagios_tail.pl (there were some similar programs on the net but I ran into problems with them). This script gets all status updates (including soft state changes) and feeds them into SEC. The script generates a log which contains entries that look like this:

SERVICE, hostname, servicename, state, check output

After some more reading I realized a nagios plugin might be a better solution.

The next step was to get SEC reading nagios_tail.pl. In order to accomplish this, I set up a task in sec.conf to run on the internal SEC startup event:

# startup task to invoke nagios_tail.pl to monitor nagios service changes
type=Single
ptype=SubStr
pattern=SEC_STARTUP
context=SEC_INTERNAL_EVENT
desc=monitor new nagios events
action=spawn /apps/nagios23/sec/nagios_tail.pl /apps/nagios23/var/status.dat

Note that the "-intevents" option must be passed to sec on startup in order to enable the internal sec events. This is convenient since it's easy disable nagios_tail.pl by omitting this flag when testing with input from stdin.

My nagios config contained lots of servers owned by multiple application groups in multiple datacenters. Ultimately, I wanted to correlate events of each service across all servers in the same datacenter. When a problem was seen with a service on more than one server within a datacenter, SEC would recognize the situation and submit a passive check to nagios. This would send a page to me notifying me of multiple problems in the datacenter, and meanwhile prevent notification by each of the individual servers.

I didn't want to have to write rules for every service--i wanted a single generic set of rules. I ended up decided to use a naming standard for the services that began with the application name and the datacenter.

First the nagios config. I created a host and a hostgroup for the simple event correlator.

define host{
  host_name             EventCorrelator
  address               127.0.0.1
  alias                 Simple Event Correlator
  check_command         check_dummy
  use                   default_host
}

define hostgroup{
 hostgroup_name         SEC
 alias                  Simple Event Correlator
 members                EventCorrelator
}

To set up the SEC service in the nagios config, first associate the check with all hosts running the service in the datacenter (DC) for an application group (APP). Then create a passive service for the SEC check.

define service{
  use                           default_app_check
  hostgroup_name                APP-DC
  service_description           APP DC Service Description
  check_command                 check_xyz!123
}

define service{
  use                           default_app_check
  host_name                     EventCorrelator
  service_description           SEC APP DC Service Description
  max_check_attempts            1
  active_checks_enabled         0
  check_command                 check_dummy
}

Finally, notifications of the service on each server should be set up to depend on the SEC check associated with that service within the datacenter via a servicedependency.

# sec checks for multiple service problems
define servicedependency{
  hostgroup_name                        SEC
  service_description                   SEC APP DC Service Description
  dependent_hostgroup_name              APP-DC
  dependent_service_description         APP DC Service Description
  notification_failure_criteria         w,c,u
}

I ended up with this sec.conf (since I wrote this I've been reading through the docs and have come up with a couple of pretty good ideas for improvements).

To prevent false positives, when a new problem is seen on a service within a datacenter, a notification delay context is set up. If all problems of the service within the dc have not been resolved prior to when the context expires, a passive failure will be submitted to nagios for the SEC check. Thanks to the service dependency, the individual servers won't all send pages. So I get one page per datacenter instead of 10+.

This monitors all services matching the naming convention and will submits passive checks to nagios when it sees problems. It's only necessary to define checks in nagios that you actually want to use with SEC--the rest will simply be ignored by nagios.

The timing of the notification delay was a bit tricky to get just right. Our services are set up to require 5 failures at on failure per minute before sending out pages. Setting a 3 minute notification delay in SEC ensured that the nagios_tail.pl can have time to notice the update in status.dat, feed it to SEC, SEC can submit the passive alert and it can be recognized by nagios.

Automated Testing

One of the trickier bits I've run into is testing the config. I came up with a simple automated tester that simply tests for the creation and deletion of contexts based on test cases. I use it to test the above sec.conf. The script is currently just called run_tests.pl.

Updated Sat Jan 20, 2007 3:16 PM