{"id":1980,"date":"2011-03-25T13:09:02","date_gmt":"2011-03-25T12:09:02","guid":{"rendered":"http:\/\/www.devco.net\/?p=1980"},"modified":"2011-03-25T15:15:46","modified_gmt":"2011-03-25T14:15:46","slug":"monitoring_framework_event_correlation","status":"publish","type":"post","link":"https:\/\/www.devco.net\/archives\/2011\/03\/25\/monitoring_framework_event_correlation.php","title":{"rendered":"Monitoring Framework: Event Correlation"},"content":{"rendered":"

Since my last post I’ve spoken to a lot of people all excited to see something fresh in the monitoring space. I’ve learned a lot – primarily what I learned is that no one tool will please everyone. This is why monitoring systems are so hated – they try to impose their world view, they’re hard to hack on and hard to get data out. This served only to reinforce my believe that rather than build a new monitoring system I should build a framework that can build monitoring systems. <\/p>\n

DevOps shops who can cut code, should be able to build the monitoring they want, not the monitoring their vendor thought they want.<\/em><\/p>\n

Thus my focus has not been on how can I declare relationships between services, or how can I declare an escalation matrix. My focus has been on events and how events relate to each other.<\/p>\n

Identifying an Event<\/strong>
\nEvents can come from many places, in the recent video demo I did<\/a> you saw events from Nagios and events from MCollective. I also have event bridges for my Apache Blackbox<\/a>, SNMP Traps and it would be trivial to support events from GitHub commit hooks, Amazon SNS<\/a> and really any conceivable source.<\/p>\n

Events need to be identified then so that you can send information related to the same event from many sources. Your trap system might raise a trap about a port on a switch but your stats poller might emit regular packet counts – you need to know these 2 are for the same port. <\/p>\n

You can identify events by subject<\/em> and by name<\/em> together they make up the event identity. Subject might be a FQDN of a host and name might be load<\/em> or cpu usage<\/em>. <\/p>\n

This way if you have many ways to input information related to some event you just need to identify them correctly.<\/p>\n

Finally as each event gets stored they get given a unique ID that you can use to pull out information about just a specific instance of an event.<\/p>\n

Types Of Event<\/strong>
\nI have identified a couple of types of event in the first iteration:<\/p>\n

<\/p>\n