Thinking about monitoring frameworks

03/19/2011

I’ve been Tweeting a bit about some prototyping of a monitoring tool I’ve been doing and had a big response from people all agreeing something has to be done.

Monitoring is something I’ve been thinking about for ages but to truly realize my needs I needed mature discovery based network addressing and ways to initiate commands on large amounts of hosts in a parallel manner. I have this now in the MCollective and I feel I can start exploring some ideas of how I might build a monitoring platform.

I won’t go into all my wishes, but I’ll list a few big ones as far as monitoring is concerned:

  • Current tools represent a sliding scale, you cannot look at your monitoring tool and ever know current state. Reported status might be a window of 10 minutes and in some cases much longer.
  • Monitoring tools are where data goes to die. Trying to get data out of Nagios and into tools like Graphite, OpenTSDB or really anywhere else is a royal pain. The problem get much harder if you have many Nagios probes. NDO is an abomination as is storing this kind of data in MySQL. Commercial tools are orders of magnitude worse.
  • Monitoring logic is not reusable. Today with approaches like continuous deployment you need your monitoring logic to be reusable by many different parties. Deployers should be able to run the same logic on demand as your scheduled monitoring does.
  • Configuration is a nightmare of static text, or worse click driven databases. People mitigate this with CM tools but there is still a long turn around time from node creation to monitored. This is not workable in modern cloud based and dynamic systems.
  • Shops with skilled staff are constantly battling decades old tools if they want to extend it to create metrics driven infrastructure. It’s all just too ’90s.
  • It does not scale. My simple prototype can easily do 300+ checks a second, including processing replies, archiving, alert logic and feeding external tools like Graphite. On a GBP20/month virtual machine. This is inconceivable with most of the tools we have to deal with.

I am prototyping some ideas at the moment to build a framework to build monitoring systems with.

There’s a single input queue on a middleware system, I expect an event in this queue – mine is a queue distributed over 3 countries and many instances of ActiveMQ.

The event can come from many places maybe from a commit hook at GitHub, fed in from Nagios performance data or by MCollective or Pingdom, the source of data is not important at all. It’s just a JSON document that has some structure – you can send in any data in addition to a few required fields, it’ll happily store the lot.

From there it gets saved into a capped collection on MongoDB in its entirety and gets given an eventid. It gets broken into its status parts and its metric parts and sent to any number of recipient queues. In the case of Metrics for example I have something that feeds Graphite, you can have many of these all active concurrently. Just write a small consumer for a queue in any language and do with the events whatever you want.

In the case of statusses it builds a MongoDB collection that represents the status of an event in relation to past statusses etc. This will notice any state transition and create alert events, alert events again can go to many destinations – right now I am sending them to Angelia, but there could be many destinations with different filtering and logic for how that happens. If you want to build something to alert based on trends of past metric data, no problem. Just write a small plugin, in any language, and plug it into the message flow.

At any point through this process the eventid is available and should you wish to get hold of the original full event its a simple lookup away – there you can find all the raw event data that you sent – stored for quick retrieval in a schemaless manner.

In effect this is a generic plugable event handling system. I currently feed it from MCollective using a modified NRPE agent and I am pushing my Nagios performance data in real time. I have many Nagios servers distributed globally and they all just drop events into a their nearest queue entry point.

Given that it’s all queued and persisted to disk I can create really vast amount of alerts using MCollective – it’s trivial for me to create 1000 check results a second. The events have the timestamp attached of when the check was done and even if the consumers are not keeping up the time series databases will get the events in the right order and right timestamps. So far on a small VM that runs Puppetmaster, MongoDB, ActiveMQ, Redmine and a lot of other stuff I am very comfortably sending 300 events a second through this process without even tuning or trying to scale it.

When I look at a graph of 50 servers load average I see the graph change at the same second for all nodes – because I have an exact single point in time view of my server estate, and what 50 servers I am monitoring in this manner is done using discovery on MCollective. Discovery is obviously no good for monitoring in general – you dont know the state of stuff you didn’t discover – but MCollective can build a database of truth using registration – correlate discovery against registration and you can easily identify missing things.

A free side effect of using an async queue is that horizontal scaling comes more or less for free, all I need to do is start more processes consuming the same queue – maybe even on a different physical server – and more capacity becomes available.

So this is a prototype, its not open source – I am undecided what I will do with it, but I am likely to post some more about its design and principals here. Right now I am only working on the event handling and routing aspects as the point in time monitoring is already solved for me as is my configuration of Nagios, but those aspects will be mixed into this system in time.

There’s a video of the prototype receiving monitor events over mcollective and feeding Loggly for alerts here.