I’ve been Tweeting a bit about some prototyping of a monitoring tool I’ve been doing and had a big response from people all agreeing something has to be done.
Monitoring is something I’ve been thinking about for ages but to truly realize my needs I needed mature discovery based network addressing and ways to initiate commands on large amounts of hosts in a parallel manner. I have this now in the MCollective and I feel I can start exploring some ideas of how I might build a monitoring platform.
I won’t go into all my wishes, but I’ll list a few big ones as far as monitoring is concerned:
- Current tools represent a sliding scale, you cannot look at your monitoring tool and ever know current state. Reported status might be a window of 10 minutes and in some cases much longer.
- Monitoring tools are where data goes to die. Trying to get data out of Nagios and into tools like Graphite, OpenTSDB or really anywhere else is a royal pain. The problem get much harder if you have many Nagios probes. NDO is an abomination as is storing this kind of data in MySQL. Commercial tools are orders of magnitude worse.
- Monitoring logic is not reusable. Today with approaches like continuous deployment you need your monitoring logic to be reusable by many different parties. Deployers should be able to run the same logic on demand as your scheduled monitoring does.
- Configuration is a nightmare of static text, or worse click driven databases. People mitigate this with CM tools but there is still a long turn around time from node creation to monitored. This is not workable in modern cloud based and dynamic systems.
- Shops with skilled staff are constantly battling decades old tools if they want to extend it to create metrics driven infrastructure. It’s all just too ’90s.
- It does not scale. My simple prototype can easily do 300+ checks a second, including processing replies, archiving, alert logic and feeding external tools like Graphite. On a GBP20/month virtual machine. This is inconceivable with most of the tools we have to deal with.
I am prototyping some ideas at the moment to build a framework to build monitoring systems with.
There’s a single input queue on a middleware system, I expect an event in this queue – mine is a queue distributed over 3 countries and many instances of ActiveMQ.
The event can come from many places maybe from a commit hook at GitHub, fed in from Nagios performance data or by MCollective or Pingdom, the source of data is not important at all. It’s just a JSON document that has some structure – you can send in any data in addition to a few required fields, it’ll happily store the lot.
From there it gets saved into a capped collection on MongoDB in its entirety and gets given an eventid. It gets broken into its status parts and its metric parts and sent to any number of recipient queues. In the case of Metrics for example I have something that feeds Graphite, you can have many of these all active concurrently. Just write a small consumer for a queue in any language and do with the events whatever you want.
In the case of statusses it builds a MongoDB collection that represents the status of an event in relation to past statusses etc. This will notice any state transition and create alert events, alert events again can go to many destinations – right now I am sending them to Angelia, but there could be many destinations with different filtering and logic for how that happens. If you want to build something to alert based on trends of past metric data, no problem. Just write a small plugin, in any language, and plug it into the message flow.
At any point through this process the eventid is available and should you wish to get hold of the original full event its a simple lookup away – there you can find all the raw event data that you sent – stored for quick retrieval in a schemaless manner.
In effect this is a generic plugable event handling system. I currently feed it from MCollective using a modified NRPE agent and I am pushing my Nagios performance data in real time. I have many Nagios servers distributed globally and they all just drop events into a their nearest queue entry point.
Given that it’s all queued and persisted to disk I can create really vast amount of alerts using MCollective – it’s trivial for me to create 1000 check results a second. The events have the timestamp attached of when the check was done and even if the consumers are not keeping up the time series databases will get the events in the right order and right timestamps. So far on a small VM that runs Puppetmaster, MongoDB, ActiveMQ, Redmine and a lot of other stuff I am very comfortably sending 300 events a second through this process without even tuning or trying to scale it.
When I look at a graph of 50 servers load average I see the graph change at the same second for all nodes – because I have an exact single point in time view of my server estate, and what 50 servers I am monitoring in this manner is done using discovery on MCollective. Discovery is obviously no good for monitoring in general – you dont know the state of stuff you didn’t discover – but MCollective can build a database of truth using registration – correlate discovery against registration and you can easily identify missing things.
A free side effect of using an async queue is that horizontal scaling comes more or less for free, all I need to do is start more processes consuming the same queue – maybe even on a different physical server – and more capacity becomes available.
So this is a prototype, its not open source – I am undecided what I will do with it, but I am likely to post some more about its design and principals here. Right now I am only working on the event handling and routing aspects as the point in time monitoring is already solved for me as is my configuration of Nagios, but those aspects will be mixed into this system in time.
There’s a video of the prototype receiving monitor events over mcollective and feeding Loggly for alerts here.
I think this would be great, and I look forward to seeing it.
Yup – I’m liking it a lot. As you say, you get some stuff for free, the async is essential as you scale.
I suspect if you did get this to a good state, you may be onto something. Especially good for people who are already using mcollective.
Awesome prototype!!
I would love to see that project grow and solve realtime problems in Nagios filetring and reporting tools.
Have you planned some sort of rule engine? A year ago I have made a nasty draft on this… Receiving a snmptrap that would be matched against a rule table (like a firewall) and if some trap with some foo/bar content on it matched a host/hostgroup/service I could change the normal behavior like changing the message body, etc. well it was like a snmp proxy triggered by a Nagios event handler.
well just 2 cents of a ideia that I think you could find nice to have on loggly 🙂
have fun and good luck 🙂
hey,
some very cool stuff here! I’ve been thinking about monitoring as well recently and there is way too much of fluff that is either super complex or just not good enough / adaptable enough.
One thing I would add to your list of ‘things to wish for’ : predictable and manageable scheduling with a tunable tolerance.
– KB
How exactly does the registration work here for detecting missing bits?
@spacehobo mcollective registration creates a db of machines that have ever existed and what they’re configured to be. If you correlate actual received data against what is in the registration DB you get a diff == the missing bits.
Ofcourse this assumes a machine at some point were able to talk to mcollective – but in my world if they haven’t ever it also means they have no software on them and effectively dont exist.
Hi there,
if you wanna, drop me a mail – I just found this post via the infra-talk (or query me on irc in there nick darkfaded). Thing is that my boss wrote some extensions to nagios that take away stupidity-in-design-makes-me-cringe (ndo, database backend) and scalability probs to an extent where nagios gets useful for real deployments.
If you’re interested I could show you i.e. the export interface *from* nagios and even if its the wrong thing, maybe it’s giving you some ideas for your own tool.
Flo
What ever happened with this project? was looking really interesting!
@anothersmith I am still really keen on it but as you can imagine its a huge undertaking, more than 1 person can do in his spare time so some kind of sponsor has to be found or has to be done as part of employment etc