{"id":2019,"date":"2011-05-04T21:51:07","date_gmt":"2011-05-04T20:51:07","guid":{"rendered":"http:\/\/www.devco.net\/?p=2019"},"modified":"2011-05-04T22:51:38","modified_gmt":"2011-05-04T21:51:38","slug":"monitor_framework_minimal_configuration","status":"publish","type":"post","link":"https:\/\/www.devco.net\/archives\/2011\/05\/04\/monitor_framework_minimal_configuration.php","title":{"rendered":"Monitor Framework: Minimal Configuration"},"content":{"rendered":"

This is a follow-up post to other posts I’ve done<\/a> regarding a new breed of monitoring that I hope to create. <\/p>\n

I’ve had some time to think about configuration of monitoring. This is a big pain point in all monitoring systems. Many require you configure all your resources, dependencies etc often in text files. Others have API that you can automate against and the worst ones have only a GUI. <\/p>\n

In the modern world where we have configuration management this end up being a lot of duplication, your CM system already knows about inter dependencies etc. Your CM’s facts system could know about contacts for a machine and I am sure we could derive a lot of information from these catalogs. Today bigger sites tend to automate the building of monitor config files using their CM systems but it tends to be slow to adapt to network conditions and it’s quite a lot of work.<\/p>\n

I spoke a bit about this in the CMDB session at Puppet Camp so thought I’d lay my thoughts down somewhere proper as I’ve been talking about this already.<\/p>\n

I’ve previously blogged about using MCollective for monitoring<\/a> based on discovery. In that post I pointed out that not all things are appropriate to be monitored using this method as you don’t know what is down. There is an approach to solving this problem though. MCollective supports building databases of what should be there – it’s called Registration<\/a>. By correlating the discovered information with the registered information you can defer what is absent\/unknown or down.<\/p>\n

Ideally this is as much configuration as I want to have for monitoring mail queue sizes on all my smart hosts:<\/p>\n

<\/p>\n

\r\nscheduler.every '1m' do\r\n    nrpe(\"check_mailq\", :cf_class => \"exim::smarthost\")\r\nend\r\n<\/pre>\n

<\/code><\/p>\n

This still leaves a huge problem, I can ask for my a specific service to be monitored on a subset of machines but I cannot defer parent child relationships or know who to notify and this is a huge problem.<\/p>\n

Now as I am using Puppet to declare these services and using Puppet based discovery to select which machines to monitor I would like to declare parent child relationships in Puppet even cross-node ones.<\/p>\n

The approach we are currently testing is around loading all my catalogs for all my machines into Neo4J<\/a> – a purpose built graph database. I am declaring relationships in the manifests and post processing the graph to create the cross node links.<\/p>\n

This means we have a huge graph of graphs containing all inter node dependencies. The image shows visually how a small part of this might look. Here we have a Exim service that depends on a database on a different machine because we use a MySQL based Greylisting service.<\/p>\n

Using this graph we can answer many questions, among others:<\/p>\n