{"id":2884,"date":"2013-01-01T13:12:00","date_gmt":"2013-01-01T12:12:00","guid":{"rendered":"http:\/\/www.devco.net\/?p=2884"},"modified":"2013-01-01T13:18:07","modified_gmt":"2013-01-01T12:18:07","slug":"scaling-nagios-nrpe-checks","status":"publish","type":"post","link":"https:\/\/www.devco.net\/archives\/2013\/01\/01\/scaling-nagios-nrpe-checks.php","title":{"rendered":"Scaling Nagios NRPE checks"},"content":{"rendered":"<p>Most Nagios systems does a lot of forking especially those built around something like NRPE where each check is a connection to be made to a remote system.  On one hand I like NRPE in that it puts the check logic on the nodes using a standard plugin format and provides a fairly re-usable configuration file but on the other hand the fact that the Nagios machine has to do all this forking has never been good for me.<\/p>\n<p>In the past I&#8217;ve <a href=\"http:\/\/www.devco.net\/archives\/2010\/07\/03\/aggregating_nagios_checks_with_mcollective.php\">shown one way to scale checks by aggregate all results for a specific check into one result<\/a> but this is not always a good fit as pointed out in the post. I&#8217;ve now built a system that use the same underlying MCollective infrastructure as in the previous post but without the aggregation.<\/p>\n<p>I have a pair of Nagios nodes &#8211; one in the UK and one in France &#8211; and they are on quite low spec VMs doing around 400 checks each.  The problems I have are:<\/p>\n<ul>\n<li>The machines are constantly loaded under all the forking, one would sit on 1.5 Load Average almost all the time<\/li>\n<li>They use a lot of RAM and it&#8217;s quite spikey, if something is wrong especially I&#8217;d have a lot of checks concurrently so the machines have to be bigger than I want them<\/li>\n<li>The check frequency is quite low in the usual Nagios manner, sometimes 10 minutes can go by without a check<\/li>\n<li>The check results do not represent a point in time, I have no idea how the check results of node1 relate to those on node2 as they can be taken anywhere in the last 10 minutes<\/li>\n<\/ul>\n<p>These are standard Nagios complaints though and there are many more but these ones specifically is what I wanted to address right now with the system I am showing here.<\/p>\n<p>Probably not a surprise but the solution is built on MCollective, it uses the existing MCollective NRPE agent and the existing queueing infrastructure to push the forking to each individual node &#8211; they would do this anyway for every NRPE check &#8211; and read the results off a queue and spool it into the Nagios command file as Passive results.  Internally it splits the traditional MCollective request-response system into a async processing system using the <a href=\"http:\/\/www.devco.net\/archives\/2012\/08\/19\/mcollective-async-result-handling.php\">technique I blogged about before<\/a>.<\/p>\n<p><center><img decoding=\"async\" src=\"http:\/\/devco.net\/images\/mnrpes-overview.jpg\"><\/center><\/p>\n<p>As you can see the system is made up of a few components:<\/p>\n<ul>\n<li>The Scheduler takes care of publishing requests for checks<\/li>\n<li>MCollective and the middleware provides AAA and transport<\/li>\n<li>The nodes all run the MCollective NRPE agent which put their replies on the Queue<\/li>\n<li>The Receiver reads the results from the Queue and write them to the Nagios command file<\/li>\n<\/ul>\n<h3>The Scheduler<\/h3>\n<p>The scheduler daemon is written using the excellent <a href=\"https:\/\/github.com\/jmettraux\/rufus-scheduler\">Rufus Scheduler<\/a> gem &#8211; if you do not know it you totally should check it out, it solves many many problems.  Rufus allows me to create simple checks on intervals like <em>60s<\/em> and I combine these checks with MCollective filters to create a simple check configuration as below:<\/p>\n<p><code><\/p>\n<pre lang=\"ruby\">\r\nnrpe 'check_bacula_main', '6h', 'bacula::node monitored_by=monitor1'\r\nnrpe 'check_disks', '60s', 'monitored_by=monitor1'\r\nnrpe 'check_greylistd', '60s', 'greylistd monitored_by=monitor1'\r\nnrpe 'check_load', '60s', 'monitored_by=monitor1'\r\nnrpe 'check_mailq', '60s', 'monitored_by=monitor1'\r\nnrpe 'check_mongodb', '60s', 'mongodb monitored_by=monitor1'\r\nnrpe 'check_mysql', '60s', 'mysql::server monitored_by=monitor1'\r\nnrpe 'check_pki', '60m', 'monitored_by=monitor1'\r\nnrpe 'check_swap', '60s', 'monitored_by=monitor1'\r\nnrpe 'check_totalprocs', '60s', 'monitored_by=monitor1'\r\nnrpe 'check_zombieprocs', '60s', 'monitored_by=monitor1'\r\n<\/pre>\n<p><\/code><\/p>\n<p>Taking the first line it says: Run the <em>check_bacula_main<\/em> NRPE check every <em>6 hours<\/em> on machines with the <em>bacula::node<\/em> Puppet Class and with the fact <em>monitored_by=monitor1<\/em>.  I had the <em>monitored_by<\/em> fact already to assist in building my Nagios configs using a simple search based approach in Puppet.<\/p>\n<p>When the scheduler starts it will log:<\/p>\n<p><code><\/p>\n<pre lang=\"text\">\r\nW, [2012-12-31T22:10:12.186789 #32043]  WARN -- : activemq.rb:96:in `on_connecting' TCP Connection attempt 0 to stomp:\/\/nagios@stomp.example.net:6163\r\nW, [2012-12-31T22:10:12.193405 #32043]  WARN -- : activemq.rb:101:in `on_connected' Conncted to stomp:\/\/nagios@stomp.example.net:6163\r\nI, [2012-12-31T22:10:12.196387 #32043]  INFO -- : scheduler.rb:23:in `nrpe' Adding a job for check_bacula_main every 6h matching 'bacula::node monitored_by=monitor1', first in 19709s\r\nI, [2012-12-31T22:10:12.196632 #32043]  INFO -- : scheduler.rb:23:in `nrpe' Adding a job for check_disks every 60s matching 'monitored_by=monitor1', first in 57s\r\nI, [2012-12-31T22:10:12.197173 #32043]  INFO -- : scheduler.rb:23:in `nrpe' Adding a job for check_load every 60s matching 'monitored_by=monitor1', first in 23s\r\nI, [2012-12-31T22:10:35.326301 #32043]  INFO -- : scheduler.rb:26:in `nrpe' Publishing request for check_load with filter 'monitored_by=monitor1'\r\n<\/pre>\n<p><\/code><\/p>\n<p>You can see it reads the file and schedule the first check a random interval between now and the interval window this spread out the checks.<\/p>\n<p><H3>The Receiver<\/H3><br \/>\nThe receiver has almost no config, it just need to know what queue to read and where your Nagios command file lives, it logs:<\/p>\n<p><code><\/p>\n<pre lang=\"text\">\r\nI, [2013-01-01T11:49:38.295661 #23628]  INFO -- : mnrpes.rb:35:in `daemonize' Starting in the background\r\nW, [2013-01-01T11:49:38.302045 #23631]  WARN -- : activemq.rb:96:in `on_connecting' TCP Connection attempt 0 to stomp:\/\/nagios@stomp.example.net:6163\r\nW, [2013-01-01T11:49:38.310853 #23631]  WARN -- : activemq.rb:101:in `on_connected' Conncted to stomp:\/\/nagios@stomp.example.net:6163\r\nI, [2013-01-01T11:49:38.310980 #23631]  INFO -- : receiver.rb:16:in `subscribe' Subscribing to \/queue\/mcollective.nagios_passive_results_monitor1\r\nI, [2013-01-01T11:49:41.572362 #23631]  INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357040981] PROCESS_SERVICE_CHECK_RESULT;node1.example.net;mongodb;0;OK: connected, databases admin local my_db puppet mcollective\r\nI, [2013-01-01T11:49:42.509061 #23631]  INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357040982] PROCESS_SERVICE_CHECK_RESULT;node2.example.net;zombieprocs;0;PROCS OK: 0 processes with STATE = Z\r\nI, [2013-01-01T11:49:42.510574 #23631]  INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357040982] PROCESS_SERVICE_CHECK_RESULT;node3.example.net;zombieprocs;0;PROCS OK: 1 process with STATE = Z\r\n<\/pre>\n<p><\/code><\/p>\n<p>As the results get pushed to Nagios I see the following in its logs:<\/p>\n<p><code><\/p>\n<pre lang=\"text\">\r\n[1357042122] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;node1.example.net;zombieprocs;0;PROCS OK: 0 processes with STATE = Z\r\n[1357042124] PASSIVE SERVICE CHECK: node1.example.net;zombieprocs;0;PROCS OK: 0 processes with STATE = Z\r\n<\/pre>\n<p><\/code><\/p>\n<p><H3>Did it solve my problems?<\/H3><br \/>\nI listed the set of problems I wanted to solve so it&#8217;s worth evaluating if I did solve them properly.<\/p>\n<h4>Less load and RAM use on the Nagios nodes<\/h4>\n<p>My Nagios nodes have gone from load averages of 1.5 to 0.1 or 0.0, they are doing nothing, they use a lot less RAM and I have removed some of the RAM from the one and given it to my Jenkins VM instead, it was a huge win.  The sender and receiver is quite light on resources as you can see below:<\/p>\n<p><code><\/p>\n<pre lang=\"text\">\r\nUSER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND\r\nnagios    9757  0.4  1.8 130132 36060 ?        S     2012   3:41 ruby \/usr\/bin\/mnrpes-receiver --pid=\/var\/run\/mnrpes\/mnrpes-receiver.pid --config=\/etc\/mnrpes\/mnrpes-receiver.cfg\r\nnagios    9902  0.3  1.4 120056 27612 ?        Sl    2012   2:22 ruby \/usr\/bin\/mnrpes-scheduler --pid=\/var\/run\/mnrpes\/mnrpes-scheduler.pid --config=\/etc\/mnrpes\/mnrpes-scheduler.cfg\r\n<\/pre>\n<p><\/code><\/p>\n<p>On the RAM side I now never get a pile up of many checks. I do have the stale detection enabled on my Nagios template so if something breaks in the scheduler\/receiver\/broker triplet Nagios will still try to do a traditional check to see what&#8217;s going on but that&#8217;s bearable.<\/p>\n<h4>Check frequency too low<\/h4>\n<p>With this system I could do my checks every 10 seconds without any problems, I settled on 60 seconds as that&#8217;s perfect for me. Rufus scheduler does a great job of managing that and the requests from the scheduler are effectively fire and forget as long as the broker is up.<\/p>\n<h4>Results are spread over 10 minutes<\/h4>\n<p>The problem with the results for <em>load<\/em> on node1 and node2 having no temporal correlation is gone too now, because I use MCollectives parallel nature all the load checks happen at the same time:<\/p>\n<p>Here is the publisher:<br \/>\n<code><\/p>\n<pre lang=\"text\">\r\nI, [2013-01-01T12:00:14.296455 #20661]  INFO -- : scheduler.rb:26:in `nrpe' Publishing request for check_load with filter 'monitored_by=monitor1'\r\n<\/pre>\n<p><\/code><\/p>\n<p>&#8230;and the receiver:<\/p>\n<p><code><\/p>\n<pre lang=\"text\">\r\nI, [2013-01-01T12:00:14.380981 #23631]  INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node1.example.net;load;0;OK - load average: 0.92, 0.54, 0.42|load1=0.920;9.000;10.000;0; load5=0.540;8.000;9.000;0; load15=0.420;7.000;8.000;0; \r\nI, [2013-01-01T12:00:14.383875 #23631]  INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node2.example.net;load;0;OK - load average: 0.00, 0.00, 0.00|load1=0.000;1.500;2.000;0; load5=0.000;1.500;2.000;0; load15=0.000;1.500;2.000;0; \r\nI, [2013-01-01T12:00:14.387427 #23631]  INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node3.example.net;load;0;OK - load average: 0.02, 0.07, 0.07|load1=0.020;1.500;2.000;0; load5=0.070;1.500;2.000;0; load15=0.070;1.500;2.000;0; \r\nI, [2013-01-01T12:00:14.388754 #23631]  INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node4.example.net;load;0;OK - load average: 0.07, 0.02, 0.00|load1=0.070;1.500;2.000;0; load5=0.020;1.500;2.000;0; load15=0.000;1.500;2.000;0; \r\nI, [2013-01-01T12:00:14.404650 #23631]  INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node5.example.net;load;0;OK - load average: 0.03, 0.09, 0.04|load1=0.030;1.500;2.000;0; load5=0.090;1.500;2.000;0; load15=0.040;1.500;2.000;0; \r\nI, [2013-01-01T12:00:14.405689 #23631]  INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node6.example.net;load;0;OK - load average: 0.06, 0.06, 0.07|load1=0.060;3.000;4.000;0; load5=0.060;3.000;4.000;0; load15=0.070;3.000;4.000;0; \r\nI, [2013-01-01T12:00:14.489590 #23631]  INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node7.example.net;load;0;OK - load average: 0.06, 0.14, 0.14|load1=0.060;1.500;2.000;0; load5=0.140;1.500;2.000;0; load15=0.140;1.500;2.000;0; \r\n<\/pre>\n<p><\/code><\/p>\n<p>All the results are from the same second, win.<\/p>\n<h3>Conclusion<\/h3>\n<p>So my scaling issues on my small site is solved and I think the way this is built will work for many people.  The <a href=\"https:\/\/github.com\/ripienaar\/mnrpes\">code is on GitHub<\/a> and requires MCollective 2.2.0 or newer.<\/p>\n<p>Having reused the MCollective and Rufus libraries for all the legwork including logging, daemonizing, broker connectivity, addressing and security I was able to build this in a very short time, the total code base is only 237 lines excluding packaging etc. which is a really low number of lines for what it does.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Most Nagios systems does a lot of forking especially those built around something like NRPE where each check is a connection to be made to a remote system. On one hand I like NRPE in that it puts the check logic on the nodes using a standard plugin format and provides a fairly re-usable configuration [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_et_pb_use_builder":"","_et_pb_old_content":"","footnotes":""},"categories":[7],"tags":[121,85,78,110,60,13],"_links":{"self":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts\/2884"}],"collection":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/comments?post=2884"}],"version-history":[{"count":9,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts\/2884\/revisions"}],"predecessor-version":[{"id":2893,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts\/2884\/revisions\/2893"}],"wp:attachment":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/media?parent=2884"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/categories?post=2884"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/tags?post=2884"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}