<\/p>\n\r\nnrpe 'check_bacula_main', '6h', 'bacula::node monitored_by=monitor1'\r\nnrpe 'check_disks', '60s', 'monitored_by=monitor1'\r\nnrpe 'check_greylistd', '60s', 'greylistd monitored_by=monitor1'\r\nnrpe 'check_load', '60s', 'monitored_by=monitor1'\r\nnrpe 'check_mailq', '60s', 'monitored_by=monitor1'\r\nnrpe 'check_mongodb', '60s', 'mongodb monitored_by=monitor1'\r\nnrpe 'check_mysql', '60s', 'mysql::server monitored_by=monitor1'\r\nnrpe 'check_pki', '60m', 'monitored_by=monitor1'\r\nnrpe 'check_swap', '60s', 'monitored_by=monitor1'\r\nnrpe 'check_totalprocs', '60s', 'monitored_by=monitor1'\r\nnrpe 'check_zombieprocs', '60s', 'monitored_by=monitor1'\r\n<\/pre>\n<\/code><\/p>\n
Taking the first line it says: Run the check_bacula_main<\/em> NRPE check every 6 hours<\/em> on machines with the bacula::node<\/em> Puppet Class and with the fact monitored_by=monitor1<\/em>. I had the monitored_by<\/em> fact already to assist in building my Nagios configs using a simple search based approach in Puppet.<\/p>\nWhen the scheduler starts it will log:<\/p>\n
<\/p>\n\r\nW, [2012-12-31T22:10:12.186789 #32043] WARN -- : activemq.rb:96:in `on_connecting' TCP Connection attempt 0 to stomp:\/\/nagios@stomp.example.net:6163\r\nW, [2012-12-31T22:10:12.193405 #32043] WARN -- : activemq.rb:101:in `on_connected' Conncted to stomp:\/\/nagios@stomp.example.net:6163\r\nI, [2012-12-31T22:10:12.196387 #32043] INFO -- : scheduler.rb:23:in `nrpe' Adding a job for check_bacula_main every 6h matching 'bacula::node monitored_by=monitor1', first in 19709s\r\nI, [2012-12-31T22:10:12.196632 #32043] INFO -- : scheduler.rb:23:in `nrpe' Adding a job for check_disks every 60s matching 'monitored_by=monitor1', first in 57s\r\nI, [2012-12-31T22:10:12.197173 #32043] INFO -- : scheduler.rb:23:in `nrpe' Adding a job for check_load every 60s matching 'monitored_by=monitor1', first in 23s\r\nI, [2012-12-31T22:10:35.326301 #32043] INFO -- : scheduler.rb:26:in `nrpe' Publishing request for check_load with filter 'monitored_by=monitor1'\r\n<\/pre>\n<\/code><\/p>\n
You can see it reads the file and schedule the first check a random interval between now and the interval window this spread out the checks.<\/p>\n
The Receiver<\/H3>
\nThe receiver has almost no config, it just need to know what queue to read and where your Nagios command file lives, it logs:<\/p>\n
<\/p>\n\r\nI, [2013-01-01T11:49:38.295661 #23628] INFO -- : mnrpes.rb:35:in `daemonize' Starting in the background\r\nW, [2013-01-01T11:49:38.302045 #23631] WARN -- : activemq.rb:96:in `on_connecting' TCP Connection attempt 0 to stomp:\/\/nagios@stomp.example.net:6163\r\nW, [2013-01-01T11:49:38.310853 #23631] WARN -- : activemq.rb:101:in `on_connected' Conncted to stomp:\/\/nagios@stomp.example.net:6163\r\nI, [2013-01-01T11:49:38.310980 #23631] INFO -- : receiver.rb:16:in `subscribe' Subscribing to \/queue\/mcollective.nagios_passive_results_monitor1\r\nI, [2013-01-01T11:49:41.572362 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357040981] PROCESS_SERVICE_CHECK_RESULT;node1.example.net;mongodb;0;OK: connected, databases admin local my_db puppet mcollective\r\nI, [2013-01-01T11:49:42.509061 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357040982] PROCESS_SERVICE_CHECK_RESULT;node2.example.net;zombieprocs;0;PROCS OK: 0 processes with STATE = Z\r\nI, [2013-01-01T11:49:42.510574 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357040982] PROCESS_SERVICE_CHECK_RESULT;node3.example.net;zombieprocs;0;PROCS OK: 1 process with STATE = Z\r\n<\/pre>\n<\/code><\/p>\n
As the results get pushed to Nagios I see the following in its logs:<\/p>\n
<\/p>\n\r\n[1357042122] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;node1.example.net;zombieprocs;0;PROCS OK: 0 processes with STATE = Z\r\n[1357042124] PASSIVE SERVICE CHECK: node1.example.net;zombieprocs;0;PROCS OK: 0 processes with STATE = Z\r\n<\/pre>\n<\/code><\/p>\n
Did it solve my problems?<\/H3>
\nI listed the set of problems I wanted to solve so it’s worth evaluating if I did solve them properly.<\/p>\nLess load and RAM use on the Nagios nodes<\/h4>\n
My Nagios nodes have gone from load averages of 1.5 to 0.1 or 0.0, they are doing nothing, they use a lot less RAM and I have removed some of the RAM from the one and given it to my Jenkins VM instead, it was a huge win. The sender and receiver is quite light on resources as you can see below:<\/p>\n
<\/p>\n\r\nUSER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND\r\nnagios 9757 0.4 1.8 130132 36060 ? S 2012 3:41 ruby \/usr\/bin\/mnrpes-receiver --pid=\/var\/run\/mnrpes\/mnrpes-receiver.pid --config=\/etc\/mnrpes\/mnrpes-receiver.cfg\r\nnagios 9902 0.3 1.4 120056 27612 ? Sl 2012 2:22 ruby \/usr\/bin\/mnrpes-scheduler --pid=\/var\/run\/mnrpes\/mnrpes-scheduler.pid --config=\/etc\/mnrpes\/mnrpes-scheduler.cfg\r\n<\/pre>\n<\/code><\/p>\n
On the RAM side I now never get a pile up of many checks. I do have the stale detection enabled on my Nagios template so if something breaks in the scheduler\/receiver\/broker triplet Nagios will still try to do a traditional check to see what’s going on but that’s bearable.<\/p>\n
Check frequency too low<\/h4>\n
With this system I could do my checks every 10 seconds without any problems, I settled on 60 seconds as that’s perfect for me. Rufus scheduler does a great job of managing that and the requests from the scheduler are effectively fire and forget as long as the broker is up.<\/p>\n
Results are spread over 10 minutes<\/h4>\n
The problem with the results for load<\/em> on node1 and node2 having no temporal correlation is gone too now, because I use MCollectives parallel nature all the load checks happen at the same time:<\/p>\nHere is the publisher:
\n<\/p>\n\r\nI, [2013-01-01T12:00:14.296455 #20661] INFO -- : scheduler.rb:26:in `nrpe' Publishing request for check_load with filter 'monitored_by=monitor1'\r\n<\/pre>\n<\/code><\/p>\n
…and the receiver:<\/p>\n
<\/p>\n\r\nI, [2013-01-01T12:00:14.380981 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node1.example.net;load;0;OK - load average: 0.92, 0.54, 0.42|load1=0.920;9.000;10.000;0; load5=0.540;8.000;9.000;0; load15=0.420;7.000;8.000;0; \r\nI, [2013-01-01T12:00:14.383875 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node2.example.net;load;0;OK - load average: 0.00, 0.00, 0.00|load1=0.000;1.500;2.000;0; load5=0.000;1.500;2.000;0; load15=0.000;1.500;2.000;0; \r\nI, [2013-01-01T12:00:14.387427 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node3.example.net;load;0;OK - load average: 0.02, 0.07, 0.07|load1=0.020;1.500;2.000;0; load5=0.070;1.500;2.000;0; load15=0.070;1.500;2.000;0; \r\nI, [2013-01-01T12:00:14.388754 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node4.example.net;load;0;OK - load average: 0.07, 0.02, 0.00|load1=0.070;1.500;2.000;0; load5=0.020;1.500;2.000;0; load15=0.000;1.500;2.000;0; \r\nI, [2013-01-01T12:00:14.404650 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node5.example.net;load;0;OK - load average: 0.03, 0.09, 0.04|load1=0.030;1.500;2.000;0; load5=0.090;1.500;2.000;0; load15=0.040;1.500;2.000;0; \r\nI, [2013-01-01T12:00:14.405689 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node6.example.net;load;0;OK - load average: 0.06, 0.06, 0.07|load1=0.060;3.000;4.000;0; load5=0.060;3.000;4.000;0; load15=0.070;3.000;4.000;0; \r\nI, [2013-01-01T12:00:14.489590 #23631] INFO -- : receiver.rb:34:in `receive_and_submit' Submitting passive data to nagios: [1357041614] PROCESS_SERVICE_CHECK_RESULT;node7.example.net;load;0;OK - load average: 0.06, 0.14, 0.14|load1=0.060;1.500;2.000;0; load5=0.140;1.500;2.000;0; load15=0.140;1.500;2.000;0; \r\n<\/pre>\n<\/code><\/p>\n
All the results are from the same second, win.<\/p>\n
Conclusion<\/h3>\n
So my scaling issues on my small site is solved and I think the way this is built will work for many people. The code is on GitHub<\/a> and requires MCollective 2.2.0 or newer.<\/p>\nHaving reused the MCollective and Rufus libraries for all the legwork including logging, daemonizing, broker connectivity, addressing and security I was able to build this in a very short time, the total code base is only 237 lines excluding packaging etc. which is a really low number of lines for what it does.<\/p>\n","protected":false},"excerpt":{"rendered":"
Most Nagios systems does a lot of forking especially those built around something like NRPE where each check is a connection to be made to a remote system. On one hand I like NRPE in that it puts the check logic on the nodes using a standard plugin format and provides a fairly re-usable configuration […]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_et_pb_use_builder":"","_et_pb_old_content":"","footnotes":""},"categories":[7],"tags":[121,85,78,110,60,13],"_links":{"self":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts\/2884"}],"collection":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/comments?post=2884"}],"version-history":[{"count":9,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts\/2884\/revisions"}],"predecessor-version":[{"id":2893,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/posts\/2884\/revisions\/2893"}],"wp:attachment":[{"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/media?parent=2884"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/categories?post=2884"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devco.net\/wp-json\/wp\/v2\/tags?post=2884"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}