R.I.Pienaar

CLI Report viewer for Puppet

10/10/2013

When using Puppet you often run it in a single run mode on the CLI and then go afk. When you return you might notice it was slow for some or other reason but did not run it with –evaltrace and in debug mode so the information to help you answer this simply isn’t present – or scrolled off or got rotated away from your logs.

Typically you’d deploy something like foreman or report handlers on your masters which would receive and display reports. But while you’re on the shell it’s a big context switch to go and find the report there.

Puppet now saves reports in it’s state dir including with apply if you ran it with –write-catalog-summary and in recent versions these reports include the performance data that you’d only find from –evaltrace.

So to solve this problem I wrote a little tool to show reports on the CLI. It’s designed to run on the shell of the node in question and as root. If you do this it will automatically pick up the latest report and print it but it will also go through and check the sizes of files and show you stats. You can run it against saved reports on some other node but you’ll lose some utility. The main focus of the information presented is to let you see logs from the past run but also information that help you answer why it was slow to run.

It’s designed to work well with very recent versions of Puppet maybe even only 3.3.0 and newer, I’ve not tested it on older versions but will gladly accept patches.

Here are some snippets of a report of one of my nodes and some comments about the sections. A full sample report can be found here.

First it’s going to show you some metadata about the report, what node, when for etc:

sudo report_print.rb
Report for puppetmaster.example.com in environment production at Thu Oct 10 13:37:04 +0000 2013
 
             Report File: /var/lib/puppet/state/last_run_report.yaml
             Report Kind: apply
          Puppet Version: 3.3.1
           Report Format: 4
   Configuration Version: 1381412220
                    UUID: 99503fe8-38f2-4441-a530-d555ede9067b
               Log Lines: 350 (show with --log)

Some important information here, you can see it figured out where to find the report by parsing the Puppet config – agent section – what version of Puppet and what report format. You can also see the report has 350 lines of logs in it but it isn’t showing them by default.

Next up it shows you a bunch of metrics from the report:

Report Metrics:
 
   Changes:
                        Total: 320
 
   Events:
                        Total: 320
                      Success: 320
                      Failure: 0
 
   Resources:
                        Total: 436
                  Out of sync: 317
                      Changed: 317
                    Restarted: 7
            Failed to restart: 0
                      Skipped: 0
                       Failed: 0
                    Scheduled: 0
 
   Time:
                        Total: 573.671295
                      Package: 509.544123
                         Exec: 33.242635
      Puppetdb conn validator: 22.767754
             Config retrieval: 4.096973
                         File: 1.343388
                         User: 1.337979
                      Service: 1.180588
                  Ini setting: 0.127856
                       Anchor: 0.013984
            Datacat collector: 0.008954
                         Host: 0.003265
             Datacat fragment: 0.00277
                     Schedule: 0.000504
                        Group: 0.00039
                   Filebucket: 0.000132

These are numerically sorted and the useful stuff is in the last section – what types were to blame for the biggest slowness in your run. Here we can see we spent 509 seconds just doing packages.

Having seen how long each type of resource took it then shows you a little report of how many resources of each type was found:

Resources by resource type:
 
    288 File
     30 Datacat_fragment
     25 Anchor
     24 Ini_setting
     22 User
     18 Package
      9 Exec
      7 Service
      6 Schedule
      3 Datacat_collector
      1 Group
      1 Host
      1 Puppetdb_conn_validator
      1 Filebucket

From here you’ll see detail about resources and files, times, sizes etc. By default it’s going to show you 20 of each but you can increase that using the –count argument.

First we see the evaluation time by resource, this is how long the agent spent to complete a specific resource:

Slowest 20 resources by evaluation time:
 
    356.94 Package[activemq]
     41.71 Package[puppetdb]
     33.31 Package[apache2-prefork-dev]
     33.05 Exec[compile-passenger]
     23.41 Package[passenger]
     22.77 Puppetdb_conn_validator[puppetdb_conn]
     22.12 Package[libcurl4-openssl-dev]
     10.94 Package[httpd]
      4.78 Package[libapr1-dev]
      3.95 Package[puppetmaster]
      3.32 Package[ntp]
      2.75 Package[puppetdb-terminus]
      2.71 Package[mcollective-client]
      1.86 Package[ruby-stomp]
      1.72 Package[mcollective]
      0.58 Service[puppet]
      0.30 Service[puppetdb]
      0.18 User[jack]
      0.16 User[jill]
      0.16 User[ant]

You can see by far the longest here was the activemq package that took 356 seconds and contributed most to the 509 seconds that Package types took in total. A clear indication that maybe this machine is picking the wrong mirrors or that I should create my own nearby mirror.

File serving in Puppet is notoriously slow so when run as root on the node in question it will look for all File resources and print the sizes. Unfortunately it can’t know if a file contents came from source or content as that information isn’t in the report. Still this might give you some information on where to target optimization. In this case nothing really stands out:

20 largest managed files (only those with full path as resource name that are readable)
 
     6.50 KB /usr/local/share/mcollective/mcollective/util/actionpolicy.rb
     3.90 KB /etc/mcollective/facts.yaml
     3.83 KB /var/lib/puppet/concat/bin/concatfragments.sh
     2.78 KB /etc/sudoers
     1.69 KB /etc/apache2/conf.d/puppetmaster.conf
     1.49 KB /etc/puppet/fileserver.conf
     1.20 KB /etc/puppet/rack/config.ru
    944.00 B /etc/apache2/apache2.conf
    573.00 B /etc/ntp.conf
    412.00 B /usr/local/share/mcollective/mcollective/util/actionpolicy.ddl
    330.00 B /etc/apache2/mods-enabled/passenger.conf
    330.00 B /etc/apache2/mods-available/passenger.conf
    262.00 B /etc/default/puppet
    215.00 B /etc/apache2/mods-enabled/worker.conf
    215.00 B /etc/apache2/mods-available/worker.conf
    195.00 B /etc/apache2/ports.conf
    195.00 B /var/lib/puppet/concat/_etc_apache2_ports.conf/fragments.concat
    195.00 B /var/lib/puppet/concat/_etc_apache2_ports.conf/fragments.concat.out
    164.00 B /var/lib/puppet/concat/_etc_apache2_ports.conf/fragments/10_Apache ports header
    158.00 B /etc/puppet/hiera.yaml

And finally if I ran it with –log I’d get the individual log lines:

350 Log lines:
 
   Thu Oct 10 13:37:06 +0000 2013 /Stage[main]/Concat::Setup/File[/var/lib/puppet/concat]/ensure (notice): created
   Thu Oct 10 13:37:06 +0000 2013 /Stage[main]/Concat::Setup/File[/var/lib/puppet/concat/bin]/ensure (notice): created
   Thu Oct 10 13:37:06 +0000 2013 /Stage[main]/Concat::Setup/File[/var/lib/puppet/concat/bin/concatfragments.sh]/ensure (notice): defined content as '{md5}2fbba597a1513eb61229551d35d42b9f'
   .
   .
   .

The code is on GitHub, I’d like to make it available as a Puppet Forge module but there really is no usable option to achieve this. The Puppet Face framework is the best available option but the UX is so poor that I would not like to expose anyone to this to use my code.

Introduction to MCollective deck

06/14/2013

I’ve not had a good introduction to MCollective slide deck ever, I usually just give demos and talk through it. I was invited to talk in San Francisco about MCollective so made a new deck for this talk.

On the night I gave people the choice of talks between the new Introduction talk and the older Managing Puppet using MCollective and sadly the intro talk lost out.

Last night the excellent people at Workday flew me to Dublin to talk to the local DevOps group there and this group was predominantly Chef users who chose the Introduction talk so I finally had a chance to deliver it. This talk was recorded, hopefully it’ll be up soon and I’ll link to it once available.

This slide deck is a work in progress, it’s clear I need to add some more information about the non-cli orientated uses of MCollective but it’s good to finally have a deck that’s receiving good feedback.

We uploaded the slides back when I was in San Francisco to slideshare and those are the ones you see here.


Managing Puppet Using MCollective

02/03/2013

I recently gave a talk titled “Managing Puppet Using MCollective” at the Puppet Camp in Ghent.

The talk introduces a complete rewrite of the MCollective plugin used to manage Puppet. The plugin can be found on our Github repo as usual. Significantly this is one of a new breed of plugin that we ship as native OS packages and practice continuous delivery on.

The packages can be found on apt.puppetlabs.com and yum.puppetlabs.com and are simply called mcollective-puppet-agent and mcollective-puppet-client.

This set of plugins show case a bunch of recent MCollective features including:

  • Data Plugins
  • Aggregation Functions
  • Custom Validators
  • Configurable enabling and disabling of the Agent
  • Direct Addressing and pluggable discovery to significantly improve the efficiency of the runall method
  • Utility classes shared amongst different types of plugin
  • Extensive testing using rspec and our mcollective specific rspec plugins

It’s a bit of a beast coming at a couple thousand lines but this was mostly because we had to invent a rather sizeable wrapper for Puppet to expose a nice API around Puppet 2.7 and 3.x for things like running them and obtaining their status.

The slides from the talk can be seen below, hopefully a video will be up soon else I’ll turn it into a screencast.

Graphing on the CLI

01/20/2013

I’ve recently been thinking about ways to do graphs on the CLI. We’ve written a new Puppet Agent for MCollective that can gather all sorts of interesting data from your server estate and I’d really like to be able to show this data on the CLI. This post isn’t really about MCollective though the ideas applies to any data.

I already have sparklines in MCollective, here’s the distribution of ping times:

This shows you that most of the nodes responded quickly with a bit of a tail at the end being my machines in the US.

Sparklines are quite nice for a quick overview so I looked at adding some more of this to the UI and came up with this:

Which is quite nice – these are the nodes in my infrastructure stuck into buckets and the node counts for each bucket is shown. We can immediately tell something is not quite right – the config retrieval time shows a bunch of slow machines and the slowness does not correspond to resource counts etc. On investigation I found these are my dev machines – KVM nodes hosted on HP Micro Servers so that’s to be expected.

I am not particularly happy with these graphs though so am still exploring other options, one other option is GNU Plot.

GNU Plot can target its graphs for different terminals like PNG and also line printers – since the Unix terminal is essentially a line printer we can use this.

Here are 2 graphs of config retrieval time produced by MCollective using the same data source that produced the spark line above – though obviously from a different time period. Note that the axis titles and graph title is supplied automatically using the MCollective DDL:

$ mco plot resource config_retrieval_time
 
                   Information about Puppet managed resources
  Nodes
    6 ++-*****----+----------+-----------+----------+----------+----------++
      +      *    +          +           +          +          +           +
      |       *                                                            |
    5 ++      *                                                           ++
      |       *                                                            |
      |        *                                                           |
    4 ++       *      *                                                   ++
      |        *      *                                                    |
      |         *    * *                                                   |
    3 ++        *    * *                                                  ++
      |          *  *  *                                                   |
      |           * *   *                                                  |
    2 ++           *    *                         *        *              ++
      |                 *                         **       **              |
      |                  *                       * *      *  *             |
    1 ++                 *               *       *  *     *   **        * ++
      |                  *              * *     *   *     *     **    **   |
      +           +       *  +         * + *    *   +*   *     +     *     +
    0 ++----------+-------*************--+--****----+*****-----+--***-----++
      0           10         20          30         40         50          60
                              Config Retrieval Time

So this is pretty serviceable for showing this data on the console! It wouldn’t scale to many lines but for just visualizing some arbitrary series of numbers it’s quite nice. Here’s the GNU Plot script that made the text graph:

set title "Information about Puppet managed resources"
set terminal dumb 78 24
set key off
set ylabel "Nodes"
set xlabel "Config Retrieval Time"
plot '-' with lines
3 6
6 6
9 3
11 2
14 4
17 0
20 0
22 0
25 0
28 0
30 1
33 0
36 038 2
41 0
44 0
46 2
49 1
52 0
54 0
57 1

The magic here comes from the second line that sets the output terminal to dump and supplies some dimensions. Very handy, worth exploring some more and adding to your toolset for the CLI. I’ll look at writing a gem or something that supports both these modes.

There are a few other players in this space, I definitely recall coming across a Python tool to do graphs but cannot find it now, shout out in the comments if you know other approaches and I’ll add them to the post!

Updated: some links to related projects: sparkler, Graphite Spark

Solving monitoring state storage problems using Redis

01/06/2013

Redis is an in-memory key-value data store that provides a small number of primitives suitable to the task of building monitoring systems. As a lot of us are hacking in this space I thought I’d write a blog post summarizing where I’ve been using it in a little Sensu like monitoring system I have been working on on and off.

There’s some monitoring related events coming up like MonitoringLove in Antwerp and Monitorama in Boston – I will be attending both and I hope a few members in the community will create similar background posts on various interesting areas before these events.

I’ve only recently started looking at Redis but really like it. It’s a very light weight daemon written in C with fantastic documentation detailing things like each commands performance characteristics and most documantation pages are live in that they have a REPL right on the page like the SET page – note you can type into the code sample and see your changes in real time. It is sponsored by VMWare and released under the 3 clause BSD license.

Redis Data Types


Redis provides a few common data structures:

  • Normal key-value storage where every key has just one string value
  • Hashes where every key contains a hash of key-values strings
  • Lists of strings – basically just plain old Arrays sorted in insertion order that allows duplicate values
  • Sets are a bit like Lists but with the addition that a given value can only appear in a list once
  • Sorted Sets are sets that in addition to the value also have a weight associated with it, the set is indexed by weight

All the keys support things like expiry based on time and TTL calculation. Additionally it also supports PubSub.

At first it can be hard to imagine how you’d use a data store with only these few data types and capable of only storing strings for monitoring but with a bit of creativity it can be really very useful.

The full reference about all the types can be found in the Redis Docs: Data Types

Monitoring Needs


Monitoring systems generally need a number of different types of storage. These are configuration, event archiving and status and alert tracking. There are more but these are the big ticket items, of the 3 I am only going to focus on the last one – Status and Alert Tracking here.

Status tracking is essentially transient data. If you loose your status view it’s not really a big deal it will be recreated quite quickly as new check results come in. Worst case you’ll get some alerts again that you recently got. This fits well with Redis that doesn’t always commit data soon as it receives it – it flushes roughly every second from memory to disk.

Redis does not provide much by way of SSL or strong authentication so I tend to consider it a single node IPC system rather than say a generic PubSub system. I feed data into a node using system like ActiveMQ and then for comms and state tracking on a single node I’ll use Redis.

I’ll show how it can be used to solve the following monitoring related storage/messaging problems:

  • Check Status – a check like load on every node
  • Staleness Tracking – you need to know when a node is not receiving check results so you can do alive checks
  • Event Notification – your core monitoring system will likely feed into alerters like Opsgenie and metric storage like Graphite
  • Alert Tracking – you need to know when you last sent an alert and when you can alert again based on an interval like every 2 hours

Check Status


The check is generally the main item of monitoring systems. Something configures a check like load and then every node gets check results for this item, the monitoring system has to track the status of the checks on a per node basis.

In my example a check result looks more or less like this:

{"lastcheck"        => "1357490521", 
 "count"            => "1143", 
 "exitcode"         => "0", 
 "output"           => "OK - load average: 0.23, 0.10, 0.02", 
 "last_state_change"=> "1357412507",
 "perfdata"         => '{"load15":0.02,"load5":0.1,"load1":0.23}',
 "check"            => "load",
 "host"             => "dev2.devco.net"}

This is standard stuff and the most boring part – you might guess this goes into a Hash and you’ll be right. Note the count item there Redis has special handling for counters and I’ll show that in a minute.

By convention Redis keys are name spaced by a : so I’d store the check status for a specific node + check combination in a key like status:example.net:load

Updating or creating a new hash is real easy – just write to it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def save_check(check)
  key = "status:%s:%s" % [check.host, check.check]
 
  check.last_state_change = @redis.hget(key, "last_state_change")
  check.previous_exitcode = @redis.hget(key, "exitcode")
 
  @redis.multi do
    @redis.hset(key, "host", check.host)
    @redis.hset(key, "check", check.check)
    @redis.hset(key, "exitcode", check.exitcode)
    @redis.hset(key, "lastcheck", check.last_check)
    @redis.hset(key, "last_state_change", check.last_state_change)
    @redis.hset(key, "output", check.output)
    @redis.hset(key, "perfdata", check.perfdata)
 
    unless check.changed_state?
      @redis.hincrby(key, "count", 1)
    else
      @redis.hset(key, "count", 1)
    end
  end
 
  check.count = @redis.hget(key, "count")
end

Here I assume we have a object that represents a check result called check and we’re more or less just fetching/updating data in it. I first retrieve the previously saved state of exitcode and last state change time and save those into the object. The object will do some internal state management to determine if the current check result represents a changed state – OK to WARNING etc – based on this information.

The @redis.multi starts a transaction, everything inside the block will be written in an atomic way by the Redis server thus ensuring we do not have any half-baked state while other parts of the system might be reading the status of this check.

As I said the check determines if the current result is a state change when I set the previous exitcode on line 5 this means lines 16-20 will either set the count to 1 if it’s a change or just increment the count if not. We use the internal Redis counter handling on line 17 to avoid having to first fetch the count and then update it and saving it, this saves a round trip to the database.

You can now just retrieve the whole hash with the HGETALL command, even on the command line:

% redis-cli hgetall status:dev2.devco.net:load
 1) "check"
 2) "load"
 3) "host"
 4) "dev2.devco.net"
 5) "output"
 6) "OK - load average: 0.00, 0.00, 0.00"
 7) "lastcheck"
 8) "1357494721"
 9) "exitcode"
10) "0"
11) "perfdata"
12) "{\"load15\":0.0,\"load5\":0.0,\"load1\":0.0}"
13) "last_state_change"
14) "1357412507"
15) "count"
16) "1178"

References: Redis Hashes, MULTI, HSET, HINCRBY, HGET, HGETALL

Staleness Tracking


Staleness Tracking here means we want to know when last we saw any data about a node, if the node is not providing information we need to go and see what happened to it. Maybe it’s up but the data sender died or maybe it’s crashed.

This is where we really start using some of the Redis features to save us time. We need to track when last we saw a specific node and then we have to be able to quickly find all nodes not seen within certain amount of time like 120 seconds.

We could retrieve all the check results and check their last updated time and so figure it out but that’s not optimal.

This is what Sorted Lists are for. Remember Sorted Lists have a weight and orders the list by the weight, if we use the timestamp that we last received data at for a host as the weight it means we can very quickly fetch a list of stale hosts.

1
2
3
def update_host_last_seen(host, time)
  @redis.zadd("host:last_seen", time, host)
end

When we call this code like update_host_last_seen(“dev2.devco.net”, Time.now.utc.to_i) the host will either be added to or updated in the Sorted List based on the current UTC time. We do this every time we save a new result set with the code in the previous section.

To get a list of hosts that we have not seen in the last 120 seconds is really easy now:

1
2
3
def get_stale_hosts(age)
  @redis.zrangebyscore("host:last_seen", 0, (Time.now.utc.to_i - age))
end

If we call this with an age like 120 we’ll get an array of nodes that have not had any data within the last 120 seconds.

You can do the same check on the CLI, this shows all the machines not seen in the last 60 seconds:

% redis-cli zrangebyscore host:last_seen 0 $(expr $(date +%s) - 60)
 1) "dev1.devco.net"

Reference: Sorted Sets, ZADD, ZRANGEBYSCORE

Event Notification


When a check result enters the system thats either a state change, a problem or have metrics associated it we’d want to send those on to other pieces of code.

We don’t know or care who those interested parties are we only care that there might be some interested parties – it might be something writing to Graphite or OpenTSDB or both at the same time or something alerting to Opsgenie or Pager Duty. This is a classic use case for PubSub and Redis has a good PubSub subsystem that we’ll use for this.

I am only going to show the metrics publishing – problem and state changes are very similar:

1
2
3
4
5
6
7
8
9
10
11
def publish_metrics(check)
  if check.has_perfdata?
    msg = {"metrics" => check.perfdata, "type" => "metrics", "time" => check.last_check, "host" => check.host, "check" => check.check}.to_json
    publish(["metrics", check.host, check.check], msg)
  end
end
 
def publish(type, message)
  target = ["overwatch", Array(type).join(":")].join(":")
  @redis.publish(target, message)
end

This is pretty simple stuff, we’re just publishing some JSON to a named destination like overwatch:metrics:dev1.devco.net:load. We can now write small standalone single function tools that consume this stream of metrics and send it wherever we like – like Graphite or OpenTSDB.

We publish similar events for any incoming check result that is not OK and also for any state transition like CRITICAL to OK, these would be consumed by alerter handlers that might feed pagers or SMS.

We’re publishing these alerts to to destinations that include the host and specific check – this way we can very easily create individual host views of activity by doing pattern based subscribes.

Reference: PubSub, PUBLISH

Alert Tracking


Alert Tracking means keeping track of which alerts we’ve already sent and when we’ll need to send them again like only after 2 hours of the same problem and not on every check result which might come in every minute.

Leading on from the previous section we’d just consume the problem and state change PubSub channels and react on messages from those:

A possible consumer of this might look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
@redis.psubscribe("overwatch:state_change:*", "overwatch:issues:*") do |on|
  on.pmessage do |channel, message|
    event = JSON.parse(message)
 
    case event["type"]
      when "issue"
        sender.notify_issue(event["issue"]["exitcode"], event["host"], event["check"], event["issue"]["output"])
      when "state_change"
        if event["state_change"]["exitcode"] == 0
          sender.notify_recovery(event["host"], event["check"], event["state_change"]["output"])
        end
    end
  end
end

This subscribes to the 2 channels and pass the incoming events to a notifier. Note we’re using the patterns here to catch all alerts and changes for all hosts.

The problem here is that without any special handling this is going to fire off alerts every minute assuming we check the load every minute. This is where Redis expiry of keys come in.

We’ll need to track which messages we have sent when and on any state change clear the tracking thus restarting the counters.

So we’ll just add keys called “alert:dev2.devco.net:load:3″ to indicate an UNKNOWN state alert for load on dev2.devco.net:

1
2
3
4
5
def record_alert(host, check, status, expire=7200)
  key = "alert:%s:%s:%d" % [host, check, status]
  @redis.set(key, 1)
  @redis.expire(key, expire)
end

This takes an expire time which defaults to 2 hours and tells redis to just remove the key when its time is up.

With this we need a way to figure out if we can send again:

1
2
3
4
def alert_ttl(host, check, status)
  key = "alert:%s:%s:%d" % [host, check, status]
  @redis.ttl(key)
end

This will return the amount of seconds till next alert and -1 if we are ready to send again

And finally on every state change we need to just purge all the tracking for a given node + check combo. The reason for this is if we notified on CRITICAL a minute ago then the service recovers to OK but soon goes to CRITICAL again this most recent CRITICAL alert will be suppressed as part of the previous cycle of alerts.

1
2
3
def clear_alert_ttls(host, check)
  @redis.del(@redis.keys.grep(/^alert:#{host}:#{check}:\d/))
end

So now I can show the two methods that will actually publish the alerts:

The first notifies of issues but only every @interval seconds and it uses the alert_ttl helper above to determine if it should or shouldn’t send:

1
2
3
4
5
6
7
8
9
10
11
12
def notify_issue(exitcode, host, check, output)
  if (ttl = @storage.alert_ttl(host, check, exitcode)) == -1
    subject = "%s %s#%s" % [status_for_code(exitcode), host, check]
    message = "%s: %s" % [subject, output]
 
    send(message, subject, @recipients)
 
    @redis.record_alert(host, check, exitcode, @alert_interval)
  else
    Log.info("Not alerting %s#%s due to interval restrictions, next alert in %d seconds" % [host, check, ttl])
  end
end

The second will publish recovery notices and we’d always want those and they will not repeat, here we clear all the previous alert tracking to avoid incorrect alert surpressions:

1
2
3
4
5
6
7
8
def notify_recovery(host, check, output)
  subject = "RECOVERY %s#%s" % [host, check]
  message = "%s: %s" % [subject, output]
 
  send_alert(message, subject, @recipients)
 
  @redis.clear_alert_ttls(host, check)
end

References: SET, EXPIRE, SUBSCRIBE, TTL, DEL

Conclusion


This covered a few Redis basics but it’s a very rich system that can be used in many areas so if you are interested spend some quality time with its docs.

Using its facilities saved me a ton of effort while working on a small monitoring system. It is fast and light weight and enable cross language collaboration that I’d have found hard to replicate in a performant manner without it.

Newer Posts
Older Posts