Solving monitoring state storage problems using Redis

01/06/2013

Redis is an in-memory key-value data store that provides a small number of primitives suitable to the task of building monitoring systems. As a lot of us are hacking in this space I thought I’d write a blog post summarizing where I’ve been using it in a little Sensu like monitoring system I have been working on on and off.

There’s some monitoring related events coming up like MonitoringLove in Antwerp and Monitorama in Boston – I will be attending both and I hope a few members in the community will create similar background posts on various interesting areas before these events.

I’ve only recently started looking at Redis but really like it. It’s a very light weight daemon written in C with fantastic documentation detailing things like each commands performance characteristics and most documantation pages are live in that they have a REPL right on the page like the SET page – note you can type into the code sample and see your changes in real time. It is sponsored by VMWare and released under the 3 clause BSD license.

Redis Data Types


Redis provides a few common data structures:

  • Normal key-value storage where every key has just one string value
  • Hashes where every key contains a hash of key-values strings
  • Lists of strings – basically just plain old Arrays sorted in insertion order that allows duplicate values
  • Sets are a bit like Lists but with the addition that a given value can only appear in a list once
  • Sorted Sets are sets that in addition to the value also have a weight associated with it, the set is indexed by weight

All the keys support things like expiry based on time and TTL calculation. Additionally it also supports PubSub.

At first it can be hard to imagine how you’d use a data store with only these few data types and capable of only storing strings for monitoring but with a bit of creativity it can be really very useful.

The full reference about all the types can be found in the Redis Docs: Data Types

Monitoring Needs


Monitoring systems generally need a number of different types of storage. These are configuration, event archiving and status and alert tracking. There are more but these are the big ticket items, of the 3 I am only going to focus on the last one – Status and Alert Tracking here.

Status tracking is essentially transient data. If you loose your status view it’s not really a big deal it will be recreated quite quickly as new check results come in. Worst case you’ll get some alerts again that you recently got. This fits well with Redis that doesn’t always commit data soon as it receives it – it flushes roughly every second from memory to disk.

Redis does not provide much by way of SSL or strong authentication so I tend to consider it a single node IPC system rather than say a generic PubSub system. I feed data into a node using system like ActiveMQ and then for comms and state tracking on a single node I’ll use Redis.

I’ll show how it can be used to solve the following monitoring related storage/messaging problems:

  • Check Status – a check like load on every node
  • Staleness Tracking – you need to know when a node is not receiving check results so you can do alive checks
  • Event Notification – your core monitoring system will likely feed into alerters like Opsgenie and metric storage like Graphite
  • Alert Tracking – you need to know when you last sent an alert and when you can alert again based on an interval like every 2 hours

Check Status


The check is generally the main item of monitoring systems. Something configures a check like load and then every node gets check results for this item, the monitoring system has to track the status of the checks on a per node basis.

In my example a check result looks more or less like this:

{"lastcheck"        => "1357490521", 
 "count"            => "1143", 
 "exitcode"         => "0", 
 "output"           => "OK - load average: 0.23, 0.10, 0.02", 
 "last_state_change"=> "1357412507",
 "perfdata"         => '{"load15":0.02,"load5":0.1,"load1":0.23}',
 "check"            => "load",
 "host"             => "dev2.devco.net"}

This is standard stuff and the most boring part – you might guess this goes into a Hash and you’ll be right. Note the count item there Redis has special handling for counters and I’ll show that in a minute.

By convention Redis keys are name spaced by a : so I’d store the check status for a specific node + check combination in a key like status:example.net:load

Updating or creating a new hash is real easy – just write to it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def save_check(check)
  key = "status:%s:%s" % [check.host, check.check]
 
  check.last_state_change = @redis.hget(key, "last_state_change")
  check.previous_exitcode = @redis.hget(key, "exitcode")
 
  @redis.multi do
    @redis.hset(key, "host", check.host)
    @redis.hset(key, "check", check.check)
    @redis.hset(key, "exitcode", check.exitcode)
    @redis.hset(key, "lastcheck", check.last_check)
    @redis.hset(key, "last_state_change", check.last_state_change)
    @redis.hset(key, "output", check.output)
    @redis.hset(key, "perfdata", check.perfdata)
 
    unless check.changed_state?
      @redis.hincrby(key, "count", 1)
    else
      @redis.hset(key, "count", 1)
    end
  end
 
  check.count = @redis.hget(key, "count")
end

Here I assume we have a object that represents a check result called check and we’re more or less just fetching/updating data in it. I first retrieve the previously saved state of exitcode and last state change time and save those into the object. The object will do some internal state management to determine if the current check result represents a changed state – OK to WARNING etc – based on this information.

The @redis.multi starts a transaction, everything inside the block will be written in an atomic way by the Redis server thus ensuring we do not have any half-baked state while other parts of the system might be reading the status of this check.

As I said the check determines if the current result is a state change when I set the previous exitcode on line 5 this means lines 16-20 will either set the count to 1 if it’s a change or just increment the count if not. We use the internal Redis counter handling on line 17 to avoid having to first fetch the count and then update it and saving it, this saves a round trip to the database.

You can now just retrieve the whole hash with the HGETALL command, even on the command line:

% redis-cli hgetall status:dev2.devco.net:load
 1) "check"
 2) "load"
 3) "host"
 4) "dev2.devco.net"
 5) "output"
 6) "OK - load average: 0.00, 0.00, 0.00"
 7) "lastcheck"
 8) "1357494721"
 9) "exitcode"
10) "0"
11) "perfdata"
12) "{\"load15\":0.0,\"load5\":0.0,\"load1\":0.0}"
13) "last_state_change"
14) "1357412507"
15) "count"
16) "1178"

References: Redis Hashes, MULTI, HSET, HINCRBY, HGET, HGETALL

Staleness Tracking


Staleness Tracking here means we want to know when last we saw any data about a node, if the node is not providing information we need to go and see what happened to it. Maybe it’s up but the data sender died or maybe it’s crashed.

This is where we really start using some of the Redis features to save us time. We need to track when last we saw a specific node and then we have to be able to quickly find all nodes not seen within certain amount of time like 120 seconds.

We could retrieve all the check results and check their last updated time and so figure it out but that’s not optimal.

This is what Sorted Lists are for. Remember Sorted Lists have a weight and orders the list by the weight, if we use the timestamp that we last received data at for a host as the weight it means we can very quickly fetch a list of stale hosts.

1
2
3
def update_host_last_seen(host, time)
  @redis.zadd("host:last_seen", time, host)
end

When we call this code like update_host_last_seen(“dev2.devco.net”, Time.now.utc.to_i) the host will either be added to or updated in the Sorted List based on the current UTC time. We do this every time we save a new result set with the code in the previous section.

To get a list of hosts that we have not seen in the last 120 seconds is really easy now:

1
2
3
def get_stale_hosts(age)
  @redis.zrangebyscore("host:last_seen", 0, (Time.now.utc.to_i - age))
end

If we call this with an age like 120 we’ll get an array of nodes that have not had any data within the last 120 seconds.

You can do the same check on the CLI, this shows all the machines not seen in the last 60 seconds:

% redis-cli zrangebyscore host:last_seen 0 $(expr $(date +%s) - 60)
 1) "dev1.devco.net"

Reference: Sorted Sets, ZADD, ZRANGEBYSCORE

Event Notification


When a check result enters the system thats either a state change, a problem or have metrics associated it we’d want to send those on to other pieces of code.

We don’t know or care who those interested parties are we only care that there might be some interested parties – it might be something writing to Graphite or OpenTSDB or both at the same time or something alerting to Opsgenie or Pager Duty. This is a classic use case for PubSub and Redis has a good PubSub subsystem that we’ll use for this.

I am only going to show the metrics publishing – problem and state changes are very similar:

1
2
3
4
5
6
7
8
9
10
11
def publish_metrics(check)
  if check.has_perfdata?
    msg = {"metrics" => check.perfdata, "type" => "metrics", "time" => check.last_check, "host" => check.host, "check" => check.check}.to_json
    publish(["metrics", check.host, check.check], msg)
  end
end
 
def publish(type, message)
  target = ["overwatch", Array(type).join(":")].join(":")
  @redis.publish(target, message)
end

This is pretty simple stuff, we’re just publishing some JSON to a named destination like overwatch:metrics:dev1.devco.net:load. We can now write small standalone single function tools that consume this stream of metrics and send it wherever we like – like Graphite or OpenTSDB.

We publish similar events for any incoming check result that is not OK and also for any state transition like CRITICAL to OK, these would be consumed by alerter handlers that might feed pagers or SMS.

We’re publishing these alerts to to destinations that include the host and specific check – this way we can very easily create individual host views of activity by doing pattern based subscribes.

Reference: PubSub, PUBLISH

Alert Tracking


Alert Tracking means keeping track of which alerts we’ve already sent and when we’ll need to send them again like only after 2 hours of the same problem and not on every check result which might come in every minute.

Leading on from the previous section we’d just consume the problem and state change PubSub channels and react on messages from those:

A possible consumer of this might look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
@redis.psubscribe("overwatch:state_change:*", "overwatch:issues:*") do |on|
  on.pmessage do |channel, message|
    event = JSON.parse(message)
 
    case event["type"]
      when "issue"
        sender.notify_issue(event["issue"]["exitcode"], event["host"], event["check"], event["issue"]["output"])
      when "state_change"
        if event["state_change"]["exitcode"] == 0
          sender.notify_recovery(event["host"], event["check"], event["state_change"]["output"])
        end
    end
  end
end

This subscribes to the 2 channels and pass the incoming events to a notifier. Note we’re using the patterns here to catch all alerts and changes for all hosts.

The problem here is that without any special handling this is going to fire off alerts every minute assuming we check the load every minute. This is where Redis expiry of keys come in.

We’ll need to track which messages we have sent when and on any state change clear the tracking thus restarting the counters.

So we’ll just add keys called “alert:dev2.devco.net:load:3″ to indicate an UNKNOWN state alert for load on dev2.devco.net:

1
2
3
4
5
def record_alert(host, check, status, expire=7200)
  key = "alert:%s:%s:%d" % [host, check, status]
  @redis.set(key, 1)
  @redis.expire(key, expire)
end

This takes an expire time which defaults to 2 hours and tells redis to just remove the key when its time is up.

With this we need a way to figure out if we can send again:

1
2
3
4
def alert_ttl(host, check, status)
  key = "alert:%s:%s:%d" % [host, check, status]
  @redis.ttl(key)
end

This will return the amount of seconds till next alert and -1 if we are ready to send again

And finally on every state change we need to just purge all the tracking for a given node + check combo. The reason for this is if we notified on CRITICAL a minute ago then the service recovers to OK but soon goes to CRITICAL again this most recent CRITICAL alert will be suppressed as part of the previous cycle of alerts.

1
2
3
def clear_alert_ttls(host, check)
  @redis.del(@redis.keys.grep(/^alert:#{host}:#{check}:\d/))
end

So now I can show the two methods that will actually publish the alerts:

The first notifies of issues but only every @interval seconds and it uses the alert_ttl helper above to determine if it should or shouldn’t send:

1
2
3
4
5
6
7
8
9
10
11
12
def notify_issue(exitcode, host, check, output)
  if (ttl = @storage.alert_ttl(host, check, exitcode)) == -1
    subject = "%s %s#%s" % [status_for_code(exitcode), host, check]
    message = "%s: %s" % [subject, output]
 
    send(message, subject, @recipients)
 
    @redis.record_alert(host, check, exitcode, @alert_interval)
  else
    Log.info("Not alerting %s#%s due to interval restrictions, next alert in %d seconds" % [host, check, ttl])
  end
end

The second will publish recovery notices and we’d always want those and they will not repeat, here we clear all the previous alert tracking to avoid incorrect alert surpressions:

1
2
3
4
5
6
7
8
def notify_recovery(host, check, output)
  subject = "RECOVERY %s#%s" % [host, check]
  message = "%s: %s" % [subject, output]
 
  send_alert(message, subject, @recipients)
 
  @redis.clear_alert_ttls(host, check)
end

References: SET, EXPIRE, SUBSCRIBE, TTL, DEL

Conclusion


This covered a few Redis basics but it’s a very rich system that can be used in many areas so if you are interested spend some quality time with its docs.

Using its facilities saved me a ton of effort while working on a small monitoring system. It is fast and light weight and enable cross language collaboration that I’d have found hard to replicate in a performant manner without it.