Select Page

Yesterday I gave a quick intro to the basics of Message Orientated Middleware, today we’ll build something kewl and useful.

Graphite is a fantastic statistics as a service for your network package. It can store, graph, slice and dice your time series data in ways that was only imaginable in the dark days of just having RRD files. The typical way to get data into it is to just talk to its socket and send some metric. This is great mostly but have some issues:

  • You have a huge network and so you might be able to overwhelm its input channel
  • You have strict policies about network connections and are not allowed to have all servers open a connection to it directly
  • Your network is spread over the globe and sometimes the connections are just not reliable, but you do not wish to loose metrics during this time

Graphite solves this already by having a AMQP input channel but for the sake of seeing how we might solve these problems I’ll show how to build your own Stomp based system to do this.

We will allow all our machines to Produce messages into the Queue and we will have a small pool of Consumers that read the queue and speak to Graphite using the normal TCP protocol. We’d run Graphite and the Consumers on the same machine to give best possible availability to the TCP connections but the Middleware can be anywhere. The TCP connections to Graphite will be persistent and be reused to publish many metrics – a connection pool in other words.

So first the Producer side of things, this is a simple CLI tool that take a metric and value on the CLI and publish it.

require 'rubygems'
require 'stomp'
raise "Please provide a metric and value on the command line" unless ARGV.size == 2
raise "The metric value must be numeric" unless ARGV[1] =~ /^[\d\.]+$/
msg = "%s.%s %s %d" % [Socket.gethostname, ARGV[0], ARGV[1],]
  Timeout::timeout(2) do
    stomp ="", "", "", 61613)
    stomp.publish("/queue/graphite", msg)
rescue Timeout::Error
  STDERR.puts "Failed to send metric within the 2 second timeout"
  exit 1

This is all there really is to sending a message to the middleware, you’d just run this like

producer.rb load1 `cat /proc/loadavg|cut -f1 -d' '`

Which would result in a message being sent with the body 0.1 1323597139

The consumer part of this conversation is not a whole lot more complex, you can see it below:

require 'rubygems'
require 'stomp'
def graphite
  @graphite ||="localhost", 2003)
client ="", "", "", 61613, true)
loop do
    msg = client.receive
    graphite.puts msg
    STDERR.puts "Failed to receive from queue: #{$!}"
    sleep 1

This subscribes to the queue, loops forever while reading messages that then get sent to Graphite using a normal TCP socket. This should be a bit more complex to use the transaction properties I mentioned since a crash here will loose a single message.

So that is really all there is to it! You’d totally want to make the receiving end a bit more robust, make it a daemon perhaps using the Daemons or Dante Gems and add some logging. You’d agree though this is extremely simple code that anyone could write and maintain.

This code has a lot of non obvious side effects though simply because we use the Middleware for communication:

  • It’s completely decoupled, the Producers don’t know anything about the Consumers other than the message format.
  • It’s reliable because the Consumer can die but the Producers would not even be aware or need to care about this
  • It’s scalable – by simply starting more Consumers you can consume messages from the queue quicker and in a load balanced way. Contrast this with perhaps writing a single multi threaded server with all that entails.
  • It’s trivial to understand how it works and the code is completely readable
  • It protects my Graphite from the Thundering Herd Problem by using the middleware as a buffer and only creating a manageable pool of writers to Graphite
  • It’s language agnostic, you can produce messages from Perl, Ruby, Java etc
  • The network layer can be made resilient without any code changes

You wouldn’t think this 44 lines of code could have all these properties, but they do and this is why I think this style of coding is particularly well suited to Systems Administrators. We are busy people, we do not have time to implement from scratch our own connection pooling, buffers, spools and everything else you would need to try to duplicate these points from scratch. We have 20 minutes and we just want to solve our problem. Languages like Ruby and technologies like Message Orientated Middleware lets you do this.

I’d like to expand on the one aspect a bit – I mentioned that the network topology can change without the code being aware of it and that we might have restricted firewalls preventing everyone from communicating with Graphite. Our 44 lines of code solves these problems with the help of the MOM.

By using the facilities the middleware provides to create complex networks we can distribute our connectivity layer globally as below:

Here we have producers all over the world and our central consumer sitting in the EU somewhere. The queuing and storage characteristics of the middleware is present in every region. The producers in each region only need the ability to communicate with their regional Broker.

The middleware layer is reliably connected in a Mesh topology but in the event that transatlantic communications are interrupted the US broker will store the metrics till the connection problem is resolved. At that point it will forward the messages on to the EU broker and finally to the Consumer.

We can deploy brokers in a HA configuration regionally to protect against failure there. This is very well suited for multi DC deployments, deployments in the cloud where you have machines in different Regions and Availability Zones etc.

This is also an approach you could use to also allow your DMZ machines to publish metrics without needing the ability to connect directly to the Graphite service. The middleware layer is very flexible in how it’s clustered, who makes the connections etc so it’s ideal for that.

So in the end with just a bit of work once we’ve invested in the underlying MOM technology and deployed that we have solved a bunch of very complex problems using very simple techniques.

While this was done with reliability and scalability in mind for me possibly the bigger win is that we now have a simple network wide service for creating metrics. You can write to the queue from almost any language and you can easily allow your developers to just emit metrics from their Java code and you can emit metrics from the system side perhaps by reusing Munin.

Using code that is not a lot more complex than this I have been able to gather 10s of thousands of Munin metrics in a very short period of time into Graphite. Was able to up my collection frequency to once every minute instead of the traditional 5 minutes and was able to do that with a load average below 1 vs below 30 for Munin. This is probably more to do with Graphite being superior than anything else but the other properties outlined above makes this very appealing. Nodes push their statistics soon as they are built and I never need to edit a Munin config file anymore to tell it where my servers are.

This enabling of all parties in the organization to quickly and easily create metrics without having an operations bottleneck is a huge win and at the heart of what it means to be a DevOps Practitioner.

Part 3 has been written, please read that next.