Moving a service from Puppet to Docker


I’ve moved a number of my more complex infrastructure components from being Puppet managed to being Docker managed. There are many reasons for this the main one being my Puppet code is ancient and faced with a rewrite to be Puppet 4 like or just rethinking things, I’m leaning towards rethinking. I don’t think CM is solving the right problem for me for certain aspects of my infrastructure and new approaches can bring more value for my use case.

There’s a lot of posts around talking about Docker and concentrating on the image building side of it or just the running of a container side – which I find quite uninteresting and in fact pretty terrible. The real benefit for me comes in workflow, the API, the events out of the daemon and the container stats. People look at the image and container aspects in isolation and go on about how this is not new technology, but that’s missing the point.

Mainly a workflow problem

I’ll look at an example moving rbldnsd from Puppet to Docker managed and what I gain from that. Along the way I’ll also throw in some examples of a more complex migration I did for my bind servers. In case you don’t know rbldnsd is a daemon that maintains a DNS based RBLs using config files that look something like this:

$DATASET dnset senderhost
.digitalmarketeer.com   : rejected after user complaints.

You can then query it using the usual ways your MTA support and decide policy based on that.

The life cycle of this service is typical of the ones I am targeting:

  • A custom RPM had to be built and maintained and served from yet another piece of infrastructure.
  • The basic package, config, service triplet. So vanilla it’s almost not worth looking at the code, it looks like all other package, config, service code.
  • Requires ongoing data management – I add/remove hosts from the blacklists constantly. But this overlaps with the config part above.
  • Requires the ability to test DNS queries work in development before even committing the change
  • Requires rapid updating of configuration data

The last 3 points here deserve some explanation. When I am editing these configuration files I want to be able to test them right there in my shell without even committing them to git. This means starting up a rbldnsd instance and querying it with dig. This is pretty annoying to do with the puppet work flow which I won’t go into here as it’s a huge subject on it’s own. Suffice to say it doesn’t work for me and end up not being production like at all.

When I am updating this config files onto the running service there’s a daemon that will load them into its running memory. I need to be pretty sure that daemon I am testing on is identical to what’s in production now. Ideally bit for bit identical. Again this is pretty hard as many/most dev environments tend to be a few steps ahead of production. I need a way to say give me the bits running production and throw this config at them and then do an end to end test with no hassles and in 5 seconds.

I need a way to orchestrate that config data update to happen when I need it to happen – and not when Puppet runs again – and ideally it has to be quick, not at the pace that Puppet manages my 600 resources. Services should let me introspect them to figure out how to update their data and a generic updater should be able to update all my services that match this flow.

I’ve never really solved the last 3 points with my Puppet workflows for anything I work on, it’s a fiendishly complex problem to solve correctly. Everyone does it with Vagrant instances or ever more complex environments. Or they do their change, commit it and make sure there are test coverage and only get feedback later when something like Beaker ran. This is way too slow for me in this scenario. I just want to block 1 annoying host. Vagrant especially does not work for me as I refuse to run things on my desktop or laptop, I develop on VMs that are remote, so Vagrant isn’t an option. Additionally Vagrant environments become so complex, basically a whole new environment. Yet built in annoyingly different ways so that keeping match with Production can be a challenge – or just prohibitively slow if you’re building them out with Puppet. So you end up again not testing in a environment that’s remotely production like.

These are pretty major things that I’ve never been able to solve to my liking with Puppet. I’ve first moved a bunch of my web sites then bind and now rbldnsd to Docker and think I’ve managed to come up with a workflow and toolchain that solves this for me.

Desired outcome

So maybe to demonstrate what I am after I should show what I want the outcome to look like. Here’s a rbldnsd dev session. I want to block *.mailingliststart.com, specifically I saw sh8.mailingliststart.com in my logs. I want to test the hosts are going to be blocked correctly before pushing to prod or even committing to git – it’s so embarrassing to make fix commits for obvious dumb things 😛

So I add to the zones/bl file:

.mailingliststart.com : spam from this host
$ vi zones/bl
$ rake test:host
Host name to test: sh8.mailingliststart.com
Testing sh8.mailingliststart.com
Starting the rbldnsd container...
>>> Testing black list
docker exec rbldnsd dig -p 5301 +noall +answer any sh8.mailingliststart.com.senderhost.bl.rbl @localhost
sh8.mailingliststart.com.senderhost.bl.rbl. 2100 IN A
sh8.mailingliststart.com.senderhost.bl.rbl. 2100 IN TXT "Excessive spam from this host"
>>> Testing white list
Removing the rbldnsd container...
$ git commit zones -m 'block mailingliststart.com'
$ git push origin master

Here I added the bits to the config file and want to be sure the hostname I saw in my logs/headers will actually be blocked.:

  • It prepares the latest container by default and mounts my working directory into the container with -v ${PWD}:/service.
  • Container starts up just like it would in production using the same bits that’s running production – but reads the new uncommitted config
  • It uses dig to query the running rbldnsd and run any in-built validation steps the service has (this container has none yet)
  • Cleans up everything

The whole thing takes about 4 seconds on a virtual machine running on virtualbox on circa 2009 Mac. I saw the host was blacklisted and not somehow also whitelisted, looks good, commit and push.

Once pushed a webhook triggers my update orchestration and the running containers get the new config files only. The whole edit, test and deploy process takes less than a minute. The data though is in git which means tonight when my containers get rebuilt from fresh they will get this change baked in and rolled out as new instances.

There’s one more pretty mind blowing early feedback story I wanted to add here. My bind zones used to be made with puppet defines:

bind::zone{"foo.com": owner => "Bob", masterip => "", type => $server_type}

I had no idea what this actually did by reading that line of code. I could guess yeah sure. But you only know for sure with certainty when you run Puppet in production since no matter what the hype says, you’ll only see the diff against actual production file when that hits the production box using Puppet. Not OK. You also learn nothing with this, it’s always bothered me that Puppet end up being a crutch like a calculator, I have all these abstractions and so a junior using this define might never even know what it does or learn how bind works. Desirable in some cases, not for me.

In my Docker bind container I have a YAML file:

     - foo.com

It’s the same data I had in my manifest just structured a bit different. Same basic problem though I have no idea what this does by looking at it. In docker world though you need to bake this YAML into bind config. And this has to be done during development so that a docker build can get to the final result. So I add a new domain bar.com:

$ vi zones.yaml
$ rake construct:files
Reading specification file buildsettings.yaml
Reading scope file zones.yaml
Rendering conf/named_slave_zones with mode 644 using template templates/slave_zones.erb
Rendering conf/named_master_zones with mode 644 using template templates/master_zones.erb
 conf/named_master_zones | 10 ++++++++++
 conf/named_slave_zones  |  9 +++++++++
 2 files changed, 19 insertions(+)
$ git diff
+// Bob
+zone "bar.com" {
+  type slave;
+  file "/srv/named/zones/slave/bar.com";
+  masters {
+  };

The rake construct:files just runs a bunch of ERB templates over the zones hash – it’s basically identical to the templates I had in Puppet with just a few var name changes and slightly different looping, no more or less complex.

This is the actual change that will hit production. No ifs or buts, that’s what will change in prod. When I rake test here without comitting this, this actual production change is being tested against the actual bits in the named binary that today runs production.

$ time rake test
docker run -ti --rm -v /home/rip/work/docker_bind:/srv/named -e TEST=1 ripienaar/bind
>> Checking named.conf syntax in master mode
>> Checking named.conf syntax in slave mode
>> Checking zones..
rake test  0.18s user 0.33s system 7% cpu 3.858 total

Again my work dir is mounted into the container version currently running in production, my uncommitted change is tested using the bit for bit identical version of bind as currently in prod. This is a massive confidence boost and the feedback cycle is < 5 seconds, I can do this all day long maybe even using something like guard to run it in a tmux pane every time I save a file, it's that fast and the feedback has real actual meaning as it relates to production.

Implementation Details

I won’t go into all the Dockerfile details it’s just normal stuff. The image building and running of containers is not exciting. The layout of the services are something like this:


What is exciting is that I can introspect a running container. The Dockerfile has lines like this:

ENV UPDATE_METHOD /service/bin/update.sh
ENV VALIDATE_METHOD /service/bin/validate.sh

And an external tool can find out how this container likes to be updated or validated – and later monitored:

$ docker inspect rbldnsd
        "Env": [

My update webhook basically just does this:

mco rpc docker runtime_update container=rbldnsd -S container("rbldnsd").present=1 --batch 1

So I use mcollective to target an update operation on all machines that runs the rbldnsd container – 1 at a time. The mcollective agent uses docker inspect to introspect the container. Once it knows how the container wants to be updated it calls that command using docker exec.

Outcome Summary

For me this turned out to be a huge win. I had to do a lot of work on the image building side of things, the orchestration, deployment etc – things I had to do with Puppet too anyway. But this basically ticks all the boxes for me that I had in the beginning of this post and quite a few more:

  • A reasonable facsimile of the package, config, service triplet that yields idempotent builds
  • A comfortable way to develop and test my changes locally with instant feedback like I would with unit tests for normal code but for integration tests of infrastructure components using the same bits as in production.
  • Much better visibility over what’s actually going to change, especially in complex cases where config files are built using templates
  • An approach where my services are standalone and they all have to think about their run, update and validation cadences. With those being introspectable and callable from the outside.
  • My services are standalone artefacts and versioned as a whole. Not spread around the place on machines, in package repos, in data and in CM code that attempts to tie it all together. It’s one thing, from one git repo, stored in one place with a version.
  • With validation built into the container and the container being a runnable artefact I get to do this during CI before rolling anything out just like I do on my CLI. And always the actual bits in use or proposed to be used in Production are used.
  • Overall I have a lot more confidence in my production changes now than I had with the Puppet workflow.
  • Changes can be rolled out to running containers very rapidly – less than 10 seconds and not at the slow Puppet run pace.
  • My dev environment is hugely simplified yet much more flexible as I can run current, past and future versions of anything. With less complexity.
  • Have a very nice middle ground between immutable server and the need for updating content. Containers are still rebuilt and redeployed every night on schedule and they are still disposable but not at the cost of day to day updates.

I’ve built this process into a number of containers now some like this that are services and even some web ones like my wiki where I edit markdown files and they get rolled out to the running containers immediately on push.

I still have some way to go with monitoring and these services are standalone and not complex multi-component ones but I don’t foresee huge issues with those.

I couldn’t solve this with all these outcomes without a rapid way to stand up and destroy production environments that are isolated from my machine I am developing on. Especially if the final service is some loosely coupled combination of parts from many different sources. I’d love to talk to people who think they have something approaching this without using Docker or similar and be proven wrong but for now, this is a huge step forward for me.

So Puppet and CM tools are irrelevant now?

Getting back to the Puppet part of this post. I could come up with some way to mix Puppet in here too. There are though other interesting aspects about the Docker life cycle that I might blog about later which I think makes it a bit of a square peg in a round hole to combine these two tools. Especially I think people today who think they should use Puppet to build containers or configure containers are a bit miss guided and missing out, I hope they keep working on that though and get somewhere interesting because omfg Dockerfiles but I don’t think the current attempts are interesting.

It kind of gets back to the old thing where it turns out Puppet is not a good choice to manage deployments of Applications but its ok for Infrastructure. I am reconsidering what is infrastructure and what are applications.

So I chose to rethink things from the ground up – how would a nameserver service looked if I considered it Application and not Infrastructure and how should a Application development life cycle around that service look?

This is not a new realisation for me, I’ve often wished and expressed the desire that Puppet Labs should focus a lot more on the workflow and the development cycle and work on providing tools and hooks for that and think about how to make that better, I don’t think that’s really happened. So the conclusion for me was that for this Application or Service development and deployment life cycle Puppet was the wrong tool. I also realise I don’t even remotely resemble their paying target audience.

I am also not saying Puppet or other CM tools are irrelevant due to Docker that’s just madness. I think there’s a place where the 2 worlds meet and for me I am starting to notice that a lot of what I thought was Infrastructure are actually Applications and these have different development and deployment needs which CM and Puppet especially do not address.

Soon there will not be a single mention of DNS related infrastructure in my Puppet code. The container and related files are about equal in complexity and lines of code to what was in Puppet, the final outcome is about the same and it’s as configurable to my environments. The workflow though is massively improved because now I have the advantages that Application developers had for this piece of Infrastructure. Finally a much larger part of the Infrastructure As Code puzzle is falling together and it actually feels like I am working on code with the same feedback cycles and single verifiable artefact outcomes. And that’s pretty huge. Infrastructure are still being CM managed – I just hope to have a radically reduced Infrastructure footprint.

The big take away here isn’t that Docker is some technological magical bullet killing off vast parts of the existing landscape or destroying a subset of tools like CM completely. It brings workflow and UX improvements that are pretty unique and well worth exploring. And this is especially a part where the CM folk have basically just not focussed on. The single biggest win is probably the single artefact aspect as this enables everything I mentioned here.

It also brings a lot of other things from the daemon side – the API, the events, the stats etc that I didn’t talk about here and those are very big deals too wrt what future work they enable. But that’s for future posts.

Technically I think I have a lot of bad things to say about almost every aspect of Docker but those are out weighed by this rapid feedback and increased overall confidence in making change at the pace I would like to.

Some travlrmap updates


Been a while since I posted here about my travlrmap web app, I’ve been out of town the whole of February – first to Config Management Camp and then on holiday to Spain and Andorra.

I released version 1.5.0 last night which brought a fair few tweaks and changes like updating to latest Bootstrap, improved Ruby 1.9.3 UTF-8 support, give it a visual spruce up using the Map Icons Collection and gallery support.

I take a lot of photos and of course often these photos coincide with travels. I wanted to make it easy to put my travels and photos on the same map so have started adding a gallery ability to the map app. For now it’s very simplistic, it makes a point with a custom HTML template that just opens a new tab to the Flickr slideshow feature. This is not what I am after exactly, ideally when you click view gallery it would just open a overlay above the map and show the gallery with escape to close – that would take you right back to the map. There re some bootstrap plugins for this but they all seem to have some pain points so that’s not done now.

Today there’s only Flickr support and a gallery takes a spec like :gallery: flickr,user=ripienaar,set=12345 and from there it renders the Flickr set. Once I get the style of popup gallery figured out I’ll make that pluggable through gems so other photo gallery tools can be supported with plugins.

As you can see from above the trip to Spain was a Road Trip, I kept GPX tracks of almost the entire trip and will be adding support to show those on the map and render them. Again they’ll appear as a point just like galleries and clicking on them will show their details like a map view of the route and stats. This should be the aim for the 1.6.0 release hopefully.

Running a secure docker registry behind Apache


I host a local Docker registry and used to just have this on port 5000 over plain http. I wanted to put it behind SSL and on port 443 and it was annoying enough that I thought I’d write this up.

I start my registry pretty much as per the docs:

% docker run --restart=always -d -p 5000:5000 -v /srv/docker-registry:/tmp/registry --name registry registry

This starts it, ensure it stays running, makes it listen on port 5000 and also use a directory on my host for the file storage so I can remove and upgrade the registry without issues.

The problem with this is there’s no SSL and so you need to configure docker specifically with:

docker -d --insecure-registry registry.devco.net:5000

At first I thought just fronting it with Apache will be as easy as:

<VirtualHost *:443>
   ServerName registry.devco.net
   ServerAdmin webmaster@devco.net
   SSLEngine On
   SSLCertificateFile /etc/httpd/conf.d/ssl/registry.devco.net.cert
   SSLCertificateKeyFile /etc/httpd/conf.d/ssl/registry.devco.net.key
   SSLCertificateChainFile /etc/httpd/conf.d/ssl/registry.devco.net.chain
   ErrorLog /srv/www/registry.devco.net/logs/error_log
   CustomLog /srv/www/registry.devco.net/logs/access_log common
   ProxyPass /
   ProxyPassReverse /

This worked on the basic level but soon as I tried to push to the registry I got errors, it seems after the initial handshake the docker daemon would get instruction from the registry to connect to which then fails.

Some digging into the registry code and I found it’s using the host header of the request to return a X-Docker-Endpoints header in the replies to the initial handshake with the registry service and future requests from the docker daemon will use the endpoints advertised here for communications.

By default Apache does not keep the host header in the proxy request, I had to add ProxyPreserveHost on to the vhost and after that it was all good, no more insecure registries or having to specify ugly ports in my image tags.

So the final vhost looks like:

<VirtualHost *:443>
   ServerName registry.devco.net
   ServerAdmin webmaster@devco.net
   SSLEngine On
   SSLCertificateFile /etc/httpd/conf.d/ssl/registry.devco.net.cert
   SSLCertificateKeyFile /etc/httpd/conf.d/ssl/registry.devco.net.key
   SSLCertificateChainFile /etc/httpd/conf.d/ssl/registry.devco.net.chain
   ErrorLog /srv/www/registry.devco.net/logs/error_log
   CustomLog /srv/www/registry.devco.net/logs/access_log common
   ProxyPreserveHost on
   ProxyPass /
   ProxyPassReverse /

I also made sure it was using localhost for the port 5000 traffic and now I can start my registry like this ensuring I do not even have that port on the internet facing interfaces:

% docker run --restart=always -d -p localhost:5000:5000 -v /srv/docker-registry:/tmp/registry --name registry registry

Some travlrmap updates


Last weekend I finally got to a point of 1.0.0 of my travel map software, this week inbetween other things I made a few improvements:

  • Support 3rd party tile sets like Open Streetmap, Map Quest, Water Color, Toner, Dark Matter and Positron. These let you customise your look a bit, the Demo Site has them all enabled.
  • Map sets are supported, I use this to track my Travel Wishlist vs Places I’ve been.
  • Rather than list every individual yaml file in a directory to define a set you can now just point at a directory and everything will get loaded
  • You can designate a single yaml file as writable, the geocoder can then save points to disk directly without you having to do any YAML things.
  • The geocoder renders better on mobile devices and support geocoding based on your current position to make it easy to add points on the go.
  • Lots of UX improvements to the geocoder

Seems like a huge amount of work but it was all quite small additions, mainly done in a hour or so after work.

Finding a new preferred VM host


I’ve been with Linode since almost their day one, I’ve never had reason to complain. Over the years they have upgraded the various machines I’ve had for free, I’ve had machines with near 1000 days uptime with them, their control panel is great, their support is great. They have a mature set of value added services around the core like load balancers, backups etc. I’ve recommended them to 10s of businesses and friends who are all hosted there. In total over the years I’ve probably had or been involved in over a thousand Linode VMs.

This is changing though, I recently moved 4 machines to their London datacenter and they have all been locking up randomly. You get helpful notices saying something like “Our administrators have detected an issue affecting the physical hardware your Linode resides on.” and on pushing the matter I got:

I apologize for the amount of hardware issues that you have had to deal with lately. After viewing your account, their have been quite a few hardware issues in the past few months. Unfortunately, we cannot easily predict when hardware issues may occur, but I can assure you that our administrators do everything possible to address and eliminate the issues as they do come up.

If you do continue to have issues on this host, we would be happy to migrate your Linode to a new host in order to see if that alleviates the issues going forward.

Which is something I can understand, yes hardware fail randomly, unpredictably etc. I’m a systems guy, we’ve all been on the wrong end of this stick. But here’s the thing, in the longer than 10 years I’ve been with Linode and had customers with Linode this is something that happens very infrequently, my recent experience is completely off the scales bad. It’s clear there’s a problem, something has to be done. You expect your ISP to do something and to be transparent about it.

I have other machines at Linode London that were not all moved there on the same day and they are fine. All the machines I moved there on the same day recently have this problem. I can’t comment on how Linode allocate VMs to hosts but it seems to me there might be a bad batch of hardware or something along these lines. This is all fine, bad things happen – it’s not like Linode manufactures the hardware – I don’t mind that it’s just realistic. What I do mind is the vague non answers to the problem, I can move all my machines around and play russian roulette till it works. Or Linode can own up to having a problem and properly investigate and do something about it while being transparent with their customers.

Their community support team reached out almost a month ago after I said something on Twitter with “I’ve been chatting with our team about the hardware issues you’ve experienced these last few months trying to get more information, and will follow up with you as soon as I have something for you” I replied saying I am moving machines one by one soon as they fail but never heard back again. So I can’t really continue to support them in the face of this.

When my Yum mirror and Git repo failed recently I decided it’s time to try Digital Ocean since that seems to be what all the hipsters are on about. After a few weeks I’ve decided they are not for me.

  • Their service is pretty barebones which is fine in general – and I was warned about this on Twitter. But they do not even provide local resolvers, the machines are set up to use Google resolvers out of the box. This is just not ok at all. Support says indeed they don’t and will pass on my feedback. Yes I can run a local cache on the machine. Why should every one of thousands of VMs need this extra overhead in terms of config, monitoring, management, security etc when the ISP can provide reliable resolvers like every other ISP?
  • Their london IP addresses at some point had incorrect contact details, or were assigned to a different DC or something. But geoip databases have them being in the US which makes all sorts of things not work well. The IP whois seems fine now, but will take time to get reflected in all the geoip databases – quite annoying.
  • Their support system do not seem to send emails. I assume I just missed some click box somewhere in their dashboard because it seems inconceivable that people sit and poll the web UI while they wait for feedback from support.

On the email thing – my anti spam could have killed them as well I guess, I did not investigate this too deep because after the resolver situation became clear it seemed like wasted effort to dig into that as the resolver issue was the nail in the coffin regardless.

Technically the machine was fine – it was fast, connectivity good, IPv6 reliable etc. But for the reasons above I am trying someone else. BigV.io contacted me to try them, so giving that a go and will see how it look.

Newer Posts
Older Posts