Replicating NATS Streams between clusters

01/09/2018

I’ve mentioned NATS before – the fast and light weight message broker from nats.io – but I haven’t yet covered the sister product NATS Streaming before so first some intro.

NATS Streaming is in the same space as Kafka, it’s a stream processing system and like NATS it’s super light weight delivered as a single binary and you do not need anything like Zookeeper. It uses normal NATS for communication and ontop of that builds streaming semantics. Like NATS – and because it uses NATS – it is not well suited to running over long cluster links so you end up with LAN local clusters only.

This presents a challenge since very often you wish to move data out of your LAN. I wrote a Replicator tool for NATS Streaming which I’ll introduce here.

Streaming?


First I guess it’s worth covering what Streaming is, I should preface also that I am quite new in using Stream Processing tools so I am not about to give you some kind of official answer but just what it means to me.

In a traditional queue like ActiveMQ or RabbitMQ, which I covered in my Common Messaging Patterns posts, you do have message storage, persistence etc but those who consume a specific queue are effectively a single group of consumers and messages either go to all or load shared all at the same pace. You can’t really go back and forth over the message store independently as a client. A message gets ack’d once and once it’s been ack’d it’s done being processed.

In a Stream your clients each have their own view over the Stream, they all have their unique progress and point in the Stream they are consuming and they can move backward and forward – and indeed join a cluster of readers if they so wish and then have load balancing with the other group members. A single message can be ack’d many times but once ack’d a specific consumer will not get it again.

This is to me the main difference between a Stream processing system and just a middleware. It’s a huge deal. Without it you will find it hard to build very different business tools centred around the same stream of data since in effect every message can be processed and ack’d many many times vs just once.

Additionally Streams tend to have well defined ordering behaviours and message delivery guarantees and they support clustering etc. much like normal middleware has. There’s a lot of similarity between streams and middleware so it’s a bit hard sometimes to see why you won’t just use your existing queueing infrastructure.

Replicating a NATS Stream


I am busy building a system that will move Choria registration data from regional data centres to a global store. The new Go based Choria daemon has a concept of a Protocol Adapter which can receive messages on the traditional NATS side of Choria and transform them into Stream messages and publish them.

This gets me my data from the high frequency, high concurrency updates from the Choria daemons into a Stream – but the Stream is local to the DC. Indeed in the DC I do want to process these messages to build a metadata store there but I also want to processes these messages for replication upward to my central location(s).

Hence the importance of the properties of Streams that I highlighted above – multiple consumers with multiple views of the Stream.

There are basically 2 options available:

  1. Pick a message from a topic, replicate it, pick the next one, one after the other in a single worker
  2. Have a pool of workers form a queue group and let them share the replication load

At the basic level the first option will retain ordering of the messages – order in the source queue will be the order in the target queue. NATS Streaming will try to redeliver a message that timed out delivery and it won’t move on till that message is handled, thus ordering is safe.

The 2nd option since you have multiple workers you have no way to retain ordering of the messages since workers will go at different rates and retries can happen in any order – it will be much faster though.

I can envision a 3rd option where I have multiple workers replicating data into a temporary store where on the other side I inject them into the queue in order but this seems super prone to failure, so I only support these 2 methods for now.

Limiting the rate of replication


There is one last concern in this scenario, I might have 10s of data centres all with 10s of thousands of nodes. At the DC level I can handle the rate of messages but at the central location where I might have 10s of DCs x 10s of thousands of machines if I had to replicate ALL the data at near real time speed I would overwhelm the central repository pretty quickly.

Now in the case of machine metadata you probably want the first piece of metadata immediately but from then on it’ll be a lot of duplicated data with only small deltas over time. You could be clever and only publish deltas but you have the problem then that should a delta publish go missing you end up with a inconsistent state – this is something that will happen in distributed systems.

So instead I let the replicator inspect your JSON, if your JSON has something like fqdn in it, it can look at that and track it and only publish data for any single matching sender every 1 hour – or whatever you configure.

This has the effect that this kind of highly duplicated data is handled continuously in the edge but that it only gets a snapshot replication upwards once a hour for any given node. This solves the problem neatly for me without there being any risks to deltas being lost, it’s also a lot simpler to implement.

Choria Stream Replicator


So finally I present the Choria Stream Replicator. It does all that was described above with a YAML configuration file, something like this:

debug: false                     # default
verbose: false                   # default
logfile: "/path/to/logfile"      # STDOUT default
state_dir: "/path/to/statedir"   # optional
topics:
    cmdb:
        topic: acme.cmdb
        source_url: nats://source1:4222,nats://source2:4222
        source_cluster_id: dc1
        target_url: nats://target1:4222,nats://target2:4222
        target_cluster_id: dc2
        workers: 10              # optional
        queued: true             # optional
        queue_group: cmdb        # optional
        inspect: host            # optional
        age: 1h                  # optional
        monitor: 10000           # optional
        name: cmdb_replicator    # optional

Please review the README document for full configuration details.

I’ve been running this in a test DC with 1k nodes for a week or so and I am really happy with the results, but be aware this is new software so due care should be given. It’s available as RPMs, has a Puppet module, and I’ll upload some binaries on the next release.