R.I.Pienaar

The Choria Emulator

10/09/2017

In my previous posts I discussed what goes into load testing a Choria network, what connections are made, subscriptions are made etc.

From this it’s obvious the things we should be able to emulate are:

  • Connections to NATS
  • Subscriptions – which implies number of agents and sub collectives
  • Message payload sizes

To make it realistically affordable to emulate many more machines that I have I made an emulator that can start numbers of Choria daemons on a single node.

I’ve been slowly rewriting MCollective daemon side in Go which means I already had all the networking and connectors available there, so a daemon was written:

usage: choria-emulator --instances=INSTANCES [<flags>]
 
Emulator for Choria Networks
 
Flags:
      --help                 Show context-sensitive help (also try --help-long and --help-man).
      --version              Show application version.
      --name=""              Instance name prefix
  -i, --instances=INSTANCES  Number of instances to start
  -a, --agents=1             Number of emulated agents to start
      --collectives=1        Number of emulated subcollectives to create
  -c, --config=CONFIG        Choria configuration file
      --tls                  Enable TLS on the NATS connections
      --verify               Enable TLS certificate verifications on the NATS connections
      --server=SERVER ...    NATS Server pool, specify multiple times (eg one:4222)
  -p, --http-port=8080       Port to listen for /debug/vars

You can see here it takes a number of instances, agents and collectives. The instances will all respond with ${name}-${instance} on any mco ping or RPC commands. It can be discovered using the normal mc discovery – though only supports agent and identity filters.

Every instance will be a Choria daemon with the exact same network connection and NATS subscriptions as real ones. Thus 50 000 emulated Choria will put the exact same load of work on your NATS brokers as would normal ones, performance wise even with high concurrency the emulator performs quite well – it’s many orders of magnitude faster than the ruby Choria client anyway so it’s real enough.

The agents they start are all copies of this one:

emulated0
=========
 
Choria Agent emulated by choria-emulator
 
      Author: R.I.Pienaar <rip@devco.net>
     Version: 0.0.1
     License: Apache-2.0
     Timeout: 120
   Home Page: http://choria.io
 
   Requires MCollective 2.9.0 or newer
 
ACTIONS:
========
   generate
 
   generate action:
   ----------------
       Generates random data of a given size
 
       INPUT:
           size:
              Description: Amount of text to generate
                   Prompt: Size
                     Type: integer
                 Optional: true
            Default Value: 20
 
 
       OUTPUT:
           message:
              Description: Generated Message
               Display As: Message

You can this has a basic data generator action – you give it a desired size and it makes you a message that size. It will run as many of these as you wish all called like emulated0 etc.

It has an mcollective agent that go with it, the idea is you create a pool of machines all with your normal mcollective on it and this agent. Using that agent then you build up a different new mcollective network comprising the emulators, federation and NATS.

Here’s some example of commands – you’ll see these later again when we talk about scenarios:

We download the dependencies onto all our nodes:

$ mco playbook run setup-prereqs.yaml --emulator_url=https://example.net/rip/choria-emulator-0.0.1 --gnatsd_url=https://example.net/rip/gnatsd --choria_url=https://example.net/rip/choria

We start NATS on our first node:

$ mco playbook run start-nats.yaml --monitor 8300 --port 4300 -I test1.example.net

We start the emulator with 1500 instances per node all pointing to our above NATS:

$ mco playbook run start-emulator.yaml --agents 10 --collectives 10 --instances 750 --monitor 8080 --servers 192.168.1.1:4300

You’ll then setup a client config for the built network and can interact with it using normal mco stuff and the test suite I’ll show later. Simularly there are playbooks to stop all the various parts etc. The playbooks just interact with the mcollective agent so you could use mco rpc directly too.

I found I can easily run 700 to 1000 instances on basic VMs – needs like 1.5GB RAM – so it’s fairly light. Using 400 nodes I managed to build a 300 000 node Choria network and could easily interact with it etc.

Finally I made a ec2 environment where you can stand up a Puppet Master, Choria, the emulator and everything you need and do load tests on your own dime. I was able to do many runs with 50 000 emulated nodes on EC2 and the whole lot cost me less than $20.

The code for this emulator is very much a work in progress as is the Go code for the Choria protocol and networking but the emulator is here if you want to take a peek.

What to consider when speccing a Choria network

09/22/2017

In my previous post I talked about the need to load test Choria given that I now aim for much larger workloads. This post goes into a few of the things you need to consider when sizing the optimal network size.

Given that we now have the flexibility to build 50 000 node networks quite easily with Choria the question is should we, and if yes then what is the right size. As we can now federate multiple Collectives together into one where each member Collective is a standalone network we have the opportunity to optimise for the operability of the network rather than be forced to just build it as big as we can.

What do I mean when I say the operability of the network? Quite a lot of things:

  • What is your target response time on a unbatched mco rpc rpcutil ping command
  • What is your target discovery time? You should use a discovery data source but broadcast is useful, so how long do you want?
  • If you are using a discovery source, how long do you want to wait for publishes to happen?
  • How many agents will you run? Each agent makes multiple subscriptions on the middleware and consume resources there
  • How many sub collectives do you want? Each sub collective multiply the amount of subscriptions
  • How many federated networks will you run?
  • When you restart the entire NATS, how long do you want to wait for the whole network to reconnect?
  • How many NATS do you need? 1 can run 50 000 nodes, but you might want a cluster for HA. Clustering introduces overhead in the middleware
  • If you are federating a global distributed network, what impact does the latency cross the federation have and what is acceptable

So you can see that to a large extend the answer here is related to your needs and not only to the needs of benchmarking Choria. I am working on a set of tools to allow anyone to run tests locally or on a EC2 network. The main work hose is a Choria emulator that runs a 1 000 or more Choria instances on a single node so you can use a 50 node EC2 network to simulate a 50 000 node one.

Middleware Scaling Concerns


Generally for middleware brokers there are a few things that impact their scalability:

  • Number of TCP Connections – generally a thread/process is made for each
  • TLS or Plain text – huge overhead in TLS typically and it can put a lot of strain on single systems
  • Number of message targets – queues, topics, etc. Different types of target have different overheads. Often a thread/process for each.
  • Number of subscribers to each target
  • Cluster overhead
  • Persistence overheads like storage and ACKs etc

You can see it’s quite a large number of variables that goes into this, anywhere that requires a thread or process to manage 1 of it means you should get worried or at least be in a position to measure it.

NATS uses 1 go routine for each connection and no additional ones per subscription etc, its quite light weight but there are no hard and fast rules. Best to observe how it grows by needs, something I’ll include in my test suite.

How Choria uses NATS


It helps then to understand how Choria will use NATS and what connections and targets it makes.

A single Choria node will:

  • Maintain a single TCP+TLS connection to NATS
  • Subscribe to 1 queue unique to the node for every Subcollective it belongs to
  • For every agent – puppet, package, service, etc – subscribe to a broadcast topic for that agent. Once in every Subcollective. Choria comes default with 7 agents.

So if you have a node with 10 agents in 5 Subcollectives:

  • 50 broadcast subjects for agents
  • 5 queue subjects
  • 1 TCP+TLS connection

So 100 nodes will have 5 500 subscriptions, 550 NATS subjects and 100 TCP+TLS connections.

Ruby based Federation brokers will maintain 1 subscription to a queue subject on the Federation and same on the Collective. The upcoming Go based Federation Brokers will maintain 10 (configurable) connections to NATS on each side, each with these subscriptions.

Conclusion


This will give us a good input into designing a suite of tools to measure various things during the run time of a big test, check back later for details about such a tool.

You can read about the emulator I wrote in the next post.

Load testing Choria

09/19/2017

Overview


Many of you probably know I am working on a project called Choria that modernize MCollective which will eventually supersede MCollective (more on this later).

Given that Choria is heading down a path of being a rewrite in Go I am also taking the opportunity to look into much larger scale problems to meet some client needs.

In this and the following posts I’ll write about work I am doing to load test and validate Choria to 100s of thousands of nodes and what tooling I created to do that.

Middleware


Choria builds around the NATS middleware which is a Go based middleware server that forgoes a lot of the persistence and other expensive features – instead it focusses on being a fire and forget middleware network. It has an additional project should you need those features so you can mix and match quite easily.

Turns out that’s exactly what typical MCollective needs as it never really used the persistence features and those just made the associated middleware quite heavy.

To give you an idea, in the old days the community would suggest every ~ 1000 nodes managed by MCollective required a single ActiveMQ instance. Want 5 500 MCollective nodes? That’ll be 6 machines – physical recommended – and 24 to 30 GB RAM in a cluster just to run the middleware. We’ve had reports of much larger RabbitMQ networks on 4 or 5 servers – 50 000 managed nodes or more, but those would be big machines and they had quite a lot of performance issues.

There was a time where 5 500 nodes was A LOT but now it’s becoming a bit every day, so I need to focus upward.

With NATS+Choria I am happily running 5 500 nodes on a single 2 CPU VM with 4GB RAM. In fact on a slightly bigger VM I am happily running 50 000 nodes on a single VM and NATS uses around 1GB to 1.5GB of RAM at peak.

Doing 100s of RPC requests in a row against 50 000 nodes the response time is pretty solid around 16 seconds for a RPC call to every node, it’s stable, never drops a message and the performance stays level in the absence of Java GC issues. This is fast but also quite slow – the Ruby client manages about 300 replies every 0.10 seconds due to the amount of protocol decoding etc that is needed.

This brings with it a whole new level of problem. Just how far can we take the client code and how do you determine when it’s too big and how do I know the client, broker and federation I am working on significantly improve things.

I’ve also significantly reworked the network protocol to support Federation but the shipped code optimize for code and config simplicity over lets say support for 20 000 Federation Collectives. When we are talking about truly gigantic Choria networks I need to be able to test scenarios involving 10s of thousands of Federated Network all with 10s of thousands of nodes in them. So I need tooling that lets me do this.

Getting to running 50 000 nodes


Not everyone just happen to have a 50 000 node network lying about they can play with so I had to improvise a bit.

As part of the rewrite I am doing I am building a Go framework with the Choria protocol, config parsing and network handling all built in Go. Unlike the Ruby code I can instantiate multiple of these in memory and run them in Go routines.

This means I could write a emulator that can start a number of faked Choria daemons all in one process. They each have their own middleware connection, run a varying amount of agents with a varying amount of sub collectives and generally behave like a normal MCollective machine. On my MacBook I can run 1 500 Choria instances quite easily.

So with fewer than 60 machines I can emulate 50 000 MCollective nodes on a 3 node NATS cluster and have plenty of spare capacity. This is well within budget to run on AWS and not uncommon these days to have that many dev machines around.

In the following posts I’ll cover bits about the emulator, what I look for when determining optimal network sizes and how to use the emulator to test and validate performance of different network topologies.

Follow-up Posts

Choria Network Protocols – Transport

03/21/2017

The old MCollective protocols are now ancient and was quite Ruby slanted – full of Symbols and used YAML and quite language specific – in Choria I’d like to support other Programming Languages, REST gateways and so forth, so a rethink was needed.

I’ll look at the basic transport protocol used by the Choria NATS connector, usually it’s quite unusual to speak of Network Protocols when dealing with messages on a broker but really for MCollective it is exactly that – a Network Protocol.

The messages need enough information for strong AAA, they need to have an agreed on security structure and within them live things like RPC requests. So a formal specification is needed which is exactly what a Protocol is.

While creating Choria the entire protocol stack has been redesigned on every level except the core MCollective messages – Choria maintains a small compatibility layer to make things work. To really achieve my goal I’d need to downgrade MCollective to pure JSON data at which point multi language interop should be possible and easy.

Networks are Onions


Network protocols tend to come in layers, one protocol within another within another. The nearer you go to the transport the more generic it gets. This is true for HTTP within TCP within IP within Ethernet and likewise it’s true for MCollective.

Just like for TCP/IP and HTTP+FTP one MCollective network can carry many protocols like the RPC one, a typical MCollective install uses 2 protocols at this inner most layer. You can even make your own, the entire RPC system is a plugin!

( middleware protocol
  ( transport packet that travels over the middleware
      ( security plugin internal representation
        ( mcollective core representation that becomes M::Message
          ( MCollective Core Message )
          ( RPC Request, RPC Reply )
          ( Other Protocols, .... )
        )
      )
    )
  )
)

Here you can see when you do mco rpc puppet status you’ll be creating a RPC Request wrapped in a MCollective Message, wrapped in a structure the Security Plugin dictates, wrapped in a structure the Connector Plugin dictates and from there to your middleware like NATS.

Today I’ll look at the Transport Packet since that is where Network Federation lives which I spoke about yesterday.

Transport Layer


The Transport Layer packets are unauthenticated and unsigned, for MCollective security happens in the packet carried within the transport so this is fine. It’s not inconceivable that a Federation might only want to route signed messages and it’s quite easy to add later if needed.

Of course the NATS daemons will only accept TLS connections from certificates signed by the CA so these network packets are encrypted and access to the transport medium is restricted, but the JSON data you’ll see below is sent as is.

In all the messages shown below you’ll see a seen-by header, this is a feature of the NATS Connector Plugin that records the connected NATS broker, we’ll soon expose this information to MCollective API clients so we can make a traceroute tool for Federations. This header is optional and off by default though.

I’ll show messages in Ruby format here but it’s all JSON on the wire.

Message Targets


First it’s worth knowing where things are sent on the NATS clusters. The targets used by the NATS connector is pretty simple stuff, there will no doubt be scope for improvement once I look to support NATS Streaming but for now this is adequate.

  • Broadcast Request for agent puppet in the mycorp sub collective – mycorp.broadcast.agent.puppet
  • Directed Request to a node for any agent in the mycorp sub collective – mycorp.node.node1.example.net
  • Reply to a node identity dev1.example.net with pid 9999 and a message sequence of 10mycorp.reply.node1.example.net.9999.10

As the Federation Brokers are independent of Sub Collectives they are not prefixed with any collective specific token:

  • Requests from a Federation Client to a Federation Broker Cluster called productionchoria.federation.production.federation queue group production_federation
  • Replies from the Collective to a Federation Broker Cluster called productionchoria.federation.production.collective queue group production_collective
  • production cluster Federation Broker Instances publishes statistics – choria.federation.production.stats

These names are designed so that in smaller setups or in development you could use a single NATS cluster with Federation Brokers between standalone collectives. Not really a recommended thing but it helps in development.

Unfederated Messages


Your basic Unfederated Message is pretty simple:

{
  "data" => "... any text ...",
  "headers" => {
    "mc_sender" => "dev1.example.net",
    "seen-by" => ["dev1.example.net", "nats1.example.net"],
    "reply-to" => "mcollective.reply.dev1.example.net.999999.0",
  }
}
  • it’s is a discovery request within the sub collective mcollective and would be published to mcollective.broadcast.agent.discovery.
  • it is sent from a machine identifying as dev1.example.net
  • we know it’s traveled via a NATS broker called nats1.example.net.
  • responses to this message needs to travel via NATS using the target mcollective.reply.dev1.example.net.999999.0.

The data is completely unstructured as far as this message is concerned it just needs to be some text, so base64 encoded is common. All the transport care for is getting this data to its destination with metadata attached, it does not care what’s in the data.

The reply to this message is almost identical:

{
  "data" => "... any text ...",
  "headers" => {
    "mc_sender" => "dev2.example.net",
    "seen-by" => ["dev1.example.net", "nats1.example.net", "dev2.example.net", "nats2.example.net"],
  }
}

This reply will travel via mcollective.reply.dev1.example.net.999999.0, we know that the node dev2.example.net is connected to nats2.example.net.

We can create a full traceroute like output with this which would show dev1.example.net -> nats1.example.net -> nats2.example.net -> dev2.example.net

Federated Messages


Federation is possible because MCollective will just store whatever Headers are in the message and put them back on the way out in any new replies. Given this we can embed all the federation metadata and this metadata travels along with each individual message – so the Federation Brokers can be entirely stateless, all the needed state lives with the messages.

With Federation Brokers being clusters this means your message request might flow over a cluster member a but the reply can come via b – and if it’s a stream of replies they will be load balanced by the members. The Federation Broker Instances do not need something like Consul or shared store since all the data needed is in the messages.

Lets look at the same Request as earlier if the client was configured to belong to a Federation with a network called production as one of its members. It’s identical to before except the federation structure was added:

{
  "data" => "... any text ...",
  "headers" => {
    "mc_sender" => "dev1.example.net",
    "seen-by" => ["dev1.example.net", "nats1.fed.example.net"],
    "reply-to" => "mcollective.reply.dev1.example.net.999999.0",
    "federation" => {
       "req" => "68b329da9893e34099c7d8ad5cb9c940",
       "target" => ["mcollective.broadcast.agent.discovery"]
    }
  }
}
  • it’s is a discovery request within the sub collective mcollective and would be published via a Federation Broker Cluster called production via NATS choria.federation.production.federation.
  • it is sent from a machine identifying as dev1.example.net
  • it’s traveled via a NATS broker called nats1.fed.example.net.
  • responses to this message needs to travel via NATS using the target mcollective.reply.dev1.example.net.999999.0.
  • it’s federated and the client wants the Federation Broker to deliver it to it’s connected Member Collective on mcollective.broadcast.agent.discovery

The Federation Broker receives this and creates a new message that it publishes on it’s Member Collective:

{
  "data" => "... any text ...",
  "headers" => {
    "mc_sender" => "dev1.example.net",
    "seen-by" => [
      "dev1.example.net",
      "nats1.fed.example.net", 
      "nats2.fed.example.net", 
      "fedbroker_production_a",
      "nats1.prod.example.net"
    ],
    "reply-to" => "choria.federation.production.collective",
    "federation" => {
       "req" => "68b329da9893e34099c7d8ad5cb9c940",
       "reply-to" => "mcollective.reply.dev1.example.net.999999.0"
    }
  }
}

This is the same message as above, the Federation Broker recorded itself and it’s connected NATS server and produced a message, but in this message it intercepts the replies and tell the nodes to send them to choria.federation.production.collective and it records the original reply destination in the federation header.

A node that replies produce a reply, again this is very similar to the earlier reply except the federation header is coming back exactly as it was sent:

{
  "data" => "... any text ...",
  "headers" => {
    "mc_sender" => "dev2.example.net",
    "seen-by" => [
      "dev1.example.net",
      "nats1.fed.example.net", 
      "nats2.fed.example.net", 
      "fedbroker_production_a", 
      "nats1.prod.example.net",
      "dev2.example.net",
      "nats2.prod.example.net"
    ],
    "federation" => {
       "req" => "68b329da9893e34099c7d8ad5cb9c940",
       "reply-to" => "mcollective.reply.dev1.example.net.999999.0"
    }
  }
}

We know this node was connected to nats1.prod.example.net and you can see the Federation Broker would know how to publish this to the client – the reply-to is exactly what the Client initially requested, so it creates:

{
  "data" => "... any text ...",
  "headers" => {
    "mc_sender" => "dev2.example.net",
    "seen-by" => [
      "dev1.example.net",
      "nats1.fed.example.net", 
      "nats2.fed.example.net", 
      "fedbroker_production_a", 
      "nats1.prod.example.net",
      "dev2.example.net",
      "nats2.prod.example.net",
      "nats3.prod.example.net",
      "fedbroker_production_b",
      "nats3.fed.example.net"
    ],
  }
}

Which gets published to mcollective.reply.dev1.example.net.999999.0.

Route Records


You noticed above there’s a seen-by header, this is something entirely new and never before done in MCollective – and entirely optional and off by default. I anticipate you’d want to run with this off most of the time once your setup is done, it’s a debugging aid.

As NATS is a full mesh your message probably only goes one hop within the Mesh. So if you record the connected server you publish into and the connected server your message entered it’s destination from you have a full route recorded.

The Federation Broker logs and MCollective Client and Server logs all include the message ID so you can do a full trace in message packets and logs.

There’s a PR against MCollective to expose this header to the client code so I will add something like mco federation trace some.node.example.net which would send a round trip to that node and tell you exactly how the packet travelled. This should help a lot in debugging your setups as they will now become quite complex.

The structure here is kind of meh and I will probably improve on it once the PR in MCollective lands and I can see what is the minimum needed to do a full trace.

By default I’ll probably record the identities of the MCollective bits when Federated and not at all when not Federated. But if you enable the setting to record the full route it will produce a record of MCollective bits and the NATS nodes involved.

In the end though from the Federation example we can infer a network like this:

Federation NATS Cluster

  • Federation Broker production_a -> nats2.fed.example.net
  • Federation Broker production_b -> nats3.fed.example.net
  • Client dev1.example.net -> nats1.fed.example.net

Production NATS Cluster:

  • Federation Broker production_a -> nats1.prod.example.net
  • Federation Broker production_b -> nats3.prod.example.net
  • Server dev2.example.net -> nats2.prod.example.net

We don’t know the details of all the individual NATS nodes that makes up the entire NATS mesh but this is good enough.

Of course this sample is the pathological case where nothing is connected to the same NATS instances anywhere. In my tests with a setup like this the overhead added across 10 000 round trips against 3 nodes – so 30 000 replies through 2 x Federation Brokers – was only 2 seconds, I couldn’t reliably measure a per message overhead as it was just too small.

The NATS gem do expose the details of the full mesh though since NATS will announce it’s cluster members to clients, I might do something with that not sure. Either way, auto generated network maps should be totally possible.

Conclusion


So this is how Network Federation works in Choria. It’s particularly nice that I was able to do this without needing any state on the cluster thanks to past self making good design decisions in MCollective.

Once the seen-by thing is figured out I’ll publish JSON Schemas for these messages and declare protocol versions.

I can probably make future posts about the other message formats but they’re a bit nasty as MCollective itself is not yet JSON safe, the plan is it would become JSON safe one day and the whole thing will become a lot more elegant. If someone pings me for this I’ll post it otherwise I’ll probably stop here.

Older Posts