{"id":3676,"date":"2017-09-19T10:55:29","date_gmt":"2017-09-19T09:55:29","guid":{"rendered":"https:\/\/www.devco.net\/?p=3676"},"modified":"2017-10-09T08:38:43","modified_gmt":"2017-10-09T07:38:43","slug":"load-testing-choria","status":"publish","type":"post","link":"https:\/\/www.devco.net\/archives\/2017\/09\/19\/load-testing-choria.php","title":{"rendered":"Load testing Choria"},"content":{"rendered":"

Overview<\/H2>
\nMany of you probably know I am working on a project called Choria<\/a> that modernize MCollective which will eventually supersede MCollective (more on this later).<\/p>\n

Given that Choria is heading down a path of being a rewrite in Go I am also taking the opportunity to look into much larger scale problems to meet some client needs.<\/p>\n

In this and the following posts I’ll write about work I am doing to load test and validate Choria to 100s of thousands of nodes and what tooling I created to do that.<\/p>\n

Middleware<\/H2>
\nChoria builds around the
NATS<\/a> middleware which is a Go based middleware server that forgoes a lot of the persistence and other expensive features – instead it focusses on being a fire and forget middleware network. It has an additional project should you need those features so you can mix and match quite easily. <\/p>\n

Turns out that’s exactly what typical MCollective needs as it never really used the persistence features and those just made the associated middleware quite heavy.<\/p>\n

To give you an idea, in the old days the community would suggest every ~ 1000 nodes managed by MCollective required a single ActiveMQ instance. Want 5 500 MCollective nodes? That’ll be 6 machines – physical recommended – and 24 to 30 GB RAM in a cluster just to run the middleware. We’ve had reports of much larger RabbitMQ networks on 4 or 5 servers – 50 000 managed nodes or more, but those would be big machines and they had quite a lot of performance issues.<\/p>\n

There was a time where 5 500 nodes was A LOT but now it’s becoming a bit every day, so I need to focus upward.<\/p>\n

With NATS+Choria I am happily running 5 500 nodes on a single 2 CPU VM with 4GB RAM. In fact on a slightly bigger VM I am happily running 50 000 nodes on a single VM and NATS uses around 1GB to 1.5GB of RAM at peak.<\/p>\n

Doing 100s of RPC requests in a row against 50 000 nodes the response time is pretty solid around 16 seconds for a RPC call to every node, it’s stable, never drops a message and the performance stays level in the absence of Java GC issues. This is fast but also quite slow – the Ruby client manages about 300 replies every 0.10 seconds due to the amount of protocol decoding etc that is needed.<\/p>\n

This brings with it a whole new level of problem. Just how far can we take the client code and how do you determine when it’s too big and how do I know the client, broker and federation I am working on significantly improve things.<\/p>\n

I’ve also significantly reworked the network protocol to support Federation<\/a> but the shipped code optimize for code and config simplicity over lets say support for 20 000 Federation Collectives. When we are talking about truly gigantic Choria networks I need to be able to test scenarios involving 10s of thousands of Federated Network all with 10s of thousands of nodes in them. So I need tooling that lets me do this.<\/p>\n

Getting to running 50 000 nodes<\/H2>
\nNot everyone just happen to have a 50 000 node network lying about they can play with so I had to improvise a bit.<\/p>\n

As part of the rewrite I am doing I am building a Go framework with the Choria protocol, config parsing and network handling all built in Go. Unlike the Ruby code I can instantiate multiple of these in memory and run them in Go routines.<\/p>\n

This means I could write a emulator<\/a> that can start a number of faked Choria daemons all in one process. They each have their own middleware connection, run a varying amount of agents with a varying amount of sub collectives and generally behave like a normal MCollective machine. On my MacBook I can run 1 500 Choria instances quite easily.<\/p>\n

So with fewer than 60 machines I can emulate 50 000 MCollective nodes on a 3 node NATS cluster and have plenty of spare capacity. This is well within budget to run on AWS and not uncommon these days to have that many dev machines around.<\/p>\n

In the following posts I’ll cover bits about the emulator, what I look for when determining optimal network sizes and how to use the emulator to test and validate performance of different network topologies.<\/p>\n

Follow-up Posts<\/H2><\/p>\n