Lab Infra Rebuild Part 3

This is an ongoing post about rebuilding my lab infrastructure, see the initial post here.

Today I’ll talk a bit about Configuration Management having previously mentioned I am ditching Kubernetes.

Server Management

The general state of server management is pretty sad, you have Ansible or Puppet and a long tail of things that just can’t work or are under terrible corporate control that you just can’t touch them.

I am, as most people are aware, a very long term Puppet user since almost day 1 and have contributed significant features like Hiera and the design of Data in Modules. I’ve not had much need/interest in being involved in that community for ages but I want to like Puppet and I want to keep using it where appropriate.

I think Puppet is more or less finished - there are bug fixes and stuff of course - but in general core Puppet is stable, mature and does what one wants and have extension points one needs. There’s not really any reason not to use if it fits your needs/tastes and should things go pear shaped its a easy fork. One can essentially stay on a current version for years at this point and it’s fine. There used to be some issues around packaging but even this is Apache-2 now. If you have the problem it solves it does so very well, but the industry have moved on so not much scope for extending it imo.

All the action is in the content - modules on the forge. Vox Pupuli are doing an amazing job, I honestly do not know how they do so much or maintain so many modules, it’s really impressive.

There’s a general state of rot though, modules I used to use are almost all abandoned with a few moved to Vox. I wonder if stats about this is available, but I get the impression that content wise things are also taking a huge dive there and Vox are holding everything afloat with Puppet hoping to make money from selling commercial modules - I doubt it will sustain a business their size but it’s a good idea.

Read on for more about Puppet today in my infra.

Given the general state of things in the server management world I decided to use Puppet again for this round of infrastructure rebuild. I can’t see this lasting alas. Most people are aware that Puppet have been bought by Perforce and have had a huge shift in people and such. It’s inevitable that revenue generation is the main push at the moment.

Unfortunately the way this play out is pretty unpleasant. Here’s an example: I have an ancient, and crappy, monitoring script that runs on the node and checks last_run_summary.yaml to infer the current status of runs - are they failing etc. It’s worked for a very long time and last_run_summary.yaml is a contract that’s to be maintained. Something in recent Puppet broke this script (the last_run_summary.yaml now behaves differently) so I thought I’d ask on their Slack if there is something newer/maintained before I fixed mine.

Immediately from Puppet people you get a message saying to use Puppet Enterprise for this. In a community Slack where people are just asking questions about a 200 line script. The suggestion is to move to a price-not-disclosed product vs a 200 line script. Without even so much as a question about needs or environment or if the suggestion would be relevant. It’s just corporate enshitification. Eventually I got some good answers from the community, despite Puppets best efforts.

This outcome is of course entirely predictable, I can only hope Vox doesn’t get burned in the inevitable slide into the sewer.

Managing Puppet

I have 2 old EL based machines that could not make the trip to Puppet 8 due some legacy there, the rest got moved to EL9 with a Puppet Server. I am though a big fan of running puppet apply based builds and will likely move to that instead of the server. Apply based workflows present a few problems though, primarily how you get the code on the nodes and how the workflow is around that.

I’d like a git based flow where I commit a change, CI package it and puts it on a repo and the fleet updates to it. Ideally the fleet updates to it asap. Further I want visibility into the runs, node-side monitoring which fits my event based world-view and allows me central control to trigger runs and do scheduled maintenance.

So I built a system to orchestrate Puppet that’ll release soon called Puppet Control.

Provides a nice cli for triggering runs, querying runs etc
Git+CI based flow handles fetching, validating and deploying code to nodes for apply, including tamper detection
Has concurrently control built in for fine-grained resource management of file servers used to deliver code bundles
Includes a run scheduler with concurrency controls that include ability to say for eg: only 1 database server out of all database server can run Puppet at a time but webservers can run 10 concurrently
Can do on-demand runs as soon as possible subject to concurrency control to ensure the shared infra performs at peak
Has various ways to find nodes in certain states like pctl nodes failing to find nodes with failing resources
Can show real-time events of Puppet runs
Can have maintenance declared that will stop all scheduled Puppet runs
Expose run statistics to Prometheus for runtimes, changes, health and more

There’s an optional video below with more details and show the code release flow etc:

I’ll release this eventually, it’s dependent on some work happening in Choria at the moment.

The concurrency control is a big deal. Scheduling Puppet runs is quite a difficult task. Usually the solution Puppet users do is just to spread the runs in cron by some random time distribution.

That leaves the problem of fast Puppet runs during maintenance windows though. In the past we had a command mco puppet runall 100 which would discover all the nodes then in a loop ask their state and schedule more as some stopped - the goal being to keep as close to 100 running at a time. The choice of 100 nodes is related to the capacity of the Puppet Server infrastructure.

This worked fine but it was very resource intensive on the Choria/MCollective network as 1000s of RPC requests have to be made to know the current state and it was not suited to using in an ongoing fashion. With Puppet Control every run is a run that happens at the desired concurrency, but without a central orchestrator trying to make all the choices. It’s significantly cheaper on the network.

More significantly by allowing the concurrency group name to be configured on a per node basis one can have different policies by type of machine. This is a big deal, let’s say we are using puppet apply but we do not wish to have our 5 database machines all do concurrent runs and potentially restarting at the same time. By creating a concurrency governor for just those machines set to 1 we prevent that.

With this in place triggering all the nodes to run at their groups configured concurrency is a simple pctl nodes trigger which takes 2 seconds to complete. From there the nodes will run without overwhelming the Servers.

Another interesting thing here is that this model maps well onto Ansible local mode as well. So in theory, unexplored theory, this same central controller and scheduler could be made for Ansible.

This is built on Choria Concurrency Governors which is an amazing distributed system building block.

Choria

No surprise that I am using Choria for a large part of this, with a bit of a twist though. Choria, as released on choria.io, is actually a distribution of a much larger system that is tailored for Puppet users. That official Choria release requires Puppet Agent and will not support unofficial builds or unsupported deviations from that.

With the writing being on the wall though for Puppet this leaves me with a problem, I have no easy to use Public distribution of Choria for non Puppet users. Puppet provides the following to Choria:

Deployment of the packages and files to nodes
Management of policies and plugins on nodes
Certificate Authority with certs on every node
Optional discovery source of truth in PuppetDB
Libraries for managing packages and services

These are actually quite significant hurdles to cross to create a fully Puppet-free Choria distribution.

That said for a long time Choria have had another life as a large scale orchestrator that is not Puppet related. This implies:

It can self-provision at a rate of 1000+ nodes / minute
Deploy its own plugins at a rate of multiple plugins delivered to millions of nodes in minutes
Manage its own security both integrated with a CA or using a new JWT+ed25519 based approach
Integrate with non Puppet data sources using external extension points
Can upgrade itself in place without any help from Puppet in an Over-The-Air type self-upgrade system
Has centralised RBAC integrated with tools like Open Policy Agent

In the last 2 years I was on a related contract and all these components have been firmed up and made much more capable, reliable and horizontally scalable and used in anger in the real world in some quite serious mission critical builds.

Thus, my Choria infrastructure is actually a highbrid between Puppet and Non Puppet. Puppet places the RPMs and plugins that require Puppet (package, service, legacy plugins), but Choria self provisions everything else and owns the life of the agent and more. I am running the new Protocol and with Open Policy Agent based RBAC. I’ve made some changes to the various Puppet models to enable this and will start looking for some early adopters.

This means my many Raspberry PI - from Xmas lights to sensors and HVAC control - are all now managed by Choria as well as the provisioner caters for them since those are without Puppet.

I wrote about the new protocol in an ADR and you can see there is scope for integration into TPMs and more. This is the future world view of Choria and already in use on some 100s of thousands of machines.

Conclusion

So that’s a bit about managing the machines without Kubernetes or a ISP managing them.

As I’ve been out of active Puppet use for a few years it’s been interesting to come back with some semi-fresh eyes and rethink some of the old things I believed was true when I used it constantly.

Choria will play a critical role in the path forward as I’ll move much into containers managed as per the previous blog post leaving the problem Puppet solves to quite a thin layer. I’ve some thoughts on doing something to at least replace the most basic package-config-service trio of CM with something in Choria long term.