The problem of getting EC2 images to do what you want is quite significant, mostly I find the whole thing a bit flakey and with too many moving parts.
- When and what AMI to start
- Once started how to do you configure it from base to functional. Especially in a way that doesn’t become a vendor lock.
- How do you manage the massive sprawl of instances, inventory them and track your assets
- Monitoring and general life cycle management
- When and how do you shut them, and what cleanup is needed. Being billed by the hour means this has to be a consideration
These are significant problems and just a tip of the ice berg. All of the traditional aspects of infrastructure management – like Asset Management, Monitoring, Procurement – are totally useless in the face of the cloud.
A lot of work is being done in this space by tools like Pool Party, Fog, Opscode and many other players like the countless companies launching control panels, clouds overlaying other clouds and so forth. As a keen believer in Open Source many of these options are not appealing.
I want to focus on the 2nd step above here today and show how I pulled together a number of my Open Source projects to automate that. I built a generic provisioner that hopefully is expandable and usable in your own environments. The provisioner deals with all the interactions between Puppet on nodes, the Puppet Master, the Puppet CA and the administrators.
<rant> Sadly the activity in the Puppet space is a bit lacking in the area of making it really easy to get going on a cloud. There are suggestions on the level of monitoring syslog files from a cronjob and signing certificates based on that. Really. It’s a pretty sad state of affairs when that’s the state of the art.
Compare the ease of using Chef’s Knife with a lot of the suggestions currently out there for using Puppet in EC2 like these: 1, 2, 3 and 4.
Not trying to have a general Puppet Bashing session here but I think it’s quite defining of the 2 user bases that Cloud readiness is such an after thought so far in Puppet and its community. </rant>
My basic needs are that instances all start in the same state, I just want 1 base AMI that I massage into the desired final state. Most of this work has to be done by Puppet so it’s repeatable. Driving this process will be done by MCollective.
I bootstrap the EC2 instances using my EC2 Bootstrap Helper and I use that to install MCollective with just a provision agent. It configures it and hook it into my collective.
From there I have the following steps that need to be done:
- Pick a nearby Puppet Master, perhaps using EC2 Region or country as guides
- Set up the host – perhaps using /etc/hosts – to talk to the right master
- Revoke and clean any old certs for this hostname on all masters
- Instruct the node to create a new CSR and send it to its master
- Sign the certificate
- Run my initial bootstrap Puppet environment, this sets up some hard to do things like facts my full build needs
- Run the final Puppet run in my normal production environment.
- Notify me using XMPP, Twitter, Google Calendar, Email, Boxcar and whatever else I want of the new node
This is a lot of work to be done on every node. And more importantly it’s a task that involves many other nodes like puppet masters, notifiers and so forth. It has to adapt dynamically to your environment and not need reconfiguring when you get new Puppet Masters. It has to deal with new data centers, regions and countries without needing any configuration or even a restart. It has to happen automatically without any user interaction so that your auto scaling infrastructure can take care of booting new instances even while you sleep.
The provisioning system I wrote does just this. It follows the above logic for any new node and is configurable for which facts to use to pick a master and how to notify you of new systems. It adapts automatically to your ever changing environments thanks to discovery of resources. The actions to perform on the node are easily pluggable by just creating an agent that complies to the published DDL like the sample agent.
You can see it in action in the video below. I am using Amazon’s console to start the instance, you’d absolutely want to automate that for your needs. You can also see it direct on blip.tv here. For best effect – and to be able to read the text – please fullscreen.
In case the text is unreadable in the video a log file similar to the one in the video can be seen here and an example config here
Past this point my Puppet runs are managed by my MCollective Puppet Scheduler.
While this is all done using EC2 nothing prevents you from applying these same techniques to your own data center or non cloud environment.
Hopefully this shows that you can wrap all the logic needed to do very complex interactions with systems that are perhaps not known for their good reusable API’s in simple to understand wrappers with MCollective, exposing those systems to the network at large with APIs that can be used to reach your goals.
The various bits of open source I used here are:
- MCollective
- EC2 Bootstrap helper on CentOS 5.5
- The MCollective Server Provisioner
- The sample provisioner agent
- My Nagger notification framework with it’s XMPP plugin
- The Naggernotify MCollective Agent
- The Puppet CA Mcollective Agent
- Puppet and Facter
This is fantastic to finally see all of these pieces pulled together cleanly. Puppet needs to push more integration like this into the puppet dashboard project.
R.I: You that, “Cloud readiness is such an after thought so far in Puppet and its community.” Do you still feel that way?
@peter: There has been improvements recently – see the cloudpack for example – but yes, at present I stand by that assertion.
I’d add that the problem has been noticed at all levels over at Puppet Labs and things will improve in time.
Trying to set this up on ec2 but i get stuck on the puppet cert sign step…
The puppet certname is the fqdn (ip-X-X-X-X.ec2.internal) but the provisioning agent looks for a certname based on hostname (ip-X-X-X-X) and as a result it will not be able to sign.
This obviously worked for you so i’m trying to figure out what i’ve missed?
Any hint would be very appreciated!
@john thats weird, but I guess the easiest is just to do something in the post
install that sets the mcollective identity to the fqdn of the machine
rather than the hostname.
This is probably due to behavior differences in distros, on redhat hostname
is fqdn on others hostname != fqdn, mcollective defaults to hostname
Got it sorted by adding a
open(‘/etc/mcollective/server.cfg’, ‘a’) { |f| f.puts “identity = “+%x[facter fqdn] } line in the bootstrap script.
Thanks for the hint, nice work on all this!
Hi
Thanks, that’a an interesting framework! 🙂 I love the idea of connecting to the most appropriate puppet master.
I’m unsure about why puppet autosign is a bad idea in EC2 where permission to be able to connect to the puppet master can be strictly controlled by security groups? Isn’t your provisioner effectively performing ‘auto-signing’ anyway as it sends out a request to sign the certificate automatically to all puppet masters? I’ve discussed a little more about using security groups and bootstrapping puppet on EC2 another way here :
http://www.practicalclouds.com/content/guide/puppet-deployment
Keep up the great work!
regards
Dave
Dave
Autosign will autosign anything that can connect. This process will only approve systems that has been correctly configured onto the mcollective security systems – ie. ones with SSL keys and such.
Additionally prior to being provisioned the node being provisioned does not even need to know ip/name/etc of the masterr. You could also in theory dynamically open firewall rules but only for machines known to mcollective.
Effectively you can create an old skool provisioning VLAN like system on the public cloud.
so no, it’s not the same
Hi
My point is that with the use of security groups, which you control, you can limit connectons to your puppet master to only those nodes that you authorised by making them a member of your group (I’m assuming that you have exclusive rights to assigning nodes to your “puppet agent” group). In this way, the puppet master can happily sign any request that comes in because it knows that it will only be from an authorised agent.
In your mcollective based boot, what authorizes the client, i.e. triggers the certificate to be signed? Is it by virtue of being your AMI?
I agree that in your method it is good that the booting instance does not need to know which puppet master to connect to, although it does need details to connect to mcollective. I wonder if it were feasible to also tell the booting instance which puppet master to connect to via the user-data too?
regards
Dave
The same bootstrap method applies to physical kit in a DC where the networks are a whole different story.
Eitherway, whatever works for you.
Autosign is not also auto-clean-then-auto-sign so if you get the same host reprovisioning or your cloud re-issue you the same hostname autosign will simply bail out since there is a previous matching certificate. My method also avoids that.
By relying on autosign you put all your eggs in one basket – the basket being human error in managing security groups for example, any simple single miss configuration that might allow any machine to make the TCP connection might leak sensitive information.
A managed 3rd party – even if thats software like the provisioner – adds a extra level of security, auditing and authority where you can put in your own checks and balances against incoming nodes like for example *requiring* them to be on your collective and requiring them to have certain meta data supplied.
Without it, you might have a user provisioning machines through the ec2 console, putting it in the wrong group and bootstrapping it through autosigning. If that node is lacking meta data or not the right spec for the role etc you might find that node listed in exported resources databases where the fall out from that single mistake can be pretty big.
This is a great tool – thanks for the work on it and share it. I’ve been trying to get implemented. Everything works great except the actual bootstrap and full run. The provisioner agent is returning from “reply[:output] = %x[#{@puppetd} –test –environment bootstrap –color=none –summarize]” before the process has completed (about 5 seconds into the puppet run) causing the provisioner service to throw up. The ddl has a timeout of 360, but it’s returning after 5 seconds. Is there another timeout I need to set somewhere?
@justin that sounds like the ttl top of the agent need to be tweeked
It would appear the video/account has been removed from blip, is there any chance of providing an alternative link to the video?
This article is clearly a few years old now, just wondering what improvements you might have made on this front as the approach described is very appealing to me.
That’s a shame, I dont have it anymore. Have not worked on it since sorry.
Seems a copy of the video made it’s way on to Youtube – http://www.youtube.com/watch?v=-iEgz9MD3qA.
Totally understand that you’ve not worked on this since, but in your opinion, do you think the approach presented here is still a valid approach….or have things progressed in the Puppet/Mcollective world that supercede this?
Oh yeah, it’s a totally viable approach. Modern mcollective have new features that will really improve the code flow too. Things like the lock handling can be improved with the new discovery system and so forth.
So with a bit of TLC this could now be improved a lot and sure, I still think this is the right way to do it.