Monday, March 31, 2014

AWS Autoscaling based on SWF Queue Depth

Amazon has great tools for autoscaling. Combined with our recently completed AMI Bakery this should be awesome right?
  • Generate an AMI
  • Create an auto scaling group that uses that AMI
  • Profit!

But... and there's always a but... 


Amazon's autoscaling can't scale based on the metric that we care about: Simple Workflow queue depth. 

I've got a great idea! It's round, and you can put it under heavy things to help them move!

Start from the basics:

  • We have a create_instance script that lets you create an instance/ami of a given type. 
  • We have a scale script that compares what you asked for to what is running and adds/removes resources as needed. 
What we lack is the ability to trigger that scaling based on something we care about. 

Enter scalecontroller.py. 


This script takes a bunch of command line arguments to let you tune it to the queue you want to watch, number of jobs that trigger increase/decrease, and how big/small the array should get. 

When a scaling event is initiated, it runs scale.py, which then handles the logic of adding/removing. 

How does scale know which AMI to use?


Chef Data Bags!

Each application has 2 data bag items associated with it: artifact and launch-artifact. Each item contains the branch/build numbers for that product.

To use our "portal" example, you would have something like :

knife data bag show portal
id: portal
branch: master
build: 13


knife data bag show launch-portal
id: portal
branch: master
build: 12

The portal data bag show the most recently deployed version. This is what you would use to pre-bundle an AMI, prior to the release of the code to the world.

The launch-portal databag is the one scale looks at to figure out what AMI to load.

We have a standard naming convention for the AMIs. Using Chef attributes and the branch/build data in the data bags, scale can calculate the AMI to use, test for its existence/availability, and act accordingly. 




AWS, Chef and scaling a mixed Windows/Linux environment (Part 1)

Note: you'll notice that I rarely talk about what company exactly... it's not them, it's me. I prefer to keep that sort of thing somewhat private.

I've been at $COMPANY for a year and a half now. In that time we've gone through a surprising number of architectural changes. Iterative development indeed.

I'm really happy with where we are today, and would like to take some time to write it all down.

In the Beginning

Whenever I hear someone talk, I always like to hear about the road they took to get to this awesome place. I want to know what they tried and hated; I want to hear about all the failures. I think there's a lot to learn there. So, I'm going to start from my beginning and work my way up to what we are doing now.

When I first came here, the infrastructure was all AWS, managed via RightScale. We weren't really using RS correctly though. We weren't letting instances inherit configurations from deployments, we weren't making smart use of templates, we were achieving a LOT of our functionality as shell code within templates, rather than using their engines.

Deploying software was a nightmare.

We had 2 different core applications (there was a lot more going on than that, but we'll stick to just the main in-house developed stuff).

One was Python, the other a Grails app in the form of a war file.

Python app, running on Windows and Linux.

This was deployed from SVN. The RightScale config included the desired SVN revision. Since we weren't using deployment inheritance, you had to go to each instance and update them one at a time.

The process for deploying code differed wildly from Windows to Linux, even though they were both from the same Python repo.

On the Windows side you would RDC to an existing box, svn up to the revision, manually edit any config files, bundle an AMI, launch new instances and terminate the old ones.

On the Linux side you would update the RightScale template variables and run the deployment script to execute the changes. The scripts were basically shell scripts running a lot of "sed s/this/that/g"

Grails app

The war file was downloaded from S3. Each environment (dev1, dev2, qa, prod, etc...) had a bucket, and the deployment script downloaded the artifact from that. The war files were copied to the buckets by hand, along with the config templates which were sed processed like the python code.


Next: First pass at automation

AWS, Chef and scaling a mixed Windows/Linux environment (Part 2)

The first pass



Our first pass was to get everything moved off of RightScale and into a system we managed. We had the technical skills to do this ourselves, so no need to give RS so much money!

After a month+ of evaluation and testing, we settled on Chef 10. We found the cookbook system to fit better into our overall development workflow, and the open source community was strong. Puppet was a serious contender too, and to this day I think either would have worked as well as the other.

But, there can be only one, and the one was Chef.

Starting at the foundation



We began with a simple goal: We need to be able to launch and minimally configure a base system of either Windows or Linux using the same set of tools.

By minimally configure, I mean:

System is online and accessible from our office or over our VPN
Users have their accounts with appropriate access
Tools that are universally used on all systems are present (sysstat on Linux, Powershell on Windows, etc...)

Sounds easy, right?



Turns out there are a LOT of awesome open source tools for managing your AWS infrastructure, and not many of them support Windows AND Linux equally. Now start looking for things that understand Amazon Simple Workflow or some of the other offerings outside of EC2 and S3.  Even Chef required a different plugin and bootstrapping syntax to do Windows.

This led to creating a basic AWS management platform in-house. We used Python/Boto, and did our own tooling around security groups, s3 buckets.

At the end of this phase we had a config file for each "fleet" (our term for stack or environment). This config file contained the core information needed to manage it: AWS credentials, root SSH keys, things like that. From there you could initialize an entire fleet in a brand new AWS account. All buckets, groups, instances, roles, etc... would be generated for you. Cool!

But that just gets us instances. How did we make them do the work?



We set up a couple of initial cookbooks. Over time this has grown, but in the beginning we had company_system and company_mainproduct.

The system cookbook contained all the OS level stuff: How do I install Java? Python? What users should have access here?

The mainproduct cookbook contained everything about our app: How do I find the Grails war file? How should tomcat be configured for this app? That level of things.

Deploying code was now as simple as a knife command



knife ssh "role:portal" "sudo chef-client -o recipe[product::portal_deploy]"

Woo!

Oh... wait... Windows.... Strange hostnames (long story... has to do with the lack of unique hostnames on the windows boxes messing with Chef's node discovery)

knife winrm -m hostname -x user -P password "chef-client -o recipe[product::worker_deploy]"

Hmm... remembering that will be tough.

Add "deploy.py" to the mix, which wraps up the logic for each app into something simple and easy.

./deploy --fleet test --artifact portal --branch master --build 12

And it sorts the rest.

Cool! Thus closes our first pass at Chef and Automation.

Previous: Starting point
Next: Improvements

AWS, Chef and scaling a mixed Windows/Linux environment (Part 3)

Next phase: Improvements


After running for a few weeks, we started to really feel some pain points.


Chef roles take effect immediately on save. 


This makes phased dev/prod/qa rollouts tough. Say we have a new recipe we want to add to the "portal" runlist. It would be good to test that in dev first. 

We started putting feature flags into recipes, so that they would only run if an environment attribute was set. But really, what we wanted was versioning of roles.

A speaker at ChefConf had the answer for us: Roles cookbook. Rather than having your runlists in the role, have a Roles cookbook. Each role has a recipe that includes anything that role should run. 

That lets you pin your environments to a particular version of the Roles cookbook, effectively giving you 

Windows is slow. Always. 

Bootstrapping a new Windows instance from scratch takes about 35 minutes... so long that you really just can't adapt to dynamic processing needs.

We knew that ultimately we needed to start building fully configured AMIs to scale off of, but we didn't have the devops resources to do that. So, for now, we settled on launching a ton of instances after a release, then stopping most of them.

When we needed more power we could start the instances, which would have them online in 5 - 10 minutes. 

Each release we'd start up all the stopped instances, deploy code, then stop a bunch again. 

AWS, Chef and scaling a mixed Windows/Linux environment (Part 4)

The AMI Bakery


Here's the part that I really wanted to write about.

We decided it was time to invest in the ability to generate fully configured AMIs for our applications.

The Roles Cookbook


Back to the roles cookbook from a prior post. Each role had a "role_launch" recipe that did all the things necessary to take a base CentOS/Windows AMI and turn it into that thing. We now added a role_ami recipe that was very similar, but with some important differences:

  • nothing would start or be run during creation: db migrations, service starts, etc... would be set to run at boot time, but not now. 
  • the ami_prep recipe would install a startup script that would allow the instance to register itself with the chef-server when it came online. 

AMI Prep


Windows. Linux. userdata. So many different ways to do things, and none of them quite as controlled or specific as we wanted. 

Like so many other wheels, I reinvented it. 

  • Each platform has an init script. For linux it's a standard SysV init style, and for Windows it's a powershell script that runs at boot via task scheduler. 
  • The script only runs on the initial boot. It looks for a file firstboot.txt in the root volume and only runs if it is present. On completion of the run the file is removed. 
  • Both use an sns handler recipe to generate an alert if the run fails. This way we only hear about it when something bad happens.  
  • In both cases the instances use IAM roles and AWS command line tools to fetch a default runlist from an S3 bucket and execute chef-client. 

Adding the --bake flag

Our main tool for instance creation is the appropriately named "create_instance.py" script.

Previously this script would:
  • create/update security groups
  • create/update IAM roles
  • create instance
  • bootstrap chef / run the role_launch recipe
With the new --bake flag it could now:
  • create/update security groups
  • create/update IAM roles
  • create instance
  • bootstrap chef / run the role_ami recipe
  • stop the instance
  • take an ami of the instance
  • wait for the ami to be available
  • terminate the instance
  • clean up the chef node objects

Post deploy task

Immediately post-deploy, a job in Jenkins is triggered to bundle new AMIs for the current release. That way if we need to scale rapidly, we have an image. More on that when I post about scaling.

Previous: Improvements

Thursday, March 20, 2014

Automated Chef Cookbook Testing. How? Why? Help! Part 1 of many!


As part of the DevOps Promised Land, we decided that it was time to move Chef out of Operations and into every developer's hands. 


When I started at my current job, we did pretty much everything manually. I spent most of my first year here building up a reasonable infrastructure management system based on Chef, and getting all 200+ hand crafted EC2 instances rebuilt in an automated fashion.

Now that we've gotten pretty happy with where we are at, it's time to put the Dev in DevOps. No more opening tickets to add cookbook attributes! Self-service equals faster response time and less work for me. Let's do this!!

I use the term "fleet" throughout this post... you can replace that with stack, environment or whatever other term you use for a collection of servers that make up an instance of your application.

Before we get into details, lets talk about the high level goals:

  • All developers can modify cookbooks.
  • All cookbooks must have appropriate tests built in.
  • All cookbooks must have an automated "build" process that runs the tests at every commit. 
  • No cookbook changes are pushed to the Chef server until they pass all of their tests.
  • Cookbook changes are applied to all fleets (including prod!) as soon as they pass their tests.

Here's where we were at the starting point of this project. 


Our testing system was there, but it was manually run.

We were using the minitest-handler cookbook, and had reasonable test coverage. I won't lie to you and say that it was 100%, but it's close.

Every cookbook had a Vagrantfile, and used Berkshelf to maintain dependencies (via the vagrant-berkshelf plugin)

You made a change, and it was on you to vagrant up / make sure the tests pass. Once it's all good, you would commit the code.

Jenkins would get notified of the commit. It would check out the code, and do some REALLY basic tests:

  • knife cookbook test
  • foodcritic
  • knife testcoverage
That last one is a crappy little knife plugin I wrote that would look at each recipe and ensure that there was an _test.rb recipe associated with it. That doesn't mean there were any useful tests, just that you had at least taken the time to create the test file. 

If everything passes it would do a knife cookbook upload. 

When I was the only developer writing code, this was a workable system.  I knew what I changed, when I changed it, and what the impact would be. I could troubleshoot problems very quickly and push fixes within minutes of bug detection.

As the team grows, this falls apart:
  • The prod and qa environments are pinned to versions, but dev is running head. broken code breaks all 5 dev fleets!
  • Cookbook versions aren't necessarily frozen. who knows what "1.5.4" actually means? it's totally possible for prod to be "pinned" to a cookbook rev, then for someone to change that version!
  • What if i just upload directly from my workstation without running tests?

We started off by defining some things that would help us get there

  • Every cookbook change bumps the version number.
  • Every cookbook upload is frozen.
  • Every fleet has cookbook versions pinned (except for the chef test fleet).
  • Every cookbook should have a build script that encapsulates the logic for building and testing it. That way our Jenkins jobs are simply check out code / run build.sh

The knife-spork plugin by etsy is a great way to manage all of versioning parts of that list. I'll get into the details of how we use all of that later on.

Building a cookbook

Let's walk through a simple cookbook. We have one that we use internally for basic OS stuff: selinux, iptables, package management, etc...

It's a good one for the example because the tests are for the most part super easy:

package "wget" do  action :installend
Gets a test of:
it "Installs wget." do  package("wget").must_be_installedend

test-kitchen

At first we were using vagrant-berkshelf to tie things together. The 1.5.1 release of Vagrant broke the plugin, and a bit of reading made me feel that test-kitchen was a better choice anyway. 

So first up was adding test-kitchen to the cookbook (and removing Vagrant)

rm Vagrantfile 
kitchen init
And there you go! Now we need to customize the .kitchen.yml

  • We use centos-6.4, so I took the ubuntu platforms entry out. 
  • Add the runlist
  • Attributes.... uh-oh!!!
.kitchen.yml allows you to specify a list of chef attribute to pass. But I don't want to list them in each cookbook! I have an attributes.json file on the Jenkins box that defines all of the credentials that would normally be in an encrypted data bag. Vagrant lets me simply

dna = JSON.parse(File.read("/var/chef/attributes.json"))
chef.json.merge!(dna)
I looked at the test-kitchen docs, but didn't see anything specifically about including a file. I did catch though that the .kitchen.yml file is ERB parsed before being used. Perfect!

I converted attributes.json to yml using one of many tools out there, then added
    attributes:
<%= File.read("/var/chef/attributes.yaml") %>
To my yaml file and it promptly blew up! Since the attributes file is just getting inserted exactly at that point, the indentation has to match up. Shifting the data in the yaml file over by a couple of tabs did the trick.

Kitchen test now works! It boots the VM and processes the runlist, including minitest-handler. We will eventually start using test-kitchen's built in tests but for now we're OK with what we've got.

build script

As I mentioned earlier, we want to have a standardized build script that would be used by all cookbooks. The content would change, but the content would be the same:

  • Run whatever tests that cookbook gets
  • If the tests pass, upload the cookbook and exit zero
  • If the tests fail, exit non-zero.
Our standard set of tests includes:
  • knife cookbook test (basic syntax)
  • knife spork check --fail (make sure the version number has been bumped)
  • kitchen test
  • kitchen destroy
  • foodcritic
  • knife spork upload
To make sure that whoever is running this has a fair chance of success, we include a Gemfile in the cookbook with all of those tools in it. The build script starts off with a bundle install, so as dependencies change, the systems should stay up to date.

The shell script looks at the exit code of each command as it runs, and dies if any of them are non zero.

git-hooks

Every cookbook has a git-hooks directory with recommended scripts to either use or integrate into your existing tools.

The main one that all cookbooks have is a simple pre-commit hook to run knife spork check --fail. This makes sure that you don't try to upload a frozen cookbook version. 

promoting the cookbooks to the fleets

Since we aren't yet running full integration tests, we don't have a really good way to prove that the newly published cookbook plays well with its other friends in the Chef playground. 

For the moment, we are automatically promoting the fleets once a cookbook passes its tests and gets uploaded.

We have a Jenkins job that looks up the latest frozen version of each cookbook and promotes the fleets to that version. Since you don't get a frozen upload without passing your tests, that should be safe enough.

That also allows us to have in flight cookbooks on the chef server without fear -- the fleets will only ever auto promote to the latest frozen revision. 

integration testing

Great! Our cookbook passed its test and has been uploaded. All of the fleets are pinned to the older version though, so no-one is seeing this change.

Now what?

Enter the test fleet. 

We have 1 development environment that does not get anything pinned. It is used only by ops, just for testing new Chef code.

Go forth and deploy / run rubot / do whatever you need to do to feel happy on the test fleet. 



Right now we are doing this by hand, but that will be next on the automation to-do list.

Getting Chef + RVM playing nicely together

I know that all of maybe 3 people read this blog. I typically forget about it myself. But, I wanted to jot this down someplace, and here seemed as good as any.

We've recently started to move into the world of Rails. We use Chef to manage our infrastructure. There are some quirks to getting them all lined up right, and finding a straight up tutorial is surprisingly tricky. So, here goes:

  • Install Chef via Omnibus as recommended by them.
  • Get the RVM cookbook, maintained by fnichol.
  • Pick a non-root user to run your rails app(s) as. 
  • Install RVM/Ruby as that user, via the user recipe in the cookbook.
Any place in your other cookbooks that you are going to call ruby scripts (bundle install, rake db migrate, etc...):
  • include rvm::default
  • wrap the commands with the rvm_shell LWRP from the rvm cookbook. This ensures that the user's rvm environment is inherited properly. 
Here's an example of the recipe I'm using to install RVM:

node.normal['rvm']['user_installs'] = [
    {
        'user'  => node['fleet']['systemuser'],
        'default_ruby' => node['rvm']['default_ruby'],
        'global_gems' => [
            { 'name'    => 'bundler' },
            { 'name'    => 'rake' }
        ]
    }
include_recipe "rvm::user"

I'm using normal for the attributes to ensure that the defaults in the rvm cookbook are overridden. In my case node['fleet']['systemuser'] and node['rvm']['default_ruby'] are defined elsewhere in the run list.


Then some code (bundle install)

include_recipe 'rvm::default'

rvm_shell "Running Bundle install" do
  user node['fleet']['systemuser']
  group node['fleet']['systemgroup']
  ruby_string node['rvm']['default_ruby']
  cwd node['app']['dir']['deploy']
  environment 'RAILS_ENV' => node['app']['rails_env']
  code %{bundle install}
end

 Again, some attributes are defined elsewhere, but you should be able to get the idea.

This may not be perfect, but it works for me!

Other gotchas:

If you try to install RVM system-wide, chef-client gets sad. You end up changing the environment such that the omnibus ruby is no longer what env ruby finds, so you end up needing to install chef gems into your RVM ruby.

Ugly!

Chef-client always runs as root, and your app should never run as root. So, there shouldn't really be any conflicts here.