Friday, July 5, 2013

A bit about metrics gathering and Sensu

Sensu has a well documented method for running checks that push metrics to Graphite.

In some cases I opted to do things a bit differently.

One of my checks hits a message queue and gets me the number of items in the queue and the age of the oldest one.

I want to both alert on that (if something has been in the queue for more than X minutes) and graph it (how did we do this week on processing everything in under Y minutes), without making the same query 2 times.

In those cases my checks bypass the Sensu graphite handler and just send the metrics directly to Graphite.

I like that the system is flexible enough that this was easy to do.

Oh Windows...

One thing that makes me sad regularly is that we have Windows instances. Lots of them. And they are important.

Sensu does support Windows, but the documentation is sparse. You pretty much have to either use their Chef cookbook or read it to get the steps required to install the client.

Similarly, go visit the Sensu Community Plugins project and see how many of them cat /proc/somethingorother to get data. No Windows for you!

I ended up writing a couple of Ruby scripts that would get lists of processes, system load and the like on both Linux and Windows platforms. I need to clean them up a bit then submit them back to the project.

It wasn't hard to do, but it is a thing to be aware of if you go down this road.

Collectd, Graphite, Tasseo, Etsy Dashboard, Vacuumetrix

My current visualization suite is made up of 4 tools:

Graphite

The open source graphing darling serves us well. Feed it metrics, make API calls to turn them into graphs. Do one thing and do it well!

Collectd

Collectd runs on every instance, gathering system stats and reporting them to graphite. We went with Collectd because right out of the box it did about 90% of what we wanted, and the remaining 10% was easily addressable via some simple Python based plugins.

This is how disk capacity, free memory, load and other basic health checks get sent to graphite.

Vacuumetrix

Vacuumetrix pulls stats out of AWS and feeds them into graphite. Good for looking at Elastic Load Balancers, RDS instances and other things where we can't install our own tools.


Etsy Dashboard

We used the Etsy Dashboard project to make some generic "Here's how this instance looks" pages. Go to the dashboard for prod-loghost and you'll see disk I/O, free mem, load, network in/out and all the other things that people like to look at when assessing a particular system.

Tasseo

Tasseo is the current NOC monitor / high level overview of the system. On one screen you can see a reasonable view of everything, which is extremely convenient.



Enter Sensu and friends

We knew what we didn't like about Nagios, but what did we actually want?

After some discussion it came down to this:


  • We want to know when a critical system/service is unavailable.
  • We want graphs to easily analyze trends.
  • We don't want to have to restart anything when instances are added or removed.
Sensu is a popular modern monitoring router (their term!). It doesn't try to be a Swiss Army knife... it takes input from scripts, and performs actions based on that input. 

For a long time Nagios user, Sensu is a bit disconcerting. There's no enormous "all clear" dashboard with hundreds of green dots. You know when something is wrong, but have no real way to assess that everything is right. 

We chewed on that for a while and eventually decided we were OK with it. 

One thing I really liked about Sensu was the way clients were added and removed. I won't re-write their documentation, but it comes down to this:
  • When sensu-client starts, the node self-registers.
  • A simple API call takes a node out of the clients list. 
Want to stop a node? Just make sure that your stop script makes the right API call. When it starts back up it will put itself back in. 

The other thing I liked is that it uses Nagios style checks (based on exit code of the plugin), so we were able to drop it in place pretty easily and just call the plugins that were already installed. 

So that gets monitoring out of the way for now... on to graphing!

Nagios + Chef = tolerable

We use Chef for system configuration management, which makes nagios sort of manageable...

Our in-house nagios cookbook has a recipe called monitor-clients. This recipe does a bunch of things:
  • Deletes all of the existing config files in the /etc/nagios/dynamic directory. Otherwise how do we know when a node has gone away? It's not like Chef tells you that it used to be there but is now gone. This is the first bad part... Chef run goes badly, Nagios is hosed!
  • Looks up the nodes on the chef server and generates configuration files for each of them based on role and attributes defined in each server's environment (more frequent checks for prod, no evening alerting for dev, stuff like that)
  • At the end of the Chef run, the nagios server restarts, loading all of these new configs.

So why does this suck?

Any time a node is created or destroyed, this recipe has to be re-run to pick up the changes. To ensure that, we've added hooks into the tools we use to provision instances so that this is run automatically. Just hope no-one uses the AWS console to create/destroy something... 

One of the cool things with AWS is that you can stop instances and not get charged for them until you turn them back on (other than the EBS storage fees). You can save thousands a month on your AWS bill by doing this. Unfortunately that means the nodes still exist in Chef, still get detected by this recipe, and end up getting flagged as down by Nagios. 

While you are waiting for this script to run your new nodes aren't monitored and your terminated ones are trying hard to make it to critical and page you one last time before they are removed.  The bigger your environment the slower this gets.  We have anywhere from 100 to 200 nodes at any given time. On the high end of that the config generation phase of the Chef run can take 3-4 minutes. Imagine 5 times the hosts. 10 times. Somewhere this would fall over.