I'm in Las Vegas for the amazon web services re:invent conference. This is the stuff of corporate dreams: a private suite in the Venetian. Late starts and early endings each day. Sponsored pub crawl featuring free drinks at all of the resort bars.
There's just one problem: I don't drink. Or gamble.
So what to do?
I've been getting up early to talk to my daughter before school. I'm going to bed around 9 and getting up at around 4.
I can only imagine how maddening this is to my friends.
Tonight is the big party hosted by Intel. I'll go for a while, but I imagine that it will largely be about the copious free booze, and that I'll get bored.
Rock of Ages is in my hotel. I've seen it in Boston and on Broadway. Maybe I'll go again ;)
Home in 2 days.
Thursday, November 14, 2013
Friday, July 5, 2013
A bit about metrics gathering and Sensu
Sensu has a well documented method for running checks that push metrics to Graphite.
In some cases I opted to do things a bit differently.
One of my checks hits a message queue and gets me the number of items in the queue and the age of the oldest one.
I want to both alert on that (if something has been in the queue for more than X minutes) and graph it (how did we do this week on processing everything in under Y minutes), without making the same query 2 times.
In those cases my checks bypass the Sensu graphite handler and just send the metrics directly to Graphite.
I like that the system is flexible enough that this was easy to do.
In some cases I opted to do things a bit differently.
One of my checks hits a message queue and gets me the number of items in the queue and the age of the oldest one.
I want to both alert on that (if something has been in the queue for more than X minutes) and graph it (how did we do this week on processing everything in under Y minutes), without making the same query 2 times.
In those cases my checks bypass the Sensu graphite handler and just send the metrics directly to Graphite.
I like that the system is flexible enough that this was easy to do.
Oh Windows...
One thing that makes me sad regularly is that we have Windows instances. Lots of them. And they are important.
Sensu does support Windows, but the documentation is sparse. You pretty much have to either use their Chef cookbook or read it to get the steps required to install the client.
Similarly, go visit the Sensu Community Plugins project and see how many of them cat /proc/somethingorother to get data. No Windows for you!
I ended up writing a couple of Ruby scripts that would get lists of processes, system load and the like on both Linux and Windows platforms. I need to clean them up a bit then submit them back to the project.
It wasn't hard to do, but it is a thing to be aware of if you go down this road.
Sensu does support Windows, but the documentation is sparse. You pretty much have to either use their Chef cookbook or read it to get the steps required to install the client.
Similarly, go visit the Sensu Community Plugins project and see how many of them cat /proc/somethingorother to get data. No Windows for you!
I ended up writing a couple of Ruby scripts that would get lists of processes, system load and the like on both Linux and Windows platforms. I need to clean them up a bit then submit them back to the project.
It wasn't hard to do, but it is a thing to be aware of if you go down this road.
Collectd, Graphite, Tasseo, Etsy Dashboard, Vacuumetrix
My current visualization suite is made up of 4 tools:
Graphite
The open source graphing darling serves us well. Feed it metrics, make API calls to turn them into graphs. Do one thing and do it well!
Collectd
Collectd runs on every instance, gathering system stats and reporting them to graphite. We went with Collectd because right out of the box it did about 90% of what we wanted, and the remaining 10% was easily addressable via some simple Python based plugins.
This is how disk capacity, free memory, load and other basic health checks get sent to graphite.
Vacuumetrix
Vacuumetrix pulls stats out of AWS and feeds them into graphite. Good for looking at Elastic Load Balancers, RDS instances and other things where we can't install our own tools.
Etsy Dashboard
We used the Etsy Dashboard project to make some generic "Here's how this instance looks" pages. Go to the dashboard for prod-loghost and you'll see disk I/O, free mem, load, network in/out and all the other things that people like to look at when assessing a particular system.
Tasseo
Tasseo is the current NOC monitor / high level overview of the system. On one screen you can see a reasonable view of everything, which is extremely convenient.
Graphite
The open source graphing darling serves us well. Feed it metrics, make API calls to turn them into graphs. Do one thing and do it well!
Collectd
Collectd runs on every instance, gathering system stats and reporting them to graphite. We went with Collectd because right out of the box it did about 90% of what we wanted, and the remaining 10% was easily addressable via some simple Python based plugins.
This is how disk capacity, free memory, load and other basic health checks get sent to graphite.
Vacuumetrix
Vacuumetrix pulls stats out of AWS and feeds them into graphite. Good for looking at Elastic Load Balancers, RDS instances and other things where we can't install our own tools.
Etsy Dashboard
We used the Etsy Dashboard project to make some generic "Here's how this instance looks" pages. Go to the dashboard for prod-loghost and you'll see disk I/O, free mem, load, network in/out and all the other things that people like to look at when assessing a particular system.
Tasseo
Tasseo is the current NOC monitor / high level overview of the system. On one screen you can see a reasonable view of everything, which is extremely convenient.
Enter Sensu and friends
We knew what we didn't like about Nagios, but what did we actually want?
After some discussion it came down to this:
After some discussion it came down to this:
- We want to know when a critical system/service is unavailable.
- We want graphs to easily analyze trends.
- We don't want to have to restart anything when instances are added or removed.
Sensu is a popular modern monitoring router (their term!). It doesn't try to be a Swiss Army knife... it takes input from scripts, and performs actions based on that input.
For a long time Nagios user, Sensu is a bit disconcerting. There's no enormous "all clear" dashboard with hundreds of green dots. You know when something is wrong, but have no real way to assess that everything is right.
We chewed on that for a while and eventually decided we were OK with it.
One thing I really liked about Sensu was the way clients were added and removed. I won't re-write their documentation, but it comes down to this:
- When sensu-client starts, the node self-registers.
- A simple API call takes a node out of the clients list.
Want to stop a node? Just make sure that your stop script makes the right API call. When it starts back up it will put itself back in.
The other thing I liked is that it uses Nagios style checks (based on exit code of the plugin), so we were able to drop it in place pretty easily and just call the plugins that were already installed.
So that gets monitoring out of the way for now... on to graphing!
Nagios + Chef = tolerable
We use Chef for system configuration management, which makes nagios sort of manageable...
Our in-house nagios cookbook has a recipe called monitor-clients. This recipe does a bunch of things:
Our in-house nagios cookbook has a recipe called monitor-clients. This recipe does a bunch of things:
- Deletes all of the existing config files in the /etc/nagios/dynamic directory. Otherwise how do we know when a node has gone away? It's not like Chef tells you that it used to be there but is now gone. This is the first bad part... Chef run goes badly, Nagios is hosed!
- Looks up the nodes on the chef server and generates configuration files for each of them based on role and attributes defined in each server's environment (more frequent checks for prod, no evening alerting for dev, stuff like that)
- At the end of the Chef run, the nagios server restarts, loading all of these new configs.
So why does this suck?
Any time a node is created or destroyed, this recipe has to be re-run to pick up the changes. To ensure that, we've added hooks into the tools we use to provision instances so that this is run automatically. Just hope no-one uses the AWS console to create/destroy something...
One of the cool things with AWS is that you can stop instances and not get charged for them until you turn them back on (other than the EBS storage fees). You can save thousands a month on your AWS bill by doing this. Unfortunately that means the nodes still exist in Chef, still get detected by this recipe, and end up getting flagged as down by Nagios.
While you are waiting for this script to run your new nodes aren't monitored and your terminated ones are trying hard to make it to critical and page you one last time before they are removed. The bigger your environment the slower this gets. We have anywhere from 100 to 200 nodes at any given time. On the high end of that the config generation phase of the Chef run can take 3-4 minutes. Imagine 5 times the hosts. 10 times. Somewhere this would fall over.
Friday, June 14, 2013
I'm sorry Nagios, but it's time to say goodbye.
Often when one starts the awkward breakup proceedings, the conversation goes something like this:
Truer words were never spoken.
I've been using Nagios for a long time... all the way back to when it was called Netsaint. In those days, setting up a new server was a Big Deal. You would have to do all sorts of things, not the least of which being the process of physically unboxing/setting up the machine.
In those days, Nagios worked well. You probably have a bunch of hostgroups defined and some nice templates... Create a host definition for web17c, add it to the webservers hostgroup, restart nagios and you are good to go! At athenahealth I was monitoring 200 odd boxes this way and have very little to complain about.
It's a different world today. At my current job we are all cloud based (AWS, more on that in another post!). We scale our resources up and down throughout the day to the order of +/- 50 nodes over the course of an 8 hour workday.
This puts us in a position where we need to monitor hosts we don't know about, running services we may not be aware of.
I've tried a lot of things to get this working well. My next couple of posts are going to be about the ways we tried to make it work, and what we ended up doing instead.
It's not you, it's me. I've changed, and we're just not compatible anymore.
Truer words were never spoken.
I've been using Nagios for a long time... all the way back to when it was called Netsaint. In those days, setting up a new server was a Big Deal. You would have to do all sorts of things, not the least of which being the process of physically unboxing/setting up the machine.
In those days, Nagios worked well. You probably have a bunch of hostgroups defined and some nice templates... Create a host definition for web17c, add it to the webservers hostgroup, restart nagios and you are good to go! At athenahealth I was monitoring 200 odd boxes this way and have very little to complain about.
It's a different world today. At my current job we are all cloud based (AWS, more on that in another post!). We scale our resources up and down throughout the day to the order of +/- 50 nodes over the course of an 8 hour workday.
This puts us in a position where we need to monitor hosts we don't know about, running services we may not be aware of.
I've tried a lot of things to get this working well. My next couple of posts are going to be about the ways we tried to make it work, and what we ended up doing instead.
Here we are... Yet Another New Blog.
Like most tech-folks, I have an account on just about any social networking site out there. Twitter for quick brain dumps, Facebook to see what my friends are having for dinner. Tumblr for pictures of cats, etc...
For many many years I've used Livejournal for more substantial posts. Lately though LJ has felt like a somewhat dead community. It doesn't integrate well with other systems, and not many people I know are there anymore.
Enter blogger.
What will I be writing here? Most posts will fall into one of these buckets:
Like most tech-folks, I have an account on just about any social networking site out there. Twitter for quick brain dumps, Facebook to see what my friends are having for dinner. Tumblr for pictures of cats, etc...
For many many years I've used Livejournal for more substantial posts. Lately though LJ has felt like a somewhat dead community. It doesn't integrate well with other systems, and not many people I know are there anymore.
Enter blogger.
What will I be writing here? Most posts will fall into one of these buckets:
- Technology of the Linux/DevOps/Sysadmin sort.
- Sci-Fi props, costumes and collectibles.
- The 501st.
- My kid.
If any of those things appeal to you, this might not be a terrible experience. If I lost you there just move along (see, Star Wars reference!)
Subscribe to:
Posts (Atom)