Stormtrooperguy: Nagios + Chef = tolerable

We use Chef for system configuration management, which makes nagios sort of manageable...

Our in-house nagios cookbook has a recipe called monitor-clients. This recipe does a bunch of things:

Deletes all of the existing config files in the /etc/nagios/dynamic directory. Otherwise how do we know when a node has gone away? It's not like Chef tells you that it used to be there but is now gone. This is the first bad part... Chef run goes badly, Nagios is hosed!
Looks up the nodes on the chef server and generates configuration files for each of them based on role and attributes defined in each server's environment (more frequent checks for prod, no evening alerting for dev, stuff like that)
At the end of the Chef run, the nagios server restarts, loading all of these new configs.

So why does this suck?

Any time a node is created or destroyed, this recipe has to be re-run to pick up the changes. To ensure that, we've added hooks into the tools we use to provision instances so that this is run automatically. Just hope no-one uses the AWS console to create/destroy something...

One of the cool things with AWS is that you can stop instances and not get charged for them until you turn them back on (other than the EBS storage fees). You can save thousands a month on your AWS bill by doing this. Unfortunately that means the nodes still exist in Chef, still get detected by this recipe, and end up getting flagged as down by Nagios.

While you are waiting for this script to run your new nodes aren't monitored and your terminated ones are trying hard to make it to critical and page you one last time before they are removed. The bigger your environment the slower this gets. We have anywhere from 100 to 200 nodes at any given time. On the high end of that the config generation phase of the Chef run can take 3-4 minutes. Imagine 5 times the hosts. 10 times. Somewhere this would fall over.

Stormtrooperguy

Friday, July 5, 2013

Nagios + Chef = tolerable

No comments:

Post a Comment