Dogfood, Target Markets, and Subtle Problems
Last night I started working on a script to help customers integrate the Where's it Up API with their existing monitoring infrastructure. I started with Nagios since that's what we're using for WonderNetwork (I also learned we're monitoring ~1200 services). Shortly after I started coding I talked with Will since he's our systems administrator, and the target market for the Nagios integration script. Over the course of a few discussions during the evening, the scope of the script changed drastically (and surprisingly, it actually shrunk). The way I wanted to write the script didn't match at all with how my target users would want to use it.
My Plan: Configure Nagios to hit the script every minute. The script will fail its first ever check (since calling the API and waiting for the result takes time), but it will cache that result. Subsequent calls will see the cached information. Once data is a minute from expiring it will silently make the call again, so that new data will be presented on the next call. When things do actually fail, make further calls to help diagnose the issue. Those details will be available to the administrator via a log, or email.
Will's Complaint: I don't like it when monitoring tools lie to me. If Nagios claims it ran a check at 12:43am that passed, that service better be functioning properly. Nagios is willing to wait 5 minutes (by default) for a result to return, so just make the script wait for results before returning.
So, that's what I did. I also had rather grand plans involving configuring the script for every check, then having Nagios pass in the check title to invoke that check. Will (thankfully) managed to convince me that this was folly, and the script now accepts relevant parameters via the command line.
Lesson One: Target your efforts to your customer's wants.
While testing the script I also noticed that our WonderProxy site was occasionally failing my tests. Further investigation revealed that the service was timing out after waiting 5 seconds with no response. We initially blamed Apache as being slow to acquire the entropy required to generate the SSL connectiona this was not at all the case. As it turns out, some of the data that's cached for the page was regenerating too often, and taking longer than 5 seconds to do so. This problem has likely existed for a long time, undetected. Users simply received incredibly poor performance when they were unlucky enough to be the first user after the cache expired.
Lesson Two: Subtle problems can exist undetected for ages, undetected unless you look for them.
Finally, while building the script I noticed a few issues with how our API was handling invalid requests. This occurred while I was unknowingly passing invalid data to the system, which obligingly blundered on, while not actually doing anything, making looking up the results of that request incredibly difficult. I didn't find these issues while writing my initial test code & curl examples, as they were all valid. These silent failures allowed me to re-visit portions of the API code to error out, and do so in a verbose manner.
Lesson Three: Eating your own dogfood is an incredibly fast way to find problems.