jm + etsy + alarming   1

Introducing Kale « Code as Craft
Etsy have implemented a tool to perform auto-correlation of service metrics, and detection of deviation from historic norms:
at Etsy, we really love to make graphs. We graph everything! Anywhere we can slap a StatsD call, we do. As a result, we’ve found ourselves with over a quarter million distinct metrics. That’s far too many graphs for a team of 150 engineers to watch all day long! And even if you group metrics into dashboards, that’s still an awful lot of dashboards if you want complete coverage. Of course, if a graph isn’t being watched, it might misbehave and no one would know about it. And even if someone caught it, lots of other graphs might be misbehaving in similar ways, and chances are low that folks would make the connection.

We’d like to introduce you to the Kale stack, which is our attempt to fix both of these problems. It consists of two parts: Skyline and Oculus. We first use Skyline to detect anomalous metrics. Then, we search for that metric in Oculus, to see if any other metrics look similar. At that point, we can make an informed diagnosis and hopefully fix the problem.


It'll be interesting to see if they can get this working well. I've found it can be tricky to get working with low false positives, without massive volume to "smooth out" spikes caused by normal activity. Amazon had one particularly successful version driving severity-1 order drop alarms, but it used massive event volumes and still had periodic false positives. Skyline looks like it will alarm on a single anomalous data point, and in the comments Abe notes "our algorithms err on the side of noise and so alerting would be very noisy."
etsy  monitoring  service-metrics  alarming  deviation  correlation  data  search  graphs  oculus  skyline  kale  false-positives 
june 2013 by jm

Copy this bookmark:



description:


tags: