jm + service-metrics   17

OpenCensus: A Stats Collection and Distributed Tracing Framework
Google open sourcing their internal Census lib for service metrics and distributed tracing
google  monitoring  service-metrics  metrics  census  opencensus  open-source  tracing  zipkin  prometheus 
4 days ago by jm
Backstage Blog - Prometheus: Monitoring at SoundCloud - SoundCloud Developers
whoa, this is pretty excellent. The major improvement over a graphite-based system would be the multi-dimensional tagging of metrics, which we currently have to do by simply expanding the graphite metric's name to encompass all those dimensions and use searching at query time, inefficiently.
monitoring  soundcloud  prometheus  metrics  service-metrics  graphite  alerting 
february 2015 by jm
Introducing Atlas: Netflix's Primary Telemetry Platform
This sounds really excellent -- the dimensionality problem it deals with is a familiar one, particularly with red/black deployments, autoscaling, and so on creating trees of metrics when new transient servers appear and disappear. Looking forward to Netflix open sourcing enough to make it usable for outsiders
netflix  metrics  service-metrics  atlas  telemetry  ops 
december 2014 by jm
Twitter tech talk video: "Profiling Java In Production"
In this talk Kaushik Srenevasan describes a new, low overhead, full-stack tool (based on the Linux perf profiler and infrastructure built into the Hotspot JVM) we've built at Twitter to solve the problem of dynamically profiling and tracing the behavior of applications (including managed runtimes) in production.


Looks very interesting. Haven't watched it yet though
twitter  tech-talks  video  presentations  java  jvm  profiling  testing  monitoring  service-metrics  performance  production  hotspot  perf 
december 2013 by jm
Cyanite
a metric storage daemon, exposing both a carbon listener and a simple web service. Its aim is to become a simple, scalable and drop-in replacement for graphite's backend.


Pretty alpha for now, but definitely worth keeping an eye on to potentially replace our burgeoning Carbon fleet...
graphite  carbon  cassandra  storage  metrics  ops  graphs  service-metrics 
december 2013 by jm
LatencyUtils by giltene
The LatencyUtils package includes useful utilities for tracking latencies. Especially in common in-process recording scenarios, which can exhibit significant coordinated omission sensitivity without proper handling.
gil-tene  metrics  java  measurement  coordinated-omission  latency  speed  service-metrics  open-source 
november 2013 by jm
Statsite
A C reimplementation of Etsy's statsd, with some interesting memory optimizations.
Statsite is designed to be both highly performant, and very flexible. To achieve this, it implements the stats collection and aggregation in pure C, using libev to be extremely fast. This allows it to handle hundreds of connections, and millions of metrics. After each flush interval expires, statsite performs a fork/exec to start a new stream handler invoking a specified application. Statsite then streams the aggregated metrics over stdin to the application, which is free to handle the metrics as it sees fit. This allows statsite to aggregate metrics and then ship metrics to any number of sinks (Graphite, SQL databases, etc). There is an included Python script that ships metrics to graphite.
statsd  graphite  statsite  performance  statistics  service-metrics  metrics  ops 
november 2013 by jm
Observability at Twitter
Bit of detail into Twitter's TSD metric store.
There are separate online clusters for different data sets: application and operating system metrics, performance critical write-time aggregates, long term archives, and temporal indexes. A typical production instance of the time series database is based on four distinct Cassandra clusters, each responsible for a different dimension (real-time, historical, aggregate, index) due to different performance constraints. These clusters are amongst the largest Cassandra clusters deployed in production today and account for over 500 million individual metric writes per minute. Archival data is stored at a lower resolution for trending and long term analysis, whereas higher resolution data is periodically expired. Aggregation is generally performed at write-time to avoid extra storage operations for metrics that are expected to be immediately consumed. Indexing occurs along several dimensions–service, source, and metric names–to give users some flexibility in finding relevant data.
twitter  monitoring  metrics  service-metrics  tsd  time-series  storage  architecture  cassandra 
september 2013 by jm
Boundary's Early Warnings alarm
Anomaly detection on network throughput metrics, alarming if throughputs on selected flows deviate by 1, 2, or 3 standard deviations from a historical baseline.
network-monitoring  throughput  boundary  service-metrics  alarming  ops  statistics 
june 2013 by jm
Paper: "Root Cause Detection in a Service-Oriented Architecture" [pdf]
LinkedIn have implemented an automated root-cause detection system:

This paper introduces MonitorRank, an algorithm that can reduce the time, domain knowledge, and human effort required to find the root causes of anomalies in such service-oriented architectures. In the event of an anomaly, MonitorRank provides a ranked order list of possible root causes for monitoring teams to investigate. MonitorRank uses the historical and current time-series metrics of each sensor as its input, along with the call graph generated between sensors to build an unsupervised model for ranking. Experiments on real production outage data from LinkedIn, one of the largest online social networks, shows a 26% to 51% improvement in mean
average precision in finding root causes compared to baseline and current state-of-the-art methods.


This is a topic close to my heart after working on something similar for 3 years in Amazon!

Looks interesting, although (a) I would have liked to see more case studies and examples of "real world" outages it helped with; and (b) it's very much a machine-learning paper rather than a systems one, and there is no discussion of fault tolerance in the design of the detection system, which would leave me worried that in the case of a large-scale outage event, the system itself will disappear when its help is most vital. (This was a major design influence on our team's work.)

Overall, particularly given those 2 issues, I suspect it's not in production yet. Ours certainly was ;)
linkedin  soa  root-cause  alarming  correlation  service-metrics  machine-learning  graphs  monitoring 
june 2013 by jm
Introducing Kale « Code as Craft
Etsy have implemented a tool to perform auto-correlation of service metrics, and detection of deviation from historic norms:
at Etsy, we really love to make graphs. We graph everything! Anywhere we can slap a StatsD call, we do. As a result, we’ve found ourselves with over a quarter million distinct metrics. That’s far too many graphs for a team of 150 engineers to watch all day long! And even if you group metrics into dashboards, that’s still an awful lot of dashboards if you want complete coverage. Of course, if a graph isn’t being watched, it might misbehave and no one would know about it. And even if someone caught it, lots of other graphs might be misbehaving in similar ways, and chances are low that folks would make the connection.

We’d like to introduce you to the Kale stack, which is our attempt to fix both of these problems. It consists of two parts: Skyline and Oculus. We first use Skyline to detect anomalous metrics. Then, we search for that metric in Oculus, to see if any other metrics look similar. At that point, we can make an informed diagnosis and hopefully fix the problem.


It'll be interesting to see if they can get this working well. I've found it can be tricky to get working with low false positives, without massive volume to "smooth out" spikes caused by normal activity. Amazon had one particularly successful version driving severity-1 order drop alarms, but it used massive event volumes and still had periodic false positives. Skyline looks like it will alarm on a single anomalous data point, and in the comments Abe notes "our algorithms err on the side of noise and so alerting would be very noisy."
etsy  monitoring  service-metrics  alarming  deviation  correlation  data  search  graphs  oculus  skyline  kale  false-positives 
june 2013 by jm
Unhelpful Graphite Tips
10 particularly good -- actually helpful -- tips on using the Graphite metric graphing system
graphite  ops  metrics  service-metrics  graphing  ui  dataviz 
february 2013 by jm
check_graphite
'a Nagios plugin to poll Graphite'. Necessary, since service metrics are the true source of service health information
nagios  graphite  service-metrics  ops 
january 2013 by jm
Metricfire - Powerful Application Metrics Made Easy
Irish "metrics as a service" company, Python-native; they've just gone GA and announced their pricing plans
python  metrics  service-metrics 
april 2012 by jm
sbtourist/nimrod - GitHub
'Nimrod is a metrics server, inspired by the excellent Coda Hale's Metrics library, but purely based on log processing: hence, it doesn't affect the way you write your applications, nor it has any side effect on them.'
nimrod  service-metrics  logging 
february 2012 by jm
Autometrics: Self-service metrics collection
how LinkedIn built a service-metrics collection and graphing infrastructure using Kafka and Zookeeper, writing to RRD files, handling 8.8k metrics per datacenter per second
kafka  zookeeper  linkedin  sysadmin  service-metrics 
february 2012 by jm

Copy this bookmark:



description:


tags: