jm + metrics   88

ASAP: Automatic Smoothing for Attention Prioritization in Streaming Time Series Visualization
Peter Bailis strikes again.

'Time series visualization of streaming telemetry (i.e., charting of
key metrics such as server load over time) is increasingly prevalent
in recent application deployments. Existing systems simply plot the
raw data streams as they arrive, potentially obscuring large-scale
deviations due to local variance and noise. We propose an alternative:
to better prioritize attention in time series exploration and
monitoring visualizations, smooth the time series as much as possible
to remove noise while still retaining large-scale structure. We
develop a new technique for automatically smoothing streaming
time series that adaptively optimizes this trade-off between noise
reduction (i.e., variance) and outlier retention (i.e., kurtosis). We
introduce metrics to quantitatively assess the quality of the choice
of smoothing parameter and provide an efficient streaming analytics
operator, ASAP, that optimizes these metrics by combining techniques
from stream processing, user interface design, and signal
processing via a novel autocorrelation-based pruning strategy and
pixel-aware preaggregation. We demonstrate that ASAP is able to
improve users’ accuracy in identifying significant deviations in time
series by up to 38.4% while reducing response times by up to 44.3%.
Moreover, ASAP delivers these results several orders of magnitude
faster than alternative optimization strategies.'
dataviz  graphs  metrics  peter-bailis  asap  smoothing  aggregation  time-series  tsd 
8 days ago by jm
How to Quantify Scalability
good page on the Universal Scalability Law and how to apply it
usl  performance  scalability  concurrency  capacity  measurement  excel  equations  metrics 
september 2016 by jm
Sex toy tells manufacturer when you’re using it
the "We-Vibe 4 Plus" phones home with telemetry data including temperature, and when the user "changes the vibration level". wtf
wtf  privacy  sex-toys  telemetry  metrics  vibrators  we-vibe 
august 2016 by jm
Raintank investing in Graphite
paying Jason Dixon to work on it, improving the backend, possibly replacing the creaky Whisper format. great news!
graphite  metrics  monitoring  ops  open-source  grafana  raintank 
july 2016 by jm
USE Method: Linux Performance Checklist
Really late in bookmarking this, but has some up-to-date sample commandlines for sar, mpstat and iostat on linux
linux  sar  iostat  mpstat  cli  ops  sysadmin  performance  tuning  use  metrics 
june 2016 by jm
Key Metrics for Amazon Aurora | AWS Partner Network (APN) Blog
Very DataDog-oriented, but some decent tips on monitorable metrics here
datadog  metrics  aurora  aws  rds  monitoring  ops 
may 2016 by jm
CoreOS and Prometheus: Building monitoring for the next generation of cluster infrastructure
Ooh, this is a great plan. :applause:
Enabling GIFEE — Google Infrastructure for Everyone Else — is a primary mission at CoreOS, and open source is key to that goal. [....]

Prometheus was initially created to handle monitoring and alerting in modern microservice architectures. It steadily grew to fit the wider idea of cloud native infrastructure. Though it was not intentional in the original design, Prometheus and Kubernetes conveniently share the key concept of identifying entities by labels, making the semantics of monitoring Kubernetes clusters simple. As we discussed previously on this blog, Prometheus metrics formed the basis of our analysis of Kubernetes scheduler performance, and led directly to improvements in that code. Metrics are essential not just to keep systems running, but also to analyze and improve application behavior.

All things considered, Prometheus was an obvious choice for the next open source project CoreOS wanted to support and improve with internal developers committed to the code base.
monitoring  coreos  prometheus  metrics  clustering  ops  gifee  google  kubernetes 
may 2016 by jm
Observability at Twitter: technical overview, part II
Interesting to me mainly for this tidbit which makes my own prejudices:
“Pull” vs “push” in metrics collection: At the time of our previous blog post, all our metrics were collected by “pulling” from our collection agents. We discovered two main issues:

* There is no easy way to differentiate service failures from collection agent failures. Service response time out and missed collection request are both manifested as empty time series.
* There is a lack of service quality insulation in our collection pipeline. It is very difficult to set an optimal collection time out for various services. A long collection time from one single service can cause a delay for other services that share the same collection agent.

In light of these issues, we switched our collection model from “pull” to “push” and increased our service isolation. Our collection agent on each host only collects metrics from services running on that specific host. Additionally, each collection agent sends separate collection status tracking metrics in addition to the metrics emitted by the services.

We have seen a significant improvement in collection reliability with these changes. However, as we moved to self service push model, it becomes harder to project the request growth. In order to solve this problem, we plan to implement service quota to address unpredictable/unbounded growth.
pull  push  metrics  tcp  stacks  monitoring  agents  twitter  fault-tolerance 
march 2016 by jm
Life360 testimonial for Prometheus
Now this is a BIG thumbs up:
'Prometheus has been known to us for a while, and we have been tracking it and reading about the active development, and at a point (a few months back) we decided to start evaluating it for production use. The PoC results were incredible. The monitoring coverage of MySQL was amazing, and we also loved the JMX monitoring for Cassandra, which had been sorely lacking in the past.'
metrics  monitoring  time-series  prometheus  testimonials  life360  cassandra  jmx  mysql 
march 2016 by jm
The Nyquist theorem and limitations of sampling profilers today, with glimpses of tracing tools from the future
Awesome post from Dan Luu with data from Google:
The cause [of some mystery widespread 250ms hangs] was kernel throttling of the CPU for processes that went beyond their usage quota. To enforce the quota, the kernel puts all of the relevant threads to sleep until the next multiple of a quarter second. When the quarter-second hand of the clock rolls around, it wakes up all the threads, and if those threads are still using too much CPU, the threads get put back to sleep for another quarter second. The phase change out of this mode happens when, by happenstance, there aren’t too many requests in a quarter second interval and the kernel stops throttling the threads. After finding the cause, an engineer found that this was happening on 25% of disk servers at Google, for an average of half an hour a day, with periods of high latency as long as 23 hours. This had been happening for three years. Dick Sites says that fixing this bug paid for his salary for a decade. This is another bug where traditional sampling profilers would have had a hard time. The key insight was that the slowdowns were correlated and machine wide, which isn’t something you can see in a profile.
debugging  performance  visualization  instrumentation  metrics  dan-luu  latency  google  dick-sites  linux  scheduler  throttling  kernel  hangs 
february 2016 by jm
Metrics integration for OkHttp. looks quite nice
okhttp  java  clients  http  metrics  dropwizard 
december 2015 by jm
Spotify wrote their own metrics store on ElasticSearch and Cassandra. Sounds very similar to Prometheus
cassandra  elasticsearch  spotify  monitoring  metrics  heroic 
december 2015 by jm
Why Percentiles Don’t Work the Way you Think
Baron Schwartz on metrics, percentiles, and aggregation. +1, although as a HN commenter noted, quantile digests are probably the better fix
performance  percentiles  quantiles  statistics  metrics  monitoring  baron-schwartz  vividcortex 
december 2015 by jm
Is Dublin Busy?
a bunch of metrics for Dublin xmas-shopping capacity
xmas  dublin  metrics  design  stats 
november 2015 by jm
CiteSeerX — The Confounding Effect of Class Size on the Validity of Object-oriented Metrics
A lovely cite from @conor. Turns out the sheer size of an OO class is itself a solid fault-proneness metric
metrics  coding  static-analysis  error-detection  faults  via:conor  oo 
november 2015 by jm
Nobody Loves Graphite Anymore - VividCortex
Graphite has a place in our current monitoring stack, and together with StatsD will always have a special place in the hearts of DevOps practitioners everywhere, but it’s not representative of state-of-the-art in the last few years. Graphite is where the puck was in 2010. If you’re skating there, you’re missing the benefits of modern monitoring infrastructure.

The future I foresee is one where time series capabilities (the raw power needed, which I described in my time series requirements blog post, for example) are within everyone’s reach. That will be considered table stakes, whereas now it’s pretty revolutionary.

Like I've been saying -- we need Time Series As A Service! This should be undifferentiated heavy lifting.
graphite  tsd  time-series  vividcortex  statsd  ops  monitoring  metrics 
november 2015 by jm
Existential Consistency: Measuring and Understanding Consistency at Facebook
The metric is termed φ(P)-consistency, and is actually very simple. A read for the same data is sent to all replicas in P, and φ(P)-consistency is defined as the frequency with which that read returns the same result from all replicas. φ(G)-consistency applies this metric globally, and φ(R)-consistency applies it within a region (cluster). Facebook have been tracking this metric in production since 2012.
facebook  eventual-consistency  consistency  metrics  papers  cap  distributed-computing 
october 2015 by jm
SolarCapture Packet Capture Software
Interesting product line -- I didn't know this existed, but it makes good sense as a "network flight recorder". Big in finance.
SolarCapture is powerful packet capture product family that can transform every server into a precision network monitoring device, increasing network visibility, network instrumentation, and performance analysis. SolarCapture products optimize network monitoring and security, while eliminating the need for specialized appliances, expensive adapters relying on exotic protocols, proprietary hardware, and dedicated networking equipment.

See also Corvil (based in Dublin!): 'I'm using a Corvil at the moment and it's awesome- nanosecond precision latency measurements on the wire.'

(via mechanical sympathy list)
corvil  timing  metrics  measurement  latency  network  solarcapture  packet-capture  financial  performance  security  network-monitoring 
may 2015 by jm
an asynchronous Netty based graphite proxy. It protects Graphite from the herds of clients by minimizing context switches and interrupts; by batching and aggregating metrics. Gruffalo also allows you to replicate metrics between Graphite installations for DR scenarios, for example.

Gruffalo can easily handle a massive amount of traffic, and thus increase your metrics delivery system availability. At Outbrain, we currently handle over 1700 concurrent connections, and over 2M metrics per minute per instance.
graphite  backpressure  metrics  outbrain  netty  proxies  gruffalo  ops 
april 2015 by jm
Introducing Vector: Netflix's On-Host Performance Monitoring Tool
It gives pinpoint real-time performance metric visibility to engineers working on specific hosts -- basically sending back system-level performance data to their browser, where a client-side renderer turns it into a usable dashboard. Essentially the idea is to replace having to ssh onto instances, run "top", systat, iostat, and so on.
vector  netflix  performance  monitoring  sysstat  top  iostat  netstat  metrics  ops  dashboards  real-time  linux 
april 2015 by jm
Germanwings flight 4U9525: what’s it like to listen to a black box recording?
After every air disaster, finding the black box recorder becomes the first priority – but for the crash investigators who have to listen to the tapes of people’s final moments, the experience can be incredibly harrowing.
flight  disasters  metrics  recording  germanwings  air-travel  black-box-recorder  flight-data-recorder  death 
april 2015 by jm
Time Series Metrics with Cassandra
slides from Chris Maxwell of Ubiquiti Networks describing what he had to do to get cyanite on Cassandra handling 30k metrics per second; an experimental "Date-tiered compaction" mode from Spotify was essential from the sounds of it. Very complex :(
cassandra  spotify  date-tiered-compaction  metrics  graphite  cyanite  chris-maxwell  time-series-data 
april 2015 by jm
an open source stream processing software system developed by Mozilla. Heka is a “Swiss Army Knife” type tool for data processing, useful for a wide variety of different tasks, such as:

Loading and parsing log files from a file system.
Accepting statsd type metrics data for aggregation and forwarding to upstream time series data stores such as graphite or InfluxDB.
Launching external processes to gather operational data from the local system.
Performing real time analysis, graphing, and anomaly detection on any data flowing through the Heka pipeline.
Shipping data from one location to another via the use of an external transport (such as AMQP) or directly (via TCP).
Delivering processed data to one or more persistent data stores.

Via feylya on twitter. Looks potentially nifty
heka  mozilla  monitoring  metrics  via:feylya  ops  statsd  graphite  stream-processing 
march 2015 by jm
A Dropwizard Metrics extension to instrument JDBC resources and measure SQL execution times.
metrics  sql  jdbc  instrumentation  dropwizard 
march 2015 by jm
VividCortex uses K-Means Clustering to discover related metrics
After selecting an interesting spike in a metric, the algorithm can automate picking out a selection of other metrics which spiked at the same time. I can see that being pretty damn useful
metrics  k-means-clustering  clustering  algorithms  discovery  similarity  vividcortex  analysis  data 
march 2015 by jm
"Open source APM for Java" -- profiling in production, with a demo benchmark showing about a 2% performance impact. Wonder about effects on memory/GC, though
apm  java  metrics  measurement  new-relic  profiling  glowroot 
march 2015 by jm
One year of InfluxDB and the road to 1.0
half of the [Monitorama] attendees were employees and entrepreneurs at monitoring, metrics, DevOps, and server analytics companies. Most of them had a story about how their metrics API was their key intellectual property that took them years to develop. The other half of the attendees were developers at larger organizations that were rolling their own DevOps stack from a collection of open source tools. Almost all of them were creating a “time series database” with a bunch of web services code on top of some other database or just using Graphite. When everyone is repeating the same work, it’s not key intellectual property or a differentiator, it’s a barrier to entry. Not only that, it’s something that is hindering innovation in this space since everyone has to spend their first year or two getting to the point where they can start building something real. It’s like building a web company in 1998. You have to spend millions of dollars and a year building infrastructure, racking servers, and getting everything ready before you could run the application. Monitoring and analytics applications should not be like this.
graphite  monitoring  metrics  tsd  time-series  analytics  influxdb  open-source 
february 2015 by jm
Performance Co-Pilot
System performance metrics framework, plugged by Netflix, open-source for ages
open-source  pcp  performance  system  metrics  ops  red-hat  netflix 
february 2015 by jm
A gateway script, now included in PCP
pcp2graphite  pcp  graphite  ops  metrics  system 
february 2015 by jm
Backstage Blog - Prometheus: Monitoring at SoundCloud - SoundCloud Developers
whoa, this is pretty excellent. The major improvement over a graphite-based system would be the multi-dimensional tagging of metrics, which we currently have to do by simply expanding the graphite metric's name to encompass all those dimensions and use searching at query time, inefficiently.
monitoring  soundcloud  prometheus  metrics  service-metrics  graphite  alerting 
february 2015 by jm
A much better carbon-relay, written in C rather than Python. Linking as we've been using it in production for quite a while with no problems.
The main reason to build a replacement is performance and configurability. Carbon is single threaded, and sending metrics to multiple consistent-hash clusters requires chaining of relays. This project provides a multithreaded relay which can address multiple targets and clusters for each and every metric based on pattern matches.
graphite  carbon  c  python  ops  metrics 
january 2015 by jm
Introducing practical and robust anomaly detection in a time series
Twitter open-sources an anomaly-spotting R package:
Early detection of anomalies plays a key role in ensuring high-fidelity data is available to our own product teams and those of our data partners. This package helps us monitor spikes in user engagement on the platform surrounding holidays, major sporting events or during breaking news. Beyond surges in social engagement, exogenic factors – such as bots or spammers – may cause an anomaly in number of favorites or followers. The package can be used to find such bots or spam, as well as detect anomalies in system metrics after a new software release. We’re open-sourcing AnomalyDetection because we’d like the public community to evolve the package and learn from it as we have.
statistics  twitter  r  anomaly-detection  outliers  metrics  time-series  spikes  holt-winters 
january 2015 by jm
Introducing Atlas: Netflix's Primary Telemetry Platform
This sounds really excellent -- the dimensionality problem it deals with is a familiar one, particularly with red/black deployments, autoscaling, and so on creating trees of metrics when new transient servers appear and disappear. Looking forward to Netflix open sourcing enough to make it usable for outsiders
netflix  metrics  service-metrics  atlas  telemetry  ops 
december 2014 by jm
PDX DevOps Graphite replacement
Replacing graphite with InfluxDB, Riemann and Grafana. Not quite there yet, looks like
influxdb  graphite  ops  metrics  riemann  grafana  slides 
december 2014 by jm
Most page loads will experience the 99th percentile response latency
MOST of the page view attempts will experience the 99%'lie server response time in modern web applications. You didn't read that wrong.
latency  metrics  percentiles  p99  web  http  soa 
october 2014 by jm
Carbon vs Megacarbon and Roadmap ? · Issue #235 · graphite-project/carbon
Carbon is a great idea, but fundamentally, twisted doesn't do what carbon-relay or carbon-aggregator were built to do when hit with sustained and heavy throughput. Much to my chagrin, concurrency isn't one of python's core competencies.

+1, sadly. We are patching around the edges with half-released third-party C rewrites in our graphite setup, as we exceed the scale Carbon can support.
carbon  graphite  metrics  ops  python  twisted  scalability 
october 2014 by jm
Felix says:

'Like I said, I'd like to move it to a more general / non-personal repo in the future, but haven't had the time yet. Anyway, you can still browse the code there for now. It is not a big code base so not that hard to wrap one's mind around it.

It is Apache licensed and both Kafka and Voldemort are using it so I would say it is pretty self-contained (although Kafka has not moved to Tehuti proper, it is essentially the same code they're using, minus a few small fixes missing that we added).

Tehuti is a bit lower level than CodaHale (i.e.: you need to choose exactly which stats you want to measure and the boundaries of your histograms), but this is the type of stuff you would build a wrapper for and then re-use within your code base. For example: the Voldemort RequestCounter class.'
asl2  apache  open-source  tehuti  metrics  percentiles  quantiles  statistics  measurement  latency  kafka  voldemort  linkedin 
october 2014 by jm
An embryonic metrics library for Java/Scala from Felix GV at LinkedIn, extracted from Kafka's metric implementation and in the new Voldemort release. It fixes the major known problems with the Meter/Timer implementations in Coda-Hale/Dropwizard/Yammer Metrics.

'Regarding Tehuti: it has been extracted from Kafka's metric implementation. The code was originally written by Jay Kreps, and then maintained improved by some Kafka and Voldemort devs, so it definitely is not the work of just one person. It is in my repo at the moment but I'd like to put it in a more generally available (git and maven) repo in the future. I just haven't had the time yet...

As for comparing with CodaHale/Yammer, there were a few concerns with it, but the main one was that we didn't like the exponentially decaying histogram implementation. While that implementation is very appealing in terms of (low) memory usage, it has several misleading characteristics (a lack of incoming data points makes old measurements linger longer than they should, and there's also a fairly high possiblity of losing interesting outlier data points). This makes the exp decaying implementation robust in high throughput fairly constant workloads, but unreliable in sparse or spiky workloads. The Tehuti implementation provides semantics that we find easier to reason with and with a small code footprint (which we consider a plus in terms of maintainability). Of course, it is still a fairly young project, so it could be improved further.'

More background at the kafka-dev thread:
kafka  metrics  dropwizard  java  scala  jvm  timers  ewma  statistics  measurement  latency  sampling  tehuti  voldemort  linkedin  jay-kreps 
october 2014 by jm
CausalImpact: A new open-source package for estimating causal effects in time series
How can we measure the number of additional clicks or sales that an AdWords campaign generated? How can we estimate the impact of a new feature on app downloads? How do we compare the effectiveness of publicity across countries?

In principle, all of these questions can be answered through causal inference.

In practice, estimating a causal effect accurately is hard, especially when a randomised experiment is not available. One approach we've been developing at Google is based on Bayesian structural time-series models. We use these models to construct a synthetic control — what would have happened to our outcome metric in the absence of the intervention. This approach makes it possible to estimate the causal effect that can be attributed to the intervention, as well as its evolution over time.

We've been testing and applying structural time-series models for some time at Google. For example, we've used them to better understand the effectiveness of advertising campaigns and work out their return on investment. We've also applied the models to settings where a randomised experiment was available, to check how similar our effect estimates would have been without an experimental control.

Today, we're excited to announce the release of CausalImpact, an open-source R package that makes causal analyses simple and fast. With its release, all of our advertisers and users will be able to use the same powerful methods for estimating causal effects that we've been using ourselves.

Our main motivation behind creating the package has been to find a better way of measuring the impact of ad campaigns on outcomes. However, the CausalImpact package could be used for many other applications involving causal inference. Examples include problems found in economics, epidemiology, or the political and social sciences.
causal-inference  r  google  time-series  models  bayes  adwords  advertising  statistics  estimation  metrics 
september 2014 by jm
Box Tech Blog » A Tale of Postmortems
How Box introduced COE-style dev/ops outage postmortems, and got them working. This PIE metric sounds really useful to head off the dreaded "it'll all have to come out missus" action item:
The picture was getting clearer, and we decided to look into individual postmortems and action items and see what was missing. As it was, action items were wasting away with no owners. Digging deeper, we noticed that many action items entailed massive refactorings or vague requirements like “make system X better” (i.e. tasks that realistically were unlikely to be addressed). At a higher level, postmortem discussions often devolved into theoretical debates without a clear outcome. We needed a way to lower and focus the postmortem bar and a better way to categorize our action items and our technical debt.

Out of this need, PIE (“Probability of recurrence * Impact of recurrence * Ease of addressing”) was born. By ranking each factor from 1 (“low”) to 5 (“high”), PIE provided us with two critical improvements:

1. A way to police our postmortems discussions. I.e. a low probability, low impact, hard to implement solution was unlikely to get prioritized and was better suited to a discussion outside the context of the postmortem. Using this ranking helped deflect almost all theoretical discussions.
2. A straightforward way to prioritize our action items.

What’s better is that once we embraced PIE, we also applied it to existing tech debt work. This was critical because we could now prioritize postmortem action items alongside existing work. Postmortem action items became part of normal operations just like any other high-priority work.
postmortems  action-items  outages  ops  devops  pie  metrics  ranking  refactoring  prioritisation  tech-debt 
august 2014 by jm
Metrics-Driven Development
we believe MDD is equal parts engineering technique and cultural process. It separates the notion of monitoring from its traditional position of exclusivity as an operations thing and places it more appropriately next to its peers as an engineering process. Provided access to real-time production metrics relevant to them individually, both software engineers and operations engineers can validate hypotheses, assess problems, implement solutions, and improve future designs.

Broken down into the following principles: 'Instrumentation-as-Code', 'Single Source of Truth', 'Developers Curate Visualizations and Alerts', 'Alert on What You See', 'Show me the Graph', 'Don’t Measure Everything (YAGNI)'.

We do all of these at Swrve, naturally (a technique I happily stole from Amazon).
metrics  coding  graphite  mdd  instrumentation  yagni  alerting  monitoring  graphs 
july 2014 by jm
Boundary's new server monitoring free offering
'High resolution, 1 second intervals for all metrics; Fluid analytics, drag any graph to any point in time; Smart alarms to cut down on false positives; Embedded graphs and customizable dashboards; Up to 10 servers for free'

Pre-registration is open now. Could be interesting, although the limit of 10 machines is pretty small for any production usage
boundary  monitoring  network  ops  metrics  alarms  tcp  ip  netstat 
july 2014 by jm
Two traps in iostat: %util and svctm
Marc Brooker:
As a measure of general IO busyness %util is fairly handy, but as an indication of how much the system is doing compared to what it can do, it's terrible. Iostat's svctm has even fewer redeeming strengths. It's just extremely misleading for most modern storage systems and workloads. Both of these fields are likely to mislead more than inform on modern SSD-based storage systems, and their use should be treated with extreme care.
ioutil  iostat  svctm  ops  ssd  disks  hardware  metrics  stats  linux 
july 2014 by jm
Urban Airship with a new open-source Graphite front-end UI; similar enough to Grafana at a glance, no releases yet, ASL2-licensed
graphite  metrics  ui  front-ends  open-source  ops 
july 2014 by jm
Twitter's TSAR
TSAR = "Time Series AggregatoR". Twitter's new event processor-style architecture for internal metrics. It's notable that now Twitter and Google are both apparently moving towards this idea of a model of code which is designed to run equally in realtime streaming and batch modes (Summingbird, Millwheel, Flume).
analytics  architecture  twitter  tsar  aggregation  event-processing  metrics  streaming  hadoop  batch 
june 2014 by jm
Monitoring Reactive Applications with Kamon
"quality monitoring tools for apps built in Akka, Spray and Play!". Uses Gil Tene's HDRHistogram and dropwizard Metrics under the hood.
metrics  dropwizard  hdrhistogram  gil-tene  kamon  akka  spray  play  reactive  statistics  java  scala  percentiles  latency 
may 2014 by jm
10 Things We Forgot to Monitor
a list of not-so-common outage causes which are easy to overlook; swap rate, NTP drift, SSL expiration, fork rate, etc.
nagios  metrics  ops  monitoring  systems  ntp  bitly 
january 2014 by jm
Extending graphite’s mileage
Ad company InMobi are using graphite heavily (albeit not as heavily as $work are), ran into the usual scaling issues, and chose to fix it in code by switching from a filesystem full of whisper files to a LevelDB per carbon-cache:
The carbon server is now able to run without breaking a sweat even when 500K metrics per minute is being pumped into it. This has been in production since late August 2013 in every datacenter that we operate from.

Very nice. I hope this gets merged/supported.
graphite  scalability  metrics  leveldb  storage  inmobi  whisper  carbon  open-source 
january 2014 by jm
examining the Hardware Performance Counters
using the overseer library and libpfm, it's possible for a JVM app to record metrics about L2/DRAM cache hit rates and latency
metrics  hpc  libpfm  java  jvm  via:normanmaurer  l2  dram  llc  cpu 
december 2013 by jm
a metric storage daemon, exposing both a carbon listener and a simple web service. Its aim is to become a simple, scalable and drop-in replacement for graphite's backend.

Pretty alpha for now, but definitely worth keeping an eye on to potentially replace our burgeoning Carbon fleet...
graphite  carbon  cassandra  storage  metrics  ops  graphs  service-metrics 
december 2013 by jm
LatencyUtils by giltene
The LatencyUtils package includes useful utilities for tracking latencies. Especially in common in-process recording scenarios, which can exhibit significant coordinated omission sensitivity without proper handling.
gil-tene  metrics  java  measurement  coordinated-omission  latency  speed  service-metrics  open-source 
november 2013 by jm
Scryer: Netflix’s Predictive Auto Scaling Engine
Scryer is a new system that allows us to provision the right number of AWS instances needed to handle the traffic of our customers. But Scryer is different from Amazon Auto Scaling (AAS), which reacts to real-time metrics and adjusts instance counts accordingly. Rather, Scryer predicts what the needs will be prior to the time of need and provisions the instances based on those predictions.
scaling  infrastructure  aws  ec2  netflix  scryer  auto-scaling  aas  metrics  prediction  spikes 
november 2013 by jm
A C reimplementation of Etsy's statsd, with some interesting memory optimizations.
Statsite is designed to be both highly performant, and very flexible. To achieve this, it implements the stats collection and aggregation in pure C, using libev to be extremely fast. This allows it to handle hundreds of connections, and millions of metrics. After each flush interval expires, statsite performs a fork/exec to start a new stream handler invoking a specified application. Statsite then streams the aggregated metrics over stdin to the application, which is free to handle the metrics as it sees fit. This allows statsite to aggregate metrics and then ship metrics to any number of sinks (Graphite, SQL databases, etc). There is an included Python script that ships metrics to graphite.
statsd  graphite  statsite  performance  statistics  service-metrics  metrics  ops 
november 2013 by jm
"What Should I Monitor?"
slides (lots of slides) from Baron Schwartz' talk at Velocity in NYC.
slides  monitoring  metrics  ops  devops  baron-schwartz  pdf  capacity 
october 2013 by jm
Low Overhead Method Profiling with Java Mission Control now enabled in the most recent HotSpot JVM release
Built into the HotSpot JVM [in JDK version 7u40] is something called the Java Flight Recorder. It records a lot of information about/from the JVM runtime, and can be thought of as similar to the Data Flight Recorders you find in modern airplanes. You normally use the Flight Recorder to find out what was happening in your JVM when something went wrong, but it is also a pretty awesome tool for production time profiling. Since Mission Control (using the default templates) normally don’t cause more than a per cent overhead, you can use it on your production server.

I'm intrigued by the idea of always-on profiling in production. This could be cool.
performance  java  measurement  profiling  jvm  jdk  hotspot  mission-control  instrumentation  telemetry  metrics 
september 2013 by jm
Observability at Twitter
Bit of detail into Twitter's TSD metric store.
There are separate online clusters for different data sets: application and operating system metrics, performance critical write-time aggregates, long term archives, and temporal indexes. A typical production instance of the time series database is based on four distinct Cassandra clusters, each responsible for a different dimension (real-time, historical, aggregate, index) due to different performance constraints. These clusters are amongst the largest Cassandra clusters deployed in production today and account for over 500 million individual metric writes per minute. Archival data is stored at a lower resolution for trending and long term analysis, whereas higher resolution data is periodically expired. Aggregation is generally performed at write-time to avoid extra storage operations for metrics that are expected to be immediately consumed. Indexing occurs along several dimensions–service, source, and metric names–to give users some flexibility in finding relevant data.
twitter  monitoring  metrics  service-metrics  tsd  time-series  storage  architecture  cassandra 
september 2013 by jm
DevOps Eye for the Coding Guy: Metrics
a pretty good description of the process of adding service metrics to a Django webapp using graphite and statsd. Bookmarking mainly for the great real-time graphing hack at the end...
statsd  django  monitoring  metrics  python  graphite 
september 2013 by jm
metric collectors for various stuff not (or poorly) handled by other monitoring daemons

Core of the project is a simple daemon (harvestd), which collects metric values and sends them to graphite carbon daemon (and/or other configured destinations) once per interval. Includes separate data collection components ("collectors") for processing of:

/proc/slabinfo for useful-to-watch values, not everything (configurable).
/proc/vmstat and /proc/meminfo in a consistent way.
/proc/stat for irq, softirq, forks.
/proc/buddyinfo and /proc/pagetypeinfo (memory fragmentation).
/proc/interrupts and /proc/softirqs.
Cron log to produce start/finish events and duration for each job into a separate metrics, adapts jobs to metric names with regexes.
Per-system-service accounting using systemd and it's cgroups.
sysstat data from sadc logs (use something like sadc -F -L -S DISK -S XDISK -S POWER 60 to have more stuff logged there) via sadf binary and it's json export (sadf -j, supported since sysstat-10.0.something, iirc).
iptables rule "hits" packet and byte counters, taken from ip{,6}tables-save, mapped via separate "table chain_name rule_no metric_name" file, which should be generated along with firewall rules (I use this script to do that).

Pretty exhaustive list of system metrics -- could have some interesting ideas for Linux OS-level metrics to monitor in future.
graphite  monitoring  metrics  unix  linux  ops  vm  iptables  sysadmin 
june 2013 by jm
Care and Feeding of Large Scale Graphite Installations [slides]
good docs for large-scale graphite use: 'Tip and tricks of using and scaling graphite. First presented at DevOpsDays Austin Texas 2013-05-01'
graphite  devops  ops  metrics  dashboards  sysadmin 
june 2013 by jm
Monitoring the Status of Your EBS Volumes
Page in the AWS docs which describes their derived metrics and how they are computed -- these are visible in the AWS Management Console, and alarmable, but not viewable in the Cloudwatch UI. grr. (page-joshea!)
ebs  aws  monitoring  metrics  ops  documentation  cloudwatch 
may 2013 by jm
Making sense out of BDB-JE fast stats
good info on the system metrics recorded by BDB-JE's EnvironmentStats code, particularly where cache and cleaner activity are concerned. Particularly useful for Voldemort
voldemort  caching  bdb  bdb-je  storage  tuning  ops  metrics  reference 
may 2013 by jm
Fred's ImageMagick Scripts: SIMILAR
compute an image-similarity metric, to discover mostly-identical-but-slightly-tweaked images:
SIMILAR computes the normalized cross correlation similarity metric between two equal dimensioned images. The normalized cross correlation metric measures how similar two images are, not how different they are. The range of ncc metric values is between 0 (dissimilar) and 1 (similar). If mode=g, then the two images will be converted to grayscale. If mode=rgb, then the two images first will be converted to colorspace=rgb. Next, the ncc similarity metric will be computed for each channel. Finally, they will be combined into an rms value.

(via Dan O'Neill)
image  photos  pictures  similar  imagemagick  via:dano  metrics  similarity 
april 2013 by jm
Boundary Product Update: Trends Dashboard Now Available
Boundary implement week-on-week trend display. Pity they use silly "giant number" dashboard boxes showing comparisons of the current datapoint with the previous week's datapoint; there's no indication of smoothing being applied, and "giant number" dashboards are basically useless anyway compared to a time-series graph, for unsmoothed time-series data. Also, no prediction bands. :(
boundary  time-series  tsd  prediction  metrics  smoothing  dataviz  dashboards 
april 2013 by jm
Measure Anything, Measure Everything « Code as Craft
the classic Etsy pro-metrics "measure everything" post. Some good basic rules and mindset
etsy  monitoring  metrics  stats  ops  devops 
april 2013 by jm
The first pillar of agile sysadmin: We alert on what we draw
'One of [the] purposes of monitoring systems was to provide data to allow us, as engineers, to detect patterns, and predict issues before they become production impacting. In order to do this, we need to be capturing data and storing it somewhere which allows us to analyse it. If we care about it - if the data could provide the kind of engineering insight which helps us to understand our systems and give early warning - we should be capturing it. ' .... 'There are a couple of weaknesses in [Nagios' design]. Assuming we’ve agreed that if we care about a metric enough to want to alert on it then we should be gathering that data for analysis, and graphing it, then we already have the data upon which to base our check. Furthermore, this data is not on the machine we’re monitoring, so our checks don’t in any way add further stress to that machine.' I would add that if we are alerting on a different set of data from what we collect for graphing, then using the graphs to investigate an alarm may run into problems if they don't sync up.
devops  monitoring  deployment  production  sysadmin  ops  alerting  metrics 
march 2013 by jm
SpaceX software dev practices
Metrics rule the roost -- I guess there's been a long history of telemetry in space applications.

To make software more visible, you need to know what it is doing, he said, which means creating "metrics on everything you can think of".... Those metrics should cover areas like performance, network utilization, CPU load, and so on.

The metrics gathered, whether from testing or real-world use, should be stored as it is "incredibly valuable" to be able to go back through them, he said. For his systems, telemetry data is stored with the program metrics, as is the version of all of the code running so that everything can be reproduced if needed.

SpaceX has programs to parse the metrics data and raise an alarm when "something goes bad". It is important to automate that, Rose said, because forcing a human to do it "would suck". The same programs run on the data whether it is generated from a developer's test, from a run on the spacecraft, or from a mission. Any failures should be seen as an opportunity to add new metrics. It takes a while to "get into the rhythm" of doing so, but it is "very useful". He likes to "geek out on error reporting", using tools like libSegFault and ftrace.

Automation is important, and continuous integration is "very valuable", Rose said. He suggested building for every platform all of the time, even for "things you don't use any more". SpaceX does that and has found interesting problems when building unused code. Unit tests are run from the continuous integration system any time the code changes. "Everyone here has 100% unit test coverage", he joked, but running whatever tests are available, and creating new ones is useful. When he worked on video games, they had a test to just "warp" the character to random locations in a level and had it look in the four directions, which regularly found problems.

"Automate process processes", he said. Things like coding standards, static analysis, spaces vs. tabs, or detecting the use of Emacs should be done automatically. SpaceX has a complicated process where changes cannot be made without tickets, code review, signoffs, and so forth, but all of that is checked automatically. If static analysis is part of the workflow, make it such that the code will not build unless it passes that analysis step.

When the build fails, it should "fail loudly" with a "monitor that starts flashing red" and email to everyone on the team. When that happens, you should "respond immediately" to fix the problem. In his team, they have a full-size Justin Bieber cutout that gets placed facing the team member who broke the build. They found that "100% of software engineers don't like Justin Bieber", and will work quickly to fix the build problem.
spacex  dev  coding  metrics  deplyment  production  space  justin-bieber 
march 2013 by jm
Metric Collection and Storage with Cassandra | DataStax
DataStax' documentation on how they store TSD data in Cass. Pretty generic
datastax  nosql  metrics  analytics  cassandra  tsd  time-series  storage 
march 2013 by jm
Unhelpful Graphite Tips
10 particularly good -- actually helpful -- tips on using the Graphite metric graphing system
graphite  ops  metrics  service-metrics  graphing  ui  dataviz 
february 2013 by jm
'The Unified Logging Infrastructure for Data Analytics at Twitter' [PDF]
A picture of how Twitter standardized their internal service event logging formats to allow batch analysis and analytics. They surface service metrics to dashboards from Pig jobs on a daily basis, which frankly doesn't sound too great...
twitter  analytics  event-logging  events  logging  metrics 
january 2013 by jm
Notes on Distributed Systems for Young Bloods
'Below is a list of some lessons I’ve learned as a distributed systems engineer that are worth being told to a new engineer. Some are subtle, and some are surprising, but none are controversial. This list is for the new distributed systems engineer to guide their thinking about the field they are taking on. It’s not comprehensive, but it’s a good beginning.' This is a pretty nice list, a little over-stated, but that's the format. I particularly like the following: 'Exploit data-locality'; 'Learn to estimate your capacity'; 'Metrics are the only way to get your job done'; 'Use percentiles, not averages'; 'Extract services'.
systems  distributed  distcomp  cap  metrics  coding  guidelines  architecture  backpressure  design  twitter 
january 2013 by jm
paperplanes. The Virtues of Monitoring, Redux
A rather vague and touchy-feely "state of the union" post on monitoring. Good set of links at the end, though; I like the look of Sensu and Tasseo, but am still unconvinced about the value of Boundary's offering
monitoring  metrics  ops 
january 2013 by jm
Scaling Crashlytics: Building Analytics on Redis 2.6
How one analytics/metrics co is using Redis on the backend
analytics  redis  presentation  metrics 
january 2013 by jm
Nintendo's work on Miiverse Penis Drawing Detection

'The unique feature of the Miiverse is being able to send drawings, not just text. But since the advent of the internet, there have always been those who have used it for unsavory purposes.'
'Motoyama: we never had such a problem with our Hatena services. But, when we brought Hatena Flipnote to the West, we were caught off-guard by the amount of penises drawn by people.
Kurisu: So the team and I had to come up with a way to create a system that auto-detects those types of pictures. [...]
'Motoyama: After a week, we made very good progress on the system. Then we tested the system with Nintendo of America and told them to start drawing. It went horribly.
Kurisu: What we learned is that people enjoy drawing penises. Multiple ones. (laughs) The system was not prepared to handle that.'

See also the "time-to-penis" metric in MMO games:
nintendo  image-detection  ttp  metrics  games  gaming  mmo  miiverse  drawing 
november 2012 by jm
« earlier      
per page:    204080120160

related tags

aas  action-items  advertising  adwords  agents  aggregation  air-travel  akka  alarms  alerting  algorithms  analysis  analytics  annotations  anomaly-detection  apache  apm  architecture  asap  asl2  atlas  aurora  auto-scaling  aws  backpressure  baron-schwartz  batch  bayes  bdb  bdb-je  bitly  black-box-recorder  boundary  c  caching  cap  capacity  carbon  cassandra  causal-inference  change-monitoring  chris-maxwell  cli  clients  cloud  cloudwatch  clustering  coding  concurrency  consistency  coordinated-omission  coreos  corvil  cpu  cyanite  dan-luu  dashboards  data  databases  datadog  datastax  dataviz  date-tiered-compaction  death  debugging  deployment  deplyment  design  dev  devops  dick-sites  disasters  discovery  disks  distcomp  distributed  distributed-computing  django  documentation  dram  drawing  dropwizard  dublin  ebs  ec2  elasticsearch  engineering  equations  error-detection  estimation  etsy  event-logging  event-processing  events  eventual-consistency  ewma  excel  facebook  fault-tolerance  faults  financial  flight  flight-data-recorder  front-ends  games  gaming  germanwings  gifee  gil-tene  github  glowroot  google  grafana  graphing  graphite  graphs  gruffalo  guidelines  hacking  hadoop  hangs  hardware  hdr  hdrhistogram  heka  heroic  histograms  holt-winters  hotspot  howto  hpc  http  image  image-detection  imagemagick  influxdb  infrastructure  inmobi  instrumentation  iostat  ioutil  ip  iptables  ireland  java  jay-kreps  jdbc  jdk  jersey  jetty  jmx  justin-bieber  jvm  k-means-clustering  kafka  kamon  kernel  kubernetes  l2  latency  leveldb  libpfm  life360  linkedin  linux  llc  load  logging  loggly  logs  logstash  mdd  measurement  metrics  miiverse  mission-control  mmo  models  monitoring  mozilla  mpstat  mysql  nagios  netdata  netflix  netstat  netty  network  network-monitoring  new-relic  nginx  nintendo  nitsan-wakart  nosql  novaucd  ntp  number-crunching  okhttp  oo  open-source  operations  ops  outages  outbrain  outliers  p99  packet-capture  papers  pcp  pcp2graphite  pdf  percentiles  performance  peter-bailis  photos  pictures  pie  pig  play  postmortems  prediction  presentation  prioritisation  privacy  production  profiling  prometheus  proxies  pull  push  python  quantiles  r  rails  raintank  ranking  rds  reactive  real-time  recording  red-hat  redis  refactoring  reference  rewrites  riemann  ruby  sampling  sar  scala  scalability  scaling  scheduler  scribe  scryer  security  service-metrics  sex-toys  similar  similarity  slides  smoothing  soa  software  solarcapture  soundcloud  space  spacex  speed  spikes  spotify  spray  sql  ssd  stacks  startups  static-analysis  statistics  stats  statsd  statsite  storage  stream-processing  streaming  svctm  sysadmin  sysstat  system  systems  tcp  tech-debt  tehuti  telemetry  testimonials  threads  throttling  time-series  time-series-data  timers  timing  top  tsar  tsd  ttp  tuning  twisted  twitter  ui  unix  use  usl  vector  via:conor  via:dano  via:feylya  via:markkenny  via:nelson  via:normanmaurer  vibrators  visualization  vividcortex  vm  voldemort  we-vibe  web  whisper  wtf  xmas  yagni 

Copy this bookmark: