jm + percentiles   20

A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The t-digest algorithm is also very parallel friendly making it useful in map-reduce and parallel streaming applications.

The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to product a data structure that is related to the Q-digest. This t-digest data structure can be used to estimate quantiles or compute other rank statistics. The advantage of the t-digest over the Q-digest is that the t-digest can handle floating point values while the Q-digest is limited to integers. With small changes, the t-digest can handle any values from any ordered set that has something akin to a mean. The accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than those produced by Q-digests in spite of the fact that t-digests are more compact when stored on disk.

Super-nice feature is that it's mergeable, so amenable to parallel usage across multiple hosts if required. Java implementation, ASL licensing.
data-structures  algorithms  java  t-digest  statistics  quantiles  percentiles  aggregation  digests  estimation  ranking 
december 2016 by jm
You CAN Average Percentiles
John Rauser on this oft-cited dictum of percentile usage in monitoring, and when it's wrong and it's actually possible to reason with averaged percentiles, and when it breaks down.
statistics  percentiles  quantiles  john-rauser  histograms  averaging  mean  p99 
july 2016 by jm
Why Percentiles Don’t Work the Way you Think
Baron Schwartz on metrics, percentiles, and aggregation. +1, although as a HN commenter noted, quantile digests are probably the better fix
performance  percentiles  quantiles  statistics  metrics  monitoring  baron-schwartz  vividcortex 
december 2015 by jm
Amazon EC2 2015 Benchmark: Testing Speeds Between AWS EC2 and S3 Regions
Here we are again, a year later, and still no bloody percentiles! Just amateurish averaging. This is not how you measure anything, ffs. Still, better than nothing I suppose
fail  latency  measurement  aws  ec2  percentiles  s3 
august 2015 by jm
testing latency measurements using CTRL-Z
An excellent tip from Gil "HDRHistogram" Tene:
Good example of why I always "calibrate" latency tools with ^Z tests. If ^Z results don't make sense, don't use [the] tool. ^Z test math examples: If you ^Z for half the time, Max is obvious. [90th percentile] should be 80% of the ^Z stall time.
control-z  suspend  unix  testing  latencies  latency  measurement  percentiles  tips 
november 2014 by jm
Most page loads will experience the 99th percentile response latency
MOST of the page view attempts will experience the 99%'lie server response time in modern web applications. You didn't read that wrong.
latency  metrics  percentiles  p99  web  http  soa 
october 2014 by jm
Felix says:

'Like I said, I'd like to move it to a more general / non-personal repo in the future, but haven't had the time yet. Anyway, you can still browse the code there for now. It is not a big code base so not that hard to wrap one's mind around it.

It is Apache licensed and both Kafka and Voldemort are using it so I would say it is pretty self-contained (although Kafka has not moved to Tehuti proper, it is essentially the same code they're using, minus a few small fixes missing that we added).

Tehuti is a bit lower level than CodaHale (i.e.: you need to choose exactly which stats you want to measure and the boundaries of your histograms), but this is the type of stuff you would build a wrapper for and then re-use within your code base. For example: the Voldemort RequestCounter class.'
asl2  apache  open-source  tehuti  metrics  percentiles  quantiles  statistics  measurement  latency  kafka  voldemort  linkedin 
october 2014 by jm
"Quantiles on Streams" [paper, 2009]
'Chiranjeeb Buragohain and Subhash Suri: "Quantiles on Streams" in Encyclopedia of Database Systems, Springer, pp 2235–2240, 2009. ISBN: 978-0-387-35544-3', cited by Martin Kleppman in as a good, short literature survey re estimating percentiles with a small memory footprint.
latency  percentiles  coding  quantiles  streams  papers  algorithms 
october 2014 by jm
A Go implementation of Greenwald-Khanna streaming quantiles: - 'a new online algorithm for computing approximate quantile summaries of very large data sequences with a worst-case space requirement of O(1/e log eN))'
quantiles  go  algorithms  greenwald-khanna  percentiles  streaming  cep  space-efficient 
july 2014 by jm
Monitoring Reactive Applications with Kamon
"quality monitoring tools for apps built in Akka, Spray and Play!". Uses Gil Tene's HDRHistogram and dropwizard Metrics under the hood.
metrics  dropwizard  hdrhistogram  gil-tene  kamon  akka  spray  play  reactive  statistics  java  scala  percentiles  latency 
may 2014 by jm
"Effective Computation of Biased Quantiles over Data Streams" [paper]

Skew is prevalent in many data sources such as IP traffic streams.To continually summarize the distribution of such data, a high-biased set of quantiles (e.g., 50th, 90th and 99th percentiles) with finer error guarantees at higher ranks (e.g., errors of 5, 1 and 0.1 percent, respectively) is more useful than uniformly distributed quantiles (e.g., 25th, 50th and 75th percentiles) with uniform error guarantees. In this paper, we address the following two prob-lems. First, can we compute quantiles with finer error guarantees for the higher ranks of the data distribution effectively, using less space and computation time than computing all quantiles uniformly at the finest error? Second, if specific quantiles and their error bounds are requested a priori, can the necessary space usage and computation time be reduced? We answer both questions in the affirmative by formalizing them as the “high-biased” quantiles and the “targeted” quantiles problems, respectively, and presenting algorithms with provable guarantees, that perform significantly better than previously known solutions for these problems. We implemented our algorithms in the Gigascope data stream management system, and evaluated alternate approaches for maintaining the relevant summary structures.Our experimental results on real and synthetic IP data streams complement our theoretical analyses, and highlight the importance of lightweight, non-blocking implementations when maintaining summary structures over high-speed data streams.

Implemented as a timer-histogram storage system in .
statistics  quantiles  percentiles  stream-processing  skew  papers  histograms  latency  algorithms 
november 2013 by jm
Coordinated Omission
Gil Tene raises an extremely good point about load testing, high-percentile response-time measurement, and behaviour when testing a system under load:

I've been harping for a while now about a common measurement technique problem I call "Coordinated Omission" for a while, which can often render percentile data useless. [...] I believe that this problem occurs extremely frequently in test results, but it's usually hard to deduce it's existence purely from the final data reported. But every once in a while, I see test results where the data provided is enough to demonstrate the huge percentile-misreporting effect of Coordinated Omission based purely on the summary report.

I ran into just such a case in Attila's cool posting about log4j2's truly amazing performance, so I decided to avoid polluting his thread with an elongated discussion of how to compute 99.9%'ile data, and started this topic here. That thread should really be about how cool log4j2 is, and I'm certain that it really is cool, even after you correct the measurements. [...] Basically, I think that the 99.99% observation computation is wrong, and demonstrably (using the data in the graph data posted) exhibits the classic "coordinated omission" measurement problem I've been preaching about. This test is not alone in exhibiting this, and there is nothing to be ashamed of when you find yourself making this mistake. I only figured it out after doing it myself many many times, and then I noticed that everyone else seems to also be doing it but most of them haven't yet figured it out. In fact, I run into this issue so often in percentile reporting and load testing that I'm starting to wonder if coordinated omission is there in 99.9% of latency tests ;-)
measurement  testing  latency  load-testing  gil-tene  coordinated-omission  validity  log4j  percentiles 
august 2013 by jm
good blog post on histogram-estimation stream processing algorithms
After reviewing several dozen papers, a score or so in depth, I identified two data structures that appear to enable us to answer these recency and frequency queries: exponential histograms (from "Maintaining Stream Statistics Over Sliding Windows" by Datar et al.) and waves (from "Distributed Streams Algorithms for Sliding Windows" by Gibbons and Tirthapura). Both of these data structures are used to solve the so-called counting problem, the problem of determining, with a bound on the relative error, the number of 1s in the last N units of time. In other words, the data structures are able to answer the question: how many 1s appeared in the last n units of time within a factor of Error (e.g., 50%). The algorithms are neat, so I'll present them briefly.
streams  streaming  stream-processing  histograms  percentiles  estimation  waves  statistics  algorithms 
february 2013 by jm
Distributed Streams Algorithms for Sliding Windows [PDF]
'Massive data sets often arise as physically distributed, parallel data streams, and it is important to estimate various aggregates and statistics on the union of these streams. This paper presents algorithms for estimating aggregate
functions over a “sliding window” of the N most recent data items in one or more streams. [...] Our results are obtained using a novel family of synopsis data structures called waves.'
waves  papers  streaming  algorithms  percentiles  histogram  distcomp  distributed  aggregation  statistics  estimation  streams 
february 2013 by jm
Heroku finds out that distributed queueing is hard
Stage 3 of the Rap Genius/Heroku blog drama. Summary (as far as I can tell): Heroku gave up on a fully-synchronised load-balancing setup ("intelligent routing"), since it didn't scale, in favour of randomised queue selection; they didn't sufficiently inform their customers, and metrics and docs were not updated to make this change public; the pessimal case became pretty damn pessimal; a customer eventually noticed and complained publicly, creating a public shit-storm.

Comments: 1. this is why you monitor real HTTP request latency (scroll down for crazy graphs!). 2. include 90/99 percentiles to catch the "tail" of poorly-performing requests. 3. Load balancers are hard. has more info on the intricacies of distributed load balancing -- worth a read.
heroku  rap-genius  via:hn  networking  distcomp  distributed  load-balancing  ip  queueing  percentiles  monitoring 
february 2013 by jm
'Medians and Beyond: New Aggregation Techniques for Sensor Networks' [paper, PDF]
'We introduce Quantile Digest or q-digest, a novel data structure which provides provable guarantees on approximation error and maximum resource consumption. In more concrete terms, if the values returned by the sensors are integers in the range [1;n], then using q-digest we can answer quantile queries using message size m within an error of O(log(n)/m). We also outline how we can use q-digest to answer other queries such as range queries, most frequent items and histograms. Another notable property of q-digest is that in addition to the theoretical worst case bound error, the structure carries with itself an estimate of error for this particular query.'
q-digest  algorithms  streams  approximation  histograms  median  percentiles  quantiles 
february 2013 by jm
High Scalability - Analyzing billions of credit card transactions and serving low-latency insights in the cloud
Hadoop, a batch-generated read-only Voldemort cluster, and an intriguing optimal-storage histogram bucketing algorithm:
The optimal histogram is computed using a random-restart hill climbing approximated algorithm.
The algorithm has been shown very fast and accurate: we achieved 99% accuracy compared to an exact dynamic algorithm, with a speed increase of one factor. [...] The amount of information to serve in Voldemort for one year of BBVA's credit card transactions on Spain is 270 GB. The whole processing flow would run in 11 hours on a cluster of 24 "m1.large" instances. The whole infrastructure, including the EC2 instances needed to serve the resulting data would cost approximately $3500/month.
scalability  scaling  voldemort  hadoop  batch  algorithms  histograms  statistics  bucketing  percentiles 
february 2013 by jm
Beatiful dataviz of weather data from,, World Weather Online and Weather Central. The main graph includes: mean and percentiles of historical temperature data for time of year, the temperature and precipitation forecast over the chosen period, wind direction and speed, with hourly data. Very nicely done! (via Una Mullally)
via:unamullally  dataviz  temperature  forecasts  weather  graphing  percentiles  wind  rain 
june 2012 by jm

Copy this bookmark: