[1902.04023] Computing Extremely Accurate Quantiles Using t-Digests

4 weeks ago by jm

'We present on-line algorithms for computing approximations of rank-based statistics that give high accuracy, particularly near the tails of a distribution, with very small sketches. Notably, the method allows a quantile q to be computed with an accuracy relative to max(q,1−q) rather than absolute accuracy as with most other methods. This new algorithm is robust with respect to skewed distributions or ordered datasets and allows separately computed summaries to be combined with no loss in accuracy. An open-source Java implementation of this algorithm is available from the author. Independent implementations in Go and Python are also available.'

(via Tony Finch)

java
go
python
via:fanf
open-source
quantiles
percentiles
approximation
statistics
sketching
algorithms
(via Tony Finch)

4 weeks ago by jm

tdunning/t-digest

Super-nice feature is that it's mergeable, so amenable to parallel usage across multiple hosts if required. Java implementation, ASL licensing.
data-structures
algorithms
java
t-digest
statistics
quantiles
percentiles
aggregation
digests
estimation
ranking

december 2016 by jm

A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The t-digest algorithm is also very parallel friendly making it useful in map-reduce and parallel streaming applications.

The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to product a data structure that is related to the Q-digest. This t-digest data structure can be used to estimate quantiles or compute other rank statistics. The advantage of the t-digest over the Q-digest is that the t-digest can handle floating point values while the Q-digest is limited to integers. With small changes, the t-digest can handle any values from any ordered set that has something akin to a mean. The accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than those produced by Q-digests in spite of the fact that t-digests are more compact when stored on disk.

Super-nice feature is that it's mergeable, so amenable to parallel usage across multiple hosts if required. Java implementation, ASL licensing.

december 2016 by jm

You CAN Average Percentiles

july 2016 by jm

John Rauser on this oft-cited dictum of percentile usage in monitoring, and when it's wrong and it's actually possible to reason with averaged percentiles, and when it breaks down.

statistics
percentiles
quantiles
john-rauser
histograms
averaging
mean
p99
july 2016 by jm

Why Percentiles Don’t Work the Way you Think

december 2015 by jm

Baron Schwartz on metrics, percentiles, and aggregation. +1, although as a HN commenter noted, quantile digests are probably the better fix

performance
percentiles
quantiles
statistics
metrics
monitoring
baron-schwartz
vividcortex
december 2015 by jm

HdrHistogram: A better latency capture method

february 2015 by jm

An excellent intro to HdrHistogram usage

hdrhistogram
hdr
histograms
statistics
latency
measurement
metrics
percentiles
quantiles
gil-tene
nitsan-wakart
february 2015 by jm

FelixGV/tehuti

october 2014 by jm

Felix says:

'Like I said, I'd like to move it to a more general / non-personal repo in the future, but haven't had the time yet. Anyway, you can still browse the code there for now. It is not a big code base so not that hard to wrap one's mind around it.

It is Apache licensed and both Kafka and Voldemort are using it so I would say it is pretty self-contained (although Kafka has not moved to Tehuti proper, it is essentially the same code they're using, minus a few small fixes missing that we added).

Tehuti is a bit lower level than CodaHale (i.e.: you need to choose exactly which stats you want to measure and the boundaries of your histograms), but this is the type of stuff you would build a wrapper for and then re-use within your code base. For example: the Voldemort RequestCounter class.'

asl2
apache
open-source
tehuti
metrics
percentiles
quantiles
statistics
measurement
latency
kafka
voldemort
linkedin
'Like I said, I'd like to move it to a more general / non-personal repo in the future, but haven't had the time yet. Anyway, you can still browse the code there for now. It is not a big code base so not that hard to wrap one's mind around it.

It is Apache licensed and both Kafka and Voldemort are using it so I would say it is pretty self-contained (although Kafka has not moved to Tehuti proper, it is essentially the same code they're using, minus a few small fixes missing that we added).

Tehuti is a bit lower level than CodaHale (i.e.: you need to choose exactly which stats you want to measure and the boundaries of your histograms), but this is the type of stuff you would build a wrapper for and then re-use within your code base. For example: the Voldemort RequestCounter class.'

october 2014 by jm

"Quantiles on Streams" [paper, 2009]

october 2014 by jm

'Chiranjeeb Buragohain and Subhash Suri: "Quantiles on Streams" in Encyclopedia of Database Systems, Springer, pp 2235–2240, 2009. ISBN: 978-0-387-35544-3', cited by Martin Kleppman in http://mail-archives.apache.org/mod_mbox/kafka-dev/201402.mbox/%3C131A7649-ED57-45CB-B4D6-F34063267664@linkedin.com%3E as a good, short literature survey re estimating percentiles with a small memory footprint.

latency
percentiles
coding
quantiles
streams
papers
algorithms
october 2014 by jm

dgryski/go-gk

july 2014 by jm

A Go implementation of Greenwald-Khanna streaming quantiles: http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf - 'a new online algorithm for computing approximate quantile summaries of very large data sequences with a worst-case space requirement of O(1/e log eN))'

quantiles
go
algorithms
greenwald-khanna
percentiles
streaming
cep
space-efficient
july 2014 by jm

"Effective Computation of Biased Quantiles over Data Streams" [paper]

statistics
quantiles
percentiles
stream-processing
skew
papers
histograms
latency
algorithms

november 2013 by jm

Skew is prevalent in many data sources such as IP traffic streams.To continually summarize the distribution of such data, a high-biased set of quantiles (e.g., 50th, 90th and 99th percentiles) with finer error guarantees at higher ranks (e.g., errors of 5, 1 and 0.1 percent, respectively) is more useful than uniformly distributed quantiles (e.g., 25th, 50th and 75th percentiles) with uniform error guarantees. In this paper, we address the following two prob-lems. First, can we compute quantiles with finer error guarantees for the higher ranks of the data distribution effectively, using less space and computation time than computing all quantiles uniformly at the finest error? Second, if specific quantiles and their error bounds are requested a priori, can the necessary space usage and computation time be reduced? We answer both questions in the affirmative by formalizing them as the “high-biased” quantiles and the “targeted” quantiles problems, respectively, and presenting algorithms with provable guarantees, that perform significantly better than previously known solutions for these problems. We implemented our algorithms in the Gigascope data stream management system, and evaluated alternate approaches for maintaining the relevant summary structures.Our experimental results on real and synthetic IP data streams complement our theoretical analyses, and highlight the importance of lightweight, non-blocking implementations when maintaining summary structures over high-speed data streams.

Implemented as a timer-histogram storage system in http://armon.github.io/statsite/ .

november 2013 by jm

'Medians and Beyond: New Aggregation Techniques for Sensor Networks' [paper, PDF]

february 2013 by jm

'We introduce Quantile Digest or q-digest, a novel data structure which provides provable guarantees on approximation error and maximum resource consumption. In more concrete terms, if the values returned by the sensors are integers in the range [1;n], then using q-digest we can answer quantile queries using message size m within an error of O(log(n)/m). We also outline how we can use q-digest to answer other queries such as range queries, most frequent items and histograms. Another notable property of q-digest is that in addition to the theoretical worst case bound error, the structure carries with itself an estimate of error for this particular query.'

q-digest
algorithms
streams
approximation
histograms
median
percentiles
quantiles
february 2013 by jm

**related tags**

Copy this bookmark: