**jm + approximation**
8

[1902.04023] Computing Extremely Accurate Quantiles Using t-Digests

4 weeks ago by jm

'We present on-line algorithms for computing approximations of rank-based statistics that give high accuracy, particularly near the tails of a distribution, with very small sketches. Notably, the method allows a quantile q to be computed with an accuracy relative to max(q,1−q) rather than absolute accuracy as with most other methods. This new algorithm is robust with respect to skewed distributions or ordered datasets and allows separately computed summaries to be combined with no loss in accuracy. An open-source Java implementation of this algorithm is available from the author. Independent implementations in Go and Python are also available.'

(via Tony Finch)

java
go
python
via:fanf
open-source
quantiles
percentiles
approximation
statistics
sketching
algorithms
(via Tony Finch)

4 weeks ago by jm

Fast Forward Labs: Probabilistic Data Structure Showdown: Cuckoo Filters vs. Bloom Filters

november 2016 by jm

Nice comparison of a counting Bloom filter and a Cuckoo Filter, implemented in Python:

algorithms
probabilistic
approximation
bloom-filters
cuckoo-filters
sets
estimation
python
This post provides an update by exploring Cuckoo filters, a new probabilistic data structure that improves upon the standard Bloom filter. The Cuckoo filter provides a few advantages: 1) it enables dynamic deletion and addition of items 2) it can be easily implemented compared to Bloom filter variants with similar capabilities, and 3) for similar space constraints, the Cuckoo filter provides lower false positives, particularly at lower capacities. We provide a python implementation of the Cuckoo filter here, and compare it to a counting Bloom filter (a Bloom filter variant).

november 2016 by jm

"last seen" sketch

july 2015 by jm

a new sketch algorithm from Baron Schwartz and Preetam Jinka of VividCortex; similar to Count-Min but with last-seen timestamp instead of frequency.

sketch
algorithms
estimation
approximation
sampling
streams
big-data
july 2015 by jm

"Cuckoo Filter: Practically Better Than Bloom"

march 2015 by jm

'We propose a new data structure called the cuckoo filter that can replace Bloom filters for approximate set membership

tests. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than

Bloom filters. For applications that store many items and target moderately low false positive rates, cuckoo filters have

lower space overhead than space-optimized Bloom filters. Our experimental results also show that cuckoo filters outperform previous data structures that extend Bloom filters to support deletions substantially in both time and space.'

algorithms
paper
bloom-filters
cuckoo-filters
cuckoo-hashing
data-structures
false-positives
big-data
probabilistic
hashing
set-membership
approximation
tests. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than

Bloom filters. For applications that store many items and target moderately low false positive rates, cuckoo filters have

lower space overhead than space-optimized Bloom filters. Our experimental results also show that cuckoo filters outperform previous data structures that extend Bloom filters to support deletions substantially in both time and space.'

march 2015 by jm

'Medians and Beyond: New Aggregation Techniques for Sensor Networks' [paper, PDF]

february 2013 by jm

'We introduce Quantile Digest or q-digest, a novel data structure which provides provable guarantees on approximation error and maximum resource consumption. In more concrete terms, if the values returned by the sensors are integers in the range [1;n], then using q-digest we can answer quantile queries using message size m within an error of O(log(n)/m). We also outline how we can use q-digest to answer other queries such as range queries, most frequent items and histograms. Another notable property of q-digest is that in addition to the theoretical worst case bound error, the structure carries with itself an estimate of error for this particular query.'

q-digest
algorithms
streams
approximation
histograms
median
percentiles
quantiles
february 2013 by jm

clearspring / stream-lib

february 2013 by jm

ASL-licensed open source library of stream-processing/approximation algorithms: count-min sketch, space-saving top-k, cardinality estimation, LogLog, HyperLogLog, MurmurHash, lookup3 hash, Bloom filters, q-digest, stochastic top-k

algorithms
coding
streams
cep
stream-processing
approximation
probabilistic
space-saving
top-k
cardinality
estimation
bloom-filters
q-digest
loglog
hyperloglog
murmurhash
lookup3
february 2013 by jm

'Eﬃcient Computation of Frequent and Top-k Elements in Data Streams' [paper, PDF]

february 2013 by jm

The Space-Saving algorithm to compute top-k in a stream. I've been asking a variation of this problem as an interview question for a while now, pretty cool to find such a neat solution. Pity neither myself nor anyone I've interviewed has come up with it ;)

space-saving
approximation
streams
stream-processing
cep
papers
pdf
algorithms
february 2013 by jm

Real-time Analytics in Scala [slides, PDF]

february 2013 by jm

some good approximation/streaming algorithms and tips on Scala implementation

streams
algorithms
approximation
coding
scala
slides
february 2013 by jm

**related tags**

Copy this bookmark: