jm + approximation   8

[1902.04023] Computing Extremely Accurate Quantiles Using t-Digests
'We present on-line algorithms for computing approximations of rank-based statistics that give high accuracy, particularly near the tails of a distribution, with very small sketches. Notably, the method allows a quantile q to be computed with an accuracy relative to max(q,1−q) rather than absolute accuracy as with most other methods. This new algorithm is robust with respect to skewed distributions or ordered datasets and allows separately computed summaries to be combined with no loss in accuracy. An open-source Java implementation of this algorithm is available from the author. Independent implementations in Go and Python are also available.'

(via Tony Finch)
java  go  python  via:fanf  open-source  quantiles  percentiles  approximation  statistics  sketching  algorithms 
4 weeks ago by jm
Fast Forward Labs: Probabilistic Data Structure Showdown: Cuckoo Filters vs. Bloom Filters
Nice comparison of a counting Bloom filter and a Cuckoo Filter, implemented in Python:
This post provides an update by exploring Cuckoo filters, a new probabilistic data structure that improves upon the standard Bloom filter. The Cuckoo filter provides a few advantages: 1) it enables dynamic deletion and addition of items 2) it can be easily implemented compared to Bloom filter variants with similar capabilities, and 3) for similar space constraints, the Cuckoo filter provides lower false positives, particularly at lower capacities. We provide a python implementation of the Cuckoo filter here, and compare it to a counting Bloom filter (a Bloom filter variant).
algorithms  probabilistic  approximation  bloom-filters  cuckoo-filters  sets  estimation  python 
november 2016 by jm
"last seen" sketch
a new sketch algorithm from Baron Schwartz and Preetam Jinka of VividCortex; similar to Count-Min but with last-seen timestamp instead of frequency.
sketch  algorithms  estimation  approximation  sampling  streams  big-data 
july 2015 by jm
"Cuckoo Filter: Practically Better Than Bloom"
'We propose a new data structure called the cuckoo filter that can replace Bloom filters for approximate set membership
tests. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than
Bloom filters. For applications that store many items and target moderately low false positive rates, cuckoo filters have
lower space overhead than space-optimized Bloom filters. Our experimental results also show that cuckoo filters outperform previous data structures that extend Bloom filters to support deletions substantially in both time and space.'
algorithms  paper  bloom-filters  cuckoo-filters  cuckoo-hashing  data-structures  false-positives  big-data  probabilistic  hashing  set-membership  approximation 
march 2015 by jm
'Medians and Beyond: New Aggregation Techniques for Sensor Networks' [paper, PDF]
'We introduce Quantile Digest or q-digest, a novel data structure which provides provable guarantees on approximation error and maximum resource consumption. In more concrete terms, if the values returned by the sensors are integers in the range [1;n], then using q-digest we can answer quantile queries using message size m within an error of O(log(n)/m). We also outline how we can use q-digest to answer other queries such as range queries, most frequent items and histograms. Another notable property of q-digest is that in addition to the theoretical worst case bound error, the structure carries with itself an estimate of error for this particular query.'
q-digest  algorithms  streams  approximation  histograms  median  percentiles  quantiles 
february 2013 by jm
clearspring / stream-lib
ASL-licensed open source library of stream-processing/approximation algorithms: count-min sketch, space-saving top-k, cardinality estimation, LogLog, HyperLogLog, MurmurHash, lookup3 hash, Bloom filters, q-digest, stochastic top-k
algorithms  coding  streams  cep  stream-processing  approximation  probabilistic  space-saving  top-k  cardinality  estimation  bloom-filters  q-digest  loglog  hyperloglog  murmurhash  lookup3 
february 2013 by jm
'Efficient Computation of Frequent and Top-k Elements in Data Streams' [paper, PDF]
The Space-Saving algorithm to compute top-k in a stream. I've been asking a variation of this problem as an interview question for a while now, pretty cool to find such a neat solution. Pity neither myself nor anyone I've interviewed has come up with it ;)
space-saving  approximation  streams  stream-processing  cep  papers  pdf  algorithms 
february 2013 by jm
Real-time Analytics in Scala [slides, PDF]
some good approximation/streaming algorithms and tips on Scala implementation
streams  algorithms  approximation  coding  scala  slides 
february 2013 by jm

Copy this bookmark: