[1902.04023] Computing Extremely Accurate Quantiles Using t-Digests

4 weeks ago by jm

'We present on-line algorithms for computing approximations of rank-based statistics that give high accuracy, particularly near the tails of a distribution, with very small sketches. Notably, the method allows a quantile q to be computed with an accuracy relative to max(q,1−q) rather than absolute accuracy as with most other methods. This new algorithm is robust with respect to skewed distributions or ordered datasets and allows separately computed summaries to be combined with no loss in accuracy. An open-source Java implementation of this algorithm is available from the author. Independent implementations in Go and Python are also available.'

(via Tony Finch)

java
go
python
via:fanf
open-source
quantiles
percentiles
approximation
statistics
sketching
algorithms
(via Tony Finch)

4 weeks ago by jm

Caffeine cache adopts Window TinyLfu eviction policy

november 2015 by jm

'Caffeine is a Java 8 rewrite of Guava's cache. In this version we focused on improving the hit rate by evaluating alternatives to the classic least-recenty-used (LRU) eviction policy. In collaboration with researchers at Israel's Technion, we developed a new algorithm that matches or exceeds the hit rate of the best alternatives (ARC, LIRS). A paper of our work is being prepared for publication.'

Specifically:

tinylfu
caches
caching
cache-eviction
java8
guava
caffeine
lru
count-min
sketching
algorithms
Specifically:

W-TinyLfu uses a small admission LRU that evicts to a large Segmented LRU if accepted by the TinyLfu admission policy. TinyLfu relies on a frequency sketch to probabilistically estimate the historic usage of an entry. The window allows the policy to have a high hit rate when entries exhibit a high temporal / low frequency access pattern which would otherwise be rejected. The configuration enables the cache to estimate the frequency and recency of an entry with low overhead. This implementation uses a 4-bit CountMinSketch, growing at 8 bytes per cache entry to be accurate. Unlike ARC and LIRS, this policy does not retain non-resident keys.

november 2015 by jm

Philippe Flajolet’s contribution to streaming algorithms [preso]

november 2013 by jm

Nice deck covering HyperLogLog and its origins, plus a slide at the end covering the Flajolet/Wegman Adaptive Sampling algorithm ("how do you count the number of elements which appear only once in stream using constant size memory?")

algorithms
sketching
hyperloglog
flajolet
wegman
adaptive-sampling
sampling
presentations
slides
november 2013 by jm

Recordinality

august 2013 by jm

a new, and interesting, sketching algorithm, with a Java implementation:

sketching
coding
algorithms
recordinality
cardinality
estimation
hll
hashing
murmurhash
java
Recordinality is unique in that it provides cardinality estimation like HLL, but also offers "distinct value sampling." This means that Recordinality can allow us to fetch a random sample of distinct elements in a stream, invariant to cardinality. Put more succinctly, given a stream of elements containing 1,000,000 occurrences of 'A' and one occurrence each of 'B' - 'Z', the probability of any letter appearing in our sample is equal. Moreover, we can also efficiently store the number of times elements in our distinct sample have been observed. This can help us to understand the distribution of occurrences of elements in our stream. With it, we can answer questions like "do the elements we've sampled present in a power law-like pattern, or is the distribution of occurrences relatively even across the set?"

august 2013 by jm

Sketch of the Day: K-Minimum Values

june 2013 by jm

Another sketching algorithm -- this one supports set union and intersection operations more easily than HyperLogLog when there are more than 2 sets

algorithms
coding
space-saving
cardinality
streams
stream-processing
estimation
sets
sketching
june 2013 by jm

js-hll

june 2013 by jm

Good UI for exploration of HyperLogLog set intersections and unions.

javascript
ui
hll
hyperloglog
algorithms
sketching
js
sets
intersection
union
apache
open-source
One of the first things that we wanted to do with HyperLogLog when we first started playing with it was to support and expose it natively in the browser. The thought of allowing users to directly interact with these structures -- perform arbitrary unions and intersections on effectively unbounded sets all on the client -- was exhilarating to us. [...] we are pleased to announce the open-source release of AK’s HyperLogLog implementation for JavaScript, js-hll. We are releasing this code under the Apache License, Version 2.0.

We knew that we couldn’t just release a bunch of JavaScript code without allowing you to see it in action — that would be a crime. We passed a few ideas around and the one that kept bubbling to the top was a way to kill two birds with one stone. We wanted something that would showcase what you can do with HLL in the browser and give us a tool for explaining HLLs. It is typical for us to explain how HLL intersections work using a Venn diagram. You draw some overlapping circles with a border that represents the error and you talk about how if that border is close to or larger than the intersection then you can’t say much about the size of that intersection. This works just ok on a whiteboard but what you really want is to just build a visualization that allows you to select from some sets and see the overlap. Maybe even play with the precision a little bit to see how that changes the result. Well, we did just that!

june 2013 by jm

hlld

(via:cscotta)
hyper-log-log
hlld
hll
data-structures
memcached
daemons
sketching
estimation
big-data
cardinality
algorithms
via:cscotta

june 2013 by jm

a high-performance C server which is used to expose HyperLogLog sets and operations over them to networked clients. It uses a simple ASCII protocol which is human readable, and similar to memcached.

HyperLogLog's are a relatively new sketching data structure. They are used to estimate cardinality, i.e. the unique number of items in a set. They are based on the observation that any bit in a "good" hash function is indepedenent of any other bit and that the probability of getting a string of N bits all set to the same value is 1/(2^N). There is a lot more in the math, but that is the basic intuition. What is even more incredible is that the storage required to do the counting is log(log(N)). So with a 6 bit register, we can count well into the trillions. For more information, its best to read the papers referenced at the end. TL;DR: HyperLogLogs enable you to have a set with about 1.6% variance, using 3280 bytes, and estimate sizes in the trillions.

(via:cscotta)

june 2013 by jm

**related tags**

Copy this bookmark: