jm + hll   8

HyperLogLog - Intersection Arithmetic
'In general HLL intersection in StreamLib works.  |A INTERSECT B|
= |A| + |B| - |A UNION B|.  Timon's article on intersection is
important to read though.  The usefulness of HLL intersection depends
on the features of the HLLs you are intersecting.'
hyperloglog  hll  hyperloglogplus  streamlib  intersections  sets  estimation  algorithms 
april 2014 by jm
Druid | How We Scaled HyperLogLog: Three Real-World Optimizations
3 optimizations have made to the HLL algorithm to scale it up for production use in Metamarkets: compacting registers (fixes a bug with unions of multiple HLLs); a sparse storage format (to optimize space); faster lookups using a lookup table.  metamarkets  scaling  hyperloglog  hll  algorithms  performance  optimization  counting  estimation 
april 2014 by jm
a new, and interesting, sketching algorithm, with a Java implementation:
Recordinality is unique in that it provides cardinality estimation like HLL, but also offers "distinct value sampling." This means that Recordinality can allow us to fetch a random sample of distinct elements in a stream, invariant to cardinality. Put more succinctly, given a stream of elements containing 1,000,000 occurrences of 'A' and one occurrence each of 'B' - 'Z', the probability of any letter appearing in our sample is equal. Moreover, we can also efficiently store the number of times elements in our distinct sample have been observed. This can help us to understand the distribution of occurrences of elements in our stream. With it, we can answer questions like "do the elements we've sampled present in a power law-like pattern, or is the distribution of occurrences relatively even across the set?"
sketching  coding  algorithms  recordinality  cardinality  estimation  hll  hashing  murmurhash  java 
august 2013 by jm
Good UI for exploration of HyperLogLog set intersections and unions.
One of the first things that we wanted to do with HyperLogLog when we first started playing with it was to support and expose it natively in the browser. The thought of allowing users to directly interact with these structures -- perform arbitrary unions and intersections on effectively unbounded sets all on the client -- was exhilarating to us. [...] we are pleased to announce the open-source release of AK’s HyperLogLog implementation for JavaScript, js-hll. We are releasing this code under the Apache License, Version 2.0.

We knew that we couldn’t just release a bunch of JavaScript code without allowing you to see it in action — that would be a crime. We passed a few ideas around and the one that kept bubbling to the top was a way to kill two birds with one stone. We wanted something that would showcase what you can do with HLL in the browser and give us a tool for explaining HLLs. It is typical for us to explain how HLL intersections work using a Venn diagram. You draw some overlapping circles with a border that represents the error and you talk about how if that border is close to or larger than the intersection then you can’t say much about the size of that intersection. This works just ok on a whiteboard but what you really want is to just build a visualization that allows you to select from some sets and see the overlap. Maybe even play with the precision a little bit to see how that changes the result. Well, we did just that!
javascript  ui  hll  hyperloglog  algorithms  sketching  js  sets  intersection  union  apache  open-source 
june 2013 by jm
a high-performance C server which is used to expose HyperLogLog sets and operations over them to networked clients. It uses a simple ASCII protocol which is human readable, and similar to memcached.

HyperLogLog's are a relatively new sketching data structure. They are used to estimate cardinality, i.e. the unique number of items in a set. They are based on the observation that any bit in a "good" hash function is indepedenent of any other bit and that the probability of getting a string of N bits all set to the same value is 1/(2^N). There is a lot more in the math, but that is the basic intuition. What is even more incredible is that the storage required to do the counting is log(log(N)). So with a 6 bit register, we can count well into the trillions. For more information, its best to read the papers referenced at the end. TL;DR: HyperLogLogs enable you to have a set with about 1.6% variance, using 3280 bytes, and estimate sizes in the trillions.

hyper-log-log  hlld  hll  data-structures  memcached  daemons  sketching  estimation  big-data  cardinality  algorithms  via:cscotta 
june 2013 by jm

Copy this bookmark: