Iterating over hash sets quickly in Java

march 2018 by jm

Interesting datum:

performance
java
locality
memory
cache-friendly
data-structures
hashsets
linkedhashsets
sets
My numbers are clear: in my tests, it is three times faster to sum up the values in a LinkedHashSet [than a HashSet].

march 2018 by jm

Fast Forward Labs: Probabilistic Data Structure Showdown: Cuckoo Filters vs. Bloom Filters

november 2016 by jm

Nice comparison of a counting Bloom filter and a Cuckoo Filter, implemented in Python:

algorithms
probabilistic
approximation
bloom-filters
cuckoo-filters
sets
estimation
python
This post provides an update by exploring Cuckoo filters, a new probabilistic data structure that improves upon the standard Bloom filter. The Cuckoo filter provides a few advantages: 1) it enables dynamic deletion and addition of items 2) it can be easily implemented compared to Bloom filter variants with similar capabilities, and 3) for similar space constraints, the Cuckoo filter provides lower false positives, particularly at lower capacities. We provide a python implementation of the Cuckoo filter here, and compare it to a counting Bloom filter (a Bloom filter variant).

november 2016 by jm

Roaring Bitmaps

bitmaps
bitsets
sets
data-structures
bits
compression
lucene
spark
daniel-lemire
algorithms

november 2014 by jm

Bitsets, also called bitmaps, are commonly used as fast data structures. Unfortunately, they can use too much memory. To compensate, we often use compressed bitmaps. Roaring bitmaps are compressed bitmaps which tend to outperform conventional compressed bitmaps such as WAH, EWAH or Concise. In some instances, they can be hundreds of times faster and they often offer significantly better compression.

Roaring bitmaps are used in Apache Lucene (as of version 5.0 using an independent implementation) and Apache Spark (as of version 1.2).

november 2014 by jm

Cuckoo Filters

october 2014 by jm

'In many networking systems, Bloom filters are used for high-speed set membership tests. They permit a small fraction of false positive answers with very good space efficiency. However, they do not permit deletion of items from the set, and previous attempts to extend “standard” Bloom filters to support deletion all degrade either space or performance. We propose a new data structure called the cuckoo filter that can replace Bloom filters for approximate set member- ship tests. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than Bloom filters. For applications that store many items and target moderately low false positive rates, cuckoo filters have lower space overhead than space-optimized Bloom filters. Our experimental results also show that cuckoo filters out-perform previous data structures that extend Bloom filters to support deletions substantially in both time and space.'

algorithms
cs
coding
cuckoo-filters
bloom-filters
sets
data-structures
october 2014 by jm

HyperLogLog - Intersection Arithmetic

hyperloglog
hll
hyperloglogplus
streamlib
intersections
sets
estimation
algorithms

april 2014 by jm

'In general HLL intersection in StreamLib works. |A INTERSECT B|

= |A| + |B| - |A UNION B|. Timon's article on intersection is

important to read though. The usefulness of HLL intersection depends

on the features of the HLLs you are intersecting.'

april 2014 by jm

Sketch of the Day: K-Minimum Values

june 2013 by jm

Another sketching algorithm -- this one supports set union and intersection operations more easily than HyperLogLog when there are more than 2 sets

algorithms
coding
space-saving
cardinality
streams
stream-processing
estimation
sets
sketching
june 2013 by jm

js-hll

june 2013 by jm

Good UI for exploration of HyperLogLog set intersections and unions.

javascript
ui
hll
hyperloglog
algorithms
sketching
js
sets
intersection
union
apache
open-source
One of the first things that we wanted to do with HyperLogLog when we first started playing with it was to support and expose it natively in the browser. The thought of allowing users to directly interact with these structures -- perform arbitrary unions and intersections on effectively unbounded sets all on the client -- was exhilarating to us. [...] we are pleased to announce the open-source release of AK’s HyperLogLog implementation for JavaScript, js-hll. We are releasing this code under the Apache License, Version 2.0.

We knew that we couldn’t just release a bunch of JavaScript code without allowing you to see it in action — that would be a crime. We passed a few ideas around and the one that kept bubbling to the top was a way to kill two birds with one stone. We wanted something that would showcase what you can do with HLL in the browser and give us a tool for explaining HLLs. It is typical for us to explain how HLL intersections work using a Venn diagram. You draw some overlapping circles with a border that represents the error and you talk about how if that border is close to or larger than the intersection then you can’t say much about the size of that intersection. This works just ok on a whiteboard but what you really want is to just build a visualization that allows you to select from some sets and see the overlap. Maybe even play with the precision a little bit to see how that changes the result. Well, we did just that!

june 2013 by jm

aaw/hyperloglog-redis - GitHub

january 2013 by jm

'This gem is a pure Ruby implementation of the HyperLogLog algorithm for estimating cardinalities of sets observed via a stream of events. A Redis instance is used for storing the counters.'

cardinality
sets
redis
algorithms
ruby
gems
hyperloglog
january 2013 by jm

Google Guava BloomFIlter

march 2012 by jm

neat, Guava now has a builtin Bloom filter implementation using the murmur hash. that'll potentially save a little hassle in the future

guava
coding
java
bloom-filters
data-structures
sets
march 2012 by jm

**related tags**

Copy this bookmark: