jm + bloom-filters   19

A general purpose counting filter
This paper introduces a new AMQ data structure, a Counting Quotient Filter, which addresses all of these shortcomings and performs extremely well in both time and space: CQF performs in-memory inserts and queries up to an order of magnitude faster than the original quotient filter structure from which it takes its inspiration, several times faster than a Bloom filter, and similarly to a cuckoo filter. The CQF structure is comparable or more space efficient than all of them too. Moreover, CQF does all of this while supporting counting, outperforming all of the other forms in both dimensions even though they do not. In short, CQF is a big deal!
cqf  counting-quotient-filters  data-structures  via:acolyer  coding  approximate  bloom-filters 
6 weeks ago by jm
Fast Forward Labs: Probabilistic Data Structure Showdown: Cuckoo Filters vs. Bloom Filters
Nice comparison of a counting Bloom filter and a Cuckoo Filter, implemented in Python:
This post provides an update by exploring Cuckoo filters, a new probabilistic data structure that improves upon the standard Bloom filter. The Cuckoo filter provides a few advantages: 1) it enables dynamic deletion and addition of items 2) it can be easily implemented compared to Bloom filter variants with similar capabilities, and 3) for similar space constraints, the Cuckoo filter provides lower false positives, particularly at lower capacities. We provide a python implementation of the Cuckoo filter here, and compare it to a counting Bloom filter (a Bloom filter variant).
algorithms  probabilistic  approximation  bloom-filters  cuckoo-filters  sets  estimation  python 
november 2016 by jm
"Cuckoo Filter: Practically Better Than Bloom"
'We propose a new data structure called the cuckoo filter that can replace Bloom filters for approximate set membership
tests. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than
Bloom filters. For applications that store many items and target moderately low false positive rates, cuckoo filters have
lower space overhead than space-optimized Bloom filters. Our experimental results also show that cuckoo filters outperform previous data structures that extend Bloom filters to support deletions substantially in both time and space.'
algorithms  paper  bloom-filters  cuckoo-filters  cuckoo-hashing  data-structures  false-positives  big-data  probabilistic  hashing  set-membership  approximation 
march 2015 by jm
Cuckoo Filters
'In many networking systems, Bloom filters are used for high-speed set membership tests. They permit a small fraction of false positive answers with very good space efficiency. However, they do not permit deletion of items from the set, and previous attempts to extend “standard” Bloom filters to support deletion all degrade either space or performance. We propose a new data structure called the cuckoo filter that can replace Bloom filters for approximate set member- ship tests. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than Bloom filters. For applications that store many items and target moderately low false positive rates, cuckoo filters have lower space overhead than space-optimized Bloom filters. Our experimental results also show that cuckoo filters out-perform previous data structures that extend Bloom filters to support deletions substantially in both time and space.'
algorithms  cs  coding  cuckoo-filters  bloom-filters  sets  data-structures 
october 2014 by jm
"Invertible Bloom Lookup Tables" [paper]
'We present a version of the Bloom filter data structure that supports not only the insertion, deletion, and lookup of key-value pairs, but also allows a complete listing of the pairs it contains with high probability, as long the number of key- value pairs is below a designed threshold. Our structure allows the number of key-value pairs to greatly exceed this threshold during normal operation. Exceeding the threshold simply temporarily prevents content listing and reduces the probability of a successful lookup. If entries are later deleted to return the structure below the threshold, everything again functions appropriately. We also show that simple variations of our structure are robust to certain standard errors, such as the deletion of a key without a corresponding insertion or the insertion of two distinct values for a key. The properties of our structure make it suitable for several applications, including database and networking applications that we highlight.'
iblt  bloom-filters  data-structures  performance  algorithms  coding  papers  probabilistic 
september 2014 by jm
3 Rules of thumb for Bloom Filters
I often need to do rough back-of-the-envelope reasoning about things, and I find that doing a bit of work to develop an intuition for how a new technique performs is usually worthwhile. So, here are three broad rules of thumb to remember when discussing Bloom filters down the pub:

One byte per item in the input set gives about a 2% false positive rate.

The optimal number of hash functions is about 0.7 times the number of bits per item.

3 - The number of hashes dominates performance.

But see also http://stackoverflow.com/a/9554448 , http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf (thanks Tony Finch!)
bloom-filters  algorithm  probabilistic  rules  reasoning  via:norman-maurer  false-positives  hashing  coding 
august 2014 by jm
bloomd
a high-performance C server which is used to expose bloom filters and operations over them to networked clients. It uses a simple ASCII protocol which is human readable, and similar to memcached.
(via Tony Finch)
via:fanf  memcached  bloomd  open-source  bloom-filters 
march 2013 by jm
clearspring / stream-lib
ASL-licensed open source library of stream-processing/approximation algorithms: count-min sketch, space-saving top-k, cardinality estimation, LogLog, HyperLogLog, MurmurHash, lookup3 hash, Bloom filters, q-digest, stochastic top-k
algorithms  coding  streams  cep  stream-processing  approximation  probabilistic  space-saving  top-k  cardinality  estimation  bloom-filters  q-digest  loglog  hyperloglog  murmurhash  lookup3 
february 2013 by jm
Google Guava BloomFIlter
neat, Guava now has a builtin Bloom filter implementation using the murmur hash. that'll potentially save a little hassle in the future
guava  coding  java  bloom-filters  data-structures  sets 
march 2012 by jm
Golomb-coded sets
'a probabilistic data structure conceptually similar to a Bloom filter, but with a more compact in-memory representation, and a slower query time.' could come in handy
gcs  bloom-filters  probabilistic  data-structures  memory  algorithms 
september 2011 by jm
Lucene Utilities and Bloom Filters - Greplin:tech
'Storing 50,000 2.5KB items in a traditional hash set requires over 125MB, but if you're willing to accept a 1-in-10,000 false positive rate on lookups, [this] bloom filter requires under 500KB' - interesting variation on the basic concept.  Java, Apache-licensed
search  bloom-filters  greplin  open-source  apache  false-positives  from delicious
april 2011 by jm
Tony Finch - Some notes on Bloom filters
more good Bloom Filter tips. he says: 'I take a slightly different tack, starting with a target population in mind which determines the size of the filter. Also there's a minor error regarding performance in the corte.si post. You only need to calculate two hash functions, and use a linear combination of them to index the Bloom filter. This simplifies the coding a lot, and if hash calculation dominates filter indexing, it's also a lot faster.'
bloom-filters  tips  coding  via:fanf  false-positives  from delicious
november 2010 by jm
/~colmmacc/ » Prime and Proper
algorithm to perform set membership tests on enumerated sets quickly and memory-efficiently, using multiplication by primes. Nice trick
hacks  colmmacc  prime-numbers  set-membership  bloom-filters  bignums  algorithms  programming  from delicious
september 2010 by jm
_Fast Cache for Your Text: Accelerating Exact Pattern Matching with Feed-Forward Bloom Filters_ [PDF]
intriguing application of a Bloom Filter optimised for modern CPUs (2-level, with a cache-partitioned first level), providing massive speedups vs GNU grep or trie-based approaches like Aho-Corasick -- or possibly re2c, as used in "sa-compile". On the other hand, a perl implementation of Rabin-Karp, which is similar, didn't perform as well. Still, may be worth investigating
bloom-filters  grep  filtering  spamassassin  sa-compile  text-matching  caches  aho-corasick  from delicious
september 2010 by jm

Copy this bookmark:



description:


tags: