**jm + probabilistic**
14

_Optimal Probabilistic Cache Stampede Prevention_ [pdf]

15 days ago by jm

'When a frequently-accessed cache item expires, multiple requests

to that item can trigger a cache miss and start regenerating

that same item at the same time. This phenomenon,

known as cache stampede, severely limits the performance

of databases and web servers. A natural countermeasure to

this issue is to let the processes that perform such requests

to randomly ask for a regeneration before the expiration

time of the item. In this paper we give optimal algorithms

for performing such probabilistic early expirations. Our algorithms

are theoretically optimal and have much better

performances than other solutions used in real-world applications.'

(via Marc Brooker)

via:marcbrooker
caching
caches
algorithm
probabilistic
expiration
vldb
papers
expiry
cache-miss
stampedes
to that item can trigger a cache miss and start regenerating

that same item at the same time. This phenomenon,

known as cache stampede, severely limits the performance

of databases and web servers. A natural countermeasure to

this issue is to let the processes that perform such requests

to randomly ask for a regeneration before the expiration

time of the item. In this paper we give optimal algorithms

for performing such probabilistic early expirations. Our algorithms

are theoretically optimal and have much better

performances than other solutions used in real-world applications.'

(via Marc Brooker)

15 days ago by jm

Fast Forward Labs: Probabilistic Data Structure Showdown: Cuckoo Filters vs. Bloom Filters

november 2016 by jm

Nice comparison of a counting Bloom filter and a Cuckoo Filter, implemented in Python:

algorithms
probabilistic
approximation
bloom-filters
cuckoo-filters
sets
estimation
python
This post provides an update by exploring Cuckoo filters, a new probabilistic data structure that improves upon the standard Bloom filter. The Cuckoo filter provides a few advantages: 1) it enables dynamic deletion and addition of items 2) it can be easily implemented compared to Bloom filter variants with similar capabilities, and 3) for similar space constraints, the Cuckoo filter provides lower false positives, particularly at lower capacities. We provide a python implementation of the Cuckoo filter here, and compare it to a counting Bloom filter (a Bloom filter variant).

november 2016 by jm

Extracting Structured Data From Recipes Using Conditional Random Fields

april 2015 by jm

nice probabilistic/ML approach to recipe parsing

nytimes
recipes
parsing
text
nlp
machine-learning
probabilistic
crf++
algorithms
feature-extraction
april 2015 by jm

"Cuckoo Filter: Practically Better Than Bloom"

march 2015 by jm

'We propose a new data structure called the cuckoo filter that can replace Bloom filters for approximate set membership

tests. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than

Bloom filters. For applications that store many items and target moderately low false positive rates, cuckoo filters have

lower space overhead than space-optimized Bloom filters. Our experimental results also show that cuckoo filters outperform previous data structures that extend Bloom filters to support deletions substantially in both time and space.'

algorithms
paper
bloom-filters
cuckoo-filters
cuckoo-hashing
data-structures
false-positives
big-data
probabilistic
hashing
set-membership
approximation
tests. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than

Bloom filters. For applications that store many items and target moderately low false positive rates, cuckoo filters have

lower space overhead than space-optimized Bloom filters. Our experimental results also show that cuckoo filters outperform previous data structures that extend Bloom filters to support deletions substantially in both time and space.'

march 2015 by jm

"Invertible Bloom Lookup Tables" [paper]

september 2014 by jm

'We present a version of the Bloom filter data structure that supports not only the insertion, deletion, and lookup of key-value pairs, but also allows a complete listing of the pairs it contains with high probability, as long the number of key- value pairs is below a designed threshold. Our structure allows the number of key-value pairs to greatly exceed this threshold during normal operation. Exceeding the threshold simply temporarily prevents content listing and reduces the probability of a successful lookup. If entries are later deleted to return the structure below the threshold, everything again functions appropriately. We also show that simple variations of our structure are robust to certain standard errors, such as the deletion of a key without a corresponding insertion or the insertion of two distinct values for a key. The properties of our structure make it suitable for several applications, including database and networking applications that we highlight.'

iblt
bloom-filters
data-structures
performance
algorithms
coding
papers
probabilistic
september 2014 by jm

3 Rules of thumb for Bloom Filters

But see also http://stackoverflow.com/a/9554448 , http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf (thanks Tony Finch!)
bloom-filters
algorithm
probabilistic
rules
reasoning
via:norman-maurer
false-positives
hashing
coding

august 2014 by jm

I often need to do rough back-of-the-envelope reasoning about things, and I find that doing a bit of work to develop an intuition for how a new technique performs is usually worthwhile. So, here are three broad rules of thumb to remember when discussing Bloom filters down the pub:

One byte per item in the input set gives about a 2% false positive rate.

The optimal number of hash functions is about 0.7 times the number of bits per item.

3 - The number of hashes dominates performance.

But see also http://stackoverflow.com/a/9554448 , http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf (thanks Tony Finch!)

august 2014 by jm

Redis adds support for HyperLogLog

april 2014 by jm

good comment thread on HN, discussing hlld and bloomd as well

hll
bloom-filters
hyperloglog
redis
data-structures
estimation
cardinality
probabilistic
probability
hashing
random
april 2014 by jm

Probabalistic Scraping of Plain Text Tables

september 2013 by jm

a nifty hack.

(via proggit)

scraping
tables
ocr
probabilistic
linear-programming
optimization
machine-learning
via:proggit
Recently I have been banging my head trying to import a ton of OCR acquired data expressed in tabular form. I think I have come up with a neat approach using probabilistic reasoning combined with mixed integer programming. The method is pretty robust to all sorts of real world issues. In particular, the method leverages topological understanding of tables, encodes it declaratively into a mixed integer/linear program, and integrates weak probabilistic signals to classify the whole table in one go (at sub second speeds). This method can be used for any kind of classification where you have strong logical constraints but noisy data.

(via proggit)

september 2013 by jm

Flower Filter

july 2013 by jm

'A simple time-decaying approximate membership filter' -- like a Bloom filter with time decay. See also http://eng.42go.com/flower-filter-an-update/ for some notes on the non-independence of survival probabilities, and how that imposes negligible differences in practice.

bloom-filter
algorithms
coding
probabilistic
approximate
time
decay
july 2013 by jm

Google Translate of "Lorem ipsum"

june 2013 by jm

The perils of unsupervised machine learning... here's what GTranslate reckons "lorem ipsum" translates to:

lorem-ipsum
boilerplate
machine-learning
translation
google
translate
probabilistic
tomato-sauce
cisco
funny
We will be sure to post a comment. Add tomato sauce, no tank or a traditional or online. Until outdoor environment, and not just any competition, reduce overall pain. Cisco Security, they set up in the throat develop the market beds of Cura; Employment silently churn-class by our union, very beginner himenaeos. Monday gate information. How long before any meaningful development. Until mandatory functional requirements to developers. But across the country in the spotlight in the notebook. The show was shot. Funny lion always feasible, innovative policies hatred assured. Information that is no corporate Japan

june 2013 by jm

clearspring / stream-lib

february 2013 by jm

ASL-licensed open source library of stream-processing/approximation algorithms: count-min sketch, space-saving top-k, cardinality estimation, LogLog, HyperLogLog, MurmurHash, lookup3 hash, Bloom filters, q-digest, stochastic top-k

algorithms
coding
streams
cep
stream-processing
approximation
probabilistic
space-saving
top-k
cardinality
estimation
bloom-filters
q-digest
loglog
hyperloglog
murmurhash
lookup3
february 2013 by jm

Probabilistic Data Structures for Web Analytics and Data Mining « Highly Scalable Blog

may 2012 by jm

Stream summary, count-min sketches, loglog counting, linear counters. Some nifty algorithms for probabilistic estimation of element frequencies and data-set cardinality (via proggit)

via:proggit
algorithms
probability
probabilistic
count-min
stream-summary
loglog-counting
linear-counting
estimation
big-data
may 2012 by jm

Golomb-coded sets

september 2011 by jm

'a probabilistic data structure conceptually similar to a Bloom filter, but with a more compact in-memory representation, and a slower query time.' could come in handy

gcs
bloom-filters
probabilistic
data-structures
memory
algorithms
september 2011 by jm

**related tags**

Copy this bookmark: