Probabilistic Data Structures for Web Analytics and Data Mining « Highly Scalable Blog

199 bookmarks. First posted by jmeagher may 2012.

Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in the areas like web analytics and Internet advertising. Analysis of such large data sets oft...

data-mining
june 2015 by kijowski

Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in the areas like web analytics and Internet advertising. Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with techniques like MapReduce. This approach often leads to heavyweight high-latency analytical processes and poor applicability to realtime use cases. On the other hand, when one is interested only in simple additive metrics like total page views or average price of conversion, it is obvious that raw data can be efficiently summarized, for example, on a daily basis or using simple in-stream counters. Computation of more advanced metrics like a number of unique visitor or most frequent items is more challenging and requires a lot of resources if implemented straightforwardly. In this article, I provide an overview of probabilistic data structures that allow one to estimate these and many other metrics and trade precision of the estimations for the memory consumption. These data structures can be used both as temporary data accumulators in query processing procedures and, perhaps more important, as a compact – sometimes astonishingly compact – replacement of raw data in stream-based computing.

september 2013 by tguemes

Count-Min Sketch and other similar techniques is not the only family of structures that allow one to estimate frequency-related metrics. Another large family of algorithms and structures that deal with frequency estimation is counter-based techniques. Stream-Summary algorithm [8] belongs to this family. Stream-Summary allows one to detect most frequent items in the dataset and estimate their frequencies with explicitly tracked estimation error.

algorithms
datamining
probability
countmin
sketch
cs
august 2013 by euler

A nice visual representing the difference in amount of memory it takes to estimate vs. get exact answers for analyzing stream-based data.

architecture
analytics
metrics
memory_footprint
algorithms
may 2013 by countfloortiles

"Probabilistic Data Structures for Web Analytics and Data Mining" - recommended via @Prismatic

from twitter
march 2013 by peschkaj

Probabilistic Data Structures for Web Analytics and Data Mining http://t.co/TkXTvNJW

from instapaper
february 2013 by apas

Probabilistic Data Structures for Web Analytics and Data Mining -

from twitter_favs
december 2012 by andrewbrown

Probabilistic Data Structures for Web Analytics and Data Mining -

from twitter_favs
december 2012 by michaeltri

Probabilistic Data Structures for Web Analytics and Data Mining

from twitter_favs
november 2012 by ngpestelos

At the same time, the length of the estimator is a very slow growing function of the capacity, 5-bit buckets are enough

analytics
november 2012 by zdwalter

I should read this... I haven't.

probabilisticcomputing
todo
programming
algorithm
algorithms
bigdata
via:HackerNews
toread
september 2012 by mcherm

Probabilistic Data Structures for Web Analytics and Data Mining:

probabilistic-data-structures
probabilistic-programming
from twitter_favs
august 2012 by leecarrot

Mining big data on a budget: use probabilistic data structures for limited memory footprint in counting and queries

later
prob
ds
refresh:1
from delicious
august 2012 by chl-archive

Computation of more advanced metrics like a number of unique visitor or most frequent items is more challenging and requires a lot of resources if implemented straightforwardly. In this article, I provide an overview of probabilistic data structures that allow one to estimate these and many other metrics and trade precision of the estimations for the memory consumption. These data structures can be used both as temporary data accumulators in query processing procedures and, perhaps more important, as a compact – sometimes astonishingly compact – replacement of raw data in stream-based computing.

algorithms
datamining
probability
machinelearning
august 2012 by charman

RT @ikatsov: Probabilistic Data Structures for Web Analytics and Data Mining

from twitter
june 2012 by kangaroo5383

@rgarver: Probabilistic Data Structures for Web Analytics and Data Mining « Highly Scalable Blog http://t.co/iQDG319c

ifttt
twitter
may 2012 by rgarver

Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with techniques like MapReduce. This approach often leads to heavyweight high-latency analytical processes and poor applicability to realtime use cases.

bigdata
algorithm
algorithms
datamining
probability
compsci
via:fourshortlinks
may 2012 by thadk

Algorithmic porn for several problems with large data sets

statistics
mapreduce
algorithms
from twitter_favs
may 2012 by peterb

Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in the areas like web analytics and Internet advertising. Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with techniques like MapReduce. This approach often leads to heavyweight high-latency analytical processes and poor applicability to realtime use cases. On the other hand, when one is interested only in simple additive metrics like total page views or average price of conversion, it is obvious that raw data can be efficiently summarized, for example, on a daily basis or using simple in-stream counters. Computation of more advanced metrics like a number of unique visitor or most frequent items is more challenging and requires a lot of resources if implemented straightforwardly. In this article, I provide an overview of probabilistic data structures that allow one to estimate these and many other metrics and trade precision of the estimations for the memory consumption. These data structures can be used both as temporary data accumulators in query processing procedures and, perhaps more important, as a compact – sometimes astonishingly compact – replacement of raw data in stream-based computing.

datastructures
probalistic
bigdata
may 2012 by sstrobel

tags

: @data_mining @web_scraping algorithm algorithms analytics architecture archive bigdata big_data bloom bloomflilter bookmarks_bar cardinality compression compsci computerscience counting countmin cs data-analytics data-mining data-science data-stream data-structure data datamining datascience datastructure datastructures data_mining ds e estimation estimator filter https://twitter.com/datajunkie/status/280197969427451906) ifttt later machine-learning machinelearning mapreduce math mathematics memory_footprint metrics mining ml old performance prob probabilistic-data-structures probabilistic-programming probabilistic probabilisticcomputing probability probalistic processing programming random referen refresh:1 research scalability scale semantic sketch software statistics stats stream streaming structe structures system:unfiled todo toread twitter twitter_analysis via:fourshortlinks via:hackernews website work