Probabilistic Data Structures for Web Analytics and Data Mining « Highly Scalable Blog

177 bookmarks. First posted by jmeagher may 2012.

Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in the areas like web analytics and Internet advertising. Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with techniques like MapReduce. This approach often leads to heavyweight high-latency analytical processes and poor applicability to realtime use cases. On the other hand, when one is interested only in simple additive metrics like total page views or average price of conversion, it is obvious that raw data can be efficiently summarized, for example, on a daily basis or using simple in-stream counters. Computation of more advanced metrics like a number of unique visitor or most frequent items is more challenging and requires a lot of resources if implemented straightforwardly. In this article, I provide an overview of probabilistic data structures that allow one to estimate these and many other metrics and trade precision of the estimations for the memory consumption. These data structures can be used both as temporary data accumulators in query processing procedures and, perhaps more important, as a compact – sometimes astonishingly compact – replacement of raw data in stream-based computing.

september 2013 by tguemes

Count-Min Sketch and other similar techniques is not the only family of structures that allow one to estimate frequency-related metrics. Another large family of algorithms and structures that deal with frequency estimation is counter-based techniques. Stream-Summary algorithm [8] belongs to this family. Stream-Summary allows one to detect most frequent items in the dataset and estimate their frequencies with explicitly tracked estimation error.

algorithms
datamining
probability
countmin
sketch
cs
august 2013 by euler

A nice visual representing the difference in amount of memory it takes to estimate vs. get exact answers for analyzing stream-based data.

architecture
analytics
metrics
memory_footprint
algorithms
may 2013 by countfloortiles

"Probabilistic Data Structures for Web Analytics and Data Mining" - recommended via @Prismatic

from twitter
march 2013 by peschkaj

Probabilistic Data Structures for Web Analytics and Data Mining http://t.co/TkXTvNJW

from instapaper
february 2013 by apas

Probabilistic Data Structures for Web Analytics and Data Mining -

from twitter_favs
december 2012 by andrewbrown

Probabilistic Data Structures for Web Analytics and Data Mining -

from twitter_favs
december 2012 by michaeltri

Probabilistic Data Structures for Web Analytics and Data Mining

from twitter_favs
november 2012 by tdhopper

Probabilistic Data Structures for Web Analytics and Data Mining

from twitter_favs
november 2012 by ngpestelos

At the same time, the length of the estimator is a very slow growing function of the capacity, 5-bit buckets are enough

analytics
november 2012 by zdwalter

I should read this... I haven't.

probabilisticcomputing
todo
programming
algorithm
algorithms
bigdata
via:HackerNews
toread
september 2012 by mcherm

Probabilistic Data Structures for Web Analytics and Data Mining:

probabilistic-data-structures
probabilistic-programming
from twitter_favs
august 2012 by leecarrot

Mining big data on a budget: use probabilistic data structures for limited memory footprint in counting and queries

later
prob
ds
from delicious
august 2012 by chl

Computation of more advanced metrics like a number of unique visitor or most frequent items is more challenging and requires a lot of resources if implemented straightforwardly. In this article, I provide an overview of probabilistic data structures that allow one to estimate these and many other metrics and trade precision of the estimations for the memory consumption. These data structures can be used both as temporary data accumulators in query processing procedures and, perhaps more important, as a compact – sometimes astonishingly compact – replacement of raw data in stream-based computing.

algorithms
datamining
probability
machinelearning
august 2012 by charman

RT @ikatsov: Probabilistic Data Structures for Web Analytics and Data Mining

from twitter
june 2012 by kangaroo5383

@rgarver: Probabilistic Data Structures for Web Analytics and Data Mining « Highly Scalable Blog http://t.co/iQDG319c

ifttt
twitter
may 2012 by rgarver

Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with techniques like MapReduce. This approach often leads to heavyweight high-latency analytical processes and poor applicability to realtime use cases.

bigdata
algorithm
algorithms
datamining
probability
compsci
via:fourshortlinks
may 2012 by thadk

Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in the areas like web analytics and Internet advertising. Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with techniques like MapReduce. This approach often leads to heavyweight high-latency analytical processes and poor applicability to realtime use cases. On the other hand, when one is interested only in simple additive metrics like total page views or average price of conversion, it is obvious that raw data can be efficiently summarized, for example, on a daily basis or using simple in-stream counters. Computation of more advanced metrics like a number of unique visitor or most frequent items is more challenging and requires a lot of resources if implemented straightforwardly. In this article, I provide an overview of probabilistic data structures that allow one to estimate these and many other metrics and trade precision of the estimations for the memory consumption. These data structures can be used both as temporary data accumulators in query processing procedures and, perhaps more important, as a compact – sometimes astonishingly compact – replacement of raw data in stream-based computing.

datastructures
probalistic
bigdata
may 2012 by sstrobel

RT @newsycombinator: Probabilistic Data Structures for Web Analytics and Data Mining

from twitter
may 2012 by ma51ne64

We'll have blog posts on most of these at @Kiip soon, as we use most. Probabilistic Data Structures for Web Analytics http://t.co/LxS9zyfA

from instapaper
may 2012 by indirect

from Pinboard Network RSS Improver http://pipes.yahoo.com/pipes/pipe.info?_id=b22b9c9acee5906aab7e8a7645a247a9 Stream summary, count-min sketches, loglog counting, linear counters. Some nifty algorithms for probabilistic estimation of element frequencies and data-set cardinality (via proggit)Source: http://pinboard.in/

iftttGR
may 2012 by earth2marsh

tags

(source : @data_mining @web_scraping algorithm algorithms analytics architecture archive axon bestof bestof2012 big-data bigdata bloom bloomflilter bookmarks_bar cardinality collective_intelligence compsci computerscience counting countmin cs data-analytics data-mining data-science data-stream data-structure data datamining datascience datastructure datastructures ds e erlang estimation exedrae fb_dev filter frequency from:instapaper hacking https://twitter.com/datajunkie/status/280197969427451906) ifttt iftttgr later machine-learning machinelearning math memory_footprint metrics mining ml ml_and_analytics olap old prob probabilistic-data-structures probabilistic-programming probabilistic probabilisticcomputing probability probalistic processing programming random referen research scalability scale sent-weekly shared sketch software statistics stats stream streaming streaming_data_research structe structures todo toread twitter twitter_analysis via:fourshortlinks via:hackernews via:popular website work