Probabilistic Data Structures for Web Analytics and Data Mining « Highly Scalable Blog


183 bookmarks. First posted by jmeagher may 2012.


Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in the areas like web analytics and Internet advertising. Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with techniques like MapReduce. This approach often leads to heavyweight high-latency analytical processes and poor applicability to realtime use cases. On the other hand, when one is interested only in simple additive metrics like total page views or average price of conversion, it is obvious that raw data can be efficiently summarized, for example, on a daily basis or using simple in-stream counters.  Computation of more advanced metrics like a number of unique visitor or most frequent items is more challenging and requires a lot of resources if implemented straightforwardly. In this article, I provide an overview of probabilistic data structures that allow one to estimate these and many other metrics and trade precision of the estimations for the memory consumption. These data structures can be used both as temporary data accumulators in query processing procedures and, perhaps more important, as a compact – sometimes astonishingly compact – replacement of raw data in stream-based computing.
bigdata  analytics  machinelearning 
september 2013 by tguemes
Count-Min Sketch and other similar techniques is not the only family of structures that allow one to estimate frequency-related metrics. Another large family of algorithms and structures that deal with frequency estimation is counter-based techniques. Stream-Summary algorithm [8] belongs to this family. Stream-Summary allows one to detect most frequent items in the dataset and estimate their frequencies with explicitly tracked estimation error.
algorithms  datamining  probability  countmin  sketch  cs 
august 2013 by euler
A nice visual representing the difference in amount of memory it takes to estimate vs. get exact answers for analyzing stream-based data.
architecture  analytics  metrics  memory_footprint  algorithms 
may 2013 by countfloortiles
"Probabilistic Data Structures for Web Analytics and Data Mining" - recommended via
from twitter
march 2013 by peschkaj
Probabilistic Data Structures for Web Analytics and Data Mining http://t.co/TkXTvNJW
from instapaper
february 2013 by apas
Probabilistic Data Structures for Web Analytics and Data Mining -
from twitter_favs
december 2012 by andrewbrown
Probabilistic Data Structures for Web Analytics and Data Mining -
from twitter_favs
december 2012 by michaeltri
Probabilistic Data Structures for Web Analytics and Data Mining
from twitter_favs
november 2012 by tdhopper
Probabilistic Data Structures for Web Analytics and Data Mining
from twitter_favs
november 2012 by ngpestelos
At the same time, the length of the estimator is a very slow growing function of the capacity, 5-bit buckets are enough
analytics 
november 2012 by zdwalter
counting algorithms
algorithms  probability 
october 2012 by klogram
웹을 위한 확률론적 자료구조. 약간의 오차를 허용하면 굉장히 효율적으로 처리가능
from twitter_favs
september 2012 by strongberry
Probabilistic Data Structures for Web Analytics and Data Mining:
probabilistic-data-structures  probabilistic-programming  from twitter_favs
august 2012 by leecarrot
Mining big data on a budget: use probabilistic data structures for limited memory footprint in counting and queries
later  prob  ds  from delicious
august 2012 by chl
Computation of more advanced metrics like a number of unique visitor or most frequent items is more challenging and requires a lot of resources if implemented straightforwardly. In this article, I provide an overview of probabilistic data structures that allow one to estimate these and many other metrics and trade precision of the estimations for the memory consumption. These data structures can be used both as temporary data accumulators in query processing procedures and, perhaps more important, as a compact – sometimes astonishingly compact – replacement of raw data in stream-based computing.
algorithms  datamining  probability  machinelearning 
august 2012 by charman
RT : Probabilistic Data Structures for Web Analytics and Data Mining
from twitter
june 2012 by kangaroo5383
Really interesting article
website  research  software 
june 2012 by rjkroege
@rgarver: Probabilistic Data Structures for Web Analytics and Data Mining « Highly Scalable Blog http://t.co/iQDG319c
ifttt  twitter 
may 2012 by rgarver
Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with techniques like MapReduce. This approach often leads to heavyweight high-latency analytical processes and poor applicability to realtime use cases.
bigdata  algorithm  algorithms  datamining  probability  compsci  via:fourshortlinks 
may 2012 by thadk
Algorithmic porn for several problems with large data sets
from twitter_favs
may 2012 by myfreeweb
Algorithmic porn for several problems with large data sets
from twitter_favs
may 2012 by liqweed
Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in the areas like web analytics and Internet advertising. Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with techniques like MapReduce. This approach often leads to heavyweight high-latency analytical processes and poor applicability to realtime use cases. On the other hand, when one is interested only in simple additive metrics like total page views or average price of conversion, it is obvious that raw data can be efficiently summarized, for example, on a daily basis or using simple in-stream counters. Computation of more advanced metrics like a number of unique visitor or most frequent items is more challenging and requires a lot of resources if implemented straightforwardly. In this article, I provide an overview of probabilistic data structures that allow one to estimate these and many other metrics and trade precision of the estimations for the memory consumption. These data structures can be used both as temporary data accumulators in query processing procedures and, perhaps more important, as a compact – sometimes astonishingly compact – replacement of raw data in stream-based computing.
datastructures  probalistic  bigdata 
may 2012 by sstrobel
RT : Probabilistic Data Structures for Web Analytics and Data Mining
from twitter
may 2012 by ma51ne64
on data structures for data mining
datamining  algorithm  programming 
may 2012 by jshwlkr
Probabilistic Data Structures for Web Analytics and Data Mining -
from twitter
may 2012 by heapdump
We'll have blog posts on most of these at @Kiip soon, as we use most. Probabilistic Data Structures for Web Analytics http://t.co/LxS9zyfA
from instapaper
may 2012 by indirect