Probabilistic Data Structures for Web Analytics and Data Mining « Highly Scalable Blog
22 days ago by jm
Stream summary, count-min sketches, loglog counting, linear counters. Some nifty algorithms for probabilistic estimation of element frequencies and data-set cardinality (via proggit)
via:proggit
algorithms
probability
probabilistic
count-min
stream-summary
loglog-counting
linear-counting
estimation
big-data
22 days ago by jm
Operations, machine learning and premature babies - O'Reilly Radar
6 weeks ago by jm
good post about applying ML techniques to ops data. 'At a recent meetup about finance, Abhi Mehta encouraged people to capture and save "everything." He was talking about financial data, but the same applies here. We'd need to build Hadoop clusters to monitor our server farms; we'd need Hadoop clusters to monitor our Hadoop clusters. It's a big investment of time and resources. If we could make that investment, what would we find out? I bet that we'd be surprised.' Let's just say that if you like the sound of that, our SDE team in Amazon's Dublin office is hiring ;)
ops
big-data
machine-learning
hadoop
ibm
6 weeks ago by jm
Occursions
10 weeks ago by jm
'Our goal is to create the world's fastest extendable, non-transactional time series database for big data (you know, for kids)! Log file indexing is our initial focus. For example append only ASCII files produced by libraries like Log4J, or containing FIX messages or JSON objects. Occursions was built by a small team sick of creating hacks to remotely copy and/or grep through tons of large log files. We use it to index around a terabyte of new log data per day. Occursions asynchronously tails log files and indexes the individual lines in each log file as each line is written to disk so you don't even have to wait for a second after an event happens to search for it. Occursions uses custom disk backed data structures to create and search its indexes so it is very efficient at using CPU, memory and disk.'
logs
search
tsd
big-data
log4j
via:proggit
10 weeks ago by jm
How to beat the CAP theorem
october 2011 by jm
Nathan "Storm" Marz on building a dual realtime/batch stack. This lines up with something I've been building in work, so I'm happy ;)
nathan-marz
realtime
batch
hadoop
storm
big-data
cap
october 2011 by jm
The Secrets of Building Realtime Big Data Systems
may 2011 by jm
great slides, via HN. recommends a canonical Hadoop long-term store and a quick, realtime, separate datastore for "not yet processed by Hadoop" data
hadoop
big-data
data
scalability
datamining
realtime
slides
presentations
may 2011 by jm
related tags
algorithms ⊕ batch ⊕ big-data ⊖ cap ⊕ count-min ⊕ data ⊕ datamining ⊕ estimation ⊕ hadoop ⊕ ibm ⊕ linear-counting ⊕ log4j ⊕ loglog-counting ⊕ logs ⊕ machine-learning ⊕ nathan-marz ⊕ ops ⊕ presentations ⊕ probabilistic ⊕ probability ⊕ realtime ⊕ scalability ⊕ search ⊕ slides ⊕ storm ⊕ stream-summary ⊕ tsd ⊕ via:proggit ⊕Copy this bookmark: