jm + via:proggit   4

Probabalistic Scraping of Plain Text Tables
a nifty hack.
Recently I have been banging my head trying to import a ton of OCR acquired data expressed in tabular form. I think I have come up with a neat approach using probabilistic reasoning combined with mixed integer programming. The method is pretty robust to all sorts of real world issues. In particular, the method leverages topological understanding of tables, encodes it declaratively into a mixed integer/linear program, and integrates weak probabilistic signals to classify the whole table in one go (at sub second speeds). This method can be used for any kind of classification where you have strong logical constraints but noisy data.


(via proggit)
scraping  tables  ocr  probabilistic  linear-programming  optimization  machine-learning  via:proggit 
september 2013 by jm
Probabilistic Data Structures for Web Analytics and Data Mining « Highly Scalable Blog
Stream summary, count-min sketches, loglog counting, linear counters. Some nifty algorithms for probabilistic estimation of element frequencies and data-set cardinality (via proggit)
via:proggit  algorithms  probability  probabilistic  count-min  stream-summary  loglog-counting  linear-counting  estimation  big-data 
may 2012 by jm
Occursions
'Our goal is to create the world's fastest extendable, non-transactional time series database for big data (you know, for kids)! Log file indexing is our initial focus. For example append only ASCII files produced by libraries like Log4J, or containing FIX messages or JSON objects. Occursions was built by a small team sick of creating hacks to remotely copy and/or grep through tons of large log files. We use it to index around a terabyte of new log data per day. Occursions asynchronously tails log files and indexes the individual lines in each log file as each line is written to disk so you don't even have to wait for a second after an event happens to search for it. Occursions uses custom disk backed data structures to create and search its indexes so it is very efficient at using CPU, memory and disk.'
logs  search  tsd  big-data  log4j  via:proggit 
march 2012 by jm
A few git tips you didn't know about
'git checkout -t' alone is worth the bookmark
git  tips  coding  unix  reference  tricks  via:proggit 
september 2011 by jm

Copy this bookmark:



description:


tags: