How-to: Index Scanned PDFs at Scale Using Fewer Than 50 Lines of Code
october 2015 by jm
using Spark, Tesseract, HBase, Solr and Leptonica. Actually pretty feasible
spark
tesseract
hbase
solr
leptonica
pdfs
scanning
cloudera
hadoop
architecture
october 2015 by jm
Scraping for Journalism: A Guide for Collecting Data - ProPublica
january 2011 by jm
modern web-scraping tech, using Ruby and Nokogiri
via:waxy
web
scraping
ruby
nokogiri
pdf
flash
tesseract
ocr
from delicious
january 2011 by jm
Copy this bookmark: