Data Mining OCR PDFs — Using pdftabextract to liberate tabular data from scanned documents
During the last months I often had to deal with the problem of extracting tabular data from scanned documents. These documents included quite old sources like catalogs of German newspapers in the 1920s to 30s or newer sources like lists of schools in Germany from the 1990s. All sources were of mixed scanning quality (including rotated or skewed pages) and had very different table layouts. Some had visible table column borders, others only table header borders so the actual table cells were only visually separated by “white-space”. Automated data extraction with tools from ABBYY or using Tabula failed in most cases. Because of the big variety of scanning quality and table layouts, a general single-solution approach didn’t work out. Hence I created a set of common tools that allow to detect table layouts on scanned pages in OCR PDFs, enable visual verification of the detected layouts and finally allow the extraction of the data in the tables.
Take Control of Your Paperless Office: The Online Appendixes
Online companion to the ebook "Your Paperless Office" by Joe Kissell.
Live notes from 'Machine learning for enhancing cultural heritage collections' meetup, Monday Jan 8th
Topics mentioned in the introductions: working with contemporary collections; discoverability (quite a few times); infrastructure; remixing and games; artists and ML; image clean-up, transition detection (moving images); creating structured data, term extraction, ontologies, gazetteers; topic modelling; labelling artworks; identifying file formats (particularly text based formats that don't lend themselves to signature/magic number identification); applying lexical models; feature detection in images; image classification including (near) duplicate image detection; image matching; computer vision; open metadata; natural language processing; linking datasets, term matching; wikidata; the user experience of ML-generated data; integrating crowdsourcing and ML; face recognition/detection
Scan, index, and archive all of your paper documents. Mayan EDMS is a wonderful product with a lot of features. However, its sheer number of features and capabilities can be a bit intimidating for the average user. This is where Open Paperless comes in. Open Paperless is a re-think of the user interface and user experience for Mayan EDMS. The goal is to reduce the complexity and make it more suitable for home users. Think of Open Paperless as a lightweight version of Mayan EDMS.

Free Online OCR - convert scanned PDF and images to Word, JPEG to Word
Comparing the best image text recognition APIs
As per our experiment when it comes to detecting text in images Google vision APIs are miles ahead with respect to Microsoft or AWS in both precision and recall. On the other hand even today the state of the art in computer vision has a long way to go and there are many gaps to be filled as seen from the less than satisfactory performance of these API on a fairly simpler use case.

