ocr   8897

« earlier    

Data Mining OCR PDFs — Using pdftabextract to liberate tabular data from scanned documents | WZB Data Science Blog
During the last months I often had to deal with the problem of extracting tabular data from scanned documents. These documents included quite old sources like catalogs of German newspapers in the 1920s to 30s or newer sources like lists of schools in Germany from the 1990s. All sources were of mixed scanning quality (including rotated or skewed pages) and had very different table layouts. Some had visible table column borders, others only table header borders so the actual table cells were only visually separated by “white-space”. Automated data extraction with tools from ABBYY or using Tabula failed in most cases. Because of the big variety of scanning quality and table layouts, a general single-solution approach didn’t work out. Hence I created a set of common tools that allow to detect table layouts on scanned pages in OCR PDFs, enable visual verification of the detected layouts and finally allow the extraction of the data in the tables.
PDF  OCR  scan  tables  tabular  detect 
4 days ago by fjordaan
Take Control of Your Paperless Office: The Online Appendixes
Online companion to the ebook "Your Paperless Office" by Joe Kissell.
ebook  JoeKissell  paperless  scanners  OCR 
9 days ago by amoore
Live notes from 'Machine learning for enhancing cultural heritage collections' meetup, Monday Jan 8th - Google Docs
Topics mentioned in the introductions: working with contemporary collections; discoverability (quite a few times); infrastructure; remixing and games; artists and ML; image clean-up, transition detection (moving images); creating structured data, term extraction, ontologies, gazetteers; topic modelling; labelling artworks; identifying file formats (particularly text based formats that don’t lend themselves to signature/magic number identification); applying lexical models; feature detection in images; image classification including (near) duplicate image detection; image matching; computer vision; open metadata; natural language processing; linking datasets, term matching; wikidata; the user experience of ML-generated data; integrating crowdsourcing and ML; face recognition/detection
machine_learning  collection  image_recognition  ocr  google_docs 
9 days ago by stacker
Scan, index, and archive all of your paper documents https://github.com/zhoubear/open-pape… Mayan EDMS is a wonderful product with a lot of features. However, its sheer number of features and capabilities can be a bit intimidating for the average user. This is where Open Paperless comes in. Open Paperless is a re-think of the user interface and user experience for Mayan EDMS. The goal is to reduce the complexity and make it more suitable for home users. Think of Open Paperless as a lightweight version of Mayan EDMS.

paperless  ocr  scanner  productivity  organization 
10 days ago by michaelfox
Free Online OCR - convert scanned PDF and images to Word, JPEG to Word
ocr  text  converter 
11 days ago by clearstyle
Comparing the best image text recognition APIs
As per our experiment when it comes to detecting text in images Google vision APIs are miles ahead with respect to Microsoft or AWS in both precision and recall. On the other hand even today the state of the art in computer vision has a long way to go and there are many gaps to be filled as seen from the less than satisfactory performance of these API on a fairly simpler use case.

Get dataset and code
ocr  imageProcessing  text 
15 days ago by euler

« earlier    

related tags

351  ai  algorithms  android  api  app  artificialintelligence  book  cheat  cheating  chess  cnn  collection  commercial  comparison  computer  computervision  converter  cv  data  dataanalysis  database  dataset  detect  digitization  directory  django  doc  documentation  documentmanagement  documents  ebook  en  engine  excel  floss  free  frontend  games  golang  google  google_docs  hacking  hqtrivia  image  image_processing  image_recognition  imageprocessing  images  interesting  joekissell  library  linux  mac-app  mac  machine_learning  macos  management  mining  music  mustknow  nextcloud  nlp  notes  omr  online  onlinetools  openpaperless  opensource  organization  organizing  package  paper  paperless  pdf  pdfs  php  productivity  programming  py  python  raspi  reader  recognition  reference  resources  review  ryan_baumann  scan  scanbot  scanner  scanners  scanning  score  scraper  scraping  screencapture  screenshot  screenshots  scripting  searchablepdf  security  selfhost  software  steganography  store  surveillance  table  tables  tabular  tesseract  text  tidbits  toolkit  tools  tutorial  ubuntu  utility  vault  vision  web  webapplication  webservices  windows  word 

Copy this bookmark: