Scraping   11718

« earlier    

alephdata/memorious: Distributed crawling framework for documents and structured data.
memorious is a distributed web scraping toolkit. It is a light-weight tool that schedules, monitors and supports scrapers that collect structured or un-structured data. This includes the following use cases:
crawler  scraping  tool 
2 days ago by davidbenque
médialab Tools
Home of SIgmaJS and multitude of other cool tools.
visualization  network-analysis  scraping  sna 
3 days ago by mjlassila
The Brutal Fight to Mine Your Data and Sell It to Your Boss - Bloomberg
If that argument is only somewhat reassuring, HiQ’s argument is effectively that we’re on our own, and that this is the price we pay for today’s internet. “There’s probably lots and lots of applications that might make someone feel a little queasy, right?” Gupta told Judge Chen. “But the thing is, we can’t sit here today and police every possible business model that some entrepreneur in Silicon Valley might come up with. It’s public information. It’s the marketplace of ideas. It’s the engine of our country’s growth.”
personalData  dataMining  scraping  legal  surveillance  LinkedIn  HiQ  analytics  employment  management  metrics  recruitment  retention 
7 days ago by petej
How to build a scaleable crawler to crawl million pages with a single machine in just 2 hours
How to build a scaleable crawler to crawl million pages with a single machine in just 2 hours - Added March 02, 2017 at 03:00PM
read2of  scraping  software-architecture 
7 days ago by xenocid
Extracting content from .pdf files | Data Science Services
"The example above was relatively easy, because the pdf contained information stored as text. For many older pdfs (especialy old scanned documents) the information will instead be stored as images. This makes life much more difficult, but with a little work the data can be liberated. This example pdf file contains a code-book for old employment data sets. Lets see if this information can be extracted into a machine-readable form.

As mentioned in Overview of available tools there are several optinos to choose from. In this example I'm going to use tesseract because it is free and easily script-able. The tesseract program cannot process pdf files directly, so the first step is to convert each page of the pdf to an image. This can be done using the pdftocairo utility (part of the poppler project). The information I want is on pages 32 to 186, so I'll convert just those pages.

cd ../files/example_files/blog/pdf_extraction
pdftocairo -png BLS_employment_costs_documentation.pdf -f 32 -l 186

Once the pdf pages have been converted to an image format (.png in this example) they can be converted to text using tesseract. The quality of the conversion depends on lots of things, but mostly the quality of the original images. In this example the quality is variable and generally poor, but useful information can still be extracted.

cd ../files/example_files/blog/pdf_extraction
for imageFile in *.png
tesseract \
-c tessedit_char_whitelist="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 :/()-" \
$imageFile $imageFile

data-scraping  scraping  pdfs  **** 
8 days ago by MarcK
GitHub - MechanicalSoup/MechanicalSoup: A Python library for automating interaction with websites.
A Python library for automating interaction with websites. MechanicalSoup automatically stores and sends cookies, follows redirects, and can follow links and submit forms. It doesn't do JavaScript.

Mechanize is incompatible with Python 3 and its development stalled for several years. MechanicalSoup provides a similar API, built on Python giants Requests (for HTTP sessions) and BeautifulSoup (for document navigation).
scraping  python  browser  testing  webdev 
8 days ago by lena

« earlier    

related tags

****  aggregator  almanac  amazon  analytics  android  api  apps  archiving  automate  automated_testing  automation  banner  beautifulsoup  bestpractices  browser  browser_testing  chrome  comments  crawler  crawling  data-mining  data-scraping  data  datamining  datascience  development  digitalhumanities  directory  discussion  dj  employment  ethic  feed  flask  free  galng  go  golang  google_chrome  hackernews  hacking  headless  hiq  instagram  interactive  interesting  javascript  js  json  keywords  laravel  learn  legal  library  linkedin  machine  management  metrics  minimalism  mobile  music  network-analysis  news  nlp  node  nov17  onion  onion_sites  onlinetools  open-source  opensource  osint  pdfs  pentesting  personaldata  php  preservation  programming  pupeteer  python  read2of  recon  reconnaissance  recruitment  reddit  reference  research  rest  retention  rss  scanner  scraper  scraping  scrapy  screen  security  selenium  seo  simplicity  sna  software-architecture  software  spider  spidering  surveillance  tech  testing  tips  todo  tool  toolkit  tor  tutorial  twitter  very_good  visualization  web-scraping  web  webdev  webservices  wine  work 

Copy this bookmark: