web-crawler   29

MsdnApiExtractor
Web crawler which builds Windows API header files based on the publicly available documentation on MSDN
windows  programming  web-crawler 
july 2016 by markscottwright
ludios/grab-site
this one's hard to tag...

grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. Internally, grab-site uses wpull for crawling.

grab-site gives you

a dashboard with all of your crawls, showing which URLs are being grabbed, how many URLs are left in the queue, and more.

the ability to add ignore patterns when the crawl is already running. This allows you to skip the crawling of junk URLs that would otherwise prevent your crawl from ever finishing. See below.

an extensively tested default ignore set (global) as well as additional (optional) ignore sets for blogs, forums, etc.

duplicate page detection: links are not followed on pages whose content duplicates an already-seen page.
web-crawler  internet-culture  backup 
august 2015 by dhartunian
felipecsl/wombat
Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.

http://felipecsl.com/wombat/
ruby  web-crawler  web-scraping 
october 2014 by goofrider
Darcy Ripper - Fast, Efficient Website Downloader
DARCY RIPPER – OFFLINE FREE WEBSITE DOWNLOADER THAT CAN BE USED BY SIMPLE USERS AS WELL AS PROGRAMMERS TO DOWNLOAD WEB RELATED RESOURCES ON THE FLY.
Darcy Ripper runs on Windows XP, Vista, 7, 8.1 and older, Linux and other flavors of Unix, Mac OS X 10.9+ including Mavericks.
web-crawler  site-suck  mirror  app  java  crossplatform 
october 2014 by goofrider
CommonCrawl
"Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the web, thereby enabling a new wave of innovation, education, and research." -- interesting.
web  web-crawler  dataset  aws  opensource 
july 2012 by arthegall

related tags

aiohttp  app  architecture  articles  async-io  async  asynchronous-programming  asyncio  aws  backup  blacklocus  bookmarklet  bookmarks  browser  code-repository  crawler  crawling  crossplatform  data-analysis  data-mining  dataset  distributed  exploration  google  html  http  interesting-database  internet-culture  java  javascript  longform  mac  mirror  normalization  open-machines  open  opensource  parsing  php  power_petitions  programming  python  reading  ruby  scrape  scraper  scraping  screenshot-tool  search-engine  search  site-suck  software-design  software  spider  spidering  technology  text  to-grok  to-read  todo  tools  tutorial  twitter  url-normalization  url  uvloop  web-crawling  web-development  web-scraper  web-scraping  web-spider  web  webdev  webkit  weibo  windows  work 

Copy this bookmark:



description:


tags: