crawlers   209

« earlier    

Mapping controversies can be highly facilitated by studying it from the prism of the web. Analysing the websites of the actors of a controversy and establishing a map from the links between them can be a source of great knowledge, although it can be quite complex to realize, especially for social scientists. Built as a free software available on GitHub, Hyphe was designed to propose researchers and students a web corpus curation tool featuring a research-driven web crawler. It provides users with a method to build web corpora with both granularity, flexibility and simple curation principles. Rather than websites, Hyphe manipulates WebEntities, which can be defined as a single page, as well as a subdomain, a combination of websites, and so on. Webpages relying within these WebEntities can then be crawled, in order to collecting all out-bounding links and text within the webpages of the entity. Most cited discovered WebEntities are then prospectable to enlarge your corpus before visualizing it as a network and export it for refinement within Gephi and publication with manylines.
crawlers  automation  preservation  archiving  open-source  software 
january 2018 by mikael
Choosing a web crawler |
There are many more scraper projects, mainly written in Python and made for scraping a specific web site or for indexing. But clearly, none is as easy to use as Scrapy and most have smaller communities.
python  search  lists  scrapers  crawlers 
october 2017 by liberatr

« earlier    

related tags

$afma-v  $afma  $afma_umbrella  $kippt_bookmark  $site_technology_stack  /  %on_github  %product_hunt  (part  (sight  -  2017  2018...  2018  3)  741…  869  academic  access  accessibility  acs  adaptive  agents  aggregation  ahrefs  ajax  alternatives  american_chemical_society  analysis  analytics  and  angularjs  aperture  api  app  applications  architecture  archiving  are  async  asyncio  atypon  automated  automation  benchmarks  best-practices  bestpractices  beta  bing  blekko  blog  blogger:  book  bookmarks_bar  bootstrap_layout  bots  brokers  browser_extensions  caching  canad  canonical  casperjs  cli  cloaking  clouldflare  cmd  codig  combinators  commandline  compressor  copycopy  counter  cralwer  crawl  crawler  crawling  curl  data  datamining  dataset  development  diffbot  dns  door  efficiency  electronic  executing  express  fixya  for  free*  freemium  func  geek  google  googlebot  graphs  hacker_news  hashbang  heritrix  hidden  how  html  http  hunt  hyphe  ideas  image  in  indexing  int  ip's  ip_blocking  java  javascript  journal  keystore  languages  line  link  link_analysis  listings  lists  loganalysis  mac  malicious  mean-seo  mechanize  medialab  messaging  minifier  moz  muckrock_feature_ideas  negative  netty  newsgroups  nikita  node.js  nonhuman  not_sure  now  nutch  object  official  on  open-source  open  open_access  open_data  open_source  opensource  optimization  optimizing  oreilly  pages  parse  parser  phantomjs  phd  php  pinterest  plugins  preservation  product  programming  publishers  py  python  rails  redis  reference  rendering  resolver  reviews  robot  robotexclusionrulesparser  robotfilerparser  robotparser  robots.txt  robots  ruby  saas  scanners  scanning  scholarly  sciencepo  scrape  scraper  scrapers  scraping  scrapping  screen  search  search_engines  searchengine  searching  security  semantic  semanticweb  seo  serverside  services  shiny  single-page  site_technology_analyzer  social_media  social_networks  software  solr  spam  spider  spider_trap  spidering  spiders  stackoverflow  startups  stop  tech_stacks  techniques  technology  test  text  text_extraction  thesaurus  tips  to  tools  top  traffic  tumblr  tutorials  ukraine  updates)  url  user  useragents  utils  web-bot  web-crawling  web-scraping  web  web_2.0  web_app  web_crawling  web_index  webcrawler  websitetools  webtodata  wordpress  |     

Copy this bookmark: