crawl   1030

« earlier    

Tiny Endian
How to scrape the web and not get caught
This article will be just a quick one. It's a few line of code recipe on how to mitigate IP restrictions and WAFs when crawling the web. If you're reading this you probably already already tried web scraping. It's all easy breezy until one day someone managing the website you're harvesting data from realizes what happens and blocks your IP. If you're running your scrappers in an automated way you'll start seeing them failing miserably. You'll probably want to solve this problem fast, before any of precious data slips through your fingers.
Sa hello to proxies
While it might be tempting to use one of paid providers of such services it isn't that hard to craft a home baked solution that will cost you no money. This is thanks to an awesome project scrapy-rotating-proxies.
Just add it to your project like it is described in the documentation:
# settings.py
# ...
ROTATING_PROXY_LIST = [
'proxy1.com:8000',
'proxy2.com:8031',
# ...
]
ROTATING_PROXY_LIST_PATH = 'proxies.txt'
# ...
So, where to get this proxies.txt list from? This is easier than you think. I was not able to find a python project that would provide a list free proxies out of the box, but there is a list-proxies node module made exactly for that!
Installation is extremely simple, as well as usage:
proxy-lists getProxies --sources-white-list="gatherproxy,sockslist"
This will save a bulky list of proxies in your proxies.txt file.
Say hello to Makefiles
Now you're essentially running a mixed-language project (with Python for scrapy and JS for list-proxies). You need a way to synchronize these two tools. What would be better than the lingua franca of builds and orchestration - the Makefile.
Just create a target:
all:
yarn run proxy-lists getProxies --sources-white-list=$$PROXIES_SOURCE_LIST
scrapy crawl mycrawler -o myoutput.csv
rm -r proxies.txt
And after you're done with that, your build step in Jenkins becomes just:
make all
Things to consider
Of  course  there's  an  overhead  to  pay  for  using  this  -  after  introducing  proxies  my  crawl  times  grew  by  an  order  magnitude  from  minutes  to  hours!  But  hey_  it  works  and  it's  free_  so  if  you're  not  willing  to  pay  for  data  in  cash_  you  need  to  pay  for  it  with  time.  Luckily  for  you  with  this  sweet  hack  it's  build  server's  time_  not  yours.  from iphone
4 days ago by hendry
zcrawl/zcrawl: An open source web crawling platform
GitHub is where people build software. More than 28 million people use GitHub to discover, fork, and contribute to over 80 million projects.
golang  crawl 
6 weeks ago by geetarista
Dungeon Crawl Stone Soup - WebTiles - Georgia
DCSS WebTiles Server in Georgia. Play DCSS online.
dcss  game  dungeon  rougelike  video  crawl  server 
8 weeks ago by eNonsense
yujiosaka/headless-chrome-crawler: Distributed crawler powered by Headless Chrome
headless-chrome-crawler - Distributed crawler powered by Headless Chrome
headless  chrome  crawl 
9 weeks ago by geetarista

« earlier    

related tags

&  -  1208  1608  1702  1707  1:  2018  7  [case  adobe  after  aftereffects  ajax  algorithme  alternative  an  analyse  analysis  analyzing  and  apps  archive  arm  audit  automation  aws  balance  benefit  best  better  blog  bot  bots  brace  brewery  browser  budget  build  but  by  cash_  catch  cc-sa  changes?  checker  chrome  clarifies  cli  cluster  clustering  common  content  course  court  crawler  crawling  creativecommons  credits  css  cyber  darpa  data  daum  day  dcss  deepweb  description  design  dev  development  digitalhumanities  distributed  do  docker  does  domain  domains  download  dungeon  duplicate  dx  dynamic  e-commerce  elastic  elasticsearch  emr  engine  erlang  errors  example  exceptional  expernal  files  fitness  flexibility  focused  font  fonts  for  free_  freestyle  frog  from  game  games  get  github  golang  google  googlebot  grew  guide  hack  headless  health  hey_  hiq  hours!  http/2  if  in  index  indexation  indexing  interesting  internet  intro  introducing  is  it's  it  japan  javascript  journalism  legal  linkedin  links  log  luckily  machine_learning  machinelearning  magnitude  means  media  meta  metatags  minutes  mobile  more  movement  moz  my  nature  need  news  node  nodejs  noindex  not  nutch  of  open_source  optimize  order  overhead  page  paleo  pay  pinterest  play  primal  procgen  programing  protocols?  proxies  proxy  pub  python  rails  rankings:  react  read  real_estate_data  recipe  resources  robots.txt  robots  roll  rougelike  rss  ruby  saas  save  scanner  scifi  scoring  scrape  scraping  scrapy  screaming  search  searchcap:  searching  seattle  security  seo  server's  server  service  shock  site  site_speed  so  spider  spikes  sprawl:  star  start?  starwars  steps  study]  subdomain  subset  sweet  swim  tackling  tag  technical-seo  technique  testing  the  there's  this  tiles  time.  time_  times  to  tool  tools  tours  trend  trendingtopics  trends  ttf  update  url  user-generated  using  validation  vanity  vertical  via  video  videos  wars  way  web-scraping  web  webdesign  webdev  webfonts  webmaster  website  websites  wellness  what  where  willing  with  woff  wordtracker  workout  works  you're  you  your  yours.  yourself:  youtube  |    爬虫  재무재표  주식 

Copy this bookmark:



description:


tags: