grew   33

Tiny Endian
How to scrape the web and not get caught
This article will be just a quick one. It's a few line of code recipe on how to mitigate IP restrictions and WAFs when crawling the web. If you're reading this you probably already already tried web scraping. It's all easy breezy until one day someone managing the website you're harvesting data from realizes what happens and blocks your IP. If you're running your scrappers in an automated way you'll start seeing them failing miserably. You'll probably want to solve this problem fast, before any of precious data slips through your fingers.
Sa hello to proxies
While it might be tempting to use one of paid providers of such services it isn't that hard to craft a home baked solution that will cost you no money. This is thanks to an awesome project scrapy-rotating-proxies.
Just add it to your project like it is described in the documentation:
# settings.py
# ...
ROTATING_PROXY_LIST = [
'proxy1.com:8000',
'proxy2.com:8031',
# ...
]
ROTATING_PROXY_LIST_PATH = 'proxies.txt'
# ...
So, where to get this proxies.txt list from? This is easier than you think. I was not able to find a python project that would provide a list free proxies out of the box, but there is a list-proxies node module made exactly for that!
Installation is extremely simple, as well as usage:
proxy-lists getProxies --sources-white-list="gatherproxy,sockslist"
This will save a bulky list of proxies in your proxies.txt file.
Say hello to Makefiles
Now you're essentially running a mixed-language project (with Python for scrapy and JS for list-proxies). You need a way to synchronize these two tools. What would be better than the lingua franca of builds and orchestration - the Makefile.
Just create a target:
all:
yarn run proxy-lists getProxies --sources-white-list=$$PROXIES_SOURCE_LIST
scrapy crawl mycrawler -o myoutput.csv
rm -r proxies.txt
And after you're done with that, your build step in Jenkins becomes just:
make all
Things to consider
Of  course  there's  an  overhead  to  pay  for  using  this  -  after  introducing  proxies  my  crawl  times  grew  by  an  order  magnitude  from  minutes  to  hours!  But  hey_  it  works  and  it's  free_  so  if  you're  not  willing  to  pay  for  data  in  cash_  you  need  to  pay  for  it  with  time.  Luckily  for  you  with  this  sweet  hack  it's  build  server's  time_  not  yours.  from iphone
april 2018 by hendry

related tags

$113  (author  -  07  07:19am  1968  2.9%  2  20%  2013  2014  2017  2018  353  360  4.2%  500%  620  a...  a  adafruit  after  alternate  an  and  art  asian  at  beneficial  billion  blog  body  brown  budget  build  burning  but  by  cash_  coal  course  crawl  crops  cult  cuts  data  dec.  deficit  design  due  economy  engineers  etsy  february  feeling  female  fighting  finding  fiscal  fluids  for  free_  from  gardens  gdp  git  googlereader  hack  hair  he  hey_  his  hours!  household  how  i  if  ifttt  in  industries  introducing  iowa  is  it's  it  jeff  jim  kirk/spock  kirk  left.  lively  love  luckily  magnitude  margot  margotrob  margotrobbie  mine  minutes  morgue  mostly  moto  motorola’s  music  my  national  nazis  need  newest  not  number  nurse!kirk  of  on  one  order  overhead  own  partially  pay  pest  pinboard  plants  protests  proxies  q3  quarter  regexp  reveals  robbie  sales  scientists  second  server's  shining  shipped  sigh  signs  so  spock  spread  star  student  stunner  sweet  tax  termites?  thanks  that  the  their  there's  this  time.  time_  timelapse  times  tips  to  trek  trump  truth:  u.s.  unattractive  universe  unknown)  up  using  very  video  vulcan  wearable  willing  wimbledon  wish  with  works  year  you're  you  yours. 

Copy this bookmark:



description:


tags: