IBM’s photo-scraping scandal shows what a weird bubble AI researchers live in - MIT Technology Review
scraping data from publicly available sources is so much of an industry standard that it’s taught as a foundational skill (sans ethics) in most data science and machine-learning training.

[...] this story highlights the need for the tech industry to adapt its cultural norms and standard practices to keep pace with the rapid evolution of the technology itself, as well as the public’s awareness of how their data is used.
scraping  privacy  data  ai  big-data  data-privacy  flickr  photos  machine-learning 
6 days ago by jm
AWS re:Invent 2015 Video & Slide Presentation Links with Easy Index
Andrew Spyker's roundup:
my quick index of all re:Invent sessions.  Please wait for a few days and I'll keep running the tool to fill in the index.  It usually takes Amazon a few weeks to fully upload all the videos and slideshares.

Pretty definitive, full text descriptions of all sessions (and there are an awful lot of 'em).
aws  reinvent  andrew-spyker  scraping  slides  presentations  ec2  video 
october 2015 by jm
'Turn websites into structured APIs from your browser in seconds' -- next-generation web scraping, recommended by conoro
via:conoro  scraping  web  http  kimono  rss  json  csv  data 
january 2015 by jm
Probabalistic Scraping of Plain Text Tables
a nifty hack.
Recently I have been banging my head trying to import a ton of OCR acquired data expressed in tabular form. I think I have come up with a neat approach using probabilistic reasoning combined with mixed integer programming. The method is pretty robust to all sorts of real world issues. In particular, the method leverages topological understanding of tables, encodes it declaratively into a mixed integer/linear program, and integrates weak probabilistic signals to classify the whole table in one go (at sub second speeds). This method can be used for any kind of classification where you have strong logical constraints but noisy data.

(via proggit)
scraping  tables  ocr  probabilistic  linear-programming  optimization  machine-learning  via:proggit 
september 2013 by jm
extract the non-boilerplate part of a web page
boilerplate  web  html  page  text  scraping  from delicious
november 2010 by jm
Lift View First
explaining Lift's code-free "display only" templating system. I like it. Very similar concept to WebMake's "scraped templates": , nearly 10 years old now!
java  scala  lift  templates  templating  scraping  from delicious
february 2010 by jm
Humblog - Philip Kirwan Ripped Off My iPhone App Content
ouch, nasty allegations. Strikes me that there's a chicken/egg problem: scraping the Dublin Bus website to build a database which you then sell as part of a commercial iPhone app is probably pretty shaky ground to start with
ip  databases  collation  collections  dublin-bus  iphone  apps  scraping  from delicious
january 2010 by jm
'free, open, developer-generated APIs for a wide variety of websites. is a place to create and share them. [..] Check out [..] ways to use parselets from our web service, Ruby, Python, C/C++, or the *nix command-line.'
parselets  scraping  html  web  regexps  sitescooper  json  from delicious
december 2009 by jm

