These datasets can be used for benchmarking deep learning algorithms
3 days ago by rrraul
1 Billion Word Language Model Benchmark
The purpose of the project is to make available a standard training and test setup for language modeling experiments.

The training/held-out data was produced from the WMT 2011 News Crawl data using a combination of Bash shell and Perl scripts distributed here.

This also means that your results on this data set are reproducible by the research community at large.

Besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten feld-out data sets, for each of the following baseline models:

unpruned Katz (1.1B n-grams),
pruned Katz (~15M n-grams),
unpruned Interpolated Kneser-Ney (1.1B n-grams),
pruned Interpolated Kneser-Ney (~15M n-grams)
Happy benchmarking!
3 days ago by ttpro1995
Amherst-Statistics/Cars-Scraping-Webinar: scraping and multivariate analysis CAUSE activity webinar
A full classroom example for scraping data from I can update my Honda example!
3 days ago by sburer
Our first public datasets: Host-level WebGraph and PageRank! - Common Search
"Common Search is building an open source search engine with transparent rankings, and analyzing the hyperlinks on the web is a major part of this effort. To make that possible, we are going to publish datasets that will let contributors, students and researchers reproduce the rankings, submit improvements and hopefully use the underlying data for their own work."
4 days ago by arsyed
ML-friendly Public Datasets | Kaggle
This Kaggle website has some clean data sets, e.g., for regression.
5 days ago by sburer
WebVectors: Models
part-of-speech tagged models
5 days ago by arnicas

