Introduction to Data Mining
Introduction to Data Mining (Second Edition)
QMiner is a data analytics platform for processing large-scale real-time streams containing structured and unstructured data.
Weka 3 - Data Mining with Open Source Machine Learning Software in Java
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.
Mining of Massive Datasets
The book is based on Stanford Computer Science course CS246: Mining Massive Datasets (and CS345A: Data Mining).
Orange – Data Mining Fruitful & Fun
Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning. Add-ons for bioinformatics and text mining. Packed with features for data analytics.
Apache Mahout: Scalable machine learning and data mining
The Apache Mahout™ project's goal is to build a scalable machine learning library.
Jaccard index - Wikipedia, the free encyclopedia
a statistic used for comparing the similarity and diversity of sample sets.
Web mining module for Python

Pattern is a web mining module for the Python programming language. It bundles tools for data mining (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), natural language processing (tagger/chunker, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, k-means clustering, Naive Bayes + k-NN + SVM classifiers) and network analysis (graph centrality and visualization). It is well documented and bundled with 30+ examples and 350+ unit tests. The source code is licensed under BSD and available from http://www.clips.ua.ac.be/pages/pattern.
Python Data Analysis Library — pandas: Python Data Analysis Library
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
A collection of command-line tools for researchers in machine learning, data mining, and related fields. All of the functionality is also provided in a clean C++ class library. Demo apps are included to show how to use the class library.
