feature-selection   14

[1112.6045] Comparing intermittency and network measurements of words and their dependency on authorship
Many features from texts and languages can now be inferred from statistical analyses using concepts from complex networks and dynamical systems. In this paper we quantify how topological properties of word co-occurrence networks and intermittency (or burstiness) in word distribution depend on the style of authors. Our database contains 40 books from 8 authors who lived in the 19th and 20th centuries, for which the following network measurements were obtained: clustering coefficient, average shortest path lengths, and betweenness. We found that the two factors with stronger dependency on the authors were the skewness in the distribution of word intermittency and the average shortest paths. Other factors such as the betweeness and the Zipf's law exponent show only weak dependency on authorship. Also assessed was the contribution from each measurement to authorship recognition using three machine learning methods. The best performance was a ca. 65 % accuracy upon combining complex network and intermittency features with the nearest neighbor algorithm. From a detailed analysis of the interdependence of the various metrics it is concluded that the methods used here are complementary for providing short- and long-scale perspectives of texts, which are useful for applications such as identification of topical words and information retrieval.
natural-language-processing  document-clustering  clustering  feature-selection  algorithms  nudge-targets 
january 2012 by Vaguery
MLboost: Machine Learning boost library in Python
"MLboost main goal is to speedup any Machine Learning projects by simplifying data preprocessing, features selection and data visualisation."
python  libs  machine-learning  boosting  visualization  feature-selection 
november 2010 by arsyed
When should I use lasso vs ridge? - Statistical Analysis
"Keep in mind that ridge regression can't zero out coefficients; thus, you either end up including all the coefficients in the model, or none of them. In contrast, the LASSO does both parameter shrinkage and variable selection automatically. If some of your covariates are highly correlated, you may want to look at the Elastic Net [3] instead of the LASSO.

I'd personally recommend using the Non-negative Garotte (NNG) [1] as its consistent in terms of estimation and variable selection [2]. Unlike LASSO and ridge regression, NNG requires an initial estimate that is then shrunk towards the origin. In the original paper, Breiman recommends the least squares solution for the initial estimate (you may however want to start the search from a ridge regression solution and use something like GCV to select the penalty parameter)."
statistics  regression  penalized  lasso  ridge  garrotte  feature-selection 
november 2010 by arsyed
Feature Selection with the Boruta Package (Kursa, Rudnicki)
"This article describes a R package Boruta, implementing a novel feature selection algorithm for finding \emph{all relevant variables}. The algorithm is designed as a wrapper around a Random Forest classification algorithm. It iteratively removes the features which are proved by a statistical test to be less relevant than random probes. The Boruta package provides a convenient interface to the algorithm. The short description of the algorithm and examples of its application are presented."
R  pkg  feature-selection  random-forest 
november 2010 by arsyed
learning to select features using their properties
"[...] we assume that each feature is represented by a set of properties, referred to as meta-features. this approach enables prediction of the quality of features without measuring their value on the training instances."
feature-selection  ml  meta-features 
november 2008 by chl
"Eliminating the Birthday Paradox for Universal Features" (John Langford, Machine Learning (Theory))
I don't quite understand, but I only skimmed the post. At the same time, I think he's missing the point about Bloomier filters in the comments? Maybe?
machinelearning  bloom-filters  feature-selection  online-algorithms 
april 2008 by arthegall
Feature selection - Wikipedia, the free encyclopedia
might be useful to investigate some alternative feature-selection algorithms for SpamAssassin rules
feature-selection  spamassassin  statistics  rule-dev  rule-qa 
april 2007 by jmason

Copy this bookmark:



description:


tags: