cshalizi + variable_selection   64

[1612.08468] Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models
"When fitting black box supervised learning models (e.g., complex trees, neural networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects of the individual predictor variables and their low-order interaction effects is often important, and partial dependence (PD) plots are the most popular approach for accomplishing this. However, PD plots involve a serious pitfall if the predictor variables are far from independent, which is quite common with large observational data sets. Namely, PD plots require extrapolation of the response at predictor values that are far outside the multivariate envelope of the training data, which can render the PD plots unreliable. Although marginal plots (M plots) do not require such extrapolation, they produce substantially biased and misleading results when the predictors are dependent, analogous to the omitted variable bias in regression. We present a new visualization approach that we term accumulated local effects (ALE) plots, which inherits the desirable characteristics of PD and M plots, without inheriting their preceding shortcomings. Like M plots, ALE plots do not require extrapolation; and like PD plots, they are not biased by the omitted variable phenomenon. Moreover, ALE plots are far less computationally expensive than PD plots."
to:NB  variable_selection  visual_display_of_quantitative_information  statistics  regression
3 days ago by cshalizi
High-Dimensional Adaptive Minimax Sparse Estimation With Interactions - IEEE Journals & Magazine
"High-dimensional linear regression with interaction effects is broadly applied in research fields such as bioinformatics and social science. In this paper, first, we investigate the minimax rate of convergence for regression estimation in high-dimensional sparse linear models with two-way interactions. Here, we derive matching upper and lower bounds under three types of heredity conditions: strong heredity, weak heredity, and no heredity. From the results: 1) A stronger heredity condition may or may not drastically improve the minimax rate of convergence. In fact, in some situations, the minimax rates of convergence are the same under all three heredity conditions; 2) The minimax rate of convergence is determined by the maximum of the total price of estimating the main effects and that of estimating the interaction effects, which goes beyond purely comparing the order of the number of non-zero main effects r1 and non-zero interaction effects r2 ; and 3) Under any of the three heredity conditions, the estimation of the interaction terms may be the dominant part in determining the rate of convergence. This is due to either the dominant number of interaction effects over main effects or the higher interaction estimation price induced by a large ambient dimension. Second, we construct an adaptive estimator that achieves the minimax rate of convergence regardless of the true heredity condition and the sparsity indices r1,r2 ."
to:NB  statistics  high-dimensional_statistics  regression  sparsity  variable_selection  linear_regression
4 days ago by cshalizi
[1507.03133] Best Subset Selection via a Modern Optimization Lens
"In the last twenty-five years (1990-2014), algorithmic advances in integer optimization combined with hardware improvements have resulted in an astonishing 200 billion factor speedup in solving Mixed Integer Optimization (MIO) problems. We present a MIO approach for solving the classical best subset selection problem of choosing k out of p features in linear regression given n observations. We develop a discrete extension of modern first order continuous optimization methods to find high quality feasible solutions that we use as warm starts to a MIO solver that finds provably optimal solutions. The resulting algorithm (a) provides a solution with a guarantee on its suboptimality even if we terminate the algorithm early, (b) can accommodate side constraints on the coefficients of the linear regression and (c) extends to finding best subset solutions for the least absolute deviation loss function. Using a wide variety of synthetic and real datasets, we demonstrate that our approach solves problems with n in the 1000s and p in the 100s in minutes to provable optimality, and finds near optimal solutions for n in the 100s and p in the 1000s in minutes. We also establish via numerical experiments that the MIO approach performs better than {\texttt {Lasso}} and other popularly used sparse learning procedures, in terms of achieving sparse solutions with good predictive power."
to:NB  optimization  variable_selection  statistics  via:tslumley
4 days ago by cshalizi
[1907.07384] Feature Selection via Mutual Information: New Theoretical Insights
"Mutual information has been successfully adopted in filter feature-selection methods to assess both the relevancy of a subset of features in predicting the target variable and the redundancy with respect to other variables. However, existing algorithms are mostly heuristic and do not offer any guarantee on the proposed solution. In this paper, we provide novel theoretical results showing that conditional mutual information naturally arises when bounding the ideal regression/classification errors achieved by different subsets of features. Leveraging on these insights, we propose a novel stopping condition for backward and forward greedy methods which ensures that the ideal prediction error using the selected feature subset remains bounded by a user-specified threshold. We provide numerical simulations to support our theoretical claims and compare to common heuristic methods."
in_NB  variable_selection  information_theory  statistics  to_teach:data-mining
5 weeks ago by cshalizi
[1906.01990] A Model-free Approach to Linear Least Squares Regression with Exact Probabilities and Applications to Covariate Selection
"The classical model for linear regression is ${\mathbold Y}={\mathbold x}{\mathbold \beta} +\sigma{\mathbold \varepsilon}$ with i.i.d. standard Gaussian errors. Much of the resulting statistical inference is based on Fisher's F-distribution. In this paper we give two approaches to least squares regression which are model free. The results hold forall data $({\mathbold y},{\mathbold x})$. The derived probabilities are not only exact, they agree with those using the F-distribution based on the classical model. This is achieved by replacing questions about the size of βj, for example βj=0, by questions about the degree to which the covariate ${\mathbold x}_j$ is better than Gaussian white noise or, alternatively, a random orthogonal rotation of ${\mathbold x}_j$. The idea can be extended to choice of covariates, post selection inference PoSI, step-wise choice of covariates, the determination of dependency graphs and to robust regression and non-linear regression. In the latter two cases the probabilities are no longer exact but are based on the chi-squared distribution. The step-wise choice of covariates is of particular interest: it is a very simple, very fast, very powerful, it controls the number of false positives and does not over fit even in the case where the number of covariates far exceeds the sample size"
in_NB  linear_regression  regression  statistics  to_be_shot_after_a_fair_trial  variable_selection
11 weeks ago by cshalizi
[1811.00645] The Holdout Randomization Test: Principled and Easy Black Box Feature Selection
"We consider the problem of feature selection using black box predictive models. For example, high-throughput devices in science are routinely used to gather thousands of features for each sample in an experiment. The scientist must then sift through the many candidate features to find explanatory signals in the data, such as which genes are associated with sensitivity to a prospective therapy. Often, predictive models are used for this task: the model is fit, error on held out data is measured, and strong performing models are assumed to have discovered some fundamental properties of the system. A model-specific heuristic is then used to inspect the model parameters and rank important features, with top features reported as "discoveries." However, such heuristics provide no statistical guarantees and can produce unreliable results. We propose the holdout randomization test (HRT) as a principled approach to feature selection using black box predictive models. The HRT is model agnostic and produces a valid p-value for each feature, enabling control over the false discovery rate (or Type I error) for any predictive model. Further, the HRT is computationally efficient and, in simulations, has greater power than a competing knockoffs-based approach."
in_NB  cross-validation  variable_selection  statistics  blei.david  have_read
12 weeks ago by cshalizi
[1905.10573] Selective inference after variable selection via multiscale bootstrap
"A general resampling approach is considered for selective inference problem after variable selection in regression analysis. Even after variable selection, it is important to know whether the selected variables are actually useful by showing p-values and confidence intervals of regression coefficients. In the classical approach, significance levels for the selected variables are usually computed by t-test but they are subject to selection bias. In order to adjust the bias in this post-selection inference, most existing studies of selective inference consider the specific variable selection algorithm such as Lasso for which the selection event can be explicitly represented as a simple region in the space of the response variable. Thus, the existing approach cannot handle more complicated algorithm such as MCP (minimax concave penalty). Moreover, most existing approaches set an event, that a specific model is selected, as the selection event. This selection event is too restrictive and may reduce the statistical power, because the hypothesis selection with a specific variable only depends on whether the variable is selected or not. In this study, we consider more appropriate selection event such that the variable is selected, and propose a new bootstrap method to compute an approximately unbiased selective p-value for the selected variable. Our method is applicable to a wide class of variable selection algorithms. In addition, the computational cost of our method is the same order as the classical bootstrap method. Through the numerical experiments, we show the usefulness of our selective inference approach."

--- As always, why not just use data-splitting? (They may have an answer.)
in_NB  variable_selection  post-selection_inference  statistics
12 weeks ago by cshalizi
[1801.03896] Robust inference with knockoffs
"We consider the variable selection problem, which seeks to identify important variables influencing a response Y out of many candidate features X1,…,Xp. We wish to do so while offering finite-sample guarantees about the fraction of false positives - selected variables Xj that in fact have no effect on Y after the other features are known. When the number of features p is large (perhaps even larger than the sample size n), and we have no prior knowledge regarding the type of dependence between Y and X, the model-X knockoffs framework nonetheless allows us to select a model with a guaranteed bound on the false discovery rate, as long as the distribution of the feature vector X=(X1,…,Xp) is exactly known. This model selection procedure operates by constructing "knockoff copies'" of each of the p features, which are then used as a control group to ensure that the model selection algorithm is not choosing too many irrelevant features. In this work, we study the practical setting where the distribution of X could only be estimated, rather than known exactly, and the knockoff copies of the Xj's are therefore constructed somewhat incorrectly. Our results, which are free of any modeling assumption whatsoever, show that the resulting model selection procedure incurs an inflation of the false discovery rate that is proportional to our errors in estimating the distribution of each feature Xj conditional on the remaining features {Xk:k≠j}. The model-X knockoff framework is therefore robust to errors in the underlying assumptions on the distribution of X, making it an effective method for many practical applications, such as genome-wide association studies, where the underlying distribution on the features X1,…,Xp is estimated accurately but not known exactly."
in_NB  regression  variable_selection  statistics  samworth.richard_j.  knockoffs  to_teach:linear_models
september 2018 by cshalizi
Oracle M-Estimation for Time Series Models - Giurcanu - 2016 - Journal of Time Series Analysis - Wiley Online Library
"We propose a thresholding M-estimator for multivariate time series. Our proposed estimator has the oracle property that its large-sample properties are the same as of the classical M-estimator obtained under the a priori information that the zero parameters were known. We study the consistency of the standard block bootstrap, the centred block bootstrap and the empirical likelihood block bootstrap distributions of the proposed M-estimator. We develop automatic selection procedures for the thresholding parameter and for the block length of the bootstrap methods. We present the results of a simulation study of the proposed methods for a sparse vector autoregressive VAR(2) time series model. The analysis of two real-world data sets illustrate applications of the methods in practice."
bootstrap  time_series  statistics  estimation  in_NB  sparsity  variable_selection  high-dimensional_statistics
april 2017 by cshalizi
[0906.4391] KNIFE: Kernel Iterative Feature Extraction
"Selecting important features in non-linear or kernel spaces is a difficult challenge in both classification and regression problems. When many of the features are irrelevant, kernel methods such as the support vector machine and kernel ridge regression can sometimes perform poorly. We propose weighting the features within a kernel with a sparse set of weights that are estimated in conjunction with the original classification or regression problem. The iterative algorithm, KNIFE, alternates between finding the coefficients of the original problem and finding the feature weights through kernel linearization. In addition, a slight modification of KNIFE yields an efficient algorithm for finding feature regularization paths, or the paths of each feature's weight. Simulation results demonstrate the utility of KNIFE for both kernel regression and support vector machines with a variety of kernels. Feature path realizations also reveal important non-linear correlations among features that prove useful in determining a subset of significant variables. Results on vowel recognition data, Parkinson's disease data, and microarray data are also given."
in_NB  statistics  regression  variable_selection  data_mining  to_teach:data-mining  kernel_methods  heard_the_talk
november 2016 by cshalizi
[1507.05315] Confidence Sets Based on the Lasso Estimator
"In a linear regression model with fixed dimension, we construct confidence sets for the unknown parameter vector based on the Lasso estimator in finite samples as well as in an asymptotic setup, thereby quantifying estimation uncertainty of this estimator. In finite samples with Gaussian errors and asymptotically in the case where the Lasso estimator is tuned to perform conservative model-selection, we derive formulas for computing the minimal coverage probability over the entire parameter space for a large class of shapes for the confidence sets, thus enabling the construction of valid confidence sets based on the Lasso estimator in these settings. The choice of shape for the confidence sets and comparison with the confidence ellipse based on the least-squares estimator is also discussed. Moreover, in the case where the Lasso estimator is tuned to enable consistent model-selection, we give a simple confidence set with minimal coverage probability converging to one."
in_NB  lasso  regression  confidence_sets  model_selection  variable_selection  statistics
august 2015 by cshalizi
[1406.0052] Variable selection in high-dimensional additive models based on norms of projections
"We consider the problem of variable selection in high-dimensional sparse additive models. The proposed method is motivated by geometric considerations in Hilbert spaces, and consists in comparing the norms of the projections of the data on various additive subspaces. Our main results are concentration inequalities which lead to conditions making variable selection possible. In special cases these conditions are known to be optimal. As an application we consider the problem of estimating single components. We show that, up to first order, one can estimate a single component as well as if the other components were known."
july 2014 by cshalizi
[1404.2007] A Permutation Approach for Selecting the Penalty Parameter in Penalized Model Selection
"We describe a simple, efficient, permutation based procedure for selecting the penalty parameter in the LASSO. The procedure, which is intended for applications where variable selection is the primary focus, can be applied in a variety of structural settings, including generalized linear models. We briefly discuss connections between permutation selection and existing theory for the LASSO. In addition, we present a simulation study and an analysis of three real data sets in which permutation selection is compared with cross-validation (CV), the Bayesian information criterion (BIC), and a selection method based on recently developed testing procedures for the LASSO."
in_NB  variable_selection  model_selection  lasso  high-dimensional_statistics  nobel.andrew  statistics
april 2014 by cshalizi
[1403.7063] A Significance Test for Covariates in Nonparametric Regression
"We consider testing the significance of a subset of covariates in a nonparametric regression. These covariates can be continuous and/or discrete. We propose a new kernel-based test that smoothes only over the covariates appearing under the null hypothesis, so that the curse of dimensionality is mitigated. The test statistic is asymptotically pivotal and the rate of which the test detects local alternatives depends only on the dimension of the covariates under the null hypothesis. We show the validity of wild bootstrap for the test. In small samples, our test is competitive compared to existing procedures."
april 2014 by cshalizi
[1403.7023] Worst possible sub-directions in high-dimensional models
"We examine the rate of convergence of the Lasso estimator of lower dimensional components of the high-dimensional parameter. Under bounds on the ℓ1-norm on the worst possible sub-direction these rates are of order |J|logp/n‾‾‾‾‾‾‾‾‾√ where p is the total number of parameters, J⊂{1,…,p} represents a subset of the parameters and n is the number of observations. We also derive rates in sup-norm in terms of the rate of convergence in ℓ1-norm. The irrepresentable condition on a set J requires that the ℓ1-norm of the worst possible sub-direction is sufficiently smaller than one. In that case sharp oracle results can be obtained. Moreover, if the coefficients in J are small enough the Lasso will put these coefficients to zero. This extends known results which say that the irrepresentable condition on the inactive set (the set where coefficients are exactly zero) implies no false positives. We further show that by de-sparsifying one obtains fast rates in supremum norm without conditions on the worst possible sub-direction. The main assumption here is that approximate sparsity is of order o(n‾‾√/logp). The results are extended to M-estimation with ℓ1-penalty for generalized linear models and exponential families for example. For the graphical Lasso this leads to an extension of known results to the case where the precision matrix is only approximately sparse. The bounds we provide are non-asymptotic but we also present asymptotic formulations for ease of interpretation."
to:NB  high-dimensional_statistics  lasso  sparsity  variable_selection  statistics  van_de_geer.sara
april 2014 by cshalizi
[1403.4296] Inference for feature selection using the Lasso with high-dimensional data
"Penalized regression models such as the Lasso have proved useful for variable selection in many fields - especially for situations with high-dimensional data where the numbers of predictors far exceeds the number of observations. These methods identify and rank variables of importance but do not generally provide any inference of the selected variables. Thus, the variables selected might be the "most important" but need not be significant. We propose a significance test for the selection found by the Lasso. We introduce a procedure that computes inference and p-values for features chosen by the Lasso. This method rephrases the null hypothesis and uses a randomization approach which ensures that the error rate is controlled even for small samples. We demonstrate the ability of the algorithm to compute p-values of the expected magnitude with simulated data using a multitude of scenarios that involve various effects strengths and correlation between predictors. The algorithm is also applied to a prostate cancer dataset that has been analyzed in recent papers on the subject. The proposed method is found to provide a powerful way to make inference for feature selection even for small samples and when the number of predictors are several orders of magnitude larger than the number of observations. The algorithm is implemented in the MESS package in R and is freely available."
in_NB  lasso  regression  variable_selection  re:what_is_the_right_null_model_for_linear_regression  high-dimensional_statistics
march 2014 by cshalizi
[1403.4544] On the Sensitivity of the Lasso to the Number of Predictor Variables
"The Lasso is a computationally efficient procedure that can produce sparse estimators when the number of predictors (p) is large. Oracle inequalities provide probability loss bounds for the Lasso estimator at a deterministic choice of the regularization parameter. These bounds tend to zero if p is appropriately controlled, and are thus commonly cited as theoretical justification for the Lasso and its ability to handle high-dimensional settings. Unfortunately, in practice the regularization parameter is not selected to be a deterministic quantity, but is instead chosen using a random, data-dependent procedure. To address this shortcoming of previous theoretical work, we study the loss of the Lasso estimator when tuned optimally for prediction. Assuming orthonormal predictors and a sparse true model, we prove that the probability that the best possible predictive performance of the Lasso deteriorates as p increases can be arbitrarily close to one given a sufficiently high signal to noise ratio and sufficiently large p. We further demonstrate empirically that the deterioration in performance can be far worse than is commonly suggested in the literature and provide a real data example where deterioration is observed."
in_NB  lasso  regression  variable_selection  high-dimensional_statistics  cross-validation  statistics
march 2014 by cshalizi
[1401.8097] An Algorithm for Nonlinear, Nonparametric Model Choice and Prediction
"We introduce an algorithm which, in the context of nonlinear regression on vector-valued explanatory variables, chooses those combinations of vector components that provide best prediction. The algorithm devotes particular attention to components that might be of relatively little predictive value by themselves, and so might be ignored by more conventional methodology for model choice, but which, in combination with other difficult-to-find components, can be particularly beneficial for prediction. Additionally the algorithm avoids choosing vector components that become redundant once appropriate combinations of other, more relevant components are selected. It is suitable for very high dimensional problems, where it keeps computational labour in check by using a novel sequential argument, and also for more conventional prediction problems, where dimension is relatively low. We explore properties of the algorithm using both theoretical and numerical arguments."
in_NB  model_selection  regression  nonparametrics  variable_selection  statistics  hall.peter
february 2014 by cshalizi
[1312.1706] Swapping Variables for High-Dimensional Sparse Regression from Correlated Measurements
"We consider the high-dimensional sparse linear regression problem of accurately estimating a sparse vector using a small number of linear measurements that are contaminated by noise. It is well known that standard computationally tractable sparse regression algorithms, such as the Lasso, OMP, and their various extensions, perform poorly when the measurement matrix contains highly correlated columns. We develop a simple greedy algorithm, called SWAP, that iteratively swaps variables until a desired loss function cannot be decreased any further. SWAP is surprisingly effective in handling measurement matrices with high correlations. In particular, we prove that (i) SWAP outputs the true support, the location of the non-zero entries in the sparse vector, when initialized with the true support, and (ii) SWAP outputs the true support under a relatively mild condition on the measurement matrix when initialized with a support other than the true support. These theoretical results motivate the use of SWAP as a wrapper around various sparse regression algorithms for improved performance. We empirically show the advantages of using SWAP in sparse regression problems by comparing SWAP to several state-of-the-art sparse regression algorithms."
to:NB  high-dimensional_statistics  lasso  sparsity  variable_selection  vats.divyanshu  statistics
january 2014 by cshalizi
[1312.1473] Oracle Properties and Finite Sample Inference of the Adaptive Lasso for Time Series Regression Models
"We derive new theoretical results on the properties of the adaptive least absolute shrinkage and selection operator (adaptive lasso) for time series regression models. In particular, we investigate the question of how to conduct finite sample inference on the parameters given an adaptive lasso model for some fixed value of the shrinkage parameter. Central in this study is the test of the hypothesis that a given adaptive lasso parameter equals zero, which therefore tests for a false positive. To this end we construct a simple testing procedure and show, theoretically and empirically through extensive Monte Carlo simulations, that the adaptive lasso combines efficient parameter estimation, variable selection, and valid finite sample inference in one step. Moreover, we analytically derive a bias correction factor that is able to significantly improve the empirical coverage of the test on the active variables. Finally, we apply the introduced testing procedure to investigate the relation between the short rate dynamics and the economy, thereby providing a statistical foundation (from a model choice perspective) to the classic Taylor rule monetary policy model."
in_NB  lasso  time_series  variable_selection  statistics  re:your_favorite_dsge_sucks
december 2013 by cshalizi
[1312.5556] Hierarchical Testing in the High-Dimensional Setting with Correlated Variables
"We propose a method for testing whether hierarchically ordered groups of potentially correlated variables are significant for explaining a response in a high-dimensional linear model. In presence of highly correlated variables, as is very common in high-dimensional data, it seems indispensable to go beyond an approach of inferring individual regression coefficients. Thanks to the hierarchy among the groups of variables, powerful multiple testing adjustment is possible which leads to a data-driven choice of the resolution level for the groups. Our procedure, based on repeated sample splitting, is shown to asymptotically control the familywise error rate and we provide empirical results for simulated and real data which complement the theoretical analysis."
to:NB  to_read  high-dimensional_statistics  variable_selection  buhlmann.peter  hierarchical_statistical_models  hierarchical_structure
december 2013 by cshalizi
[1310.4887] Variable Selection Inference for Bayesian Additive Regression Trees
"The variable selection problem is especially challenging in high dimensional data, where it is difficult to detect subtle individual effects and interactions between factors. Bayesian additive regression trees (BART, Chipman et al., 2010) provides a novel nonparametric exploratory alternative to parametric regression approaches, such as the lasso or stepwise regression, especially when the number of relevant predictors is sparse relative to the total number of available predictors and the fundamental relationships are nonlinear. To move from the exploratory to the confirmatory, we here provide a principled permutation-based inferential approach for determining when the effect of a selected predictor is likely to be real. Going further, we adapt the BART procedure to incorporate informed prior information about variable importance. We present simulations demonstrating that our method compares favorably with lasso regression and random forests adapted for variable selection in a variety of data settings. To demonstrate the potential of our approach, we apply it to the task of inferring the gene regulatory network in yeast (Saccharomyces cerevisiae). In this application, our BART-based procedure is best able to recover the subset of covariates with the largest signal compared to other variable selection methods."
in_NB  statistics  high-dimensional_statistics  variable_selection  regression  kith_and_kin  jensen.shane  george.ed  gene_expression_data_analysis
october 2013 by cshalizi
[1309.2068] Modified Cross-Validation for Penalized High-Dimensional Linear Regression Models
"In this paper, for Lasso penalized linear regression models in high-dimensional settings, we propose a modified cross-validation method for selecting the penalty parameter. The methodology is extended to other penalties, such as Elastic Net. We conduct extensive simulation studies and real data analysis to compare the performance of the modified cross-validation method with other methods. It is shown that the popular $K$-fold cross-validation method includes many noise variables in the selected model, while the modified cross-validation works well in a wide range of coefficient and correlation settings. Supplemental materials containing the computer code are available online."
in_NB  cross-validation  lasso  regression  statistics  high-dimensional_statistics  variable_selection
september 2013 by cshalizi
[1306.6557] Optimal Feature Selection in High-Dimensional Discriminant Analysis
"We consider the high-dimensional discriminant analysis problem. For this problem, different methods have been proposed and justified by establishing exact convergence rates for the classification risk, as well as the l2 convergence results to the discriminative rule. However, sharp theoretical analysis for the variable selection performance of these procedures have not been established, even though model interpretation is of fundamental importance in scientific data analysis. This paper bridges the gap by providing sharp sufficient conditions for consistent variable selection using the sparse discriminant analysis (Mai et al., 2012). Through careful analysis, we establish rates of convergence that are significantly faster than the best known results and admit an optimal scaling of the sample size n, dimensionality p, and sparsity level s in the high-dimensional setting. Sufficient conditions are complemented by the necessary information theoretic limits on the variable selection problem in the context of high-dimensional discriminant analysis. Exploiting a numerical equivalence result, our method also establish the optimal results for the ROAD estimator (Fan et al., 2012) and the sparse optimal scaling estimator (Clemmensen et al., 2011). Furthermore, we analyze an exhaustive search procedure, whose performance serves as a benchmark, and show that it is variable selection consistent under weaker conditions. Extensive simulations demonstrating the sharpness of the bounds are also provided."
in_NB  classifiers  high-dimensional_statistics  sparsity  variable_selection  statistics  liu.han  kolar.mladen
june 2013 by cshalizi
[1306.5505] Asymptotic Properties of Lasso+mLS and Lasso+Ridge in Sparse High-dimensional Linear Regression
"We study the asymptotic properties of Lasso+mLS and Lasso+Ridge under the sparse high-dimensional linear regression model: Lasso selecting predictors and then modified Least Squares (mLS) or Ridge estimating their coefficients. First, we propose a valid inference procedure for parameter estimation based on parametric residual bootstrap after Lasso+mLS and Lasso+Ridge. Second, we derive the asymptotic unbiasedness of Lasso+mLS and Lasso+Ridge. More specifically, we show that their biases decay at an exponential rate and they can achieve the oracle convergence rate of $s/n$ (where $s$ is the number of nonzero regression coefficients and $n$ is the sample size) for mean squared error (MSE). Third, we show that Lasso+mLS and Lasso+Ridge are asymptotically normal. They have an oracle property in the sense that they can select the true predictors with probability converging to 1 and the estimates of nonzero parameters have the same asymptotic normal distribution that they would have if the zero parameters were known in advance. In fact, our analysis is not limited to adopting Lasso in the selection stage, but is applicable to any other model selection criteria with exponentially decay rates of the probability of selecting wrong models."
to:NB  lasso  regression  variable_selection  high-dimensional_statistics  statistics  estimation  yu.bin
june 2013 by cshalizi
[1304.5678] Analytic Feature Selection for Support Vector Machines
"Support vector machines (SVMs) rely on the inherent geometry of a data set to classify training data. Because of this, we believe SVMs are an excellent candidate to guide the development of an analytic feature selection algorithm, as opposed to the more commonly used heuristic methods. We propose a filter-based feature selection algorithm based on the inherent geometry of a feature set. Through observation, we identified six geometric properties that differ between optimal and suboptimal feature sets, and have statistically significant correlations to classifier performance. Our algorithm is based on logistic and linear regression models using these six geometric properties as predictor variables. The proposed algorithm achieves excellent results on high dimensional text data sets, with features that can be organized into a handful of feature types; for example, unigrams, bigrams or semantic structural features. We believe this algorithm is a novel and effective approach to solving the feature selection problem for linear SVMs."
to:NB  variable_selection  data_mining  to_teach:data-mining  text_mining  classifiers
april 2013 by cshalizi
[1304.5245] Feature Elimination in empirical risk minimization and support vector machines
"We develop an approach for feature elimination in empirical risk minimization and support vector machines, based on recursive elimination of features. We present theoretical properties of this method and show that this is uniformly consistent in finding the correct feature space under certain generalized assumptions. We present case studies to show that the assumptions are met in most practical situations and also present simulation studies to demonstrate performance of the proposed approach."
to:NB  variable_selection  classifiers  learning_theory
april 2013 by cshalizi
[0906.4391] KNIFE: Kernel Iterative Feature Extraction
"Selecting important features in non-linear or kernel spaces is a difficult challenge in both classification and regression problems. When many of the features are irrelevant, kernel methods such as the support vector machine and kernel ridge regression can sometimes perform poorly. We propose weighting the features within a kernel with a sparse set of weights that are estimated in conjunction with the original classification or regression problem. The iterative algorithm, KNIFE, alternates between finding the coefficients of the original problem and finding the feature weights through kernel linearization. In addition, a slight modification of KNIFE yields an efficient algorithm for finding feature regularization paths, or the paths of each feature's weight. Simulation results demonstrate the utility of KNIFE for both kernel regression and support vector machines with a variety of kernels. Feature path realizations also reveal important non-linear correlations among features that prove useful in determining a subset of significant variables. Results on vowel recognition data, Parkinson's disease data, and microarray data are also given."

to_teach tags are tentative
november 2012 by cshalizi
Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR
"Variable selection can be challenging, particularly in situations with a large number of predic- tors with possibly high correlations, such as gene expression data. In this article, a new method called the OSCAR (octagonal shrinkage and clustering algorithm for regression) is proposed to simultaneously select variables while grouping them into predictive clusters. In addition to improving prediction accuracy and interpretation, these resulting groups can then be investigated further to discover what contributes to the group having a similar behavior. The technique is based on penalized least squares with a geometrically in- tuitive penalty function that shrinks some coefficients to exactly zero. Additionally, this penalty yields exact equality of some coefficients, encouraging correlated predictors that have a similar effect on the response to form predictive clusters represented by a single coefficient. The proposed procedure is shown to compare favorably to the existing shrinkage and variable selection techniques in terms of both prediction error and model complexity, while yielding the additional grouping information."
to:NB  regression  variable_selection  statistics  via:ryantibs
september 2012 by cshalizi
[1208.2572] Nonparametric sparsity and regularization
"In this work we are interested in the problems of supervised learning and variable selection when the input-output dependence is described by a nonlinear function depending on a few variables. Our goal is to consider a sparse nonparametric model, hence avoiding linear or additive models. The key idea is to measure the importance of each variable in the model by making use of partial derivatives. Based on this intuition we propose a new notion of nonparametric sparsity and a corresponding least squares regularization scheme. Using concepts and results from the theory of reproducing kernel Hilbert spaces and proximal methods, we show that the proposed learning algorithm corresponds to a minimization problem which can be provably solved by an iterative procedure. The consistency properties of the obtained estimator are studied both in terms of prediction and selection performance. An extensive empirical analysis shows that the proposed method performs favorably with respect to the state-of-the-art methods."
to:NB  to_read  nonparametrics  regression  variable_selection  sparsity  statistics
september 2012 by cshalizi
[1205.6843] Significance Testing and Group Variable Selection
"Let X; Z be r and s-dimensional covariates, respectively, used to model the response variable Y as Y = m(X;Z) + sigma(X;Z)epsilon. We develop an ANOVA-type test for the null hypothesis that Z has no influence on the regression function, based on residuals obtained from local polynomial ?fitting of the null model. Using p-values from this test, a group variable selection method based on multiple testing ideas is proposed. Simulations studies suggest that the proposed test procedure outperforms the generalized likelihood ratio test when the alternative is non-additive or there is heteroscedasticity. Additional simulation studies, with data generated from linear, non-linear and logistic regression, reveal that the proposed group variable selection procedure performs competitively against Group Lasso, and outperforms it in selecting groups having nonlinear effects. The proposed group variable selection procedure is illustrated on a real data set."
june 2012 by cshalizi
[1206.2696] Flexible Variable Selection for Recovering Sparsity in Nonadditive Nonparametric Models
"Variable selection for recovering sparsity in nonadditive nonparametric models has been challenging. This problem becomes even more difficult due to complications in modeling unknown interaction terms among high dimensional variables. There is currently no variable selection method to overcome these limitations. Hence, in this paper we propose a variable selection approach that is developed by connecting a kernel machine with the nonparametric multiple regression model. The advantages of our approach are that it can: (1) recover the sparsity, (2) automatically model unknown and complicated interactions, (3) connect with several existing approaches including linear nonnegative garrote, kernel learning and automatic relevant determinants (ARD), and (4) provide flexibility for both additive and nonadditive nonparametric models. Our approach may be viewed as a nonlinear version of a nonnegative garrote method. We model the smoothing function by a least squares kernel machine and construct the nonnegative garrote objective function as the function of the similarity matrix. Since the multiple regression similarity matrix can be written as an additive form of univariate similarity matrices corresponding to input variables, applying a sparse scale parameter on each univariate similarity matrix can reveal its relevance to the response variable. We also derive the asymptotic properties of our approach, and show that it provides a square root consistent estimator of the scale parameters. Furthermore, we prove that sparsistency is satisfied with consistent initial kernel function coefficients under certain conditions and give the necessary and sufficient conditions for sparsistency. An efficient coordinate descent/backfitting algorithm is developed. A resampling procedure for our variable selection methodology is also proposed to improve power."

to_teach tag is tentative, I do a lot with additive models and this might be worth mentioning if it's good.
june 2012 by cshalizi
A Confidence Region Approach to Tuning for Variable Selection - Journal of Computational and Graphical Statistics - Volume 21, Issue 2
"We develop an approach to tuning of penalized regression variable selection methods by calculating the sparsest estimator contained in a confidence region of a specified level. Because confidence intervals/regions are generally understood, tuning penalized regression methods in this way is intuitive and more easily understood by scientists and practitioners. More importantly, our work shows that tuning to a fixed confidence level often performs better than tuning via the common methods based on Akaike information criterion (AIC), Bayesian information criterion (BIC), or cross-validation (CV) over a wide range of sample sizes and levels of sparsity. Additionally, we prove that by tuning with a sequence of confidence levels converging to one, asymptotic selection consistency is obtained, and with a simple two-stage procedure, an oracle property is achieved. The confidence-region-based tuning parameter is easily calculated using output from existing penalized regression computer packages. Our work also shows how to map any penalty parameter to a corresponding confidence coefficient. This mapping facilitates comparisons of tuning parameter selection methods such as AIC, BIC, and CV, and reveals that the resulting tuning parameters correspond to confidence levels that are extremely low, and can vary greatly across datasets. Supplemental materials for the article are available online."
to:NB  variable_selection  regression  statistics  confidence_sets  lasso
june 2012 by cshalizi
Variable selection with error control: another look at stability selection - Shah - 2012 - Journal of the Royal Statistical Society: Series B (Statistical Methodology) - Wiley Online Library
"Stability selection was recently introduced by Meinshausen and Bühlmann as a very general technique designed to improve the performance of a variable selection algorithm. It is based on aggregating the results of applying a selection procedure to subsamples of the data. We introduce a variant, called complementary pairs stability selection, and derive bounds both on the expected number of variables included by complementary pairs stability selection that have low selection probability under the original procedure, and on the expected number of high selection probability variables that are excluded. These results require no (e.g. exchangeability) assumptions on the underlying model or on the quality of the original selection procedure. Under reasonable shape restrictions, the bounds can be further tightened, yielding improved error control, and therefore increasing the applicability of the methodology."
to:NB  variable_selection  statistics
june 2012 by cshalizi
[1206.4682] Copula-based Kernel Dependency Measures
"The paper presents a new copula based method for measuring dependence between random variables. Our approach extends the Maximum Mean Discrepancy to the copula of the joint distribution. We prove that this approach has several advantageous properties. Similarly to Shannon mutual information, the proposed dependence measure is invariant to any strictly increasing transformation of the marginal variables. This is important in many applications, for example in feature selection. The estimator is consistent, robust to outliers, and uses rank statistics only. We derive upper bounds on the convergence rate and propose independence tests too. We illustrate the theoretical contributions through a series of experiments in feature selection and low-dimensional embedding of distributions."
in_NB  information_theory  entropy_estimation  poczos.barnabas  variable_selection  machine_learning  copulas  kernel_methods
june 2012 by cshalizi
[1206.4680] Fast Prediction of New Feature Utility
"We study the new feature utility prediction problem: statistically testing whether adding a new feature to the data representation can improve predictive accuracy on a supervised learning task. In many applications, identifying new informative features is the primary pathway for improving performance. However, evaluating every potential feature by re-training the predictor with it can be costly. The paper describes an efficient, learner-independent technique for estimating new feature utility without re-training based on the current predictor's outputs. The method is obtained by deriving a connection between loss reduction potential and the new feature's correlation with the loss gradient of the current predictor. This leads to a simple yet powerful hypothesis testing procedure, for which we prove consistency. Our theoretical analysis is accompanied by empirical evaluation on standard benchmarks and a large-scale industrial dataset."
in_NB  machine_learning  prediction  regression  classifiers  variable_selection  have_read
june 2012 by cshalizi
[0801.1158] Hierarchical selection of variables in sparse high-dimensional regression
"We study a regression model with a huge number of interacting variables. We consider a specific approximation of the regression function under two ssumptions: (i) there exists a sparse representation of the regression function in a suggested basis, (ii) there are no interactions outside of the set of the corresponding main effects. We suggest an hierarchical randomized search procedure for selection of variables and of their interactions. We show that given an initial estimator, an estimator with a similar prediction loss but with a smaller number of non-zero coordinates can be found."
to:NB  variable_selection  high-dimensional_statistics  regression  statistics  bickel.peter_j.  re:what_is_the_right_null_model_for_linear_regression
june 2012 by cshalizi
[1205.6761] Nonparametric Model Checking and Variable Selection
"Let X be a d dimensional vector of covariates and Y be the response variable. Under the nonparametric model Y = m(X) + {sigma}(X) in we develop an ANOVA-type test for the null hypothesis that a particular coordinate of X has no influence on the regression function. The asymptotic distribution of the test statistic, using residuals based on Nadaraya-Watson type kernel estimator and d leq 4, is established under the null hypothesis and local alternatives. Simulations suggest that under a sparse model, the applicability of the test extends to arbitrary d through sufficient dimension reduction. Using p-values from this test, a variable selection method based on multiple testing ideas is proposed. The proposed test outperforms existing procedures, while additional simulations reveal that the proposed variable selection method performs competitively against well established procedures. A real data set is analyzed."
in_NB  variable_selection  regression  nonparametrics  statistics
june 2012 by cshalizi
Feature Selection via Dependence Maximization
"We introduce a framework for feature selection based on dependence maximization between the selected features and the labels of an estimation problem, using the Hilbert-Schmidt Independence Criterion. The key idea is that good features should be highly dependent on the labels. Our approach leads to a greedy procedure for feature selection. We show that a number of existing feature selectors are special cases of this framework. Experiments on both artificial and real-world data show that our feature selector works well in practice."
to:NB  variable_selection  hilbert_space  machine_learning  information_theory
june 2012 by cshalizi
[1206.1024] Conditional Sure Independence Screening
"Independence screening is a powerful method for variable selection for `Big Data' when the number of variables is massive. Commonly used independence screening methods are based on marginal correlations or variations of it. In many applications, researchers often have some prior knowledge that a certain set of variables is related to the response. In such a situation, a natural assessment on the relative importance of the other predictors is the conditional contributions of the individual predictors in presence of the known set of variables. This results in conditional sure independence screening (CSIS). Conditioning helps for reducing the false positive and the false negative rates in the variable selection process. In this paper, we propose and study CSIS in the context of generalized linear models. For ultrahigh-dimensional statistical problems, we give conditions under which sure screening is possible and derive an upper bound on the number of selected variables. We also spell out the situation under which CSIS yields model selection consistency. Moreover, we provide two data-driven methods to select the thresholding parameter of conditional screening. The utility of the procedure is illustrated by simulation studies and analysis of two real data sets."
to:NB  variable_selection  high-dimensional_statistics  statistics
june 2012 by cshalizi
[0805.1179] Autoregressive Process Modeling via the Lasso Procedure
"The Lasso is a popular model selection and estimation procedure for linear models that enjoys nice theoretical properties. In this paper, we study the Lasso estimator for fitting autoregressive time series models. We adopt a double asymptotic framework where the maximal lag may increase with the sample size. We derive theoretical results establishing various types of consistency. In particular, we derive conditions under which the Lasso estimator for the autoregressive coefficients is model selection consistent, estimation consistent and prediction consistent. Simulation study results are reported."
in_NB  time_series  statistics  lasso  sparsity  variable_selection  kith_and_kin  heard_the_talk  rinaldo.alessandro  nardi.yuval
march 2012 by cshalizi
[1102.3616] Tight conditions for consistent variable selection in high dimensional nonparametric regression
"We address the issue of variable selection in the regression model with very high ambient dimension, i.e., when the number of covariates is very large. The main focus is on the situation where the number of relevant covariates, called intrinsic dimension, is much smaller than the ambient dimension. Without assuming any parametric form of the underlying regression function, we get tight conditions making it possible to consistently estimate the set of relevant variables. These conditions relate the intrinsic dimension to the ambient dimension and to the sample size. The procedure that is provably consistent under these tight conditions is simple and is based on comparing the empirical Fourier coefficients with an appropriately chosen threshold value."
in_NB  regression  variable_selection  nonparametrics  statistics
february 2012 by cshalizi
Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection
"We present a unifying framework for information theoretic feature selection, bringing almost two decades of research on heuristic filter criteria under a single theoretical interpretation. This is in response to the question: "what are the implicit statistical assumptions of feature selection criteria based on mutual information?". To answer this, we adopt a different strategy than is usual in the feature selection literature−instead of trying to define a criterion, we derive one, directly from a clearly specified objective function: the conditional likelihood of the training labels. While many hand-designed heuristic criteria try to optimize a definition of feature 'relevancy' and 'redundancy', our approach leads to a probabilistic framework which naturally incorporates these concepts. As a result we can unify the numerous criteria published over the last two decades, and show them to be low-order approximations to the exact (but intractable) optimisation problem. The primary contribution is to show that common heuristics for information based feature selection (including Markov Blanket algorithms as a special case) are approximate iterative maximisers of the conditional likelihood. A large empirical study provides strong evidence to favour certain classes of criteria, in particular those that balance the relative size of the relevancy/redundancy terms. Overall we conclude that the JMI criterion (Yang and Moody, 1999; Meyer et al., 2008) provides the best tradeoff in terms of accuracy, stability, and flexibility with small data samples."
in_NB  information_theory  statistics  variable_selection  model_selection  to_teach:data-mining  to:blog  machine_learning  classifiers  have_read  graphical_models
february 2012 by cshalizi
Nonparametric estimation of the link function including variable selection - Gerhard Tutz and Sebastian Petry - Statistics and Computing, Volume 22, Number 2
"Nonparametric methods for the estimation of the link function in generalized linear models are able to avoid bias in the regression parameters. But for the estimation of the link typically the full model, which includes all predictors, has been used. When the number of predictors is large these methods fail since the full model cannot be estimated. In the present article a boosting type method is proposed that simultaneously selects predictors and estimates the link function. The method performs quite well in simulations and real data examples." (The "to teach" tag is conjectural.)
december 2011 by cshalizi
Variance estimation using refitted cross-validation in ultrahigh dimensional regression - Fan - 2011 - Journal of the Royal Statistical Society: Series B (Statistical Methodology) - Wiley Online Library
"Variance estimation is a fundamental problem in statistical modelling. In ultrahigh dimensional linear regression where the dimensionality is much larger than the sample size, traditional variance estimation techniques are not applicable. Recent advances in variable selection in ultrahigh dimensional linear regression make this problem accessible. One of the major problems in ultrahigh dimensional regression is the high spurious correlation between the unobserved realized noise and some of the predictors. As a result, the realized noises are actually predicted when extra irrelevant variables are selected, leading to a serious underestimate of the level of noise. We propose a two-stage refitted procedure via a data splitting technique, called refitted cross-validation, to attenuate the influence of irrelevant variables with high spurious correlations. Our asymptotic results show that the resulting procedure performs as well as the oracle estimator, which knows in advance the mean regression function. The simulation studies lend further support to our theoretical claims. The naive two-stage estimator and the plug-in one-stage estimators using the lasso and smoothly clipped absolute deviation are also studied and compared. Their performances can be improved by the refitted cross-validation method proposed."
statistics  regression  variable_selection  cross-validation  estimation  in_NB  fan.jianqing  variance_estimation
october 2011 by cshalizi
[1106.5242] High Dimensional Sparse Econometric Models: An Introduction
I love how they just flat-out identify "econometrics" with "linear regression with Gaussian noise"; but it looks like a clean exposition with proofs.
regression  lasso  variable_selection  econometrics
june 2011 by cshalizi
[1009.2302] The Predictive Lasso
"We propose a shrinkage procedure for simultaneous variable selection and estimation in generalized linear models (GLMs) with an explicit predictive motivation. The procedure estimates the coefficients by minimizing the Kullback-Leibler divergence of a set of predictive distributions to the corresponding predictive distributions for the full model, subject to an $l_1$ constraint on the coefficient vector. This results in selection of a parsimonious model with similar predictive performance to the full model. Thanks to its similar form to the original lasso problem for GLMs, our procedure can benefit from available $l_1$-regularization path algorithms. Simulation studies and real-data examples confirm the efficiency of our method in terms of predictive performance on future observations."
regression  lasso  variable_selection  sparsity  information_theory  statistics
september 2010 by cshalizi
"Partial Generalized Additive Models: An Information-Theoretic Approach for Dealing With Concurvity and Selecting Variables" (Gu, Kenny, Zhu, 2010)
"Scientists [want to know] which covariates are important, and how [they] affect the response variable, rather than just making predictions. ... Generalized additive models (GAMs) are a class of interpretable, multivariate nonparametric regression models which are very useful ... for these purposes, but concurvity among covariates (the nonlinear analogue of collinearity for linear regression) can ... produce unstable or even wrong estimates of the covariates’ functional effects. We develop a new procedure called partial generalized additive models (pGAM), based on mutual information ... Our procedure is similar in spirit to the Gram–Schmidt method for linear least squares. By building a GAM on a selected set of transformed variables, pGAM produces more stable models, selects variables parsimoniously, and provides insight into the nature of concurvity between the covariates by calculating functional dependencies among them. ... R code for fitting pGAMs is available online"
september 2010 by cshalizi
Penalized regression with correlation-based penalty
But do I _want_ to exclude _all_ of a bundle of correlated input variables from my regression? Surely it'd be better to include just _one_ of them...
regression  variable_selection  statistics
june 2009 by cshalizi
[0906.3590] Forest Garrote
We have got to do something about the nams of techniques in this area. I don't mind the whimsy, it's just that combinations like this don't work, metaphorically.
ensemble_methods  classifiers  statistics  machine_learning  sparsity  variable_selection  lasso
june 2009 by cshalizi
[0901.3202] Model-Consistent Sparse Estimation through the Bootstrap
"if we run the Lasso for several bootstrapped replications of a given sample, then intersecting the supports of the Lasso bootstrap estimates leads to consistent model selection"
lasso  linear_regression  model_selection  variable_selection  bootstrap
january 2009 by cshalizi

Copy this bookmark:

description:

tags: