cshalizi + sparsity   142

Testing Sparsity-Inducing Penalties: Journal of Computational and Graphical Statistics: Vol 0, No 0
"Many penalized maximum likelihood estimators correspond to posterior mode estimators under specific prior distributions. Appropriateness of a particular class of penalty functions can therefore be interpreted as the appropriateness of a prior for the parameters. For example, the appropriateness of a lasso penalty for regression coefficients depends on the extent to which the empirical distribution of the regression coefficients resembles a Laplace distribution. We give a testing procedure of whether or not a Laplace prior is appropriate and accordingly, whether or not using a lasso penalized estimate is appropriate. This testing procedure is designed to have power against exponential power priors which correspond to ℓqℓq penalties. Via simulations, we show that this testing procedure achieves the desired level and has enough power to detect violations of the Laplace assumption when the numbers of observations and unknown regression coefficients are large. We then introduce an adaptive procedure that chooses a more appropriate prior and corresponding penalty from the class of exponential power priors when the null hypothesis is rejected. We show that this can improve estimation of the regression coefficients both when they are drawn from an exponential power distribution and when they are drawn from a spike-and-slab distribution. Supplementary materials for this article are available online."

--- I feel like I fundamentally disagree with this approach. Those priors are merely (to quote Jamie Robins and Larry Wasserman) "frequentist pursuit", and have no bearing on whether (say) the Lasso will give a good sparse, linear approximation to the underlying regression function (see https://normaldeviate.wordpress.com/2013/09/11/consistency-sparsistency-and-presistency/). All of which said, Hoff is always worth listening to, so the last tag applies with special force.
to:NB  model_checking  sparsity  regression  hypothesis_testing  bayesianism  re:phil-of-bayes_paper  hoff.peter  to_besh 
29 days ago by cshalizi
High-Dimensional Adaptive Minimax Sparse Estimation With Interactions - IEEE Journals & Magazine
"High-dimensional linear regression with interaction effects is broadly applied in research fields such as bioinformatics and social science. In this paper, first, we investigate the minimax rate of convergence for regression estimation in high-dimensional sparse linear models with two-way interactions. Here, we derive matching upper and lower bounds under three types of heredity conditions: strong heredity, weak heredity, and no heredity. From the results: 1) A stronger heredity condition may or may not drastically improve the minimax rate of convergence. In fact, in some situations, the minimax rates of convergence are the same under all three heredity conditions; 2) The minimax rate of convergence is determined by the maximum of the total price of estimating the main effects and that of estimating the interaction effects, which goes beyond purely comparing the order of the number of non-zero main effects r1 and non-zero interaction effects r2 ; and 3) Under any of the three heredity conditions, the estimation of the interaction terms may be the dominant part in determining the rate of convergence. This is due to either the dominant number of interaction effects over main effects or the higher interaction estimation price induced by a large ambient dimension. Second, we construct an adaptive estimator that achieves the minimax rate of convergence regardless of the true heredity condition and the sparsity indices r1,r2 ."
to:NB  statistics  high-dimensional_statistics  regression  sparsity  variable_selection  linear_regression 
29 days ago by cshalizi
[1806.03120] Variational inference for sparse network reconstruction from count data
"In multivariate statistics, the question of finding direct interactions can be formulated as a problem of network inference - or network reconstruction - for which the Gaussian graphical model (GGM) provides a canonical framework. Unfortunately, the Gaussian assumption does not apply to count data which are encountered in domains such as genomics, social sciences or ecology.
"To circumvent this limitation, state-of-the-art approaches use two-step strategies that first transform counts to pseudo Gaussian observations and then apply a (partial) correlation-based approach from the abundant literature of GGM inference. We adopt a different stance by relying on a latent model where we directly model counts by means of Poisson distributions that are conditional to latent (hidden) Gaussian correlated variables. In this multivariate Poisson lognormal-model, the dependency structure is completely captured by the latent layer. This parametric model enables to account for the effects of covariates on the counts.
"To perform network inference, we add a sparsity inducing constraint on the inverse covariance matrix of the latent Gaussian vector. Unlike the usual Gaussian setting, the penalized likelihood is generally not tractable, and we resort instead to a variational approach for approximate likelihood maximization. The corresponding optimization problem is solved by alternating a gradient ascent on the variational parameters and a graphical-Lasso step on the covariance matrix.
"We show that our approach is highly competitive with the existing methods on simulation inspired from microbiological data. We then illustrate on three various data sets how accounting for sampling efforts via offsets and integrating external covariates (which is mostly never done in the existing literature) drastically changes the topology of the inferred network."
to:NB  sparsity  graphical_models  statistics  re:6dfb 
may 2019 by cshalizi
[1903.04641] Generalized Sparse Additive Models
"We present a unified framework for estimation and analysis of generalized additive models in high dimensions. The framework defines a large class of penalized regression estimators, encompassing many existing methods. An efficient computational algorithm for this class is presented that easily scales to thousands of observations and features. We prove minimax optimal convergence bounds for this class under a weak compatibility condition. In addition, we characterize the rate of convergence when this compatibility condition is not met. Finally, we also show that the optimal penalty parameters for structure and sparsity penalties in our framework are linked, allowing cross-validation to be conducted over only a single tuning parameter. We complement our theoretical results with empirical studies comparing some existing methods within this framework."
to:NB  sparsity  regression  additive_models  high-dimensional_statistics  statistics 
april 2019 by cshalizi
[1712.05630] Sparse principal component analysis via random projections
"We introduce a new method for sparse principal component analysis, based on the aggregation of eigenvector information from carefully-selected random projections of the sample covariance matrix. Unlike most alternative approaches, our algorithm is non-iterative, so is not vulnerable to a bad choice of initialisation. Our theory provides great detail on the statistical and computational trade-off in our procedure, revealing a subtle interplay between the effective sample size and the number of random projections that are required to achieve the minimax optimal rate. Numerical studies provide further insight into the procedure and confirm its highly competitive finite-sample performance."
to:NB  principal_components  sparsity  random_projections  statistics  samworth.richard_j. 
september 2018 by cshalizi
Large numbers of explanatory variables, a semi-descriptive analysis
"Data with a relatively small number of study individuals and a very large number of potential explanatory features arise particularly, but by no means only, in genomics. A powerful method of analysis, the lasso [Tibshirani R (1996) J Roy Stat Soc B 58:267–288], takes account of an assumed sparsity of effects, that is, that most of the features are nugatory. Standard criteria for model fitting, such as the method of least squares, are modified by imposing a penalty for each explanatory variable used. There results a single model, leaving open the possibility that other sparse choices of explanatory features fit virtually equally well. The method suggested in this paper aims to specify simple models that are essentially equally effective, leaving detailed interpretation to the specifics of the particular study. The method hinges on the ability to make initially a very large number of separate analyses, allowing each explanatory feature to be assessed in combination with many other such features. Further stages allow the assessment of more complex patterns such as nonlinear and interactive dependences. The method has formal similarities to so-called partially balanced incomplete block designs introduced 80 years ago [Yates F (1936) J Agric Sci 26:424–455] for the study of large-scale plant breeding trials. The emphasis in this paper is strongly on exploratory analysis; the more formal statistical properties obtained under idealized assumptions will be reported separately."

--- Contributed, which is a bad sign, but by Cox, so...
to:NB  statistics  regression  sparsity  lasso  high-dimensional_statistics  cox.d.r. 
august 2017 by cshalizi
Oracle M-Estimation for Time Series Models - Giurcanu - 2016 - Journal of Time Series Analysis - Wiley Online Library
"We propose a thresholding M-estimator for multivariate time series. Our proposed estimator has the oracle property that its large-sample properties are the same as of the classical M-estimator obtained under the a priori information that the zero parameters were known. We study the consistency of the standard block bootstrap, the centred block bootstrap and the empirical likelihood block bootstrap distributions of the proposed M-estimator. We develop automatic selection procedures for the thresholding parameter and for the block length of the bootstrap methods. We present the results of a simulation study of the proposed methods for a sparse vector autoregressive VAR(2) time series model. The analysis of two real-world data sets illustrate applications of the methods in practice."
bootstrap  time_series  statistics  estimation  in_NB  sparsity  variable_selection  high-dimensional_statistics 
april 2017 by cshalizi
[1311.5768] An RKHS Approach to Estimation with Sparsity Constraints
"The investigation of the effects of sparsity or sparsity constraints in signal processing problems has received considerable attention recently. Sparsity constraints refer to the a priori information that the object or signal of interest can be represented by using only few elements of a predefined dictionary. Within this thesis, sparsity refers to the fact that a vector to be estimated has only few nonzero entries. One specific field concerned with sparsity constraints has become popular under the name Compressed Sensing (CS). Within CS, the sparsity is exploited in order to perform (nearly) lossless compression. Moreover, this compression is carried out jointly or simultaneously with the process of sensing a physical quantity. In contrast to CS, one can alternatively use sparsity to enhance signal processing methods. Obviously, sparsity constraints can only improve the obtainable estimation performance since the constraints can be interpreted as an additional prior information about the unknown parameter vector which is to be estimated. Our main focus will be on this aspect of sparsity, i.e., we analyze how much we can gain in estimation performance due to the sparsity constraints."
to:NB  sparsity  compressed_sensing  hilbert_space  estimation  statistics 
december 2016 by cshalizi
[1209.1508] Confidence sets in sparse regression
"The problem of constructing confidence sets in the high-dimensional linear model with n response variables and p parameters, possibly p≥n, is considered. Full honest adaptive inference is possible if the rate of sparse estimation does not exceed n−1/4, otherwise sparse adaptive confidence sets exist only over strict subsets of the parameter spaces for which sparse estimators exist. Necessary and sufficient conditions for the existence of confidence sets that adapt to a fixed sparsity level of the parameter vector are given in terms of minimal ℓ2-separation conditions on the parameter space. The design conditions cover common coherence assumptions used in models for sparsity, including (possibly correlated) sub-Gaussian designs."
to:NB  confidence_sets  regression  high-dimensional_statistics  linear_regression  statistics  sparsity  van_de_geer.sara  nickl.richard 
november 2016 by cshalizi
Estimation and Testing Under Sparsity | Sara van de Geer | Springer
"Taking the Lasso method as its starting point, this book describes the main ingredients needed to study general loss functions and sparsity-inducing regularizers. It also provides a semi-parametric approach to establishing confidence intervals and tests. Sparsity-inducing methods have proven to be very useful in the analysis of high-dimensional data. Examples include the Lasso and group Lasso methods, and the least squares method with other norm-penalties, such as the nuclear norm. The illustrations provided include generalized linear models, density estimation, matrix completion and sparse principal components. Each chapter ends with a problem section. The book can be used as a textbook for a graduate or PhD course."
to:NB  books:noted  statistics  sparsity  high-dimensional_statistics  lasso  hypothesis_testing  confidence_sets  van_de_geer.sara  to_read  empirical_processes 
july 2016 by cshalizi
Cat Basis Pursuit
Will the statistical machine learning reading group meet on 1 April?
machine_learning  cats  funny:geeky  principal_components  sparsity 
february 2016 by cshalizi
[1602.00355] A Spectral Series Approach to High-Dimensional Nonparametric Regression
"A key question in modern statistics is how to make fast and reliable inferences for complex, high-dimensional data. While there has been much interest in sparse techniques, current methods do not generalize well to data with nonlinear structure. In this work, we present an orthogonal series estimator for predictors that are complex aggregate objects, such as natural images, galaxy spectra, trajectories, and movies. Our series approach ties together ideas from kernel machine learning, and Fourier methods. We expand the unknown regression on the data in terms of the eigenfunctions of a kernel-based operator, and we take advantage of orthogonality of the basis with respect to the underlying data distribution, P, to speed up computations and tuning of parameters. If the kernel is appropriately chosen, then the eigenfunctions adapt to the intrinsic geometry and dimension of the data. We provide theoretical guarantees for a radial kernel with varying bandwidth, and we relate smoothness of the regression function with respect to P to sparsity in the eigenbasis. Finally, using simulated and real-world data, we systematically compare the performance of the spectral series approach with classical kernel smoothing, k-nearest neighbors regression, kernel ridge regression, and state-of-the-art manifold and local regression methods."
to:NB  have_read  statistics  regression  nonparametrics  sparsity  kernel_methods  kith_and_kin  lee.ann_b. 
february 2016 by cshalizi
An Introduction to Sparse Stochastic Processes | Communications and Signal Processing | Cambridge University Press
"Providing a novel approach to sparsity, this comprehensive book presents the theory of stochastic processes that are ruled by linear stochastic differential equations, and that admit a parsimonious representation in a matched wavelet-like basis. Two key themes are the statistical property of infinite divisibility, which leads to two distinct types of behaviour - Gaussian and sparse - and the structural link between linear stochastic processes and spline functions, which is exploited to simplify the mathematical analysis. The core of the book is devoted to investigating sparse processes, including a complete description of their transform-domain statistics. The final part develops practical signal-processing algorithms that are based on these models, with special emphasis on biomedical image reconstruction. This is an ideal reference for graduate students and researchers with an interest in signal/image processing, compressed sensing, approximation theory, machine learning, or statistics."
to:NB  books:noted  sparsity  stochastic_processes  compressed_sensing  splines  statistics  levy_processes 
december 2015 by cshalizi
[1507.05185] Fast Sparse Least-Squares Regression with Non-Asymptotic Guarantees
"In this paper, we study a fast approximation method for {\it large-scale high-dimensional} sparse least-squares regression problem by exploiting the Johnson-Lindenstrauss (JL) transforms, which embed a set of high-dimensional vectors into a low-dimensional space. In particular, we propose to apply the JL transforms to the data matrix and the target vector and then to solve a sparse least-squares problem on the compressed data with a {\it slightly larger regularization parameter}. Theoretically, we establish the optimization error bound of the learned model for two different sparsity-inducing regularizers, i.e., the elastic net and the ℓ1 norm. Compared with previous relevant work, our analysis is {\it non-asymptotic and exhibits more insights} on the bound, the sample complexity and the regularization. As an illustration, we also provide an error bound of the {\it Dantzig selector} under JL transforms."
to:NB  regression  random_projections  computational_statistics  statistics  sparsity 
august 2015 by cshalizi
[1507.00280] Network Lasso: Clustering and Optimization in Large Graphs
"Convex optimization is an essential tool for modern data analysis, as it provides a framework to formulate and solve many problems in machine learning and data mining. However, general convex optimization solvers do not scale well, and scalable solvers are often specialized to only work on a narrow class of problems. Therefore, there is a need for simple, scalable algorithms that can solve many common optimization problems. In this paper, we introduce the \emph{network lasso}, a generalization of the group lasso to a network setting that allows for simultaneous clustering and optimization on graphs. We develop an algorithm based on the Alternating Direction Method of Multipliers (ADMM) to solve this problem in a distributed and scalable manner, which allows for guaranteed global convergence even on large graphs. We also examine a non-convex extension of this approach. We then demonstrate that many types of problems can be expressed in our framework. We focus on three in particular - binary classification, predicting housing prices, and event detection in time series data - comparing the network lasso to baseline approaches and showing that it is both a fast and accurate method of solving large optimization problems."
to:NB  optimization  sparsity  prediction  regression  networks 
august 2015 by cshalizi
[1506.03850] Generalized Additive Model Selection
"We introduce GAMSEL (Generalized Additive Model Selection), a penalized likelihood approach for fitting sparse generalized additive models in high dimension. Our method interpolates between null, linear and additive models by allowing the effect of each variable to be estimated as being either zero, linear, or a low-complexity curve, as determined by the data. We present a blockwise coordinate descent procedure for efficiently optimizing the penalized likelihood objective over a dense grid of the tuning parameter, producing a regularization path of additive models. We demonstrate the performance of our method on both real and simulated data examples, and compare it with existing techniques for additive model selection."
to:NB  statistics  sparsity  additive_models  regression  kith_and_kin  chouldechova.alexandra  hastie.trevor 
july 2015 by cshalizi
How a well-adapted immune system is organized
"The repertoire of lymphocyte receptors in the adaptive immune system protects organisms from diverse pathogens. A well-adapted repertoire should be tuned to the pathogenic environment to reduce the cost of infections. We develop a general framework for predicting the optimal repertoire that minimizes the cost of infections contracted from a given distribution of pathogens. The theory predicts that the immune system will have more receptors for rare antigens than expected from the frequency of encounters; individuals exposed to the same infections will have sparse repertoires that are largely different, but nevertheless exploit cross-reactivity to provide the same coverage of antigens; and the optimal repertoires can be reached via the dynamics of competitive binding of antigens by receptors and selective amplification of stimulated receptors. Our results follow from a tension between the statistics of pathogen detection, which favor a broader receptor distribution, and the effects of cross-reactivity, which tend to concentrate the optimal repertoire onto a few highly abundant clones. Our predictions can be tested in high-throughput surveys of receptor and pathogen diversity."
to:NB  immunology  distributed_systems  adaptive_behavior  sparsity 
may 2015 by cshalizi
[1405.5103] Estimation in high dimensions: a geometric perspective
"This tutorial paper provides an exposition of a flexible geometric framework for high dimensional estimation problems with constraints. The paper develops geometric intuition about high dimensional sets, justifies it with some results of asymptotic convex geometry, and demonstrates connections between geometric results and estimation problems. The theory is illustrated with applications to sparse recovery, matrix completion, quantization, linear and logistic regression and generalized linear models."
to:NB  statistics  high-dimensional_statistics  estimation  sparsity  convexity  geometry  compressed_sensing  to_read 
june 2014 by cshalizi
Information-Theoretic Characterization of Sparse Recovery | AISTATS 2014 | JMLR W&CP
"We formulate sparse support recovery as a salient set identification problem and use information-theoretic analyses to characterize the recovery performance and sample complexity. We consider a very general framework where we are not restricted to linear models or specific distributions. We state non-asymptotic bounds on recovery probability and a tight mutual information formula for sample complexity. We evaluate our bounds for applications such as sparse linear regression and explicitly characterize effects of correlation or noisy features on recovery performance. We show improvements upon previous work and identify gaps between the performance of recovery algorithms and fundamental information. This illustrates a trade-off between computational complexity and sample complexity, contrasting the recovery of the support as a discrete object with signal estimation approaches."
to:NB  to_read  information_theory  statistics  learning_theory  sparsity  compressed_sensing 
april 2014 by cshalizi
<span>Path Thresholding: Asymptotically Tuning-Free High-Dimensional Sparse Regression</span> | AISTATS 2014 | JMLR W&CP
"In this paper, we address the challenging problem of selecting tuning parameters for high-dimensional sparse regression. We propose a simple and computationally efficient method, called path thresholding PaTh, that transforms any tuning parameter-dependent sparse regression algorithm into an asymptotically tuning-free sparse regression algorithm. More specifically, we prove that, as the problem size becomes large (in the number of variables and in the number of observations), PaTh performs accurate sparse regression, under appropriate conditions, without specifying a tuning parameter. In finite-dimensional settings, we demonstrate that PaTh can alleviate the computational burden of model selection algorithms by significantly reducing the search space of tuning parameters."
to:NB  to_read  regression  sparsity  high-dimensional_statistics  statistics  vats.divyanshu 
april 2014 by cshalizi
IEEE Xplore Abstract - Minimum Complexity Pursuit for Universal Compressed Sensing
"The nascent field of compressed sensing is founded on the fact that high-dimensional signals with simple structure can be recovered accurately from just a small number of randomized samples. Several specific kinds of structures have been explored in the literature, from sparsity and group sparsity to low-rankness. However, two fundamental questions have been left unanswered. What are the general abstract meanings of structure and simplicity? Do there exist universal algorithms for recovering such simple structured objects from fewer samples than their ambient dimension? In this paper, we address these two questions. Using algorithmic information theory tools such as the Kolmogorov complexity, we provide a unified definition of structure and simplicity. Leveraging this new definition, we develop and analyze an abstract algorithm for signal recovery motivated by Occam's Razor. Minimum complexity pursuit (MCP) requires approximately 2κ randomized samples to recover a signal of complexity κ and ambient dimension n. We also discuss the performance of the MCP in the presence of measurement noise and with approximately simple signals."

Ungated version: http://arxiv.org/abs/1208.5814
in_NB  compressed_sensing  sparsity  information_theory  algorithmic_information_theory  statistics  to_be_shot_after_a_fair_trial 
april 2014 by cshalizi
[1403.7023] Worst possible sub-directions in high-dimensional models
"We examine the rate of convergence of the Lasso estimator of lower dimensional components of the high-dimensional parameter. Under bounds on the ℓ1-norm on the worst possible sub-direction these rates are of order |J|logp/n‾‾‾‾‾‾‾‾‾√ where p is the total number of parameters, J⊂{1,…,p} represents a subset of the parameters and n is the number of observations. We also derive rates in sup-norm in terms of the rate of convergence in ℓ1-norm. The irrepresentable condition on a set J requires that the ℓ1-norm of the worst possible sub-direction is sufficiently smaller than one. In that case sharp oracle results can be obtained. Moreover, if the coefficients in J are small enough the Lasso will put these coefficients to zero. This extends known results which say that the irrepresentable condition on the inactive set (the set where coefficients are exactly zero) implies no false positives. We further show that by de-sparsifying one obtains fast rates in supremum norm without conditions on the worst possible sub-direction. The main assumption here is that approximate sparsity is of order o(n‾‾√/logp). The results are extended to M-estimation with ℓ1-penalty for generalized linear models and exponential families for example. For the graphical Lasso this leads to an extension of known results to the case where the precision matrix is only approximately sparse. The bounds we provide are non-asymptotic but we also present asymptotic formulations for ease of interpretation."
to:NB  high-dimensional_statistics  lasso  sparsity  variable_selection  statistics  van_de_geer.sara 
april 2014 by cshalizi
Introduction to Compressed Sensing
"In recent years, compressed sensing (CS) has attracted considerable attention in areas of applied mathematics, computer science, and electrical engineering by suggesting that it may be possible to surpass the traditional limits of sam- pling theory. CS builds upon the fundamental fact that we can represent many signals using only a few non-zero coefficients in a suitable basis or dictionary. Nonlinear optimization can then enable recovery of such signals from very few measurements. In this chapter, we provide an up-to-date review of the basic theory underlying CS. After a brief historical overview, we begin with a dis- cussion of sparsity and other low-dimensional signal models. We then treat the central question of how to accurately recover a high-dimensional signal from a small set of measurements and provide performance guarantees for a variety of sparse recovery algorithms. We conclude with a discussion of some extensions of the sparse recovery framework. In subsequent chapters of the book, we will see how the fundamentals presented in this chapter are extended in many excit- ing directions, including new models for describing structure in both analog and discrete-time signals, new sensing design techniques, more advanced recovery results, and emerging applications."
to:NB  compressed_sensing  sparsity  estimation  statistics  linear_algebra  re:network_differences  entableted 
march 2014 by cshalizi
Link Prediction in Graphs with Autoregressive Features
"In the paper, we consider the problem of link prediction in time-evolving graphs. We assume that certain graph features, such as the node degree, follow a vector autoregressive (VAR) model and we propose to use this information to improve the accuracy of prediction. Our strategy involves a joint optimization procedure over the space of adjacency matrices and VAR matrices. On the adjacency matrix it takes into account both sparsity and low rank properties and on the VAR it encodes the sparsity. The analysis involves oracle inequalities that illustrate the trade-offs in the choice of smoothing parameters when modeling the joint effect of sparsity and low rank. The estimate is computed efficiently using proximal methods, and evaluated through numerical experiments."
to:NB  time_series  network_data_analysis  statistics  sparsity  low-rank_approximation 
march 2014 by cshalizi
[1402.7349] Learning Graphical Models With Hubs
"We consider the problem of learning a high-dimensional graphical model in which certain hub nodes are highly-connected to many other nodes. Many authors have studied the use of an l1 penalty in order to learn a sparse graph in high-dimensional setting. However, the l1 penalty implicitly assumes that each edge is equally likely and independent of all other edges. We propose a general framework to accommodate more realistic networks with hub nodes, using a convex formulation that involves a row-column overlap norm penalty. We apply this general framework to three widely-used probabilistic graphical models: the Gaussian graphical model, the covariance graph model, and the binary Ising model. An alternating direction method of multipliers algorithm is used to solve the corresponding convex optimization problems. On synthetic data, we demonstrate that our proposed framework outperforms competitors that do not explicitly model hub nodes. We illustrate our proposal on a webpage data set and a gene expression data set."
to:NB  graphical_models  statistics  sparsity  network_data_analysis  re:6dfb 
march 2014 by cshalizi
Nickl , van de Geer : Confidence sets in sparse regression
"The problem of constructing confidence sets in the high-dimensional linear model with n response variables and p parameters, possibly p≥n, is considered. Full honest adaptive inference is possible if the rate of sparse estimation does not exceed n−1/4, otherwise sparse adaptive confidence sets exist only over strict subsets of the parameter spaces for which sparse estimators exist. Necessary and sufficient conditions for the existence of confidence sets that adapt to a fixed sparsity level of the parameter vector are given in terms of minimal ℓ2-separation conditions on the parameter space. The design conditions cover common coherence assumptions used in models for sparsity, including (possibly correlated) sub-Gaussian designs."

--- Ungated version: http://arxiv.org/abs/1209.1508
in_NB  high-dimensional_statistics  sparsity  confidence_sets  regression  statistics  van_de_geer.sara  nickl.richard  lasso  model_selection  linear_regression 
february 2014 by cshalizi
Vu , Lei : Minimax sparse principal subspace estimation in high dimensions
"We study sparse principal components analysis in high dimensions, where p (the number of variables) can be much larger than n (the number of observations), and analyze the problem of estimating the subspace spanned by the principal eigenvectors of the population covariance matrix. We introduce two complementary notions of ℓq subspace sparsity: row sparsity and column sparsity. We prove nonasymptotic lower and upper bounds on the minimax subspace estimation error for 0≤q≤1. The bounds are optimal for row sparse subspaces and nearly optimal for column sparse subspaces, they apply to general classes of covariance matrices, and they show that ℓq constrained estimates can achieve optimal minimax rates without restrictive spiked covariance conditions. Interestingly, the form of the rates matches known results for sparse regression when the effective noise variance is defined appropriately. Our proof employs a novel variational sinΘ theorem that may be useful in other regularized spectral estimation problems."
to:NB  principal_components  sparsity  dimension_reduction  kith_and_kin  vu.vincent  statistics  to_read  minimax  high-dimensional_statistics 
february 2014 by cshalizi
[1205.0953] Non-negative least squares for high-dimensional linear models: consistency and sparse recovery without regularization
"Least squares fitting is in general not useful for high-dimensional linear models, in which the number of predictors is of the same or even larger order of magnitude than the number of samples. Theory developed in recent years has coined a paradigm according to which sparsity-promoting regularization is regarded as a necessity in such setting. Deviating from this paradigm, we show that non-negativity constraints on the regression coefficients may be similarly effective as explicit regularization if the design matrix has additional properties, which are met in several applications of non-negative least squares (NNLS). We show that for these designs, the performance of NNLS with regard to prediction and estimation is comparable to that of the lasso. We argue further that in specific cases, NNLS may have a better ℓ∞-rate in estimation and hence also advantages with respect to support recovery when combined with thresholding. From a practical point of view, NNLS does not depend on a regularization parameter and is hence easier to use."
to:NB  regression  high-dimensional_statistics  sparsity 
february 2014 by cshalizi
Annu. Rev. Neurosci. 2012. 35:485–508 First published online as a Review in Advance on April 5, 2012 The Annual Review of Neuroscience is online at neuro.annualreviews.org This article’s doi: 10.1146/annurev-neuro-062111–150410 Copyright ⃝c 2012 b
"The curse of dimensionality poses severe challenges to both techni- cal and conceptual progress in neuroscience. In particular, it plagues our ability to acquire, process, and model high-dimensional data sets. Moreover, neural systems must cope with the challenge of processing data in high dimensions to learn and operate successfully within a com- plex world. We review recent mathematical advances that provide ways to combat dimensionality in specific situations. These advances shed light on two dual questions in neuroscience. First, how can we as neu- roscientists rapidly acquire high-dimensional data from the brain and subsequently extract meaningful models from limited amounts of these data? And second, how do brains themselves process information in their intrinsically high-dimensional patterns of neural activity as well as learn meaningful, generalizable models of the external world from limited experience?"
to:NB  high-dimensional_statistics  neural_data_analysis  neuroscience  compressed_sensing  sparsity 
february 2014 by cshalizi
[1401.6978] Sparsistency and Agnostic Inference in Sparse PCA
"The presence of a sparse "truth" has been a constant assumption in the theoretical analysis of sparse PCA and is often implicit in its methodological development. This naturally raises questions about the properties of sparse PCA methods and how they depend on the assumption of sparsity. Under what conditions can the relevant variables be selected consistently if the truth is assumed to be sparse? If the truth is not sparse, let alone unique, what can be said about the results of sparse PCA? We answer these questions by investigating the properties of the recently proposed Fantope projection and selection (FPS) method in the high dimensional setting. Our results provide general sufficient conditions for sparsistency of the FPS estimator. These conditions are weak and can hold in situations where other estimators are known to fail. On the other hand, without assuming sparsity or identifiability, we show that FPS provides a sparse, linear dimension-reducing transformation that is close to the best possible in terms of maximizing the predictive covariance."
to:NB  principal_components  sparsity  dimension_reduction  kith_and_kin  statistics  vu.vincent  lei.jing 
february 2014 by cshalizi
[1312.1706] Swapping Variables for High-Dimensional Sparse Regression from Correlated Measurements
"We consider the high-dimensional sparse linear regression problem of accurately estimating a sparse vector using a small number of linear measurements that are contaminated by noise. It is well known that standard computationally tractable sparse regression algorithms, such as the Lasso, OMP, and their various extensions, perform poorly when the measurement matrix contains highly correlated columns. We develop a simple greedy algorithm, called SWAP, that iteratively swaps variables until a desired loss function cannot be decreased any further. SWAP is surprisingly effective in handling measurement matrices with high correlations. In particular, we prove that (i) SWAP outputs the true support, the location of the non-zero entries in the sparse vector, when initialized with the true support, and (ii) SWAP outputs the true support under a relatively mild condition on the measurement matrix when initialized with a support other than the true support. These theoretical results motivate the use of SWAP as a wrapper around various sparse regression algorithms for improved performance. We empirically show the advantages of using SWAP in sparse regression problems by comparing SWAP to several state-of-the-art sparse regression algorithms."
to:NB  high-dimensional_statistics  lasso  sparsity  variable_selection  vats.divyanshu  statistics 
january 2014 by cshalizi
[1311.6238] Exact inference after model selection via the Lasso
"We develop a framework for inference after model selection based on the Lasso. At the core of this framework is a result that characterizes the exact (non-asymptotic) distribution of a pivot computed from the Lasso solution. This pivot allows us to (i) devise a test statistic that has an exact (non-asymptotic) $\unif(0,1)$ distribution under the null hypothesis that all relevant variables have been included in the model, and (ii) construct valid confidence intervals for the selected coefficients that account for the selection procedure."
in_NB  model_selection  confidence_sets  sparsity  lasso  hypothesis_testing  regression  statistics 
november 2013 by cshalizi
[1311.6425] Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching
"Graph matching is a challenging problem with very important applications in a wide range of fields, from image and video analysis to biological and biomedical problems. We propose a robust graph matching algorithm inspired in sparsity-related techniques. We cast the problem, resembling group or collaborative sparsity formulations, as a non-smooth convex optimization problem that can be efficiently solved using augmented Lagrangian techniques. The method can deal with weighted or unweighted graphs, as well as multimodal data, where different graphs represent different types of data. The proposed approach is also naturally integrated with collaborative graph inference techniques, solving general network inference problems where the observed variables, possibly coming from different modalities, are not in correspondence. The algorithm is tested and compared with state-of-the-art graph matching techniques in both synthetic and real graphs. We also present results on multimodal graphs and applications to collaborative inference of brain connectivity from alignment-free functional magnetic resonance imaging (fMRI) data. The code is publicly available."
in_NB  to_read  network_data_analysis  optimization  sparsity  re:network_differences  entableted 
november 2013 by cshalizi
[1311.4175] Estimation in High-dimensional Vector Autoregressive Models
"Vector Autoregression (VAR) is a widely used method for learning complex interrelationship among the components of multiple time series. Over the years it has gained popularity in the fields of control theory, statistics, economics, finance, genetics and neuroscience. We consider the problem of estimating stable VAR models in a high-dimensional setting, where both the number of time series and the VAR order are allowed to grow with sample size. In addition to the ``curse of dimensionality" introduced by a quadratically growing dimension of the parameter space, VAR estimation poses considerable challenges due to the temporal and cross-sectional dependence in the data. Under a sparsity assumption on the model transition matrices, we establish estimation and prediction consistency of ℓ1-penalized least squares and likelihood based methods. Exploiting spectral properties of stationary VAR processes, we develop novel theoretical techniques that provide deeper insight into the effect of dependence on the convergence rates of the estimates. We study the impact of error correlations on the estimation problem and develop fast, parallelizable algorithms for penalized likelihood based VAR estimates."
in_NB  time_series  sparsity  statistics  re:your_favorite_dsge_sucks 
november 2013 by cshalizi
Taylor & Francis Online :: Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space - Journal of the American Statistical Association - Volume 108, Issue 503
"High-dimensional data analysis has motivated a spectrum of regularization methods for variable selection and sparse modeling, with two popular methods being convex and concave ones. A long debate has taken place on whether one class dominates the other, an important question both in theory and to practitioners. In this article, we characterize the asymptotic equivalence of regularization methods, with general penalty functions, in a thresholded parameter space under the generalized linear model setting, where the dimensionality can grow exponentially with the sample size. To assess their performance, we establish the oracle inequalities—as in Bickel, Ritov, and Tsybakov (2009)—of the global minimizer for these methods under various prediction and variable selection losses. These results reveal an interesting phase transition phenomenon. For polynomially growing dimensionality, the L 1-regularization method of Lasso and concave methods are asymptotically equivalent, having the same convergence rates in the oracle inequalities. For exponentially growing dimensionality, concave methods are asymptotically equivalent but have faster convergence rates than the Lasso. We also establish a stronger property of the oracle risk inequalities of the regularization methods, as well as the sampling properties of computable solutions. Our new theoretical results are illustrated and justified by simulation and real data examples."
to:NB  lasso  sparsity  high-dimensional_statistics  regression  statistics 
september 2013 by cshalizi
[1309.6702] Statistical paleoclimate reconstructions via Markov random fields
"Understanding centennial scale climate variability requires datasets that are accurate, long, continuous, and of broad spatial coverage. Since instrumental measurements are generally only available after 1850, temperature fields must be reconstructed using paleoclimate archives, known as proxies. Various climate field reconstructions (CFR) methods have been proposed to relate past temperature and multiproxy networks, most notably the regularized EM algorithm (RegEM). In this work, we propose a new CFR method, called GraphEM, based on Gaussian Markov random fields (GMRF) embedded within RegEM. GMRFs provide a natural and flexible framework for modeling the inherent spatial heterogeneities of high-dimensional spatial fields, which would in general be more difficult with standard parametric covariance models. At the same time, they provide the parameter reduction necessary for obtaining precise and well-conditioned estimates of the covariance structure of the field, even when the sample size is much smaller than the number of variables (as is typically the case in paleoclimate applications). We demonstrate how the graphical structure of the field can be estimated from the data via l1-penalization methods, and how the GraphEM algorithm can be used to reconstruct past climate variations. The performance of GraphEM is then compared to a popular CFR method (RegEM TTLS) using synthetic data. Our results show that GraphEM can yield significant improvements over existing methods, with gains uniformly over space, and far better risk properties. We proceed to demonstrate that the increase in performance is directly related to recovering the underlying sparsity in the covariance of the spatial field. In particular, we show that spatial points with fewer neighbors in the recovered graph tend to be the ones where there are higher improvements in the reconstructions."
to:NB  climatology  variance_estimation  random_fields  sparsity  statistics 
september 2013 by cshalizi
[1308.2408] Group Lasso for generalized linear models in high dimension
"Nowadays an increasing amount of data is available and we have to deal with models in high dimension (number of covariates much larger than the sample size). Under sparsity assumption it is reasonable to hope that we can make a good estimation of the regression parameter. This sparsity assumption as well as a block structuration of the covariates into groups with similar modes of behavior is for example quite natural in genomics. A huge amount of scientific literature exists for Gaussian linear models including the Lasso estimator and also the Group Lasso estimator which promotes group sparsity under an a priori knowledge of the groups. We extend this Group Lasso procedure to generalized linear models and we study the properties of this estimator for sparse high-dimensional generalized linear models to find convergence rates. We provide oracle inequalities for the prediction and estimation error under assumptions on the joint distribution of the pair observable covariables and under a condition on the design matrix. We show the ability of this estimator to recover good sparse approximation of the true model. At last we extend these results to the case of an Elastic net penalty and we apply them to the so-called Poisson regression case which has not been studied in this context contrary to the logistic regression."

--- Isn't this already done in Buhlmann and van de Geer's book?
to:NB  linear_regression  regression  sparsity  statistics  high-dimensional_statistics 
september 2013 by cshalizi
[1309.2895] Sparse and Functional Principal Components Analysis
"Regularized principal components analysis, especially Sparse PCA and Functional PCA, has become widely used for dimension reduction in high-dimensional settings. Many examples of massive data, however, may benefit from estimating both sparse AND functional factors. These include neuroimaging data where there are discrete brain regions of activation (sparsity) but these regions tend to be smooth spatially (functional). Here, we introduce an optimization framework that can encourage both sparsity and smoothness of the row and/or column PCA factors. This framework generalizes many of the existing approaches to Sparse PCA, Functional PCA and two-way Sparse PCA and Functional PCA, as these are all special cases of our method. In particular, our method permits flexible combinations of sparsity and smoothness that lead to improvements in feature selection and signal recovery as well as more interpretable PCA factors. We demonstrate our method on simulated data and a neuroimaging example on EEG data. This work provides a unified framework for regularized PCA that can form the foundation for a cohesive approach to regularization in high-dimensional multivariate analysis."
to:NB  sparsity  functional_data_analysis  principal_components  statistics  dimension_reduction  allen.genevera_i. 
september 2013 by cshalizi
Nonparametric Sparsity and Regularization
"In this work we are interested in the problems of supervised learning and variable selection when the input-output dependence is described by a nonlinear function depending on a few variables. Our goal is to consider a sparse nonparametric model, hence avoiding linear or additive models. The key idea is to measure the importance of each variable in the model by making use of partial derivatives. Based on this intuition we propose a new notion of nonparametric sparsity and a corresponding least squares regularization scheme. Using concepts and results from the theory of reproducing kernel Hilbert spaces and proximal methods, we show that the proposed learning algorithm corresponds to a minimization problem which can be provably solved by an iterative procedure. The consistency properties of the obtained estimator are studied both in terms of prediction and selection performance. An extensive empirical analysis shows that the proposed method performs favorably with respect to the state-of-the-art methods."
to:NB  sparsity  hilbert_space  regression  statistics 
september 2013 by cshalizi
[1307.5381] A convex pseudo-likelihood framework for high dimensional partial correlation estimation with convergence guarantees
"Sparse high dimensional graphical model selection is a topic of much interest in modern day statistics. A popular approach is to apply l1-penalties to either (1) parametric likelihoods, or, (2) regularized regression/pseudo-likelihoods, with the latter having the distinct advantage that they do not explicitly assume Gaussianity. As none of the popular methods proposed for solving pseudo-likelihood based objective functions have provable convergence guarantees, it is not clear if corresponding estimators exist or are even computable, or if they actually yield correct partial correlation graphs. This paper proposes a new pseudo-likelihood based graphical model selection method that aims to overcome some of the shortcomings of current methods, but at the same time retain all their respective strengths. In particular, we introduce a novel framework that leads to a convex formulation of the partial covariance regression graph problem, resulting in an objective function comprised of quadratic forms. The objective is then optimized via a coordinate-wise approach. The specific functional form of the objective function facilitates rigorous convergence analysis leading to convergence guarantees; an important property that cannot be established using standard results, when the dimension is larger than the sample size, as is often the case in high dimensional applications. These convergence guarantees ensure that estimators are well-defined under very general conditions, and are always computable. In addition, the approach yields estimators that have good large sample properties and also respect symmetry. Furthermore, application to simulated/real data, timing comparisons and numerical convergence is demonstrated. We also present a novel unifying framework that places all graphical pseudo-likelihood methods as special cases of a more general formulation, leading to important insights."
to:NB  convexity  graphical_models  sparsity  optimization  lasso  likelihood  statistics 
july 2013 by cshalizi
[1307.5339] The Cluster Graphical Lasso for improved estimation of Gaussian graphical models
"We consider the task of estimating a Gaussian graphical model in the high-dimensional setting. The graphical lasso, which involves maximizing the Gaussian log likelihood subject to an l1 penalty, is a well-studied approach for this task. We begin by introducing a surprising connection between the graphical lasso and hierarchical clustering: the graphical lasso in effect performs a two-step procedure, in which (1) single linkage hierarchical clustering is performed on the variables in order to identify connected components, and then (2) an l1-penalized log likelihood is maximized on the subset of variables within each connected component. In other words, the graphical lasso determines the connected components of the estimated network via single linkage clustering. Unfortunately, single linkage clustering is known to perform poorly in certain settings. Therefore, we propose the cluster graphical lasso, which involves clustering the features using an alternative to single linkage clustering, and then performing the graphical lasso on the subset of variables within each cluster. We establish model selection consistency for this technique, and demonstrate its improved performance relative to the graphical lasso in a simulation study, as well as in applications to an equities data set, a university webpage data set, and a gene expression data set."
to:NB  sparsity  graphical_models  lasso  clustering  statistics 
july 2013 by cshalizi
[1205.2081] The Computational Complexity of the Restricted Isometry Property, the Nullspace Property, and Related Concepts in Compressed Sensing
"This paper deals with the computational complexity of conditions which guarantee that the NP-hard problem of finding the sparsest solution to an underdetermined linear system can be solved by efficient algorithms. In the literature, several such conditions have been introduced. The most well-known ones are the mutual coherence, the restricted isometry property (RIP), and the nullspace property (NSP). While evaluating the mutual coherence of a given matrix is easy, it has been suspected for some time that evaluating RIP and NSP is computationally intractable in general. We confirm these conjectures by showing that for a given matrix A and positive integer k, computing the best constants for which the RIP or NSP hold is, in general, NP-hard. These results are based on the fact that determining the spark of a matrix is NP-hard, which is also established in this paper. Furthermore, we also give several complexity statements about problems related to the above concepts."
to:NB  computational_complexity  lasso  sparsity  statistics 
july 2013 by cshalizi
[1307.4765] Adaptive testing for the graphical lasso
"We consider tests of significance in the setting of the graphical lasso for inverse covariance matrix estimation. We propose a simple test statistic based on a subsequence of the knots in the graphical lasso path. We show that this statistic has an exponential asymptotic null distribution, under the null hypothesis that the model contains the true connected components.
"Though the null distribution is asymptotic, we show through simulation that it provides a close approximation to the true distribution at reasonable sample sizes. Thus the test provides a simple, tractable test for the significance of new edges as they are introduced into the model. Finally, we show connections between our results and other results for regularized regression, as well as extensions of our results to other correlation matrix based methods like single-linkage clustering."
to:NB  hypothesis_testing  lasso  graphical_models  statistics  tibshirani.robert  sparsity 
july 2013 by cshalizi
Mayrink , Lucas : Sparse latent factor models with interactions: Analysis of gene expression data
"Sparse latent multi-factor models have been used in many exploratory and predictive problems with high-dimensional multivariate observations. Because of concerns with identifiability, the latent factors are almost always assumed to be linearly related to measured feature variables. Here we explore the analysis of multi-factor models with different structures of interactions between latent factors, including multiplicative effects as well as a more general framework for nonlinear interactions introduced via the Gaussian Process. We utilize sparsity priors to test whether the factors and interaction terms have significant effect. The performance of the models is evaluated through simulated and real data applications in genomics. Variation in the number of copies of regions of the genome is a well-known and important feature of most cancers. We examine interactions between factors directly associated with different chromosomal regions detected with copy number alteration in breast cancer data. In this context, significant interaction effects for specific genes suggest synergies between duplications and deletions in different regions of the chromosome."
in_NB  gene_expression_data_analysis  factor_analysis  statistics  sparsity 
july 2013 by cshalizi
[1307.2342] Model Selection with Piecewise Regular Gauges
"Regularization plays a pivotal role when facing the challenge of solving ill-posed inverse problems, where the number of observations is smaller than the ambient dimension of the object to be estimated. A line of recent work has studied regularization models with various types of low-dimensional structures. In such settings, the general approach is to solve a regularized optimization problem, which combines a data fidelity term and some regularization penalty that promotes the assumed low-dimensional/simple structure. This paper provides a general framework to capture this low-dimensional structure through what we coin piecewise regular gauges. These are convex, non-negative, closed, bounded and positively homogenous functions that will promote objects living on low-dimensional subspaces. This class of regularizers encompasses many popular examples such as the L^1 norm, L^1-L^2 norm (group sparsity), nuclear norm, as well as several others including the L^inf norm. We will show that the set of piecewise regular gauges is closed under addition and pre-composition by a linear operator, which allows to cover mixed regularization (e.g. sparse+low-rank), and the so-called analysis-type priors (e.g. total variation, fused Lasso, trace Lasso, bounded polyhedral gauges). Our main result presents a unified sharp analysis of exact and robust recovery of the low-dimensional subspace model associated to the object to recover from partial measurements. This analysis is illustrated on a number of special and previously studied cases."

--- A general theory of low-dimensional structure recovery would be _very nice_
to:NB  statistics  optimization  sparsity  lasso  high-dimensional_statistics  inverse_problems  to_read 
july 2013 by cshalizi
[1307.0366] Learning directed acyclic graphs based on sparsest permutations
"We consider the problem of learning a directed acyclic graph (DAG) model based on conditional independence testing. The most widely used approaches to this problem are variants of the PC algorithm. One of the drawbacks of the PC algorithm is that it requires the strong-faithfulness assumption, which has been show to be restrictive especially for graphs with cycles in the skeleton. In this paper, we propose an alternative method based on finding the permutation of the variables that yields the sparsest DAG. We prove that the constraints required for our sparsest permutation (SP) algorithm are strictly weaker than faithfulness and are necessary for any causal inference algorithm based on conditional independence testing. Through specific examples and simulations we show that the SP algorithm has better performance than the PC algorithm. In the Gaussian setting, we prove that our algorithm boils down to finding the permutation of the variables with sparsest Cholesky decomposition for the inverse covariance matrix. Using this connection, we show that in the oracle setting, where the true covariance matrix is known, the SP algorithm is in fact equivalent to $\ell_0$-penalized maximum likelihood estimation."

- but isn't L0 penalization intractable?
to:NB  causal_inference  graphical_models  statistics  sparsity  to_read 
july 2013 by cshalizi
Sparse Additive Functionals and Kernel CCA
"Canonical Correlation Analysis (CCA) is a classical tool for finding correlations among the components of two random vectors. In recent years, CCA has been widely applied to the analysis of genomic data, where it is common for researchers to perform mul- tiple assays on a single set of patient sam- ples. Recent work has proposed sparse vari- ants of CCA to address the high dimension- ality of such data. However, classical and sparse CCA are based on linear models, and are thus limited in their ability to find gen- eral correlations. In this paper, we present two approaches to high-dimensional nonpara- metric CCA, building on recent developments in high-dimensional nonparametric regres- sion. We present estimation procedures for both approaches, and analyze their theoret- ical properties in the high-dimensional set- ting. We demonstrate the effectiveness of these procedures in discovering nonlinear cor- relations via extensive simulations, as well as through experiments with genomic data."
to:NB  sparsity  nonparametrics  high-dimensional_statistics  correlation  to_teach:undergrad-ADA  kith_and_kin  balakrishnan.sivaraman  lafferty.john  have_read 
july 2013 by cshalizi
[1306.6557] Optimal Feature Selection in High-Dimensional Discriminant Analysis
"We consider the high-dimensional discriminant analysis problem. For this problem, different methods have been proposed and justified by establishing exact convergence rates for the classification risk, as well as the l2 convergence results to the discriminative rule. However, sharp theoretical analysis for the variable selection performance of these procedures have not been established, even though model interpretation is of fundamental importance in scientific data analysis. This paper bridges the gap by providing sharp sufficient conditions for consistent variable selection using the sparse discriminant analysis (Mai et al., 2012). Through careful analysis, we establish rates of convergence that are significantly faster than the best known results and admit an optimal scaling of the sample size n, dimensionality p, and sparsity level s in the high-dimensional setting. Sufficient conditions are complemented by the necessary information theoretic limits on the variable selection problem in the context of high-dimensional discriminant analysis. Exploiting a numerical equivalence result, our method also establish the optimal results for the ROAD estimator (Fan et al., 2012) and the sparse optimal scaling estimator (Clemmensen et al., 2011). Furthermore, we analyze an exhaustive search procedure, whose performance serves as a benchmark, and show that it is variable selection consistent under weaker conditions. Extensive simulations demonstrating the sharpness of the bounds are also provided."
in_NB  classifiers  high-dimensional_statistics  sparsity  variable_selection  statistics  liu.han  kolar.mladen 
june 2013 by cshalizi
[1306.3343] A Theory of High-dimensional Sparse Estimation via Non-Convex Regularized Regression
"Non-convex regularized regression improves the performance of high-dimensional sparse estimation. Compared with convex regularizers, one of the important improvements is the weaker requirement on design matrix or, rather, weaker estimation condition. Estimation condition is a core issue for high-dimensional sparse estimation. However, previous works demanded the same estimation conditions as the convex regularized regression, which cannot explain the superiority of non-convex regularizers from the view of estimation condition and limits the further applications of them.
"This paper fills the gap between theory and experience by proposing new sparse eigenvalue based estimation conditions. For a general family of regularizers, named \xi-sharp concave regularizers, our conditions are weaker than that the convex regularizers need. Moreover, consistent sparse estimations are available not only for the global solutions of regularized regression, but also for the so-called approximate global and approximate stationary (AGAS) solutions. Our results on AGAS solutions are useful for application since we show the robustness of the non-convex regularized regression to the inaccuracy of the solutions and give a theoretical guarantee for the numerical solutions. Also, we give a quality guarantee for any solution that is regarded as an approximate global solution and prove that the desired approximate stationary solutions can be obtained simply by coordinate descent methods. This paper provides a general theory to non-convex high-dimensional sparse estimation and can serve as a guideline for selecting regularizers and developing algorithms for non-convex regularized regression."
to:NB  regression  optimization  high-dimensional_statistics  statistics  estimation  sparsity 
june 2013 by cshalizi
[1306.1970] High-dimensional Fused Lasso Regression using Majorization-Minimization and Parallel Processing
"In this paper, we propose a majorization-minimization (MM) algorithm for high-dimensional fused lasso regression (FLR) suitable for parallelization using graphics processing units (GPUs). The MM algorithm is stable and flexible as it can solve the FLR problems with various types of design matrices and penalty structures within a few tens of iterations. We also show that the convergence of the proposed algorithm is guaranteed. We conduct numerical studies to compare our algorithm with other existing algorithms. We demonstrate that the proposed MM algorithm is competitive in general settings. The merit of GPU parallelization is also exhibited."
to:NB  optimization  regression  lasso  sparsity  partial_pooling  statistics 
june 2013 by cshalizi
[1305.7477] On model selection consistency of M-estimators with geometrically decomposable penalties
"Penalized M-estimators are used in many areas of science and engineering to fit models with some low-dimensional structure in high-dimensional settings. In many problems arising in bioinformatics, signal processing, and statistical learning, the penalties are geometrically decomposable, i.e. can be expressed as a sum of support functions. We generalize the notion of irrepresentability and develop a general framework for establishing the model selection consistency of M-estimators with these penalties. We then use this framework to derive results for some special cases of interest in bioinformatics and statistical learning."
to:NB  estimation  model_selection  sparsity  statistics 
june 2013 by cshalizi
[1305.3235] Optimal Estimation and Rank Detection for Sparse Spiked Covariance Matrices
"This paper considers sparse spiked covariance matrix models in the high-dimensional setting and studies the minimax estimation of the covariance matrix and the principal subspace as well as the minimax rank detection. The optimal rate of convergence for estimating the spiked covariance matrix under the spectral norm is established, which requires significantly different techniques from those for estimating other structured covariance matrices such as bandable or sparse covariance matrices. We also establish the minimax rate under the spectral norm for estimating the principal subspace, the primary object of interest in principal component analysis. In addition, the optimal rate for the rank detection boundary is obtained. This result also resolves the gap in a recent paper by Berthet and Rigollet [1] where the special case of rank one is considered."
in_NB  statistics  variance_estimation  factor_analysis  high-dimensional_statistics  estimation  sparsity  to_read  re:g_paper  to_teach:undergrad-ADA  low-rank_approximation 
may 2013 by cshalizi
Ma : Sparse principal component analysis and iterative thresholding
"Principal component analysis (PCA) is a classical dimension reduction method which projects data onto the principal subspace spanned by the leading eigenvectors of the covariance matrix. However, it behaves poorly when the number of features p is comparable to, or even much larger than, the sample size n. In this paper, we propose a new iterative thresholding approach for estimating principal subspaces in the setting where the leading eigenvectors are sparse. Under a spiked covariance model, we find that the new approach recovers the principal subspace and leading eigenvectors consistently, and even optimally, in a range of high-dimensional sparse settings. Simulated examples also demonstrate its competitive performance."
to:NB  sparsity  principal_components  statistics 
may 2013 by cshalizi
[1305.0355] Model Selection for High-Dimensional Regression under the Generalized Irrepresentability Condition
"In the high-dimensional regression model a response variable is linearly related to $p$ covariates, but the sample size $n$ is smaller than $p$. We assume that only a small subset of covariates is `active' (i.e., the corresponding coefficients are non-zero), and consider the model-selection problem of identifying the active covariates. A popular approach is to estimate the regression coefficients through the Lasso ($\ell_1$-regularized least squares). This is known to correctly identify the active set only if the irrelevant covariates are roughly orthogonal to the relevant ones, as quantified through the so called `irrepresentability' condition. In this paper we study the `Gauss-Lasso' selector, a simple two-stage method that first solves the Lasso, and then performs ordinary least squares restricted to the Lasso active set. We formulate `generalized irrepresentability condition' (GIC), an assumption that is substantially weaker than irrepresentability. We prove that, under GIC, the Gauss-Lasso correctly recovers the active set."
to:NB  regression  statistics  high-dimensional_statistics  lasso  sparsity 
may 2013 by cshalizi
van de Geer , Bühlmann : $ell_{0}$-penalized maximum likelihood for sparse directed acyclic graphs
"We consider the problem of regularized maximum likelihood estimation for the structure and parameters of a high-dimensional, sparse directed acyclic graphical (DAG) model with Gaussian distribution, or equivalently, of a Gaussian structural equation model. We show that the ℓ0-penalized maximum likelihood estimator of a DAG has about the same number of edges as the minimal-edge I-MAP (a DAG with minimal number of edges representing the distribution), and that it converges in Frobenius norm. We allow the number of nodes p to be much larger than sample size n but assume a sparsity condition and that any representation of the true DAG has at least a fixed proportion of its nonzero edge weights above the noise level. Our results do not rely on the faithfulness assumption nor on the restrictive strong faithfulness condition which are required for methods based on conditional independence testing such as the PC-algorithm."

--- Huh.
to:NB  to_read  graph_discovery  graphical_models  causal_inference  sparsity  statistics  van_de_geer.sara  buhlmann.peter  high-dimensional_statistics  to_teach:undergrad-ADA 
april 2013 by cshalizi
[1304.4549] Learning Heteroscedastic Models by Convex Programming under Group Sparsity
"Popular sparse estimation methods based on $\ell_1$-relaxation, such as the Lasso and the Dantzig selector, require the knowledge of the variance of the noise in order to properly tune the regularization parameter. This constitutes a major obstacle in applying these methods in several frameworks---such as time series, random fields, inverse problems---for which the noise is rarely homoscedastic and its level is hard to know in advance. In this paper, we propose a new approach to the joint estimation of the conditional mean and the conditional variance in a high-dimensional (auto-) regression setting. An attractive feature of the proposed estimator is that it is efficiently computable even for very large scale problems by solving a second-order cone program (SOCP). We present theoretical analysis and numerical results assessing the performance of the proposed procedure."
to:NB  lasso  sparsity  variance_estimation  time_series  regression  statistics 
april 2013 by cshalizi
[1304.5802] Nonlinear Basis Pursuit
"In compressive sensing, the basis pursuit algorithm aims to find the sparsest solution to an underdetermined linear equation system. In this paper, we generalize basis pursuit to finding the sparsest solution to higher order nonlinear systems of equations, called nonlinear basis pursuit. In contrast to the existing nonlinear compressive sensing methods, the new algorithm that solves the nonlinear basis pursuit problem is convex and not greedy. The novel algorithm enables the compressive sensing approach to be used for a broader range of applications where there are nonlinear relationships between the measurements and the unknowns."
to:NB  sparsity  dimension_reduction  information_theory 
april 2013 by cshalizi
[1304.2986] Adaptive Piecewise Polynomial Estimation via Trend Filtering
"We study trend filtering, a recently proposed tool of Kim et al. (2009) for nonparametric regression. The trend filtering estimate is defined as the minimizer of a penalized least squares criterion, in which the penalty term sums the absolute kth order discrete derivatives over the input points. Perhaps not surprisingly, trend filtering estimates appear to have the structure of kth degree spline functions, with adaptively chosen knot points (we say "appear" here as trend filtering estimates are not really functions over continuous domains, and are only defined over the discrete set of inputs). This brings to mind comparisons to other nonparametric regression tools that also produce adaptive splines; in particular, we compare trend filtering to smoothing splines, which penalize the sum of squared derivatives across input points, and to locally adaptive regression splines (Mammen & van de Geer 1997), which penalize the total variation of the kth derivative. Empirically, we discover that trend filtering estimates adapt to the local level of smoothness much better than smoothing splines, and further, they exhibit a remarkable similarity to locally adaptive regression splines. We also provide theoretical support for these empirical findings; most notably, we prove that (with the right choice of tuning parameter) the trend filtering estimate converges to the true underlying function at the minimax rate for functions whose kth derivative is of bounded variation. This is done via an asymptotic pairing of trend filtering and locally adaptive regression splines, which have already been shown to converge at the minimax rated (Mammen & van de Geer 1997). At the core of this argument is a new result tying together the fitted values of two lasso problems that share the same outcome vector, but have different predictor matrices."
to:NB  filtering  regression  statistics  splines  sparsity  lasso  kith_and_kin  tibshirani.ryan  have_read 
april 2013 by cshalizi
Alfons , Croux , Gelper : Sparse least trimmed squares regression for analyzing high-dimensional large data sets
"Sparse model estimation is a topic of high importance in modern data analysis due to the increasing availability of data sets with a large number of variables. Another common problem in applied statistics is the presence of outliers in the data. This paper combines robust regression and sparse model estimation. A robust and sparse estimator is introduced by adding an L1 penalty on the coefficient estimates to the well-known least trimmed squares (LTS) estimator. The breakdown point of this sparse LTS estimator is derived, and a fast algorithm for its computation is proposed. In addition, the sparse LTS is applied to protein and gene expression data of the NCI-60 cancer cell panel. Both a simulation study and the real data application show that the sparse LTS has better prediction performance than its competitors in the presence of leverage points."
to:NB  regression  sparsity  statistics  robust_statistics 
april 2013 by cshalizi
Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses
"We investigate a curious relationship between the structure of a discrete graphical model and the support of the inverse of a generalized covariance matrix. We show that for certain graph structures, the support of the inverse covariance matrix of indicator variables on the vertices of a graph reflects the conditional independence structure of the graph. Our work extends results that have previously been es- tablished only in the context of multivariate Gaussian graphical models, thereby addressing an open question about the significance of the inverse covariance matrix of a non-Gaussian distribution. Based on our population-level results, we show how the graphical Lasso may be used to recover the edge structure of cer- tain classes of discrete graphical models, and present simulations to verify our theoretical results."
to:NB  graphical_models  sparsity  lasso  statistics  wainwright.martin_j. 
april 2013 by cshalizi
Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
"We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T. We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design."
in_NB  low-regret_learning  individual_sequence_prediction  regression  machine_learning  sparsity 
april 2013 by cshalizi
[1304.0682] Sparse Signal Processing with Linear and Non-Linear Observations: A Unified Shannon Theoretic Approach
"In this work we derive fundamental limits for many linear and non-linear sparse signal processing models including group testing, quantized compressive sensing, multivariate regression and observations with missing features. In general sparse signal processing problems can be characterized in terms of the following Markovian property. We are given a set of $N$ variables $X_1,X_2,\ldots,X_N$, and there is an unknown subset $S \subset \{1,\,2,\,\ldots, N\}$ that are \emph{relevant} for predicting outcomes/outputs $Y$. In other words, when $Y$ is conditioned on $\{X_k\}_{k\in S}$ it is conditionally independent of the other variables, $\{X_k\}_{k \not \in S}$.
"Our goal is to identify the set $S$ from samples of the variables $X$ and the associated outcomes $Y$. We characterize this problem as a version of the noisy channel coding theorem. Using asymptotic information theoretic analyses, we describe mutual information formulas that provide sufficient and necessary conditions on the number of samples required to successfully recover the salient variables. This mutual information expression unifies conditions for both linear and non-linear observations. We then compute sample complexity bounds based on the mutual information expressions for different settings including group testing, quantized compressive sensing, multivariate regression and observations with missing features."
to:NB  to_read  information_theory  model_selection  sparsity  compressed_sensing  statistics  machine_learning 
april 2013 by cshalizi
[1303.7409] Homogeneity in Regression
"This paper explores the homogeneity of coefficients in high-dimensional regression, which extends the sparsity concept and is more general and suitable for many applications. Homogeneity arises when one expects regression coefficients corresponding to neighboring geographical regions or a similar cluster of covariates to be approximately the same. Sparsity corresponds to a special case of homogeneity with a known atom zero. In this article, we propose a new method called clustering algorithm in regression via data-driven segmentation (CARDS) to explore homogeneity. New mathematics are provided on the gain that can be achieved by exploring homogeneity. Statistical properties of two versions of CARDS are analyzed. In particular, the asymptotic normality of our proposed CARDS estimator is established, which reveals better estimation accuracy for homogeneous parameters than that without homogeneity exploration. When our methods are combined with sparsity exploration, further efficiency can be achieved beyond the exploration of sparsity alone. This provides additional insights into the power of exploring low-dimensional strucuture in high-dimensional regression: homogeneity and sparsity. The newly developed method is further illustrated by simulation studies and applications to real data."

--- God damn it. How long has that been on my get-to-it-someday list?
to:NB  have_read  regression  sparsity  statistics  scooped  fan.jianqing  memo_to_self_real_scholars_write 
april 2013 by cshalizi
[1303.6811] Sparse approximation and recovery by greedy algorithms in Banach spaces
"We study sparse approximation by greedy algorithms. We prove the Lebesgue-type inequalities for the Weak Chebyshev Greedy Algorithm (WCGA), a generalization of the Weak Orthogonal Matching Pursuit to the case of a Banach space. The main novelty of these results is a Banach space setting instead of a Hilbert space setting. The results are proved for redundant dictionaries satisfying certain conditions. Then we apply these general results to the case of bases. In particular, we prove that the WCGA provides almost optimal sparse approximation for the trigonometric system in $L_p$, $2\le p<\infty$."
to:NB  optimization  sparsity 
march 2013 by cshalizi
[1303.5817] Assumptionless consistency of the Lasso
"The Lasso is a popular statistical tool invented by Robert Tibshirani for linear regression when the number of covariates is greater than or comparable to the number of observations. The validity of the Lasso procedure has been theoretically established under a variety of complicated-looking assumptions by various authors. This article shows that for the loss function considered in Tibshirani's original paper, the Lasso is consistent under almost no assumptions at all."

--- But, umm, didn't we know this already? See Buhlmann & van de Geer, or even just http://normaldeviate.wordpress.com/2012/08/07/rip-rip-restricted-isometry-property-rest-in-peace/
to:NB  regression  lasso  sparsity  statistics  have_read 
march 2013 by cshalizi
[1303.5685] Sparse Factor Analysis for Learning and Content Analytics
"We develop a new model and algorithms for machine learning-based learning analytics, which estimate a learner's knowledge of the concepts underlying a domain, and content analytics, which estimate the relationships among a collection of questions and those concepts. Our model represents the probability that a learner provides the correct response to a question in terms of three factors: their understanding of a set of underlying concepts, the concepts involved in each question, and each question's intrinsic difficulty. We estimate these factors given the graded responses to a collection of questions. The underlying estimation problem is ill-posed in general, especially when only a subset of the questions are answered. The key observation that enables a well-posed solution is the fact that typical educational domains of interest involve only a small number of key concepts. Leveraging this observation, we develop both a bi-convex maximum-likelihood and a Bayesian solution to the resulting SPARse Factor Analysis (SPARFA) problem. We also incorporate user-defined tags on questions to facilitate the interpretability of the estimated factors. Experiments with synthetic and real-world data demonstrate the efficacy of our approach. Finally, we make a connection between SPARFA and noisy, binary-valued (1-bit) dictionary learning that is of independent interest."
to:NB  sparsity  factor_analysis  statistics 
march 2013 by cshalizi
Learning Sparse Causal Gaussian Networks With Experimental Intervention: Regularization and Coordinate Descent - Journal of the American Statistical Association - Volume 108, Issue 501
"Causal networks are graphically represented by directed acyclic graphs (DAGs). Learning causal networks from data is a challenging problem due to the size of the space of DAGs, the acyclicity constraint placed on the graphical structures, and the presence of equivalence classes. In this article, we develop an L 1-penalized likelihood approach to estimate the structure of causal Gaussian networks. A blockwise coordinate descent algorithm, which takes advantage of the acyclicity constraint, is proposed for seeking a local maximizer of the penalized likelihood. We establish that model selection consistency for causal Gaussian networks can be achieved with the adaptive lasso penalty and sufficient experimental interventions. Simulation and real data examples are used to demonstrate the effectiveness of our method. In particular, our method shows satisfactory performance for DAGs with 200 nodes, which have about 20,000 free parameters. Supplementary materials for this article are available online."

- But _why_ would I assume things are Gaussian?
to:NB  sparsity  causal_inference  graphical_models  statistics 
march 2013 by cshalizi
[1302.2758] Sparsity and the Bayesian Perspective
"Sparsity has been recently introduced in cosmology for weak-lensing and CMB data analysis for different applications such as denoising, component separation or inpainting (i.e. filling the missing data or the mask). Although it gives very nice numerical results, CMB sparse inpainting has been severely criticized by top researchers in cosmology, based on arguments derived from a Bayesian perspective. Trying to understand their point of view, we realize that interpreting a regularization penalty term as a prior in a Bayesian framework can lead to erroneous conclusions. This paper is by no means against the Bayesian approach, which has proven to be very useful for many applications, but warns about a Bayesian-only interpretation in data analysis, which can be misleading in some cases."
to:NB  bayesianism  statistics  sparsity  data_analysis 
february 2013 by cshalizi
[1212.6232] High-Dimensional Sparse Additive Hazards Regression
"High-dimensional sparse modeling with censored survival data is of great practical importance, as exemplified by modern applications in high-throughput genomic data analysis and credit risk analysis. In this article, we propose a class of regularization methods for simultaneous variable selection and estimation in the additive hazards model, by combining the nonconcave penalized likelihood approach and the pseudoscore method. In a high-dimensional setting where the dimensionality can grow fast, polynomially or nonpolynomially, with the sample size, we establish the weak oracle property and oracle property under mild, interpretable conditions, thus providing strong performance guarantees for the proposed methodology. Moreover, we show that the regularity conditions required by the $L_1$ method are substantially relaxed by a certain class of sparsity-inducing concave penalties. As a result, concave penalties such as the smoothly clipped absolute deviation (SCAD), minimax concave penalty (MCP), and smooth integration of counting and absolute deviation (SICA) can significantly improve on the $L_1$ method and yield sparser models with better prediction performance. We present a coordinate descent algorithm for efficient implementation and rigorously investigate its convergence properties. The practical utility and effectiveness of the proposed methods are demonstrated by simulation studies and a real data example."
to:NB  statistics  survival_analysis  additive_models  high-dimensional_statistics  regression  sparsity 
december 2012 by cshalizi
[1212.6088] Bayesian shrinkage
"Penalized regression methods, such as $L_1$ regularization, are routinely used in high-dimensional applications, and there is a rich literature on optimality properties under sparsity assumptions. In the Bayesian paradigm, sparsity is routinely induced through two-component mixture priors having a probability mass at zero, but such priors encounter daunting computational problems in high dimensions. This has motivated an amazing variety of continuous shrinkage priors, which can be expressed as global-local scale mixtures of Gaussians, facilitating computation. In sharp contrast to the corresponding frequentist literature, very little is known about the properties of such priors. Focusing on a broad class of shrinkage priors, we provide precise results on prior and posterior concentration. Interestingly, we demonstrate that most commonly used shrinkage priors, including the Bayesian Lasso, are suboptimal in high-dimensional settings. A new class of Dirichlet Laplace (DL) priors are proposed, which are optimal and lead to efficient posterior computation exploiting results from normalized random measure theory. Finite sample performance of Dirichlet Laplace priors relative to alternatives is assessed in simulations."
to:NB  sparsity  regression  estimation  high-dimensional_statistics  pillai.natesh 
december 2012 by cshalizi
Mazumder , Hastie : The graphical lasso: New insights and alternatives
"The graphical lasso [5] is an algorithm for learning the structure in an undirected Gaussian graphical model, using ℓ1 regularization to control the number of zeros in the precision matrix Θ=Σ−1 [2, 11]. The R package GLASSO [5] is popular, fast, and allows one to efficiently build a path of models for different values of the tuning parameter. Convergence of GLASSO can be tricky; the converged precision matrix might not be the inverse of the estimated covariance, and occasionally it fails to converge with warm starts. In this paper we explain this behavior, and propose new algorithms that appear to outperform GLASSO.
"By studying the “normal equations” we see that, GLASSO is solving the dual of the graphical lasso penalized likelihood, by block coordinate ascent; a result which can also be found in [2]. In this dual, the target of estimation is Σ, the covariance matrix, rather than the precision matrix Θ. We propose similar primal algorithms P-GLASSO and DP-GLASSO, that also operate by block-coordinate descent, where Θ is the optimization target. We study all of these algorithms, and in particular different approaches to solving their coordinate sub-problems. We conclude that DP-GLASSO is superior from several points of view."
to:NB  graphical_models  statistics  sparsity  lasso  optimization 
november 2012 by cshalizi
Chandrasekaran , Parrilo , Willsky : Latent variable graphical model selection via convex optimization
"Suppose we observe samples of a subset of a collection of random variables. No additional information is provided about the number of latent variables, nor of the relationship between the latent and observed variables. Is it possible to discover the number of latent components, and to learn a statistical model over the entire collection of variables? We address this question in the setting in which the latent and observed variables are jointly Gaussian, with the conditional statistics of the observed variables conditioned on the latent variables being specified by a graphical model. As a first step we give natural conditions under which such latent-variable Gaussian graphical models are identifiable given marginal statistics of only the observed variables. Essentially these conditions require that the conditional graphical model among the observed variables is sparse, while the effect of the latent variables is “spread out” over most of the observed variables. Next we propose a tractable convex program based on regularized maximum-likelihood for model selection in this latent-variable setting; the regularizer uses both the ℓ1 norm and the nuclear norm. Our modeling framework can be viewed as a combination of dimensionality reduction (to identify latent variables) and graphical modeling (to capture remaining statistical structure not attributable to the latent variables), and it consistently estimates both the number of latent components and the conditional graphical model structure among the observed variables. These results are applicable in the high-dimensional setting in which the number of latent/observed variables grows with the number of samples of the observed variables. The geometric properties of the algebraic varieties of sparse matrices and of low-rank matrices play an important role in our analysis."

- With lots of discussants. When reading, I should think about how Thomson sampling (http://bactra.org/notebooks/thomson-sampling-model.html) would or would not break their assumptions.

- Ungated: http://arxiv.org/abs/1008.1290
in_NB  factor_analysis  graphical_models  inference_to_latent_objects  statistics  sparsity  high-dimensional_statistics 
october 2012 by cshalizi
Castillo , van der Vaart : Needles and Straw in a Haystack: Posterior concentration for possibly sparse sequences
"We consider full Bayesian inference in the multivariate normal mean model in the situation that the mean vector is sparse. The prior distribution on the vector of means is constructed hierarchically by first choosing a collection of nonzero means and next a prior on the nonzero values. We consider the posterior distribution in the frequentist set-up that the observations are generated according to a fixed mean vector, and are interested in the posterior distribution of the number of nonzero components and the contraction of the posterior distribution to the true mean vector. We find various combinations of priors on the number of nonzero coefficients and on these coefficients that give desirable performance. We also find priors that give suboptimal convergence, for instance, Gaussian priors on the nonzero coefficients. We illustrate the results by simulations."
in_NB  statistics  estimation  bayesian_consistency  sparsity  van_der_vaart.aad 
october 2012 by cshalizi
ℓ1-regularized linear regression: persistence and oracle inequalities
"We study the predictive performance of ell_1-regularized linear regression in a model-free setting, including the case where the number of covariates is substantially larger than the sample size. We introduce a new analysis method that avoids the boundedness problems that typically arise in model-free empirical minimization. Our technique provides an answer to a conjecture of Greenshtein and Ritov (Bernoulli 10(6):971–988, 2004) regarding the “persistence” rate for linear regression and allows us to prove an oracle inequality for the error of the regularized minimizer. It also demonstrates that empirical risk minimization gives optimal rates (up to log factors) of convex aggregation of a set of estimators of a regression function."
to:NB  regression  statistics  lasso  sparsity 
october 2012 by cshalizi
[1209.1557] Learning Model-Based Sparsity via Projected Gradient Descent
"Several convex formulation methods have been proposed previously for statistical estimation with structured sparsity as the prior. These methods often require a carefully tuned regularization parameter, often a cumbersome or heuristic exercise. Furthermore, the estimate that these methods produce might not belong to the desired sparsity model, albeit accurately approximating the true parameter. Therefore, greedy-type algorithms could often be more desirable in estimating structured-sparse parameters. So far, these greedy methods have mostly focused on linear statistical models. In this paper we study the projected gradient descent with non-convex structured-sparse parameter model as the constraint set. Should the cost function have a Stable Model-Restricted Hessian the algorithm converges to the desired minimizer up to an approximation error. As an example we elaborate on application of the main results to estimation in Generalized Linear Model."
to:NB  sparsity  optimization  statistics  estimation 
september 2012 by cshalizi
Lee , MacEachern , Jung : Regularization of Case-Specific Parameters for Robustness and Efficiency
"Regularization methods allow one to handle a variety of inferential problems where there are more covariates than cases. This allows one to consider a potentially enormous number of covariates for a problem. We exploit the power of these techniques, supersaturating models by augmenting the “natural” covariates in the problem with an additional indicator for each case in the data set. We attach a penalty term for these case-specific indicators which is designed to produce a desired effect. For regression methods with squared error loss, an ℓ1 penalty produces a regression which is robust to outliers and high leverage cases; for quantile regression methods, an ℓ2 penalty decreases the variance of the fit enough to overcome an increase in bias. The paradigm thus allows us to robustify procedures which lack robustness and to increase the efficiency of procedures which are robust.
"We provide a general framework for the inclusion of case-specific parameters in regularization problems, describing the impact on the effective loss for a variety of regression and classification problems. We outline a computational strategy by which existing software can be modified to solve the augmented regularization problem, providing conditions under which such modification will converge to the optimum solution. We illustrate the benefits of including case-specific parameters in the context of mean regression and quantile regression through analysis of NHANES and linguistic data sets."
to:NB  statistics  regression  sparsity  outliers 
september 2012 by cshalizi
Loh , Wainwright : High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity
"Although the standard formulations of prediction problems involve fully-observed and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependence, as well. We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently nonconvex, and it is difficult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing nonconvex programs, we are able to both analyze the statistical error associated with any global optimum, and more surprisingly, to prove that a simple algorithm based on projected gradient descent will converge in polynomial time to a small neighborhood of the set of all global minimizers. On the statistical side, we provide nonasymptotic bounds that hold with high probability for the cases of noisy, missing and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm is guaranteed to converge at a geometric rate to a near-global minimizer. We illustrate these theoretical predictions with simulations, showing close agreement with the predicted scalings."

Ungated; http://arxiv.org/abs/1109.3714
in_NB  to_read  regression  linear_regression  sparsity  high-dimensional_statistics  optimization  wainwright.martin_j.  statistics 
september 2012 by cshalizi
[1208.2572] Nonparametric sparsity and regularization
"In this work we are interested in the problems of supervised learning and variable selection when the input-output dependence is described by a nonlinear function depending on a few variables. Our goal is to consider a sparse nonparametric model, hence avoiding linear or additive models. The key idea is to measure the importance of each variable in the model by making use of partial derivatives. Based on this intuition we propose a new notion of nonparametric sparsity and a corresponding least squares regularization scheme. Using concepts and results from the theory of reproducing kernel Hilbert spaces and proximal methods, we show that the proposed learning algorithm corresponds to a minimization problem which can be provably solved by an iterative procedure. The consistency properties of the obtained estimator are studied both in terms of prediction and selection performance. An extensive empirical analysis shows that the proposed method performs favorably with respect to the state-of-the-art methods."
to:NB  to_read  nonparametrics  regression  variable_selection  sparsity  statistics 
september 2012 by cshalizi
« earlier      
per page:    204080120160

related tags

adaptive_behavior  additive_models  algorithmic_information_theory  allen.genevera_i.  anomaly_detection  balakrishnan.sivaraman  bayesianism  bayesian_consistency  biochemical_networks  books:noted  books:owned  books:recommended  bootstrap  buhlmann.peter  cats  causal_inference  chouldechova.alexandra  classifiers  climatology  clustering  compressed_sensing  computational_complexity  computational_statistics  conditional_random_fields  confidence_sets  convexity  copulas  correlation  cox.d.r.  data_analysis  degrees_of_freedom  density_estimation  dimension_reduction  distributed_systems  econometrics  empirical_processes  ensemble_methods  entableted  estimation  exponential_families  factor_analysis  fan.jianqing  filtering  flack.jessica  functional_data_analysis  funny:geeky  gene_expression_data_analysis  genovese.chris  geometry  graphical_models  graph_discovery  hansen.christian  hastie.trevor  have_read  heard_the_talk  hierarchical_structure  high-dimensional_statistics  hilbert_space  hoff.peter  hypothesis_testing  immunology  individual_sequence_prediction  inference_to_latent_objects  information_theory  instrumental_variables  inverse_problems  in_NB  ising_model  jin.jiashun  kernel_methods  kith_and_kin  kolar.mladen  krakauer.david  lafferty.john  lasso  learning_theory  lee.ann_b.  lei.jing  levy_processes  likelihood  linear_algebra  linear_regression  liu.han  logistic_regression  low-rank_approximation  low-regret_learning  machine_learning  markov_models  meier.lukas  memo_to_self_real_scholars_write  method_of_sieves  minimax  missing_data  mixture_models  model_checking  model_selection  nardi.yuval  networks  network_data_analysis  neural_data_analysis  neuroscience  nickl.richard  nonparametrics  optimization  outliers  partial_pooling  pillai.natesh  prediction  primates  principal_components  random_fields  random_projections  ravikumar.pradeep  re:6dfb  re:g_paper  re:network_differences  re:phil-of-bayes_paper  re:what_is_the_right_null_model_for_linear_regression  re:your_favorite_dsge_sucks  regression  ridge_regression  rinaldo.alessandro  robust_statistics  samworth.richard_j.  scooped  semi-supervised_learning  signal_processing  sociology  sparsity  splines  statistical_mechanics  statistics  stochastic_processes  survival_analysis  text_mining  tibshirani.robert  tibshirani.ryan  time_series  to:NB  to_besh  to_be_shot_after_a_fair_trial  to_read  to_teach:data-mining  to_teach:undergrad-ADA  track_down_references  van_der_vaart.aad  van_de_geer.sara  variable_selection  variance_estimation  vats.divyanshu  via:klk  via:shivak  vu.vincent  wainwright.martin_j.  wasserman.larry  xing.eric  yu.bin  zhang.tong 

Copy this bookmark: