**sparsity**423

[1603.07632] Statistical inference in sparse high-dimensional additive models

28 days ago by cshalizi

"In this paper we discuss the estimation of a nonparametric component f1 of a nonparametric additive model Y=f1(X1)+...+fq(Xq)+ϵ. We allow the number q of additive components to grow to infinity and we make sparsity assumptions about the number of nonzero additive components. We compare this estimation problem with that of estimating f1 in the oracle model Z=f1(X1)+ϵ, for which the additive components f2,…,fq are known. We construct a two-step presmoothing-and-resmoothing estimator of f1 and state finite-sample bounds for the difference between our estimator and some smoothing estimators f̂ (oracle)1 in the oracle model. In an asymptotic setting these bounds can be used to show asymptotic equivalence of our estimator and the oracle estimators; the paper thus shows that, asymptotically, under strong enough sparsity conditions, knowledge of f2,…,fq has no effect on estimation accuracy. Our first step is to estimate f1 with an undersmoothed estimator based on near-orthogonal projections with a group Lasso bias correction. We then construct pseudo responses Ŷ by evaluating a debiased modification of our undersmoothed estimator of f1 at the design points. In the second step the smoothing method of the oracle estimator f̂ (oracle)1 is applied to a nonparametric regression problem with responses Ŷ and covariates X1. Our mathematical exposition centers primarily on establishing properties of the presmoothing estimator. We present simulation results demonstrating close-to-oracle performance of our estimator in practical applications."

to:NB
additive_models
statistics
regression
nonparametrics
sparsity
28 days ago by cshalizi

[1905.13657] Approximate Cross-Validation in High Dimensions with Guarantees

28 days ago by cshalizi

"Leave-one-out cross validation (LOOCV) can be particularly accurate among CV variants for estimating out-of-sample error. Unfortunately, LOOCV requires re-fitting a model N times for a dataset of size N. To avoid this prohibitive computational expense, a number of authors have proposed approximations to LOOCV. These approximations work well when the unknown parameter is of small, fixed dimension but suffer in high dimensions; they incur a running time roughly cubic in the dimension, and, in fact, we show their accuracy significantly deteriorates in high dimensions. We demonstrate that these difficulties can be surmounted in ℓ1-regularized generalized linear models when we assume that the unknown parameter, while high dimensional, has a small support. In particular, we show that, under interpretable conditions, the support of the recovered parameter does not change as each datapoint is left out. This result implies that the previously proposed heuristic of only approximating CV along the support of the recovered parameter has running time and error that scale with the (small) support size even when the full dimension is large. Experiments on synthetic and real data support the accuracy of our approximations."

to:NB
cross-validation
sparsity
high-dimensional_statistics
computational_statistics
statistics
28 days ago by cshalizi

[1909.13189] Learning Sparse Nonparametric DAGs

7 weeks ago by cshalizi

"We develop a framework for learning sparse nonparametric directed acyclic graphs (DAGs) from data. Our approach is based on a recent algebraic characterization of DAGs that led to the first fully continuous optimization for score-based learning of DAG models parametrized by a linear structural equation model (SEM). We extend this algebraic characterization to nonparametric SEM by leveraging nonparametric sparsity based on partial derivatives, resulting in a continuous optimization problem that can be applied to a variety of nonparametric and semiparametric models including GLMs, additive noise models, and index models as special cases. We also explore the use of neural networks and orthogonal basis expansions to model nonlinearities for general nonparametric models. Extensive empirical study confirms the necessity of nonlinear dependency and the advantage of continuous optimization for score-based learning."

to:NB
causal_discovery
graphical_models
optimization
statistics
ravikumar.pradeep
xing.eric
sparsity
to_read
7 weeks ago by cshalizi

Testing Sparsity-Inducing Penalties: Journal of Computational and Graphical Statistics: Vol 0, No 0

august 2019 by cshalizi

"Many penalized maximum likelihood estimators correspond to posterior mode estimators under specific prior distributions. Appropriateness of a particular class of penalty functions can therefore be interpreted as the appropriateness of a prior for the parameters. For example, the appropriateness of a lasso penalty for regression coefficients depends on the extent to which the empirical distribution of the regression coefficients resembles a Laplace distribution. We give a testing procedure of whether or not a Laplace prior is appropriate and accordingly, whether or not using a lasso penalized estimate is appropriate. This testing procedure is designed to have power against exponential power priors which correspond to ℓqℓq penalties. Via simulations, we show that this testing procedure achieves the desired level and has enough power to detect violations of the Laplace assumption when the numbers of observations and unknown regression coefficients are large. We then introduce an adaptive procedure that chooses a more appropriate prior and corresponding penalty from the class of exponential power priors when the null hypothesis is rejected. We show that this can improve estimation of the regression coefficients both when they are drawn from an exponential power distribution and when they are drawn from a spike-and-slab distribution. Supplementary materials for this article are available online."

--- I feel like I fundamentally disagree with this approach. Those priors are merely (to quote Jamie Robins and Larry Wasserman) "frequentist pursuit", and have no bearing on whether (say) the Lasso will give a good sparse, linear approximation to the underlying regression function (see https://normaldeviate.wordpress.com/2013/09/11/consistency-sparsistency-and-presistency/). All of which said, Hoff is always worth listening to, so the last tag applies with special force.

to:NB
model_checking
sparsity
regression
hypothesis_testing
bayesianism
re:phil-of-bayes_paper
hoff.peter
to_besh
--- I feel like I fundamentally disagree with this approach. Those priors are merely (to quote Jamie Robins and Larry Wasserman) "frequentist pursuit", and have no bearing on whether (say) the Lasso will give a good sparse, linear approximation to the underlying regression function (see https://normaldeviate.wordpress.com/2013/09/11/consistency-sparsistency-and-presistency/). All of which said, Hoff is always worth listening to, so the last tag applies with special force.

august 2019 by cshalizi

High-Dimensional Adaptive Minimax Sparse Estimation With Interactions - IEEE Journals & Magazine

august 2019 by cshalizi

"High-dimensional linear regression with interaction effects is broadly applied in research fields such as bioinformatics and social science. In this paper, first, we investigate the minimax rate of convergence for regression estimation in high-dimensional sparse linear models with two-way interactions. Here, we derive matching upper and lower bounds under three types of heredity conditions: strong heredity, weak heredity, and no heredity. From the results: 1) A stronger heredity condition may or may not drastically improve the minimax rate of convergence. In fact, in some situations, the minimax rates of convergence are the same under all three heredity conditions; 2) The minimax rate of convergence is determined by the maximum of the total price of estimating the main effects and that of estimating the interaction effects, which goes beyond purely comparing the order of the number of non-zero main effects r1 and non-zero interaction effects r2 ; and 3) Under any of the three heredity conditions, the estimation of the interaction terms may be the dominant part in determining the rate of convergence. This is due to either the dominant number of interaction effects over main effects or the higher interaction estimation price induced by a large ambient dimension. Second, we construct an adaptive estimator that achieves the minimax rate of convergence regardless of the true heredity condition and the sparsity indices r1,r2 ."

to:NB
statistics
high-dimensional_statistics
regression
sparsity
variable_selection
linear_regression
august 2019 by cshalizi

[1806.03120] Variational inference for sparse network reconstruction from count data

may 2019 by cshalizi

"In multivariate statistics, the question of finding direct interactions can be formulated as a problem of network inference - or network reconstruction - for which the Gaussian graphical model (GGM) provides a canonical framework. Unfortunately, the Gaussian assumption does not apply to count data which are encountered in domains such as genomics, social sciences or ecology.

"To circumvent this limitation, state-of-the-art approaches use two-step strategies that first transform counts to pseudo Gaussian observations and then apply a (partial) correlation-based approach from the abundant literature of GGM inference. We adopt a different stance by relying on a latent model where we directly model counts by means of Poisson distributions that are conditional to latent (hidden) Gaussian correlated variables. In this multivariate Poisson lognormal-model, the dependency structure is completely captured by the latent layer. This parametric model enables to account for the effects of covariates on the counts.

"To perform network inference, we add a sparsity inducing constraint on the inverse covariance matrix of the latent Gaussian vector. Unlike the usual Gaussian setting, the penalized likelihood is generally not tractable, and we resort instead to a variational approach for approximate likelihood maximization. The corresponding optimization problem is solved by alternating a gradient ascent on the variational parameters and a graphical-Lasso step on the covariance matrix.

"We show that our approach is highly competitive with the existing methods on simulation inspired from microbiological data. We then illustrate on three various data sets how accounting for sampling efforts via offsets and integrating external covariates (which is mostly never done in the existing literature) drastically changes the topology of the inferred network."

to:NB
sparsity
graphical_models
statistics
re:6dfb
"To circumvent this limitation, state-of-the-art approaches use two-step strategies that first transform counts to pseudo Gaussian observations and then apply a (partial) correlation-based approach from the abundant literature of GGM inference. We adopt a different stance by relying on a latent model where we directly model counts by means of Poisson distributions that are conditional to latent (hidden) Gaussian correlated variables. In this multivariate Poisson lognormal-model, the dependency structure is completely captured by the latent layer. This parametric model enables to account for the effects of covariates on the counts.

"To perform network inference, we add a sparsity inducing constraint on the inverse covariance matrix of the latent Gaussian vector. Unlike the usual Gaussian setting, the penalized likelihood is generally not tractable, and we resort instead to a variational approach for approximate likelihood maximization. The corresponding optimization problem is solved by alternating a gradient ascent on the variational parameters and a graphical-Lasso step on the covariance matrix.

"We show that our approach is highly competitive with the existing methods on simulation inspired from microbiological data. We then illustrate on three various data sets how accounting for sampling efforts via offsets and integrating external covariates (which is mostly never done in the existing literature) drastically changes the topology of the inferred network."

may 2019 by cshalizi

SpecAugment – Keunwoo Choi

may 2019 by arsyed

"This means we need all the combinations of those features to completely cover the instances of /uh/ sound. If there are large numbers of features, (if N is large), chances are we don’t have them all. Therefore, this {N dimension vector}-to-{number of phonemes} classifier is hard to train. Data sparsity. Maybe the network even has to extrapolate which it sucks at.

By masking, we’re providing the hints in a more separate way. The network can learn to classify with smaller numbers of features. It also breaks the N-dim feature vectors into something smaller. The N features don’t need to co-exist."

speech
data-augmentation
spectrogram
masking
sparsity
By masking, we’re providing the hints in a more separate way. The network can learn to classify with smaller numbers of features. It also breaks the N-dim feature vectors into something smaller. The N features don’t need to co-exist."

may 2019 by arsyed

[1903.04641] Generalized Sparse Additive Models

april 2019 by cshalizi

"We present a unified framework for estimation and analysis of generalized additive models in high dimensions. The framework defines a large class of penalized regression estimators, encompassing many existing methods. An efficient computational algorithm for this class is presented that easily scales to thousands of observations and features. We prove minimax optimal convergence bounds for this class under a weak compatibility condition. In addition, we characterize the rate of convergence when this compatibility condition is not met. Finally, we also show that the optimal penalty parameters for structure and sparsity penalties in our framework are linked, allowing cross-validation to be conducted over only a single tuning parameter. We complement our theoretical results with empirical studies comparing some existing methods within this framework."

to:NB
sparsity
regression
additive_models
high-dimensional_statistics
statistics
april 2019 by cshalizi

Bayes Sparse Regression

december 2018 by tarakc02

In this case study I’ll review how sparsity arises in frequentist and Bayesian analyses and discuss the often subtle challenges in implementing sparsity in practical Bayesian analyses.

bayesian-inference
sparsity
sparse-regression
regression
statistical-modeling
bayes
shrinkage
penalized-regression
lasso
december 2018 by tarakc02

[1811.01159] Understanding and Comparing Scalable Gaussian Process Regression for Big Data

november 2018 by arsyed

As a non-parametric Bayesian model which produces informative predictive distribution, Gaussian process (GP) has been widely used in various fields, like regression, classification and optimization. The cubic complexity of standard GP however leads to poor scalability, which poses challenges in the era of big data. Hence, various scalable GPs have been developed in the literature in order to improve the scalability while retaining desirable prediction accuracy. This paper devotes to investigating the methodological characteristics and performance of representative global and local scalable GPs including sparse approximations and local aggregations from four main perspectives: scalability, capability, controllability and robustness. The numerical experiments on two toy examples and five real-world datasets with up to 250K points offer the following findings. In terms of scalability, most of the scalable GPs own a time complexity that is linear to the training size. In terms of capability, the sparse approximations capture the long-term spatial correlations, the local aggregations capture the local patterns but suffer from over-fitting in some scenarios. In terms of controllability, we could improve the performance of sparse approximations by simply increasing the inducing size. But this is not the case for local aggregations. In terms of robustness, local aggregations are robust to various initializations of hyperparameters due to the local attention mechanism. Finally, we highlight that the proper hybrid of global and local scalable GPs may be a promising way to improve both the model capability and scalability for big data.

gaussian-processes
sparsity
scaling
big-data
november 2018 by arsyed

[1712.05630] Sparse principal component analysis via random projections

september 2018 by arsyed

We introduce a new method for sparse principal component analysis, based on the aggregation of eigenvector information from carefully-selected random projections of the sample covariance matrix. Unlike most alternative approaches, our algorithm is non-iterative, so is not vulnerable to a bad choice of initialisation. Our theory provides great detail on the statistical and computational trade-off in our procedure, revealing a subtle interplay between the effective sample size and the number of random projections that are required to achieve the minimax optimal rate. Numerical studies provide further insight into the procedure and confirm its highly competitive finite-sample performance.

pca
sparsity
random-projections
september 2018 by arsyed

[1712.05630] Sparse principal component analysis via random projections

september 2018 by cshalizi

"We introduce a new method for sparse principal component analysis, based on the aggregation of eigenvector information from carefully-selected random projections of the sample covariance matrix. Unlike most alternative approaches, our algorithm is non-iterative, so is not vulnerable to a bad choice of initialisation. Our theory provides great detail on the statistical and computational trade-off in our procedure, revealing a subtle interplay between the effective sample size and the number of random projections that are required to achieve the minimax optimal rate. Numerical studies provide further insight into the procedure and confirm its highly competitive finite-sample performance."

to:NB
principal_components
sparsity
random_projections
statistics
samworth.richard_j.
september 2018 by cshalizi

Everything you did and didn't know about PCA · Its Neuronal

july 2018 by arsyed

"The question then becomes, how do we choose where to truncate? This used to be one of those classic questions with an unsatisfying answer… Basically, eyeball it.

Gavish & Donoho (2014) present a long overdue result on this problem and their answer is surprisingly simple and concrete. Essentially, the optimal[7] procedure boils down to estimating the noise in the dataset, , and then throwing away all components whose singular values are below a specified threshold. For a square matrix, this threshold is:

There is a similar threshold for non-square datasets explained in the paper. As with any theoretical study, the result comes with a few assumptions and caveats,[8] but their work appears robust and useful in practice.

Edit: Thanks to Jonathan Pillow for pointing out a Bayesian alternative outlined here: Minka (2000). Automatic choice of dimensionality for PCA"

pca
dimensionality-reduction
sparsity
optimization
Gavish & Donoho (2014) present a long overdue result on this problem and their answer is surprisingly simple and concrete. Essentially, the optimal[7] procedure boils down to estimating the noise in the dataset, , and then throwing away all components whose singular values are below a specified threshold. For a square matrix, this threshold is:

There is a similar threshold for non-square datasets explained in the paper. As with any theoretical study, the result comes with a few assumptions and caveats,[8] but their work appears robust and useful in practice.

Edit: Thanks to Jonathan Pillow for pointing out a Bayesian alternative outlined here: Minka (2000). Automatic choice of dimensionality for PCA"

july 2018 by arsyed

[1705.07704] A Regularized Framework for Sparse and Structured Neural Attention

january 2018 by arsyed

Modern neural networks are often augmented with an attention mechanism, which tells the network where to focus within the input. We propose in this paper a new framework for sparse and structured attention, building upon a smoothed max operator. We show that the gradient of this operator defines a mapping from real values to probabilities, suitable as an attention mechanism. Our framework includes softmax and a slight generalization of the recently-proposed sparsemax as special cases. However, we also show how our framework can incorporate modern structured penalties, resulting in more interpretable attention mechanisms, that focus on entire segments or groups of an input. We derive efficient algorithms to compute the forward and backward passes of our attention mechanisms, enabling their use in a neural network trained with backpropagation. To showcase their potential as a drop-in replacement for existing ones, we evaluate our attention mechanisms on three large-scale tasks: textual entailment, machine translation, and sentence summarization. Our attention mechanisms improve interpretability without sacrificing performance; notably, on textual entailment and summarization, we outperform the standard attention mechanisms based on softmax and sparsemax.

papers
regularization
attention
sparsity
january 2018 by arsyed

[1710.06276] Smooth and Sparse Optimal Transport

january 2018 by arsyed

Entropic regularization is quickly emerging as a new standard in optimal transport (OT). It enables to cast the OT computation as a differentiable and unconstrained convex optimization problem, which can be efficiently solved using the Sinkhorn algorithm. However, the entropy term keeps the transportation plan strictly positive and therefore completely dense, unlike unregularized OT. This lack of sparsity can be problematic for applications where the transportation plan itself is of interest. In this paper, we explore regularizing both the primal and dual original formulations with an arbitrary strongly convex term. We show that this corresponds to relaxing dual and primal constraints with approximate smooth constraints. We show how to incorporate squared 2-norm and group lasso regularizations within that framework, leading to sparse and group-sparse transportation plans. On the theoretical side, we are able to bound the approximation error introduced by smoothing the original primal and dual formulations. Our results suggest that, for the smoothed dual, the approximation error can often be smaller with squared 2-norm regularization than with entropic regularization. We showcase our proposed framework on the task of color transfer.

papers
optimal-transport
regularization
sparsity
january 2018 by arsyed

The sum of its parts: Reducing sparsity in click estimation with query segments (PDF Download Available)

december 2017 by foodbaby

The critical task of predicting clicks on search advertisements is typically addressed by learning from historical click data. When enough history is observed for a given query-ad pair, future clicks can be accurately modeled. However, based on the empirical distribution of queries, sufficient historical information is unavailable for many query-ad pairs. The sparsity of data for new and rare queries makes it difficult to accurately estimate clicks for a significant portion of typical search engine traffic. In this paper we provide analysis to motivate modeling approaches that can reduce the sparsity of the large space of user search queries. We then propose methods to improve click and relevance models for sponsored search by mining click behavior for partial user queries. We aggregate click history for individual query words, as well as for phrases extracted with a CRF model. The new models show significant improvement in clicks and revenue compared to state-of-the-art baselines trained on several months of query logs. Results are reported on live traffic of a commercial search engine, in addition to results from offline evaluation.

CTR
prediction
sparsity
papers
2011
december 2017 by foodbaby

**related tags**

Copy this bookmark: