sparsity   423

« earlier    

[1603.07632] Statistical inference in sparse high-dimensional additive models
"In this paper we discuss the estimation of a nonparametric component f1 of a nonparametric additive model Y=f1(X1)+...+fq(Xq)+ϵ. We allow the number q of additive components to grow to infinity and we make sparsity assumptions about the number of nonzero additive components. We compare this estimation problem with that of estimating f1 in the oracle model Z=f1(X1)+ϵ, for which the additive components f2,…,fq are known. We construct a two-step presmoothing-and-resmoothing estimator of f1 and state finite-sample bounds for the difference between our estimator and some smoothing estimators f̂ (oracle)1 in the oracle model. In an asymptotic setting these bounds can be used to show asymptotic equivalence of our estimator and the oracle estimators; the paper thus shows that, asymptotically, under strong enough sparsity conditions, knowledge of f2,…,fq has no effect on estimation accuracy. Our first step is to estimate f1 with an undersmoothed estimator based on near-orthogonal projections with a group Lasso bias correction. We then construct pseudo responses Ŷ  by evaluating a debiased modification of our undersmoothed estimator of f1 at the design points. In the second step the smoothing method of the oracle estimator f̂ (oracle)1 is applied to a nonparametric regression problem with responses Ŷ  and covariates X1. Our mathematical exposition centers primarily on establishing properties of the presmoothing estimator. We present simulation results demonstrating close-to-oracle performance of our estimator in practical applications."
to:NB  additive_models  statistics  regression  nonparametrics  sparsity 
28 days ago by cshalizi
[1905.13657] Approximate Cross-Validation in High Dimensions with Guarantees
"Leave-one-out cross validation (LOOCV) can be particularly accurate among CV variants for estimating out-of-sample error. Unfortunately, LOOCV requires re-fitting a model N times for a dataset of size N. To avoid this prohibitive computational expense, a number of authors have proposed approximations to LOOCV. These approximations work well when the unknown parameter is of small, fixed dimension but suffer in high dimensions; they incur a running time roughly cubic in the dimension, and, in fact, we show their accuracy significantly deteriorates in high dimensions. We demonstrate that these difficulties can be surmounted in ℓ1-regularized generalized linear models when we assume that the unknown parameter, while high dimensional, has a small support. In particular, we show that, under interpretable conditions, the support of the recovered parameter does not change as each datapoint is left out. This result implies that the previously proposed heuristic of only approximating CV along the support of the recovered parameter has running time and error that scale with the (small) support size even when the full dimension is large. Experiments on synthetic and real data support the accuracy of our approximations."
to:NB  cross-validation  sparsity  high-dimensional_statistics  computational_statistics  statistics 
28 days ago by cshalizi
[1909.13189] Learning Sparse Nonparametric DAGs
"We develop a framework for learning sparse nonparametric directed acyclic graphs (DAGs) from data. Our approach is based on a recent algebraic characterization of DAGs that led to the first fully continuous optimization for score-based learning of DAG models parametrized by a linear structural equation model (SEM). We extend this algebraic characterization to nonparametric SEM by leveraging nonparametric sparsity based on partial derivatives, resulting in a continuous optimization problem that can be applied to a variety of nonparametric and semiparametric models including GLMs, additive noise models, and index models as special cases. We also explore the use of neural networks and orthogonal basis expansions to model nonlinearities for general nonparametric models. Extensive empirical study confirms the necessity of nonlinear dependency and the advantage of continuous optimization for score-based learning."
to:NB  causal_discovery  graphical_models  optimization  statistics  ravikumar.pradeep  xing.eric  sparsity  to_read 
7 weeks ago by cshalizi
Testing Sparsity-Inducing Penalties: Journal of Computational and Graphical Statistics: Vol 0, No 0
"Many penalized maximum likelihood estimators correspond to posterior mode estimators under specific prior distributions. Appropriateness of a particular class of penalty functions can therefore be interpreted as the appropriateness of a prior for the parameters. For example, the appropriateness of a lasso penalty for regression coefficients depends on the extent to which the empirical distribution of the regression coefficients resembles a Laplace distribution. We give a testing procedure of whether or not a Laplace prior is appropriate and accordingly, whether or not using a lasso penalized estimate is appropriate. This testing procedure is designed to have power against exponential power priors which correspond to ℓqℓq penalties. Via simulations, we show that this testing procedure achieves the desired level and has enough power to detect violations of the Laplace assumption when the numbers of observations and unknown regression coefficients are large. We then introduce an adaptive procedure that chooses a more appropriate prior and corresponding penalty from the class of exponential power priors when the null hypothesis is rejected. We show that this can improve estimation of the regression coefficients both when they are drawn from an exponential power distribution and when they are drawn from a spike-and-slab distribution. Supplementary materials for this article are available online."

--- I feel like I fundamentally disagree with this approach. Those priors are merely (to quote Jamie Robins and Larry Wasserman) "frequentist pursuit", and have no bearing on whether (say) the Lasso will give a good sparse, linear approximation to the underlying regression function (see All of which said, Hoff is always worth listening to, so the last tag applies with special force.
to:NB  model_checking  sparsity  regression  hypothesis_testing  bayesianism  re:phil-of-bayes_paper  hoff.peter  to_besh 
august 2019 by cshalizi
High-Dimensional Adaptive Minimax Sparse Estimation With Interactions - IEEE Journals & Magazine
"High-dimensional linear regression with interaction effects is broadly applied in research fields such as bioinformatics and social science. In this paper, first, we investigate the minimax rate of convergence for regression estimation in high-dimensional sparse linear models with two-way interactions. Here, we derive matching upper and lower bounds under three types of heredity conditions: strong heredity, weak heredity, and no heredity. From the results: 1) A stronger heredity condition may or may not drastically improve the minimax rate of convergence. In fact, in some situations, the minimax rates of convergence are the same under all three heredity conditions; 2) The minimax rate of convergence is determined by the maximum of the total price of estimating the main effects and that of estimating the interaction effects, which goes beyond purely comparing the order of the number of non-zero main effects r1 and non-zero interaction effects r2 ; and 3) Under any of the three heredity conditions, the estimation of the interaction terms may be the dominant part in determining the rate of convergence. This is due to either the dominant number of interaction effects over main effects or the higher interaction estimation price induced by a large ambient dimension. Second, we construct an adaptive estimator that achieves the minimax rate of convergence regardless of the true heredity condition and the sparsity indices r1,r2 ."
to:NB  statistics  high-dimensional_statistics  regression  sparsity  variable_selection  linear_regression 
august 2019 by cshalizi
[1806.03120] Variational inference for sparse network reconstruction from count data
"In multivariate statistics, the question of finding direct interactions can be formulated as a problem of network inference - or network reconstruction - for which the Gaussian graphical model (GGM) provides a canonical framework. Unfortunately, the Gaussian assumption does not apply to count data which are encountered in domains such as genomics, social sciences or ecology.
"To circumvent this limitation, state-of-the-art approaches use two-step strategies that first transform counts to pseudo Gaussian observations and then apply a (partial) correlation-based approach from the abundant literature of GGM inference. We adopt a different stance by relying on a latent model where we directly model counts by means of Poisson distributions that are conditional to latent (hidden) Gaussian correlated variables. In this multivariate Poisson lognormal-model, the dependency structure is completely captured by the latent layer. This parametric model enables to account for the effects of covariates on the counts.
"To perform network inference, we add a sparsity inducing constraint on the inverse covariance matrix of the latent Gaussian vector. Unlike the usual Gaussian setting, the penalized likelihood is generally not tractable, and we resort instead to a variational approach for approximate likelihood maximization. The corresponding optimization problem is solved by alternating a gradient ascent on the variational parameters and a graphical-Lasso step on the covariance matrix.
"We show that our approach is highly competitive with the existing methods on simulation inspired from microbiological data. We then illustrate on three various data sets how accounting for sampling efforts via offsets and integrating external covariates (which is mostly never done in the existing literature) drastically changes the topology of the inferred network."
to:NB  sparsity  graphical_models  statistics  re:6dfb 
may 2019 by cshalizi
SpecAugment – Keunwoo Choi
"This means we need all the combinations of those features to completely cover the instances of /uh/ sound. If there are large numbers of features, (if N is large), chances are we don’t have them all. Therefore, this {N dimension vector}-to-{number of phonemes} classifier is hard to train. Data sparsity. Maybe the network even has to extrapolate which it sucks at.

By masking, we’re providing the hints in a more separate way. The network can learn to classify with smaller numbers of features. It also breaks the N-dim feature vectors into something smaller. The N features don’t need to co-exist."
speech  data-augmentation  spectrogram  masking  sparsity 
may 2019 by arsyed
[1903.04641] Generalized Sparse Additive Models
"We present a unified framework for estimation and analysis of generalized additive models in high dimensions. The framework defines a large class of penalized regression estimators, encompassing many existing methods. An efficient computational algorithm for this class is presented that easily scales to thousands of observations and features. We prove minimax optimal convergence bounds for this class under a weak compatibility condition. In addition, we characterize the rate of convergence when this compatibility condition is not met. Finally, we also show that the optimal penalty parameters for structure and sparsity penalties in our framework are linked, allowing cross-validation to be conducted over only a single tuning parameter. We complement our theoretical results with empirical studies comparing some existing methods within this framework."
to:NB  sparsity  regression  additive_models  high-dimensional_statistics  statistics 
april 2019 by cshalizi
Bayes Sparse Regression
In this case study I’ll review how sparsity arises in frequentist and Bayesian analyses and discuss the often subtle challenges in implementing sparsity in practical Bayesian analyses.
bayesian-inference  sparsity  sparse-regression  regression  statistical-modeling  bayes  shrinkage  penalized-regression  lasso 
december 2018 by tarakc02
[1811.01159] Understanding and Comparing Scalable Gaussian Process Regression for Big Data
As a non-parametric Bayesian model which produces informative predictive distribution, Gaussian process (GP) has been widely used in various fields, like regression, classification and optimization. The cubic complexity of standard GP however leads to poor scalability, which poses challenges in the era of big data. Hence, various scalable GPs have been developed in the literature in order to improve the scalability while retaining desirable prediction accuracy. This paper devotes to investigating the methodological characteristics and performance of representative global and local scalable GPs including sparse approximations and local aggregations from four main perspectives: scalability, capability, controllability and robustness. The numerical experiments on two toy examples and five real-world datasets with up to 250K points offer the following findings. In terms of scalability, most of the scalable GPs own a time complexity that is linear to the training size. In terms of capability, the sparse approximations capture the long-term spatial correlations, the local aggregations capture the local patterns but suffer from over-fitting in some scenarios. In terms of controllability, we could improve the performance of sparse approximations by simply increasing the inducing size. But this is not the case for local aggregations. In terms of robustness, local aggregations are robust to various initializations of hyperparameters due to the local attention mechanism. Finally, we highlight that the proper hybrid of global and local scalable GPs may be a promising way to improve both the model capability and scalability for big data.
gaussian-processes  sparsity  scaling  big-data 
november 2018 by arsyed
[1712.05630] Sparse principal component analysis via random projections
We introduce a new method for sparse principal component analysis, based on the aggregation of eigenvector information from carefully-selected random projections of the sample covariance matrix. Unlike most alternative approaches, our algorithm is non-iterative, so is not vulnerable to a bad choice of initialisation. Our theory provides great detail on the statistical and computational trade-off in our procedure, revealing a subtle interplay between the effective sample size and the number of random projections that are required to achieve the minimax optimal rate. Numerical studies provide further insight into the procedure and confirm its highly competitive finite-sample performance.
pca  sparsity  random-projections 
september 2018 by arsyed
[1712.05630] Sparse principal component analysis via random projections
"We introduce a new method for sparse principal component analysis, based on the aggregation of eigenvector information from carefully-selected random projections of the sample covariance matrix. Unlike most alternative approaches, our algorithm is non-iterative, so is not vulnerable to a bad choice of initialisation. Our theory provides great detail on the statistical and computational trade-off in our procedure, revealing a subtle interplay between the effective sample size and the number of random projections that are required to achieve the minimax optimal rate. Numerical studies provide further insight into the procedure and confirm its highly competitive finite-sample performance."
to:NB  principal_components  sparsity  random_projections  statistics  samworth.richard_j. 
september 2018 by cshalizi
Everything you did and didn't know about PCA · Its Neuronal
"The question then becomes, how do we choose where to truncate? This used to be one of those classic questions with an unsatisfying answer… Basically, eyeball it.

Gavish & Donoho (2014) present a long overdue result on this problem and their answer is surprisingly simple and concrete. Essentially, the optimal[7] procedure boils down to estimating the noise in the dataset, , and then throwing away all components whose singular values are below a specified threshold. For a square matrix, this threshold is:

There is a similar threshold for non-square datasets explained in the paper. As with any theoretical study, the result comes with a few assumptions and caveats,[8] but their work appears robust and useful in practice.

Edit: Thanks to Jonathan Pillow for pointing out a Bayesian alternative outlined here: Minka (2000). Automatic choice of dimensionality for PCA"
pca  dimensionality-reduction  sparsity  optimization 
july 2018 by arsyed
[1705.07704] A Regularized Framework for Sparse and Structured Neural Attention
Modern neural networks are often augmented with an attention mechanism, which tells the network where to focus within the input. We propose in this paper a new framework for sparse and structured attention, building upon a smoothed max operator. We show that the gradient of this operator defines a mapping from real values to probabilities, suitable as an attention mechanism. Our framework includes softmax and a slight generalization of the recently-proposed sparsemax as special cases. However, we also show how our framework can incorporate modern structured penalties, resulting in more interpretable attention mechanisms, that focus on entire segments or groups of an input. We derive efficient algorithms to compute the forward and backward passes of our attention mechanisms, enabling their use in a neural network trained with backpropagation. To showcase their potential as a drop-in replacement for existing ones, we evaluate our attention mechanisms on three large-scale tasks: textual entailment, machine translation, and sentence summarization. Our attention mechanisms improve interpretability without sacrificing performance; notably, on textual entailment and summarization, we outperform the standard attention mechanisms based on softmax and sparsemax.
papers  regularization  attention  sparsity 
january 2018 by arsyed
[1710.06276] Smooth and Sparse Optimal Transport
Entropic regularization is quickly emerging as a new standard in optimal transport (OT). It enables to cast the OT computation as a differentiable and unconstrained convex optimization problem, which can be efficiently solved using the Sinkhorn algorithm. However, the entropy term keeps the transportation plan strictly positive and therefore completely dense, unlike unregularized OT. This lack of sparsity can be problematic for applications where the transportation plan itself is of interest. In this paper, we explore regularizing both the primal and dual original formulations with an arbitrary strongly convex term. We show that this corresponds to relaxing dual and primal constraints with approximate smooth constraints. We show how to incorporate squared 2-norm and group lasso regularizations within that framework, leading to sparse and group-sparse transportation plans. On the theoretical side, we are able to bound the approximation error introduced by smoothing the original primal and dual formulations. Our results suggest that, for the smoothed dual, the approximation error can often be smaller with squared 2-norm regularization than with entropic regularization. We showcase our proposed framework on the task of color transfer.
papers  optimal-transport  regularization  sparsity 
january 2018 by arsyed
The sum of its parts: Reducing sparsity in click estimation with query segments (PDF Download Available)
The critical task of predicting clicks on search advertisements is typically addressed by learning from historical click data. When enough history is observed for a given query-ad pair, future clicks can be accurately modeled. However, based on the empirical distribution of queries, sufficient historical information is unavailable for many query-ad pairs. The sparsity of data for new and rare queries makes it difficult to accurately estimate clicks for a significant portion of typical search engine traffic. In this paper we provide analysis to motivate modeling approaches that can reduce the sparsity of the large space of user search queries. We then propose methods to improve click and relevance models for sponsored search by mining click behavior for partial user queries. We aggregate click history for individual query words, as well as for phrases extracted with a CRF model. The new models show significant improvement in clicks and revenue compared to state-of-the-art baselines trained on several months of query logs. Results are reported on live traffic of a commercial search engine, in addition to results from offline evaluation.
CTR  prediction  sparsity  papers  2011 
december 2017 by foodbaby

« earlier    

related tags

2011  accuracy  acm  acmtariat  activation  active-learning  adaptive_behavior  additive_models  advanced  advice  ai  albion  algorithmic-econ  algorithms  andrew_gelman  aphorism  applications  applied  approximation  article  asr  atoms  attention  bandits  bayes  bayesian-inference  bayesian  bayesianism  behavioral-gen  ben-recht  best-practices  big-data  big-list  big-picture  bio  biodet  bioinformatics  bits  boltzmann  books  books:noted  bootstrap  brain  cace  cats  causal_discovery  chicago  chouldechova.alexandra  circuits  clarity  classification  cnn  code  coding-theory  combo-optimization  commentary  communication-complexity  comparison  complexity  compressed-sensing  compressed  compressed_sensing  compression  computational_statistics  computer-vision  concentration-of-measure  concept  confidence_sets  confusion  consistency  convexity-curvature  convnet  coreset  course  courses  cox.d.r.  cross-validation  crypto  ctr  curiosity  curvature  data-augmentation  data-selection  data-structures  debate  deep-learning  definition  dft  differential-privacy  dimensionality-reduction  dimensionality  direction  discovery  discrete-optimization  discrete  distributed_systems  ece285  economics  elastic-net  elegance  embedding  embeddings  empirical_processes  essay  estimation  evaluation  examples  exemplar  expanders  explanans  explanation  exploratory  exposition  extrema  facility-location  feature-selection  features  feedback-loop  fields  fourier  funny:geeky  game-theory  gaussian-processes  gcta  generalization  generative  genetics  genomics  geometry  glm  gowers  gradient-descent  graph  graphdb  graphical-models  graphical_models  graphs  grokkability-clarity  gwas  hamming  hardness  hashing  hastie.trevor  have_read  hierarchical  high-dimension  high-dimensional_statistics  highdim  hilbert_space  hmm  hoff.peter  homepage  hsu  human-capital  hypothesis_testing  ideas  ieee  immunology  in_nb  inner-product  innovation  interpretability  intuition  ipynbs  iq  isotropy  kernel-methods  kernel_methods  kith_and_kin  ksvd  l0  l1  l2  language-model  language  laplacian  lasso  lda  learning-theory  learning  lecture-notes  lee.ann_b.  levers  levy_processes  libs  linear-algebra  linear-models  linear-programming  linear_regression  linearity  liner-notes  links  list  local-global  logistic-regression  low-hanging  low-rank  low-resource  lower-bounds  machine-learning  machine_learning  madhu-sudan  markov  masking  math  mathematics  mathtariat  matlab  matrix-factorization  measure  measurement  metabuch  methodology  metric-learning  metric-space  microfoundations  mihai  mit  ml  model-class  model-selection  model_checking  monte-carlo  multi  networks  neural-net  neuroscience  nibble  nickl.richard  nlp  nn  no-go  nonlinearity  nonparametric  nonparametrics  norms  notes  novelty  off-convex  offline  omp  online-learning  optimal-transport  optimization  org:bleg  org:edu  org:popup  overflow  p:***  p:**  p:someday  pac  paper  papers  pca  pdf  penalized-regression  phase-transition  photography  pic  pigeonhole-markov  pin2vec  pinterest  polynomials  population-genetics  position-bias  practical  prediction  preprint  presentation  princeton  principal_components  probabilistic-method  probability  proofs  pseudorandomness  python  pytorch  q-n-a  qra  qtl  quixotic  random-matrices  random-projections  random_projections  ranking  ravikumar.pradeep  re:6dfb  re:phil-of-bayes_paper  rectifier  reference  reflection  regression  regularization  relu  representation  representative  research-program  research  rhetoric  rigidity  rigorous-crypto  robust  roots  s:*  sampling  samworth.richard_j.  sanjeev-arora  sapiens  scaling-up  scaling  scitariat  search  sensing  separation  serving  sgd  shannon  shrinkage  signal-processing  similarity-learning  similarity  simplex  sketching  soft-question  softmax  space-complexity  sparksee  sparse-coding  sparse-filtering  sparse-learning  sparse-regression  sparsemax  spearhead  spectral-methods  spectral  spectrogram  speculation  speech  splines  stanford  state-of-art  statistical-modeling  statistical  statistics  stats  stochastic-processes  stochastic_processes  stories  stream  sublinear  submodular  submodularity  subset-selection  summarization  summary  synthesis  talks  tcs  tcstariat  the-trenches  thesis  thinking  tidbits  tim-roughgarden  time_series  to:nb  to_besh  to_read  toolkit  topic-models  topics  unit  unsupervised  valiant  van_de_geer.sara  variable_selection  video  videos  visual-understanding  volo-avolo  wiki  word-embedding  wormholes  xavier  xing.eric  yoga  🌞  🎩  👳  🔬 

Copy this bookmark: