cshalizi + linear_regression   58

[1509.09169] Lecture notes on ridge regression
"The linear regression model cannot be fitted to high-dimensional data, as the high-dimensionality brings about empirical non-identifiability. Penalized regression overcomes this non-identifiability by augmentation of the loss function by a penalty (i.e. a function of regression coefficients). The ridge penalty is the sum of squared regression coefficients, giving rise to ridge regression. Here many aspect of ridge regression are reviewed e.g. moments, mean squared error, its equivalence to constrained estimation, and its relation to Bayesian regression. Finally, its behaviour and use are illustrated in simulation and on omics data. Subsequently, ridge regression is generalized to allow for a more general penalty. The ridge penalization framework is then translated to logistic regression and its properties are shown to carry over. To contrast ridge penalized estimation, the final chapter introduces its lasso counterpart."
to:NB  regression  linear_regression  ridge_regression  statistics
24 days ago by cshalizi
[1907.01954] An Econometric View of Algorithmic Subsampling
"Datasets that are terabytes in size are increasingly common, but computer bottlenecks often frustrate a complete analysis of the data. While more data are better than less, diminishing returns suggest that we may not need terabytes of data to estimate a parameter or test a hypothesis. But which rows of data should we analyze, and might an arbitrary subset of rows preserve the features of the original data? This paper reviews a line of work that is grounded in theoretical computer science and numerical linear algebra, and which finds that an algorithmically desirable {\em sketch} of the data must have a {\em subspace embedding} property. Building on this work, we study how prediction and inference is affected by data sketching within a linear regression setup. The sketching error is small compared to the sample size effect which is within the control of the researcher. As a sketch size that is algorithmically optimal may not be suitable for prediction and inference, we use statistical arguments to provide inference conscious' guides to the sketch size. When appropriately implemented, an estimator that pools over different sketches can be nearly as efficient as the infeasible one using the full sample."
to:NB  computational_statistics  statistics  linear_regression  random_projections
4 weeks ago by cshalizi
[1807.11408] Local Linear Forests
"Random forests are a powerful method for non-parametric regression, but are limited in their ability to fit smooth signals, and can show poor predictive performance in the presence of strong, smooth effects. Taking the perspective of random forests as an adaptive kernel method, we pair the forest kernel with a local linear regression adjustment to better capture smoothness. The resulting procedure, local linear forests, enables us to improve on asymptotic rates of convergence for random forests with smooth signals, and provides substantial gains in accuracy on both real and simulated data. We prove a central limit theorem valid under regularity conditions on the forest and smoothness constraints, and propose a computationally efficient construction for confidence intervals. Moving to a causal inference application, we discuss the merits of local regression adjustments for heterogeneous treatment effect estimation, and give an example on a dataset exploring the effect word choice has on attitudes to the social safety net. Last, we include simulation results on real and generated data."
to:NB  linear_regression  ensemble_methods  decision_trees  athey.susan  statistics
7 weeks ago by cshalizi
Fast Generalized Linear Models by Database Sampling and One-Step Polishing: Journal of Computational and Graphical Statistics: Vol 0, No 0
"In this article, I show how to fit a generalized linear model to N observations on p variables stored in a relational database, using one sampling query and one aggregation query, as long as N^{1/2+δ} observations can be stored in memory, for some δ>0. The resulting estimator is fully efficient and asymptotically equivalent to the maximum likelihood estimator, and so its variance can be estimated from the Fisher information in the usual way. A proof-of-concept implementation uses R with MonetDB and with SQLite, and could easily be adapted to other popular databases. I illustrate the approach with examples of taxi-trip data in New York City and factors related to car color in New Zealand. "
to:NB  computational_statistics  linear_regression  regression  databases  lumley.thomas  to_teach:statcomp
7 weeks ago by cshalizi
[1906.01990] A Model-free Approach to Linear Least Squares Regression with Exact Probabilities and Applications to Covariate Selection
"The classical model for linear regression is ${\mathbold Y}={\mathbold x}{\mathbold \beta} +\sigma{\mathbold \varepsilon}$ with i.i.d. standard Gaussian errors. Much of the resulting statistical inference is based on Fisher's F-distribution. In this paper we give two approaches to least squares regression which are model free. The results hold forall data $({\mathbold y},{\mathbold x})$. The derived probabilities are not only exact, they agree with those using the F-distribution based on the classical model. This is achieved by replacing questions about the size of βj, for example βj=0, by questions about the degree to which the covariate ${\mathbold x}_j$ is better than Gaussian white noise or, alternatively, a random orthogonal rotation of ${\mathbold x}_j$. The idea can be extended to choice of covariates, post selection inference PoSI, step-wise choice of covariates, the determination of dependency graphs and to robust regression and non-linear regression. In the latter two cases the probabilities are no longer exact but are based on the chi-squared distribution. The step-wise choice of covariates is of particular interest: it is a very simple, very fast, very powerful, it controls the number of false positives and does not over fit even in the case where the number of covariates far exceeds the sample size"
in_NB  linear_regression  regression  statistics  to_be_shot_after_a_fair_trial  variable_selection
10 weeks ago by cshalizi
Robust Regression on Stationary Time Series: A Self‐Normalized Resampling Approach - Akashi - 2018 - Journal of Time Series Analysis - Wiley Online Library
"This article extends the self‐normalized subsampling method of Bai et al. (2016) to the M‐estimation of linear regression models, where the covariate and the noise are stationary time series which may have long‐range dependence or heavy tails. The method yields an asymptotic confidence region for the unknown coefficients of the linear regression. The determination of these regions does not involve unknown parameters such as the intensity of the dependence or the heaviness of the distributional tail of the time series. Additional simulations can be found in a supplement. The computer codes are available from the authors."
to:NB  time_series  statistics  linear_regression  heavy_tails  long-range_dependence
may 2018 by cshalizi
[1611.05401] Bootstrapping and Sample Splitting For High-Dimensional, Assumption-Free Inference
"Several new methods have been proposed for performing valid inference after model selection. An older method is sampling splitting: use part of the data for model selection and part for inference. In this paper we revisit sample splitting combined with the bootstrap (or the Normal approximation). We show that this leads to a simple, assumption-free approach to inference and we establish results on the accuracy of the method. In fact, we find new bounds on the accuracy of the bootstrap and the Normal approximation for general nonlinear parameters with increasing dimension which we then use to assess the accuracy of regression inference. We show that an alternative, called the image bootstrap, has higher coverage accuracy at the cost of more computation. We define new parameters that measure variable importance and that can be inferred with greater accuracy than the usual regression coefficients. There is a inference-prediction tradeoff: splitting increases the accuracy and robustness of inference but can decrease the accuracy of the predictions."
to:NB  heard_the_talk  linear_regression  model_selection  bootstrap  kith_and_kin  wasserman.larry  rinaldo.alessandro  g'sell.max  lei.jing  high-dimensional_statistics  statistics  to_teach:linear_models  post-selection_inference
april 2018 by cshalizi
Sufficient Dimension Reduction via Direct Estimation of the Gradients of Logarithmic Conditional Densities | Neural Computation | MIT Press Journals
"Sufficient dimension reduction (SDR) is aimed at obtaining the low-rank projection matrix in the input space such that information about output data is maximally preserved. Among various approaches to SDR, a promising method is based on the eigendecomposition of the outer product of the gradient of the conditional density of output given input. In this letter, we propose a novel estimator of the gradient of the logarithmic conditional density that directly fits a linear-in-parameter model to the true gradient under the squared loss. Thanks to this simple least-squares formulation, its solution can be computed efficiently in a closed form. Then we develop a new SDR method based on the proposed gradient estimator. We theoretically prove that the proposed gradient estimator, as well as the SDR solution obtained from it, achieves the optimal parametric convergence rate. Finally, we experimentally demonstrate that our SDR method compares favorably with existing approaches in both accuracy and computational efficiency on a variety of artificial and benchmark data sets."
to:NB  dimension_reduction  sufficiency  density_estimation  linear_regression  statistics
january 2018 by cshalizi
The importance of the normality assumption in large public health data sets. - PubMed - NCBI
"It is widely but incorrectly believed that the t-test and linear regression are valid only for Normally distributed outcomes. The t-test and linear regression compare the mean of an outcome variable for different subjects. While these are valid even in very small samples if the outcome variable is Normally distributed, their major usefulness comes from the fact that in large samples they are valid for any distribution. We demonstrate this validity by simulation in extremely non-Normal data. We discuss situations in which in other methods such as the Wilcoxon rank sum test and ordinal logistic regression (proportional odds model) have been recommended, and conclude that the t-test and linear regression often provide a convenient and practical alternative. The major limitation on the t-test and linear regression for inference about associations is not a distributional one, but whether detecting and estimating a difference in the mean of the outcome answers the scientific question at hand."
to:NB  linear_regression  statistics  to_teach:linear_models  lumley.thomas
july 2017 by cshalizi
[1611.05923] "Influence Sketching": Finding Influential Samples In Large-Scale Regressions
"There is an especially strong need in modern large-scale data analysis to prioritize samples for manual inspection. For example, the inspection could target important mislabeled samples or key vulnerabilities exploitable by an adversarial attack. In order to solve the "needle in the haystack" problem of which samples to inspect, we develop a new scalable version of Cook's distance, a classical statistical technique for identifying samples which unusually strongly impact the fit of a regression model (and its downstream predictions). In order to scale this technique up to very large and high-dimensional datasets, we introduce a new algorithm which we call "influence sketching." Influence sketching embeds random projections within the influence computation; in particular, the influence score is calculated using the randomly projected pseudo-dataset from the post-convergence General Linear Model (GLM). We validate that influence sketching can reliably and successfully discover influential samples by applying the technique to a malware detection dataset of over 2 million executable files, each represented with almost 100,000 features. For example, we find that randomly deleting approximately 10% of training samples reduces predictive accuracy only slightly from 99.47% to 99.45%, whereas deleting the same number of samples with high influence sketch scores reduces predictive accuracy all the way down to 90.24%. Moreover, we find that influential samples are especially likely to be mislabeled. In the case study, we manually inspect the most influential samples, and find that influence sketching pointed us to new, previously unidentified pieces of malware."
to:NB  regression  linear_regression  computational_statistics  random_projections  via:vaguery
december 2016 by cshalizi
[1209.1508] Confidence sets in sparse regression
"The problem of constructing confidence sets in the high-dimensional linear model with n response variables and p parameters, possibly p≥n, is considered. Full honest adaptive inference is possible if the rate of sparse estimation does not exceed n−1/4, otherwise sparse adaptive confidence sets exist only over strict subsets of the parameter spaces for which sparse estimators exist. Necessary and sufficient conditions for the existence of confidence sets that adapt to a fixed sparsity level of the parameter vector are given in terms of minimal ℓ2-separation conditions on the parameter space. The design conditions cover common coherence assumptions used in models for sparsity, including (possibly correlated) sub-Gaussian designs."
to:NB  confidence_sets  regression  high-dimensional_statistics  linear_regression  statistics  sparsity  van_de_geer.sara  nickl.richard
november 2016 by cshalizi
Vector Generalized Linear and Additive Models - With an | Thomas W. Yee | Springer
"This book presents a greatly enlarged statistical framework compared to generalized linear models (GLMs) with which to approach regression modelling. Comprising of about half-a-dozen major classes of statistical models, and fortified with necessary infrastructure to make the models more fully operable, the framework allows analyses based on many semi-traditional applied statistics models to be performed as a coherent whole.
"Since their advent in 1972, GLMs have unified important distributions under a single umbrella with enormous implications. However, GLMs are not flexible enough to cope with the demands of practical data analysis. And data-driven GLMs, in the form of generalized additive models (GAMs), are also largely confined to the exponential family. The methodology here and accompanying software (the extensive VGAM R package) are directed at these limitations and are described comprehensively for the first time in one volume. This book treats distributions and classical models as generalized regression models, and the result is a much broader application base for GLMs and GAMs.
"The book can be used in senior undergraduate or first-year postgraduate courses on GLMs or categorical data analysis and as a methodology resource for VGAM users. In the second part of the book, the R package VGAM allows readers to grasp immediately applications of the methodology. R code is integrated in the text, and datasets are used throughout. Potential applications include ecology, finance, biostatistics, and social sciences. The methodological contribution of this book stands alone and does not require use of the VGAM package."

--- Hopefully this means the VGAM package is less user-hostile than it was...
october 2015 by cshalizi
Bhatia, R.: Positive Definite Matrices (eBook and Paperback).
"This book represents the first synthesis of the considerable body of new research into positive definite matrices. These matrices play the same role in noncommutative analysis as positive real numbers do in classical analysis. They have theoretical and computational uses across a broad spectrum of disciplines, including calculus, electrical engineering, statistics, physics, numerical analysis, quantum information theory, and geometry. Through detailed explanations and an authoritative and inspiring writing style, Rajendra Bhatia carefully develops general techniques that have wide applications in the study of such matrices.
"Bhatia introduces several key topics in functional analysis, operator theory, harmonic analysis, and differential geometry--all built around the central theme of positive definite matrices. He discusses positive and completely positive linear maps, and presents major theorems with simple and direct proofs. He examines matrix means and their applications, and shows how to use positive definite functions to derive operator inequalities that he and others proved in recent years. He guides the reader through the differential geometry of the manifold of positive definite matrices, and explains recent work on the geometric mean of several matrices."
to:NB  books:noted  mathematics  algebra  linear_regression  re:g_paper  statistics
september 2015 by cshalizi
[1507.01173] Model Diagnostics Based on Cumulative Residuals: The R-package gof
"The generalized linear model is widely used in all areas of applied statistics and while correct asymptotic inference can be achieved under misspecification of the distributional assumptions, a correctly specified mean structure is crucial to obtain interpretable results. Usually the linearity and functional form of predictors are checked by inspecting various scatter plots of the residuals, however, the subjective task of judging these can be challenging. In this paper we present an implementation of model diagnostics for the generalized linear model as well as structural equation models, based on aggregates of the residuals where the asymptotic behavior under the null is imitated by simulations. A procedure for checking the proportional hazard assumption in the Cox regression is also implemented."
to:NB  model_checking  regression  linear_regression  statistics  to_teach:linear_models
august 2015 by cshalizi
[1503.06426] High-dimensional inference in misspecified linear models
"We consider high-dimensional inference when the assumed linear model is misspecified. We describe some correct interpretations and corresponding sufficient assumptions for valid asymptotic inference of the model parameters, which still have a useful meaning when the model is misspecified. We largely focus on the de-sparsified Lasso procedure but we also indicate some implications for (multiple) sample splitting techniques. In view of available methods and software, our results contribute to robustness considerations with respect to model misspecification."
to:NB  statistics  linear_regression  high-dimensional_statistics  lasso  buhlmann.peter  van_de_geer.sara
may 2015 by cshalizi
On the Interpretation of Instrumental Variables in the Presence of Specification Errors
"The method of instrumental variables (IV) and the generalized method of moments (GMM), and their applications to the estimation of errors-in-variables and simultaneous equations models in econometrics, require data on a sufficient number of instrumental variables that are both exogenous and relevant. We argue that, in general, such instruments (weak or strong) cannot exist."

--- I think they are too quick to dismiss non-parametric IV; if what one wants is consistent estimates of the partial derivatives at a given point, you _can_ get that by (e.g.) splines or locally linear regression. Need to think through this in terms of Pearl's graphical definition of IVs.
february 2015 by cshalizi
[1404.1578] Models as Approximations: How Random Predictors and Model Violations Invalidate Classical Inference in Regression
"We review and interpret the early insights of Halbert White who over thirty years ago inaugurated a form of statistical inference for regression models that is asymptotically correct even under "model misspecification," that is, under the assumption that models are approximations rather than generative truths. This form of inference, which is pervasive in econometrics, relies on the "sandwich estimator" of standard error. Whereas linear models theory in statistics assumes models to be true and predictors to be fixed, White's theory permits models to be approximate and predictors to be random. Careful reading of his work shows that the deepest consequences for statistical inference arise from a synergy --- a "conspiracy" --- of nonlinearity and randomness of the predictors which invalidates the ancillarity argument that justifies conditioning on the predictors when they are random. Unlike the standard error of linear models theory, the sandwich estimator provides asymptotically correct inference in the presence of both nonlinearity and heteroskedasticity. An asymptotic comparison of the two types of standard error shows that discrepancies between them can be of arbitrary magnitude. If there exist discrepancies, standard errors from linear models theory are usually too liberal even though occasionally they can be too conservative as well. A valid alternative to the sandwich estimator is provided by the "pairs bootstrap"; in fact, the sandwich estimator can be shown to be a limiting case of the pairs bootstrap. We conclude by giving meaning to regression slopes when the linear model is an approximation rather than a truth. --- In this review we limit ourselves to linear least squares regression, but many qualitative insights hold for most forms of regression."

-- Very close to what I teach in my class, though I haven't really talked about sandwich variances.
in_NB  have_read  statistics  regression  linear_regression  bootstrap  misspecification  estimation  approximation
february 2015 by cshalizi
[1406.5986] A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares
"We consider statistical aspects of solving large-scale least-squares (LS) problems using randomized sketching algorithms. Prior work has typically adopted an \emph{algorithmic perspective}, in that it has made no statistical assumptions on the input X and Y, and instead it has assumed that the data (X,Y) are fixed and worst-case. In this paper, we adopt a \emph{statistical perspective}, and we consider the mean-squared error performance of randomized sketching algorithms, when data (X,Y) are generated according to a statistical linear model Y=Xβ+ϵ, where ϵ is a noise process. To do this, we first develop a framework for assessing, in a unified manner, algorithmic and statistical aspects of randomized sketching methods. We then consider the statistical predicition efficiency (SPE) and the statistical residual efficiency (SRE) of the sketched LS estimator; and we use our framework to provide results for several types of random projection and random sampling sketching algorithms. Among other results, we show that the SRE can be bounded when p≲r≪n but that the SPE typically requires the sample size r to be substantially larger. Our theoretical results reveal that, depending on the specifics of the situation, leverage-based sampling methods can perform as well as or better than projection methods. Our empirical results reveal that when r is only slightly greater than p and much less than n, projection-based methods out-perform sampling-based methods, but as r grows, sampling methods start to out-perform projection methods."
to:NB  computational_statistics  regression  linear_regression  random_projections  statistics
january 2015 by cshalizi
[1412.3730] Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It
"We empirically show that Bayesian inference can be inconsistent under misspecification in simple linear regression problems, both in a model averaging/selection and in a Bayesian ridge regression setting. We use the standard linear model, which assumes homoskedasticity, whereas the data are heteroskedastic, and observe that the posterior puts its mass on ever more high-dimensional models as the sample size increases. To remedy the problem, we equip the likelihood in Bayes' theorem with an exponent called the learning rate, and we propose the Safe Bayesian method to learn the learning rate from the data. SafeBayes tends to select small learning rates as soon the standard posterior is not cumulatively concentrated', and its results on our data are quite encouraging."
in_NB  to_read  linear_regression  bayesianism  bayesian_consistency  misspecification  statistics  grunwald.peter
december 2014 by cshalizi
Political Language in Economics
"Does political ideology influence economic research? We rely upon purely inductive methods in natural language processing and machine learning to examine patterns of implicit political ideology in economic articles. Using observed political behavior of economists and the phrases from their academic articles, we construct a high-dimensional predictor of political ideology by article, economist, school, and journal. In addition to field, journal, and editor ideology, we look at the correlation of author ideology with magnitudes of reported policy relevant elasticities. Overall our results suggest that there is substantial sorting by ideology into fields, departments, and methodologies, and that political ideology influences the results of economic research."
in_NB  economics  ideology  political_economy  text_mining  naidu.suresh  jelveh.zubin  to:blog  topic_models  linear_regression  to_teach:data-mining
december 2014 by cshalizi
Nickl , van de Geer : Confidence sets in sparse regression
"The problem of constructing confidence sets in the high-dimensional linear model with n response variables and p parameters, possibly p≥n, is considered. Full honest adaptive inference is possible if the rate of sparse estimation does not exceed n−1/4, otherwise sparse adaptive confidence sets exist only over strict subsets of the parameter spaces for which sparse estimators exist. Necessary and sufficient conditions for the existence of confidence sets that adapt to a fixed sparsity level of the parameter vector are given in terms of minimal ℓ2-separation conditions on the parameter space. The design conditions cover common coherence assumptions used in models for sparsity, including (possibly correlated) sub-Gaussian designs."

--- Ungated version: http://arxiv.org/abs/1209.1508
in_NB  high-dimensional_statistics  sparsity  confidence_sets  regression  statistics  van_de_geer.sara  nickl.richard  lasso  model_selection  linear_regression
february 2014 by cshalizi
[1112.3450] The sparse Laplacian shrinkage estimator for high-dimensional regression
"We propose a new penalized method for variable selection and estimation that explicitly incorporates the correlation patterns among predictors. This method is based on a combination of the minimax concave penalty and Laplacian quadratic associated with a graph as the penalty function. We call it the sparse Laplacian shrinkage (SLS) method. The SLS uses the minimax concave penalty for encouraging sparsity and Laplacian quadratic penalty for promoting smoothness among coefficients associated with the correlated predictors. The SLS has a generalized grouping property with respect to the graph represented by the Laplacian quadratic. We show that the SLS possesses an oracle property in the sense that it is selection consistent and equal to the oracle Laplacian shrinkage estimator with high probability. This result holds in sparse, high-dimensional settings with p >> n under reasonable conditions. We derive a coordinate descent algorithm for computing the SLS estimates. Simulation studies are conducted to evaluate the performance of the SLS method and a real data example is used to illustrate its application."
to:NB  have_read  regression  linear_regression  high-dimensional_statistics  statistics
september 2013 by cshalizi
[1308.2408] Group Lasso for generalized linear models in high dimension
"Nowadays an increasing amount of data is available and we have to deal with models in high dimension (number of covariates much larger than the sample size). Under sparsity assumption it is reasonable to hope that we can make a good estimation of the regression parameter. This sparsity assumption as well as a block structuration of the covariates into groups with similar modes of behavior is for example quite natural in genomics. A huge amount of scientific literature exists for Gaussian linear models including the Lasso estimator and also the Group Lasso estimator which promotes group sparsity under an a priori knowledge of the groups. We extend this Group Lasso procedure to generalized linear models and we study the properties of this estimator for sparse high-dimensional generalized linear models to find convergence rates. We provide oracle inequalities for the prediction and estimation error under assumptions on the joint distribution of the pair observable covariables and under a condition on the design matrix. We show the ability of this estimator to recover good sparse approximation of the true model. At last we extend these results to the case of an Elastic net penalty and we apply them to the so-called Poisson regression case which has not been studied in this context contrary to the logistic regression."

--- Isn't this already done in Buhlmann and van de Geer's book?
to:NB  linear_regression  regression  sparsity  statistics  high-dimensional_statistics
september 2013 by cshalizi
Let's Put Garbage-Can Regressions and Garbage-Can Probits Where They Belong
"Many social scientists believe that dumping long lists of explanatory variables into linear regression, probit, logit, and other statistical equations will successfully “control” for the effects of auxiliary factors. Encouraged by convenient software and ever more powerful computing, researchers also believe that this conventional approach gives the true explanatory variables the best chance to emerge. The present paper argues that these beliefs are false, and that statistical models with more than a few independent variables are likely to be inaccurate. Instead, a quite different research methodology is needed, one that integrates contemporary powerful statistical methods with classic data-analytic techniques of creative engagement with the data."
september 2013 by cshalizi
[1307.7963] Efficient variational inference for generalized linear mixed models with large datasets
"The article develops a hybrid Variational Bayes algorithm that combines the mean-field and fixed-form Variational Bayes methods. The new estimation algorithm can be used to approximate any posterior without relying on conjugate priors. We propose a divide and recombine strategy for the analysis of large datasets, which partitions a large dataset into smaller pieces and then combines the variational distributions that have been learnt in parallel on each separate piece using the hybrid Variational Bayes algorithm. The proposed method is applied to fitting generalized linear mixed models. The computational efficiency of the parallel and hybrid Variational Bayes algorithm is demonstrated on several simulated and real datasets."
to:NB  computational_statistics  linear_regression  regression  statistics  estimation  variational_inference
august 2013 by cshalizi
Linear Models: A Useful “Microscope” for Causal Analysis : Journal of Causal Inference
"This note reviews basic techniques of linear path analysis and demonstrates, using simple examples, how causal phenomena of non-trivial character can be understood, exemplified and analyzed using diagrams and a few algebraic steps. The techniques allow for swift assessment of how various features of the model impact the phenomenon under investigation. This includes: Simpson’s paradox, case–control bias, selection bias, missing data, collider bias, reverse regression, bias amplification, near instruments, and measurement errors."
june 2013 by cshalizi
Berk , Brown , Buja , Zhang , Zhao : Valid post-selection inference
"It is common practice in statistical data analysis to perform data-driven variable selection and derive statistical inference from the resulting model. Such inference enjoys none of the guarantees that classical statistical theory provides for tests and confidence intervals when the model has been chosen a priori. We propose to produce valid “post-selection inference” by reducing the problem to one of simultaneous inference and hence suitably widening conventional confidence and retention intervals. Simultaneity is required for all linear functions that arise as coefficient estimates in all submodels. By purchasing “simultaneity insurance” for all possible submodels, the resulting post-selection inference is rendered universally valid under all possible model selection procedures. This inference is therefore generally conservative for particular selection procedures, but it is always less conservative than full Scheffé protection. Importantly it does not depend on the truth of the selected submodel, and hence it produces valid inference even in wrong models. We describe the structure of the simultaneous inference problem and give some asymptotic results."

--- I find this abstract very puzzling given some of the strong negative results on post-selection inference....
model_selection  to_read  linear_regression  regression  statistics  confidence_sets  in_NB  post-selection_inference
may 2013 by cshalizi
Müller , Scealy , Welsh : Model Selection in Linear Mixed Models
"Linear mixed effects models are highly flexible in handling a broad range of data types and are therefore widely used in applications. A key part in the analysis of data is model selection, which often aims to choose a parsimonious model with other desirable properties from a possibly very large set of candidate statistical models. Over the last 5–10 years the literature on model selection in linear mixed models has grown extremely rapidly. The problem is much more complicated than in linear regression because selection on the covariance structure is not straightforward due to computational issues and boundary problems arising from positive semidefinite constraints on covariance matrices. To obtain a better understanding of the available methods, their properties and the relationships between them, we review a large body of literature on linear mixed model selection. We arrange, implement, discuss and compare model selection methods based on four major approaches: information criteria such as AIC or BIC, shrinkage methods based on penalized loss functions such as LASSO, the Fence procedure and Bayesian techniques."
to:NB  variance_estimation  hierarchical_statistical_models  model_selection  regression  linear_regression
may 2013 by cshalizi
Leeb : On the conditional distributions of low-dimensional projections from high-dimensional data
"We study the conditional distribution of low-dimensional projections from high-dimensional data, where the conditioning is on other low-dimensional projections. To fix ideas, consider a random d-vector Z that has a Lebesgue density and that is standardized so that 𝔼Z=0 and 𝔼ZZ′=Id. Moreover, consider two projections defined by unit-vectors α and β, namely a response y=α′Z and an explanatory variable x=β′Z. It has long been known that the conditional mean of y given x is approximately linear in x, under some regularity conditions; cf. Hall and Li [Ann. Statist. 21 (1993) 867–889]. However, a corresponding result for the conditional variance has not been available so far. We here show that the conditional variance of y given x is approximately constant in x (again, under some regularity conditions). These results hold uniformly in α and for most β’s, provided only that the dimension of Z is large. In that sense, we see that most linear submodels of a high-dimensional overall model are approximately correct. Our findings provide new insights in a variety of modeling scenarios. We discuss several examples, including sliced inverse regression, sliced average variance estimation, generalized linear models under potential link violation, and sparse linear modeling."

Free version: http://arxiv.org/abs/1304.5943
to:NB  regression  linear_regression  random_projections  high-dimensional_probability  re:what_is_the_right_null_model_for_linear_regression  leeb.hannes
april 2013 by cshalizi
[1303.7092] Pivotal uniform inference in high-dimensional regression with random design in wide classes of models via linear programming
"We propose a new method of estimation in high-dimensional linear regression model. It allows for very weak distributional assumptions including heteroscedasticity, and does not require the knowledge of the variance of random errors. The method is based on linear programming only, so that its numerical implementation is faster than for previously known techniques using conic programs, and it allows one to deal with higher dimensional models. We provide upper bounds for estimation and prediction errors of the proposed estimator showing that it achieves the same rate as in the more restrictive situation of fixed design and i.i.d. Gaussian errors with known variance. Following Gautier and Tsybakov (2011), we obtain the results under weaker sensitivity assumptions than the restricted eigenvalue or assimilated conditions."
to:NB  regression  linear_regression  optimization  high-dimensional_statistics  statistics
march 2013 by cshalizi
[1302.5831] On Testing Independence and Goodness-of-fit in Linear Models
"We consider a linear regression model and propose an omnibus test to simultaneously check the assumption of independence between the error and the predictor variables, and the goodness-of-fit of the parametric model. Our approach is based on testing for independence between the residual and the predictor using the recently developed Hilbert-Schmidt independence criterion, see Gretton et al. (2008). The proposed method requires no user-defined regularization, is simple to compute, based merely on pairwise distances between points in the sample, and is consistent against all alternatives. We develop the distribution theory of the proposed test-statistic, both under the null and the alternative hypotheses, and devise a bootstrap scheme to approximate its null distribution. We prove the consistency of the bootstrap procedure. The superior finite sample performance of our procedure is illustrated through a simulation study."
march 2013 by cshalizi
Loh , Wainwright : High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity
"Although the standard formulations of prediction problems involve fully-observed and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependence, as well. We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently nonconvex, and it is difficult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing nonconvex programs, we are able to both analyze the statistical error associated with any global optimum, and more surprisingly, to prove that a simple algorithm based on projected gradient descent will converge in polynomial time to a small neighborhood of the set of all global minimizers. On the statistical side, we provide nonasymptotic bounds that hold with high probability for the cases of noisy, missing and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm is guaranteed to converge at a geometric rate to a near-global minimizer. We illustrate these theoretical predictions with simulations, showing close agreement with the predicted scalings."

Ungated; http://arxiv.org/abs/1109.3714
in_NB  to_read  regression  linear_regression  sparsity  high-dimensional_statistics  optimization  wainwright.martin_j.  statistics
september 2012 by cshalizi
The Duck of Minerva: Professionalization and the Poverty of IR Theory
"Having spent far too many years on my Department's admissions committee--which I currently chair--I have to agree with part of PM's response: it is simply now a fact of life that prior mathematical and statistical trainings improves one's chances of getting into most of the first- and second-tier IR programs in the United States. But that, as PM also notes, begs the "should it be this way?" question.
"My sense is that over-professionalization of graduate students is an enormous threat to the vibrancy and innovativeness of International Relations (IR). I am far from alone in this assessment. But I think the structural pressures for over-professionalization are awfully powerful; in conjunction with the triumph of behavioralism (or what PTJ reconstructs as neo-positivism), this means that "theory testing" via large-n regression analysis will only grow in dominance over time. ...
"I am not claiming that neopositivist work is "bad" or making substantive claims about the merits of statistical work. I do believe that general-linear-reality (GLR) approaches -- both qualitative and quantitative -- are overused at the expense of non-GLR frameworks--again, both qualitative and quantitative. I am also concerned with the general devaluation of singular-causal analysis...
"What I am claiming is this: that the conjunction of over-professionalization, GLR-style statistical work, and environmental factors is diminishing the overall quality of theorization, circumscribing the audience for good theoretical work, and otherwise working in the direction of impoverishing IR theory."
september 2012 by cshalizi
Transcending General Linear Reality (Abbott, 1988)
"This paper argues that the dominance of linear models has led many sociologists to construe the social world in terms of a "general linear reality." This reality assumes (1) that the social world consists of fixed entities with variable attributes, (2) that cause cannot flow from "small" to "large" attributes/events, (3) that causal attributes have only one causal pattern at once, (4) that the sequence of events does not influence their outcome, (5) that the "careers" of entities are largely independent, and (6) that causal attributes are generally independent of each other. The paper discusses examples of these assumptions in empirical work, consider standard and new methods addressing them, and briefly explores alternative models for reality that employ demographic, sequential, and network perspectives."
august 2012 by cshalizi
Mahoney: Randomized Algorithms for Matrices and Data
"Randomized algorithms for very large matrix problems have received a great deal of attention in recent years. Much of this work was motivated by problems in large-scale data analysis, largely since matrices are popular structures with which to model data drawn from a wide range of application domains, and this work was performed by individuals from many different research communities. While the most obvious benefit of randomization is that it can lead to faster algorithms, either in worst-case asymptotic theory and/or numerical implementation, there are numerous other benefits that are at least as important. For example, the use of randomization can lead to simpler algorithms that are easier to analyze or reason about when applied in counterintuitive settings; it can lead to algorithms with more interpretable output, which is of interest in applications where analyst time rather than just computational time is of interest; it can lead implicitly to regularization and more robust output; and randomized algorithms can often be organized to exploit modern computational architectures better than classical numerical methods.

"This monograph will provide a detailed overview of recent work on the theory of randomized matrix algorithms as well as the application of those ideas to the solution of practical problems in large-scale data analysis. Throughout this review, an emphasis will be placed on a few simple core ideas that underlie not only recent theoretical advances but also the usefulness of these tools in large-scale data applications. Crucial in this context is the connection with the concept of statistical leverage. This concept has long been used in statistical regression diagnostics to identify outliers; and it has recently proved crucial in the development of improved worst-case matrix algorithms that are also amenable to high-quality numerical implementation and that are useful to domain scientists. This connection arises naturally when one explicitly decouples the effect of randomization in these matrix algorithms from the underlying linear algebraic structure. This decoupling also permits much finer control in the application of randomization, as well as the easier exploitation of domain knowledge.

"Most of the review will focus on random sampling algorithms and random projection algorithms for versions of the linear least-squares problem and the low-rank matrix approximation problem. These two problems are fundamental in theory and ubiquitous in practice. Randomized methods solve these problems by constructing and operating on a randomized sketch of the input matrix A — for random sampling methods, the sketch consists of a small number of carefully-sampled and rescaled columns/rows of A, while for random projection methods, the sketch consists of a small number of linear combinations of the columns/rows of A. Depending on the specifics of the situation, when compared with the best previously-existing deterministic algorithms, the resulting randomized algorithms have worst-case running time that is asymptotically faster; their numerical implementations are faster in terms of clock-time; or they can be implemented in parallel computing environments where existing numerical algorithms fail to run at all. Numerous examples illustrating these observations will be described in detail."
to:NB  data_analysis  linear_regression  computational_complexity  computational_statistics  random_projections
january 2012 by cshalizi
Audibert , Catoni : Robust linear least squares regression
"We consider the problem of robustly predicting as well as the best linear combination of d given functions in least squares regression, and variants of this problem including constraints on the parameters of the linear combination. For the ridge estimator and the ordinary least squares estimator, and their variants, we provide new risk bounds of order d/n without logarithmic factor unlike some standard results, where n is the size of the training data. We also provide a new estimator with better deviations in the presence of heavy-tailed noise. It is based on truncating differences of losses in a min–max framework and satisfies a d/n risk bound both in expectation and in deviations. The key common surprising factor of these results is the absence of exponential moment condition on the output distribution while achieving exponential deviations. All risk bounds are obtained through a PAC-Bayesian analysis on truncated differences of losses. Experimental results strongly back up our truncated min–max estimator."
in_NB  regression  statistics  linear_regression  learning_theory  catoni.olivier
december 2011 by cshalizi
Lehmann: On the history and use of some standard statistical models
"his paper tries to tell the story of the general linear model, which saw the light of day 200 years ago, and the assumptions underlying it. We distinguish three principal stages (ignoring earlier more isolated instances). The model was first proposed in the context of astronomical and geodesic observations, where the main source of variation was observational error. This was the main use of the model during the 19th century.

In the 1920’s it was developed in a new direction by R.A. Fisher whose principal applications were in agriculture and biology. Finally, beginning in the 1930’s and 40’s it became an important tool for the social sciences. As new areas of applications were added, the assumptions underlying the model tended to become more questionable, and the resulting statistical techniques more prone to misuse."
december 2009 by cshalizi
36-707: Regression Analysis, Fall 2007
Larry's class notes on regression. He finished with standard linear regression about 1/3 of the way through the semester, which seems to me to be about the amount of time the subject warrants, and then want to town...
kith_and_kin  regression  linear_regression  nonparametrics  causal_inference  graphical_models  support-vector_machines  statistics  wasserman.larry
february 2009 by cshalizi
[0901.3202] Model-Consistent Sparse Estimation through the Bootstrap
"if we run the Lasso for several bootstrapped replications of a given sample, then intersecting the supports of the Lasso bootstrap estimates leads to consistent model selection"
lasso  linear_regression  model_selection  variable_selection  bootstrap
january 2009 by cshalizi
"A Note on the Cobb-Douglas Function": The Review of Economic Studies, Vol. 30, No. 2, (1963 ), pp. 93-94
Shorter Simon & Levy (1963): I am sickened by the weakness of your model's goodness-of-fit test. (Does make me reconsider the many papers I still see using Cobb-Douglas...)
april 2008 by cshalizi
[0802.3364] Evaluation and selection of models for out-of-sample prediction when the the sample size is small relative to the complexity of the data-generating process
"In regression with random design, we study the problem of selecting a model that performs well for out-of-sample prediction. We do not assume that any of the candidate models under consideration are correct. Our analysis is based on explicit finite-sample results. Our main findings differ from those of other analyses that are based on traditional large-sample limit approximations because we consider a situation where the sample size is small relative to the complexity of the data-generating process, in the sense that the number of parameters in a `good' model is of the same order as sample size. Also, we allow for the case where the number of candidate models is (much) larger than sample size."
february 2008 by cshalizi
Fast and robust bootstrap
"recent developments on a bootstrap method for robust estimators which is computationally faster and more resistant to outliers than the classical bootstrap"
bootstrap  statistics  linear_regression
february 2008 by cshalizi
interaction models « orgtheory.net
Good - but why use a linear regression model in the first place? (Memo to self: write teaching note about testing for interactions with nonparametric regressions)