cshalizi + regression   529

 « earlier
Sadhanala , Tibshirani : Additive models with trend filtering
"We study additive models built with trend filtering, that is, additive models whose components are each regularized by the (discrete) total variation of their kkth (discrete) derivative, for a chosen integer k≥0k≥0. This results in kkth degree piecewise polynomial components, (e.g., k=0k=0 gives piecewise constant components, k=1k=1 gives piecewise linear, k=2k=2 gives piecewise quadratic, etc.). Analogous to its advantages in the univariate case, additive trend filtering has favorable theoretical and computational properties, thanks in large part to the localized nature of the (discrete) total variation regularizer that it uses. On the theory side, we derive fast error rates for additive trend filtering estimates, and show these rates are minimax optimal when the underlying function is additive and has component functions whose derivatives are of bounded variation. We also show that these rates are unattainable by additive smoothing splines (and by additive models built from linear smoothers, in general). On the computational side, we use backfitting, to leverage fast univariate trend filtering solvers; we also describe a new backfitting algorithm whose iterations can be run in parallel, which (as far as we can tell) is the first of its kind. Lastly, we present a number of experiments to examine the empirical performance of trend filtering."
to:NB  regression  additive_models  statistics  kith_and_kin  tibshirani.ryan
4 weeks ago by cshalizi
[1907.02306] Consistent Regression using Data-Dependent Coverings
"In this paper, we introduce a novel method to generate interpretable regression function estimators. The idea is based on called data-dependent coverings. The aim is to extract from the data a covering of the feature space instead of a partition. The estimator predicts the empirical conditional expectation over the cells of the partitions generated from the coverings. Thus, such estimator has the same form as those issued from data-dependent partitioning algorithms. We give sufficient conditions to ensure the consistency, avoiding the sufficient condition of shrinkage of the cells that appears in the former literature. Doing so, we reduce the number of covering elements. We show that such coverings are interpretable and each element of the covering is tagged as significant or insignificant. The proof of the consistency is based on a control of the error of the empirical estimation of conditional expectations which is interesting on its own."
to:NB  statistics  regression  nonparametrics  to_read
5 weeks ago by cshalizi
[1603.07632] Statistical inference in sparse high-dimensional additive models
"In this paper we discuss the estimation of a nonparametric component f1 of a nonparametric additive model Y=f1(X1)+...+fq(Xq)+ϵ. We allow the number q of additive components to grow to infinity and we make sparsity assumptions about the number of nonzero additive components. We compare this estimation problem with that of estimating f1 in the oracle model Z=f1(X1)+ϵ, for which the additive components f2,…,fq are known. We construct a two-step presmoothing-and-resmoothing estimator of f1 and state finite-sample bounds for the difference between our estimator and some smoothing estimators f̂ (oracle)1 in the oracle model. In an asymptotic setting these bounds can be used to show asymptotic equivalence of our estimator and the oracle estimators; the paper thus shows that, asymptotically, under strong enough sparsity conditions, knowledge of f2,…,fq has no effect on estimation accuracy. Our first step is to estimate f1 with an undersmoothed estimator based on near-orthogonal projections with a group Lasso bias correction. We then construct pseudo responses Ŷ  by evaluating a debiased modification of our undersmoothed estimator of f1 at the design points. In the second step the smoothing method of the oracle estimator f̂ (oracle)1 is applied to a nonparametric regression problem with responses Ŷ  and covariates X1. Our mathematical exposition centers primarily on establishing properties of the presmoothing estimator. We present simulation results demonstrating close-to-oracle performance of our estimator in practical applications."
to:NB  additive_models  statistics  regression  nonparametrics  sparsity
6 weeks ago by cshalizi
[1910.09227] Safe-Bayesian Generalized Linear Regression
"We study generalized Bayesian inference under misspecification, i.e. when the model is wrong but useful'. Generalized Bayes equips the likelihood with a learning rate η. We show that for generalized linear models (GLMs), η-generalized Bayes concentrates around the best approximation of the truth within the model for specific η≠1, even under severely misspecified noise, as long as the tails of the true distribution are exponential. We then derive MCMC samplers for generalized Bayesian lasso and logistic regression, and give examples of both simulated and real-world data in which generalized Bayes outperforms standard Bayes by a vast margin.
to:NB  regression  bayesian_consistency  statistics  grunwald.peter  misspecification
6 weeks ago by cshalizi
[1806.04823] Plug-in Regularized Estimation of High-Dimensional Parameters in Nonlinear Semiparametric Models
"We propose an l1-regularized M-estimator for a high-dimensional sparse parameter that is identified by a class of semiparametric conditional moment restrictions (CMR). We estimate the nonparametric nuisance parameter by modern machine learning methods. Plugging the first-stage estimate into the CMR, we construct the M-estimator loss function for the target parameter so that its gradient is insensitive (formally, Neyman-orthogonal) with respect to the first-stage regularization bias. As a result, the estimator achieves oracle convergence rate \sqrt{k \log p/n}, where oracle knows the true first stage and solves only a parametric problem. We apply our results to conditional moment models with missing data, games of incomplete information and treatment effects in regression models with non-linear link functions."
to:NB  regression  high-dimensional_statistics  statistics
6 weeks ago by cshalizi
[1910.06443] Measurement error as a missing data problem
"This article focuses on measurement error in covariates in regression analyses in which the aim is to estimate the association between one or more covariates and an outcome, adjusting for confounding. Error in covariate measurements, if ignored, results in biased estimates of parameters representing the associations of interest. Studies with variables measured with error can be considered as studies in which the true variable is missing, for either some or all study participants. We make the link between measurement error and missing data and describe methods for correcting for bias due to covariate measurement error with reference to this link, including regression calibration (conditional mean imputation), maximum likelihood and Bayesian methods, and multiple imputation. The methods are illustrated using data from the Third National Health and Nutrition Examination Survey (NHANES III) to investigate the association between the error-prone covariate systolic blood pressure and the hazard of death due to cardiovascular disease, adjusted for several other variables including those subject to missing data. We use multiple imputation and Bayesian approaches that can address both measurement error and missing data simultaneously. Example data and R code are provided in supplementary materials."
to:NB  measurement  statistics  regression  missing_data  to_be_shot_after_a_fair_trial
7 weeks ago by cshalizi
[1910.06386] All of Linear Regression
"Least squares linear regression is one of the oldest and widely used data analysis tools. Although the theoretical analysis of the ordinary least squares (OLS) estimator is as old, several fundamental questions are yet to be answered. Suppose regression observations (X1,Y1),…,(Xn,Yn)∈ℝd×ℝ (not necessarily independent) are available. Some of the questions we deal with are as follows: under what conditions, does the OLS estimator converge and what is the limit? What happens if the dimension is allowed to grow with n? What happens if the observations are dependent with dependence possibly strengthening with n? How to do statistical inference under these kinds of misspecification? What happens to the OLS estimator under variable selection? How to do inference under misspecification and variable selection?
"We answer all the questions raised above with one simple deterministic inequality which holds for any set of observations and any sample size. This implies that all our results are a finite sample (non-asymptotic) in nature. In the end, one only needs to bound certain random quantities under specific settings of interest to get concrete rates and we derive these bounds for the case of independent observations. In particular, the problem of inference after variable selection is studied, for the first time, when d, the number of covariates increases (almost exponentially) with sample size n. We provide comments on the right'' statistic to consider for inference under variable selection and efficient computation of quantiles."
to:NB  regression  statistics  to_read  re:TALR  to_teach:linear_models
7 weeks ago by cshalizi
[1910.04743] The Implicit Regularization of Ordinary Least Squares Ensembles
"Ensemble methods that average over a collection of independent predictors that are each limited to a subsampling of both the examples and features of the training data command a significant presence in machine learning, such as the ever-popular random forest, yet the nature of the subsampling effect, particularly of the features, is not well understood. We study the case of an ensemble of linear predictors, where each individual predictor is fit using ordinary least squares on a random submatrix of the data matrix. We show that, under standard Gaussianity assumptions, when the number of features selected for each predictor is optimally tuned, the asymptotic risk of a large ensemble is equal to the asymptotic ridge regression risk, which is known to be optimal among linear predictors in this setting. In addition to eliciting this implicit regularization that results from subsampling, we also connect this ensemble to the dropout technique used in training deep (neural) networks, another strategy that has been shown to have a ridge-like regularizing effect."
to:NB  ensemble_methods  regression  statistics
7 weeks ago by cshalizi
[1905.03353] Regression from Dependent Observations
"The standard linear and logistic regression models assume that the response variables are independent, but share the same linear relationship to their corresponding vectors of covariates. The assumption that the response variables are independent is, however, too strong. In many applications, these responses are collected on nodes of a network, or some spatial or temporal domain, and are dependent. Examples abound in financial and meteorological applications, and dependencies naturally arise in social networks through peer effects. Regression with dependent responses has thus received a lot of attention in the Statistics and Economics literature, but there are no strong consistency results unless multiple independent samples of the vectors of dependent responses can be collected from these models. We present computationally and statistically efficient methods for linear and logistic regression models when the response variables are dependent on a network. Given one sample from a networked linear or logistic regression model and under mild assumptions, we prove strong consistency results for recovering the vector of coefficients and the strength of the dependencies, recovering the rates of standard regression under independent observations. We use projected gradient descent on the negative log-likelihood, or negative log-pseudolikelihood, and establish their strong convexity and consistency using concentration of measure for dependent random variables."

--- Umm? Spatial statistics is a thing? From a very quick skim, they seem to just be reproducing the usual asymptotics about M-estimators, but maybe the concentration results will be of interest. (Note: published in a CS theory conference, not a stats journal or even an ML conference.)
to:NB  network_data_analysis  regression  statistics  to_be_shot_after_a_fair_trial
7 weeks ago by cshalizi
[1806.03467] Orthogonal Random Forest for Causal Inference
"We propose the orthogonal random forest, an algorithm that combines Neyman-orthogonality to reduce sensitivity with respect to estimation error of nuisance parameters with generalized random forests (Athey et al., 2017)--a flexible non-parametric method for statistical estimation of conditional moment models using random forests. We provide a consistency rate and establish asymptotic normality for our estimator. We show that under mild assumptions on the consistency rate of the nuisance estimator, we can achieve the same error rate as an oracle with a priori knowledge of these nuisance parameters. We show that when the nuisance functions have a locally sparse parametrization, then a local ℓ1-penalized regression achieves the required rate. We apply our method to estimate heterogeneous treatment effects from observational data with discrete treatments or continuous treatments, and we show that, unlike prior work, our method provably allows to control for a high-dimensional set of variables under standard sparsity conditions. We also provide a comprehensive empirical evaluation of our algorithm on both synthetic and real data."
to:NB  decision_trees  ensemble_methods  regression  causal_inference  statistics  nonparametrics  random_forests
9 weeks ago by cshalizi
[1909.09370] Consensual aggregation of clusters based on Bregman divergences to improve predictive models
"A new procedure to construct predictive models in supervised learning problems by paying attention to the clustering structure of the input data is introduced. We are interested in situations where the input data consists of more than one unknown cluster, and where there exist different underlying models on these clusters. Thus, instead of constructing a single predictive model on the whole dataset, we propose to use a K-means clustering algorithm with different options of Bregman divergences, to recover the clustering structure of the input data. Then one dedicated predictive model is fit per cluster. For each divergence, we construct a simple local predictor on each observed cluster. We obtain one estimator, the collection of the K simple local predictors, per divergence, and we propose to combine them in a smart way based on a consensus idea. Several versions of consensual aggregation in both classification and regression problems are considered. A comparison of the performances of all constructed estimators on different simulated and real data assesses the excellent performance of our method. In a large variety of prediction problems, the consensual aggregation procedure outperforms all the other models."

--- Compare to Gershenfeld's old cluster-weighted modeling...
to:NB  clustering  regression  statistics
10 weeks ago by cshalizi
[1909.09138] Uncovering Sociological Effect Heterogeneity using Machine Learning
"Individuals do not respond uniformly to treatments, events, or interventions. Sociologists routinely partition samples into subgroups to explore how the effects of treatments vary by covariates like race, gender, and socioeconomic status. In so doing, analysts determine the key subpopulations based on theoretical priors. Data-driven discoveries are also routine, yet the analyses by which sociologists typically go about them are problematic and seldom move us beyond our expectations, and biases, to explore new meaningful subgroups. Emerging machine learning methods allow researchers to explore sources of variation that they may not have previously considered, or envisaged. In this paper, we use causal trees to recursively partition the sample and uncover sources of treatment effect heterogeneity. We use honest estimation, splitting the sample into a training sample to grow the tree and an estimation sample to estimate leaf-specific effects. Assessing a central topic in the social inequality literature, college effects on wages, we compare what we learn from conventional approaches for exploring variation in effects to causal trees. Given our use of observational data, we use leaf-specific matching and sensitivity analyses to address confounding and offer interpretations of effects based on observed and unobserved heterogeneity. We encourage researchers to follow similar practices in their work on variation in sociological effects."
to:NB  causal_inference  statistics  regression  nonparametrics
10 weeks ago by cshalizi
IPAD: Stable Interpretable Forecasting with Knockoffs Inference: Journal of the American Statistical Association: Vol 0, No 0
"Interpretability and stability are two important features that are desired in many contemporary big data applications arising in statistics, economics, and finance. While the former is enjoyed to some extent by many existing forecasting approaches, the latter in the sense of controlling the fraction of wrongly discovered features which can enhance greatly the interpretability is still largely underdeveloped. To this end, in this article, we exploit the general framework of model-X knockoffs introduced recently in Candès, Fan, Janson and Lv [(2018Candès, E. J., Fan, Y., Janson, L., and Lv, J. (2018), “Panning for Gold: ‘Model X’ Knockoffs for High Dimensional Controlled Variable Selection,” Journal of the Royal Statistical Society, Series B, 80, 551–577. DOI:10.1111/rssb.12265. [Crossref] , [Google Scholar]), “Panning for Gold: ‘model X’ Knockoffs for High Dimensional Controlled Variable Selection,” Journal of the Royal Statistical Society, Series B, 80, 551–577], which is nonconventional for reproducible large-scale inference in that the framework is completely free of the use of p-values for significance testing, and suggest a new method of intertwined probabilistic factors decoupling (IPAD) for stable interpretable forecasting with knockoffs inference in high-dimensional models. The recipe of the method is constructing the knockoff variables by assuming a latent factor model that is exploited widely in economics and finance for the association structure of covariates. Our method and work are distinct from the existing literature in which we estimate the covariate distribution from data instead of assuming that it is known when constructing the knockoff variables, our procedure does not require any sample splitting, we provide theoretical justifications on the asymptotic false discovery rate control, and the theory for the power analysis is also established. Several simulation examples and the real data analysis further demonstrate that the newly suggested method has appealing finite-sample performance with desired interpretability and stability compared to some popularly used forecasting methods."
to:NB  statistics  regression  high-dimensional_statistics  factor_analysis  variable_selection
11 weeks ago by cshalizi
[1908.11140] Deep Learning and MARS: A Connection
"We consider least squares regression estimates using deep neural networks. We show that these estimates satisfy an oracle inequality, which implies that (up to a logarithmic factor) the error of these estimates is at least as small as the optimal possible error bound which one would expect for MARS in case that this procedure would work in the optimal way. As a result we show that our neural networks are able to achieve a dimensionality reduction in case that the regression function locally has low dimensionality. This assumption seems to be realistic in real-world applications, since selected high-dimensional data are often confined to locally-low-dimensional distributions. In our simulation study we provide numerical experiments to support our theoretical results and to compare our estimate with other conventional nonparametric regression estimates, especially with MARS. The use of our estimates is illustrated through a real data analysis."
to:NB  regression  nonparametrics  neural_networks  statistics
11 weeks ago by cshalizi
[1906.00232] Kernel Instrumental Variable Regression
"Instrumental variable regression is a strategy for learning causal relationships in observational data. If measurements of input X and output Y are confounded, the causal relationship can nonetheless be identified if an instrumental variable Z is available that influences X directly, but is conditionally independent of Y given X and the unmeasured confounder. The classic two-stage least squares algorithm (2SLS) simplifies the estimation problem by modeling all relationships as linear functions. We propose kernel instrumental variable regression (KIV), a nonparametric generalization of 2SLS, modeling relations among X, Y, and Z as nonlinear functions in reproducing kernel Hilbert spaces (RKHSs). We prove the consistency of KIV under mild assumptions, and derive conditions under which the convergence rate achieves the minimax optimal rate for unconfounded, one-stage RKHS regression. In doing so, we obtain an efficient ratio between training sample sizes used in the algorithm's first and second stages. In experiments, KIV outperforms state of the art alternatives for nonparametric instrumental variable regression. Of independent interest, we provide a more general theory of conditional mean embedding regression in which the RKHS has infinite dimension."
to:NB  instrumental_variables  kernel_estimators  regression  nonparametrics  causal_inference  statistics  re:ADAfaEPoV  to_read
11 weeks ago by cshalizi
[1909.03968] Tree-based Control Methods: Consequences of Moving the US Embassy
"We recast the synthetic controls for evaluating policies as a counterfactual prediction problem and replace its linear regression with a non-parametric model inspired by machine learning. The proposed method enables us to achieve more accurate counterfactual predictions. We apply our method to a highly-debated policy: the movement of the US embassy to Jerusalem. In Israel and Palestine, we find that the average number of weekly conflicts has increased by roughly 103 % over 48 weeks since the movement was announced on December 6, 2017. Using conformal inference tests, we justify our model and find the increase to be statistically significant."

--- I am very skeptical of the application, but interested in the methodology.
to:NB  causal_inference  statistics  economics  nonparametrics  regression  re:ADAfaEPoV  synthetic_controls
11 weeks ago by cshalizi
[1909.05495] Optimal choice of $k$ for $k$-nearest neighbor regression
"The k-nearest neighbor algorithm (k-NN) is a widely used non-parametric method for classification and regression. We study the mean squared error of the k-NN estimator when k is chosen by leave-one-out cross-validation (LOOCV). Although it was known that this choice of k is asymptotically consistent, it was not known previously that it is an optimal k. We show, with high probability, the mean squared error of this estimator is close to the minimum mean squared error using the k-NN estimate, where the minimum is over all choices of k."

--- Looks legit on first pass (and we know that LOOCV is generally _predictively_ good).
to:NB  regression  nearest_neighbors  statistics  cross-validation  to_teach:data-mining  have_skimmed
11 weeks ago by cshalizi
[1909.02088] On Least Squares Estimation under Heteroscedastic and Heavy-Tailed Errors
"We consider least squares estimation in a general nonparametric regression model. The rate of convergence of the least squares estimator (LSE) for the unknown regression function is well studied when the errors are sub-Gaussian. We find upper bounds on the rates of convergence of the LSE when the errors have uniformly bounded conditional variance and have only finitely many moments. We show that the interplay between the moment assumptions on the error, the metric entropy of the class of functions involved, and the "local" structure of the function class around the truth drives the rate of convergence of the LSE. We find sufficient conditions on the errors under which the rate of the LSE matches the rate of the LSE under sub-Gaussian error. Our results are finite sample and allow for heteroscedastic and heavy-tailed errors."
to:NB  regression  empirical_processes  statistics  heavy_tails
12 weeks ago by cshalizi
"Dyadic data, where outcomes reflecting pairwise interaction among sampled units are of primary interest, arise frequently in social science research. Regression analyses with such data feature prominently in many research literatures (e.g., gravity models of trade). The dependence structure associated with dyadic data raises special estimation and, especially, inference issues. This chapter reviews currently available methods for (parametric) dyadic regression analysis and presents guidelines for empirical researchers."
to:NB  regression  network_data_analysis  statistics
august 2019 by cshalizi
[1908.08596] Regression Analysis of Unmeasured Confounding
"When studying the causal effect of x on y, researchers may conduct regression and report a confidence interval for the slope coefficient βx. This common confidence interval provides an assessment of uncertainty from sampling error, but it does not assess uncertainty from confounding. An intervention on x may produce a response in y that is unexpected, and our misinterpretation of the slope happens when there are confounding factors w. When w are measured we may conduct multiple regression, but when w are unmeasured it is common practice to include a precautionary statement when reporting the confidence interval, warning against unwarranted causal interpretation. If the goal is robust causal interpretation then we can do something more informative. Uncertainty in the specification of three confounding parameters can be propagated through an equation to produce a confounding interval. Here we develop supporting mathematical theory and describe an example application. Our proposed methodology applies well to studies of a continuous response or rare outcome. It is a general method for quantifying error from model uncertainty. Whereas confidence intervals are used to assess uncertainty from unmeasured individuals, confounding intervals can be used to assess uncertainty from unmeasured attributes."

--- How is this not just sensitivity analysis?
to:NB  regression  statistics  misspecification  causal_inference
august 2019 by cshalizi
[1908.07193] Counterfactual Distribution Regression for Structured Inference
"We consider problems in which a system receives external \emph{perturbations} from time to time. For instance, the system can be a train network in which particular lines are repeatedly disrupted without warning, having an effect on passenger behavior. The goal is to predict changes in the behavior of the system at particular points of interest, such as passenger traffic around stations at the affected rails. We assume that the data available provides records of the system functioning at its "natural regime" (e.g., the train network without disruptions) and data on cases where perturbations took place. The inference problem is how information concerning perturbations, with particular covariates such as location and time, can be generalized to predict the effect of novel perturbations. We approach this problem from the point of view of a mapping from the counterfactual distribution of the system behavior without disruptions to the distribution of the disrupted system. A variant on \emph{distribution regression} is developed for this setup."
to:NB  causal_inference  regression  statistics
august 2019 by cshalizi
[1612.08468] Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models
"When fitting black box supervised learning models (e.g., complex trees, neural networks, boosted trees, random forests, nearest neighbors, local kernel-weighted methods, etc.), visualizing the main effects of the individual predictor variables and their low-order interaction effects is often important, and partial dependence (PD) plots are the most popular approach for accomplishing this. However, PD plots involve a serious pitfall if the predictor variables are far from independent, which is quite common with large observational data sets. Namely, PD plots require extrapolation of the response at predictor values that are far outside the multivariate envelope of the training data, which can render the PD plots unreliable. Although marginal plots (M plots) do not require such extrapolation, they produce substantially biased and misleading results when the predictors are dependent, analogous to the omitted variable bias in regression. We present a new visualization approach that we term accumulated local effects (ALE) plots, which inherits the desirable characteristics of PD and M plots, without inheriting their preceding shortcomings. Like M plots, ALE plots do not require extrapolation; and like PD plots, they are not biased by the omitted variable phenomenon. Moreover, ALE plots are far less computationally expensive than PD plots."
to:NB  variable_selection  visual_display_of_quantitative_information  statistics  regression
august 2019 by cshalizi
Testing Sparsity-Inducing Penalties: Journal of Computational and Graphical Statistics: Vol 0, No 0
"Many penalized maximum likelihood estimators correspond to posterior mode estimators under specific prior distributions. Appropriateness of a particular class of penalty functions can therefore be interpreted as the appropriateness of a prior for the parameters. For example, the appropriateness of a lasso penalty for regression coefficients depends on the extent to which the empirical distribution of the regression coefficients resembles a Laplace distribution. We give a testing procedure of whether or not a Laplace prior is appropriate and accordingly, whether or not using a lasso penalized estimate is appropriate. This testing procedure is designed to have power against exponential power priors which correspond to ℓqℓq penalties. Via simulations, we show that this testing procedure achieves the desired level and has enough power to detect violations of the Laplace assumption when the numbers of observations and unknown regression coefficients are large. We then introduce an adaptive procedure that chooses a more appropriate prior and corresponding penalty from the class of exponential power priors when the null hypothesis is rejected. We show that this can improve estimation of the regression coefficients both when they are drawn from an exponential power distribution and when they are drawn from a spike-and-slab distribution. Supplementary materials for this article are available online."

--- I feel like I fundamentally disagree with this approach. Those priors are merely (to quote Jamie Robins and Larry Wasserman) "frequentist pursuit", and have no bearing on whether (say) the Lasso will give a good sparse, linear approximation to the underlying regression function (see https://normaldeviate.wordpress.com/2013/09/11/consistency-sparsistency-and-presistency/). All of which said, Hoff is always worth listening to, so the last tag applies with special force.
to:NB  model_checking  sparsity  regression  hypothesis_testing  bayesianism  re:phil-of-bayes_paper  hoff.peter  to_besh
august 2019 by cshalizi
High-Dimensional Adaptive Minimax Sparse Estimation With Interactions - IEEE Journals & Magazine
"High-dimensional linear regression with interaction effects is broadly applied in research fields such as bioinformatics and social science. In this paper, first, we investigate the minimax rate of convergence for regression estimation in high-dimensional sparse linear models with two-way interactions. Here, we derive matching upper and lower bounds under three types of heredity conditions: strong heredity, weak heredity, and no heredity. From the results: 1) A stronger heredity condition may or may not drastically improve the minimax rate of convergence. In fact, in some situations, the minimax rates of convergence are the same under all three heredity conditions; 2) The minimax rate of convergence is determined by the maximum of the total price of estimating the main effects and that of estimating the interaction effects, which goes beyond purely comparing the order of the number of non-zero main effects r1 and non-zero interaction effects r2 ; and 3) Under any of the three heredity conditions, the estimation of the interaction terms may be the dominant part in determining the rate of convergence. This is due to either the dominant number of interaction effects over main effects or the higher interaction estimation price induced by a large ambient dimension. Second, we construct an adaptive estimator that achieves the minimax rate of convergence regardless of the true heredity condition and the sparsity indices r1,r2 ."
to:NB  statistics  high-dimensional_statistics  regression  sparsity  variable_selection  linear_regression
august 2019 by cshalizi
[1908.05355] The generalization error of random features regression: Precise asymptotics and double descent curve
"Deep learning methods operate in regimes that defy the traditional statistical mindset. The neural network architectures often contain more parameters than training samples, and are so rich that they can interpolate the observed labels, even if the latter are replaced by pure noise. Despite their huge complexity, the same architectures achieve small generalization error on real data.
"This phenomenon has been rationalized in terms of a so-called double descent' curve. As the model complexity increases, the generalization error follows the usual U-shaped curve at the beginning, first decreasing and then peaking around the interpolation threshold (when the model achieves vanishing training error). However, it descends again as model complexity exceeds this threshold. The global minimum of the generalization error is found in this overparametrized regime, often when the number of parameters is much larger than the number of samples. Far from being a peculiar property of deep neural networks, elements of this behavior have been demonstrated in much simpler settings, including linear regression with random covariates.
"In this paper we consider the problem of learning an unknown function over the d-dimensional sphere 𝕊d−1, from n i.i.d. samples (xi,yi)∈𝕊d−1×ℝ, i≤n. We perform ridge regression on N random features of the form σ(w𝖳ax), a≤N. This can be equivalently described as a two-layers neural network with random first-layer weights. We compute the precise asymptotics of the generalization error, in the limit N,n,d→∞ with N/d and n/d fixed. This provides the first analytically tractable model that captures all the features of the double descent phenomenon."
to:NB  learning_theory  regression  random_projections  statistics  montanari.andrea
august 2019 by cshalizi
[1611.03015] Honest confidence sets in nonparametric IV regression and other ill-posed models
"This paper develops inferential methods for a very general class of ill-posed models in econometrics encompassing the nonparametric instrumental regression, various functional regressions, and the density deconvolution. We focus on uniform confidence sets for the parameter of interest estimated with Tikhonov regularization, as in Darolles, Fan, Florens, and Renault (2011). Since it is impossible to have inferential methods based on the central limit theorem, we develop two alternative approaches relying on the concentration inequality and bootstrap approximations. We show that expected diameters and coverage properties of resulting sets have uniform validity over a large class of models, i.e., constructed confidence sets are honest. Monte Carlo experiments illustrate that introduced confidence sets have reasonable width and coverage properties. Using the U.S. data, we provide uniform confidence sets for Engel curves for various commodities."
to:NB  confidence_sets  nonparametrics  instrumental_variables  regression  causal_inference
august 2019 by cshalizi
[1908.04427] A Groupwise Approach for Inferring Heterogeneous Treatment Effects in Causal Inference
"There is a growing literature in nonparametric estimation of the conditional average treatment effect given a specific value of covariates. However, this estimate is often difficult to interpret if covariates are high dimensional and in practice, effect heterogeneity is discussed in terms of subgroups of individuals with similar attributes. The paper propose to study treatment heterogeneity under the groupwise framework. Our method is simple, only based on linear regression and sample splitting, and is semiparametrically efficient under assumptions. We also discuss ways to conduct multiple testing. We conclude by reanalyzing a get-out-the-vote experiment during the 2014 U.S. midterm elections."
to:NB  causal_inference  regression  statistics  nonparametrics
august 2019 by cshalizi
[1605.02214] On cross-validated Lasso
"In this paper, we derive non-asymptotic error bounds for the Lasso estimator when the penalty parameter for the estimator is chosen using K-fold cross-validation. Our bounds imply that the cross-validated Lasso estimator has nearly optimal rates of convergence in the prediction, L2, and L1 norms. For example, we show that in the model with the Gaussian noise and under fairly general assumptions on the candidate set of values of the penalty parameter, the estimation error of the cross-validated Lasso estimator converges to zero in the prediction norm with the slogp/n‾‾‾‾‾‾‾‾√×log(pn)‾‾‾‾‾‾‾√ rate, where n is the sample size of available data, p is the number of covariates, and s is the number of non-zero coefficients in the model. Thus, the cross-validated Lasso estimator achieves the fastest possible rate of convergence in the prediction norm up to a small logarithmic factor log(pn)‾‾‾‾‾‾‾√, and similar conclusions apply for the convergence rate both in L2 and in L1 norms. Importantly, our results cover the case when p is (potentially much) larger than n and also allow for the case of non-Gaussian noise. Our paper therefore serves as a justification for the widely spread practice of using cross-validation as a method to choose the penalty parameter for the Lasso estimator."
to:NB  cross-validation  lasso  regression  statistics
august 2019 by cshalizi
[1908.02399] Estimation of Conditional Average Treatment Effects with High-Dimensional Data
"Given the unconfoundedness assumption, we propose new nonparametric estimators for the reduced dimensional conditional average treatment effect (CATE) function. In the first stage, the nuisance functions necessary for identifying CATE are estimated by machine learning methods, allowing the number of covariates to be comparable to or larger than the sample size. This is a key feature since identification is generally more credible if the full vector of conditioning variables, including possible transformations, is high-dimensional. The second stage consists of a low-dimensional kernel regression, reducing CATE to a function of the covariate(s) of interest. We consider two variants of the estimator depending on whether the nuisance functions are estimated over the full sample or over a hold-out sample. Building on Belloni at al. (2017) and Chernozhukov et al. (2018), we derive functional limit theory for the estimators and provide an easy-to-implement procedure for uniform inference based on the multiplier bootstrap."
to:NB  causal_inference  regression  statistics  high-dimensional_statistics  nonparametrics  kernel_estimators
august 2019 by cshalizi
[1908.02718] A Characterization of Mean Squared Error for Estimator with Bagging
"Bagging can significantly improve the generalization performance of unstable machine learning algorithms such as trees or neural networks. Though bagging is now widely used in practice and many empirical studies have explored its behavior, we still know little about the theoretical properties of bagged predictions. In this paper, we theoretically investigate how the bagging method can reduce the Mean Squared Error (MSE) when applied on a statistical estimator. First, we prove that for any estimator, increasing the number of bagged estimators N in the average can only reduce the MSE. This intuitive result, observed empirically and discussed in the literature, has not yet been rigorously proved. Second, we focus on the standard estimator of variance called unbiased sample variance and we develop an exact analytical expression of the MSE for this estimator with bagging.
"This allows us to rigorously discuss the number of iterations N and the batch size m of the bagging method. From this expression, we state that only if the kurtosis of the distribution is greater than 32, the MSE of the variance estimator can be reduced with bagging. This result is important because it demonstrates that for distribution with low kurtosis, bagging can only deteriorate the performance of a statistical prediction. Finally, we propose a novel general-purpose algorithm to estimate with high precision the variance of a sample."
to:NB  ensemble_methods  prediction  regression  statistics
august 2019 by cshalizi
[1907.12732] Local Inference in Additive Models with Decorrelated Local Linear Estimator
"Additive models, as a natural generalization of linear regression, have played an important role in studying nonlinear relationships. Despite of a rich literature and many recent advances on the topic, the statistical inference problem in additive models is still relatively poorly understood. Motivated by the inference for the exposure effect and other applications, we tackle in this paper the statistical inference problem for f′1(x0) in additive models, where f1 denotes the univariate function of interest and f′1(x0) denotes its first order derivative evaluated at a specific point x0. The main challenge for this local inference problem is the understanding and control of the additional uncertainty due to the need of estimating other components in the additive model as nuisance functions. To address this, we propose a decorrelated local linear estimator, which is particularly useful in reducing the effect of the nuisance function estimation error on the estimation accuracy of f′1(x0). We establish the asymptotic limiting distribution for the proposed estimator and then construct confidence interval and hypothesis testing procedures for f′1(x0). The variance level of the proposed estimator is of the same order as that of the local least squares in nonparametric regression, or equivalently the additive model with one component, while the bias of the proposed estimator is jointly determined by the statistical accuracies in estimating the nuisance functions and the relationship between the variable of interest and the nuisance variables. The method is developed for general additive models and is demonstrated in the high-dimensional sparse setting."
to:NB  additive_models  regression  statistics
august 2019 by cshalizi
Tan , Zhang : Doubly penalized estimation in additive regression with high-dimensional data
"Additive regression provides an extension of linear regression by modeling the signal of a response as a sum of functions of covariates of relatively low complexity. We study penalized estimation in high-dimensional nonparametric additive regression where functional semi-norms are used to induce smoothness of component functions and the empirical L2L2 norm is used to induce sparsity. The functional semi-norms can be of Sobolev or bounded variation types and are allowed to be different amongst individual component functions. We establish oracle inequalities for the predictive performance of such methods under three simple technical conditions: a sub-Gaussian condition on the noise, a compatibility condition on the design and the functional classes under consideration and an entropy condition on the functional classes. For random designs, the sample compatibility condition can be replaced by its population version under an additional condition to ensure suitable convergence of empirical norms. In homogeneous settings where the complexities of the component functions are of the same order, our results provide a spectrum of minimax convergence rates, from the so-called slow rate without requiring the compatibility condition to the fast rate under the hard sparsity or certain LqLq sparsity to allow many small components in the true regression function. These results significantly broaden and sharpen existing ones in the literature."
to:NB  statistics  regression  additive_models  nonparametrics  empirical_processes
august 2019 by cshalizi
[1907.09244] Fast rates for empirical risk minimization with cadlag losses with bounded sectional variation norm
"Empirical risk minimization over sieves of the class  of cadlag functions with bounded variation norm has a long history, starting with Total Variation Denoising (Rudin et al., 1992), and has been considered by several recent articles, in particular Fang et al. (2019) and van der Laan (2015).
"In this article, we show how a certain representation of cadlag functions with bounded sectional variation, also called Hardy-Krause variation, allows to bound the bracketing entropy of sieves of  and therefore derive fast rates of convergence in nonparametric function estimation. Specifically, for any sequence an that (slowly) diverges to ∞, we show that we can construct an estimator with rate of convergence OP(2d/3n−1/3(logn)d/3a2/3n) over , under some fairly general assumptions. Remarkably, the dimension only affects the rate in n through the logarithmic factor, making this method especially appropriate for high dimensional problems.
"In particular, we show that in the case of nonparametric regression over sieves of cadlag functions with bounded sectional variation norm, this upper bound on the rate of convergence holds for least-squares estimators, under the random design, sub-exponential errors setting."
to:NB  learning_theory  method_of_sieves  regression  empirical_processes  statistics  van_der_laan.mark
july 2019 by cshalizi
Liu , Shih , Strawderman , Zhang , Johnson , Chai : Statistical Analysis of Zero-Inflated Nonnegative Continuous Data: A Review
"Zero-inflated nonnegative continuous (or semicontinuous) data arise frequently in biomedical, economical, and ecological studies. Examples include substance abuse, medical costs, medical care utilization, biomarkers (e.g., CD4 cell counts, coronary artery calcium scores), single cell gene expression rates, and (relative) abundance of microbiome. Such data are often characterized by the presence of a large portion of zero values and positive continuous values that are skewed to the right and heteroscedastic. Both of these features suggest that no simple parametric distribution may be suitable for modeling such type of outcomes. In this paper, we review statistical methods for analyzing zero-inflated nonnegative outcome data. We will start with the cross-sectional setting, discussing ways to separate zero and positive values and introducing flexible models to characterize right skewness and heteroscedasticity in the positive values. We will then present models of correlated zero-inflated nonnegative continuous data, using random effects to tackle the correlation on repeated measures from the same subject and that across different parts of the model. We will also discuss expansion to related topics, for example, zero-inflated count and survival data, nonlinear covariate effects, and joint models of longitudinal zero-inflated nonnegative continuous data and survival. Finally, we will present applications to three real datasets (i.e., microbiome, medical costs, and alcohol drinking) to illustrate these methods. Example code will be provided to facilitate applications of these methods."
to:NB  statistics  regression  zero-inflation
july 2019 by cshalizi
[1509.09169] Lecture notes on ridge regression
"The linear regression model cannot be fitted to high-dimensional data, as the high-dimensionality brings about empirical non-identifiability. Penalized regression overcomes this non-identifiability by augmentation of the loss function by a penalty (i.e. a function of regression coefficients). The ridge penalty is the sum of squared regression coefficients, giving rise to ridge regression. Here many aspect of ridge regression are reviewed e.g. moments, mean squared error, its equivalence to constrained estimation, and its relation to Bayesian regression. Finally, its behaviour and use are illustrated in simulation and on omics data. Subsequently, ridge regression is generalized to allow for a more general penalty. The ridge penalization framework is then translated to logistic regression and its properties are shown to carry over. To contrast ridge penalized estimation, the final chapter introduces its lasso counterpart."
to:NB  regression  linear_regression  ridge_regression  statistics
july 2019 by cshalizi
[1402.2734] Graph-based Multivariate Conditional Autoregressive Models
"The conditional autoregressive model is a routinely used statistical model for areal data that arise from, for instances, epidemiological, socio-economic or ecological studies. Various multivariate conditional autoregressive models have also been extensively studied in the literature and it has been shown that extending from the univariate case to the multivariate case is not trivial. The difficulties lie in many aspects, including validity, interpretability, flexibility and computational feasibility of the model. In this paper, we approach the multivariate modeling from an element-based perspective instead of the traditional vector-based perspective. We focus on the joint adjacency structure of elements and discuss graphical structures for both the spatial and non-spatial domains. We assume that the graph for the spatial domain is generally known and fixed while the graph for the non-spatial domain can be unknown and random. We propose a very general specification for the multivariate conditional modeling and then focus on three special cases, which are linked to well known models in the literature. Bayesian inference for parameter learning and graph learning is provided for the focused cases, and finally, an example with public health data is illustrated."
to:NB  spatial_statistics  regression  statistics  network_data_analysis
july 2019 by cshalizi
[1703.04467] spmoran: An R package for Moran's eigenvector-based spatial regression analysis
"This study illustrates how to use "spmoran," which is an R package for Moran's eigenvector-based spatial regression analysis for up to millions of observations. This package estimates fixed or random effects eigenvector spatial filtering models and their extensions including a spatially varying coefficient model, a spatial unconditional quantile regression model, and low rank spatial econometric models. These models are estimated computationally efficiently."

--- ETA after reading: The approach sounds interesting enough that I want to track down the references that actually explain it, rather than just the software.
in_NB  spatial_statistics  regression  statistics  to_teach:data_over_space_and_time  R  have_read
july 2019 by cshalizi
The Standard Errors of Persistence
"A large literature on persistence finds that many modern outcomes strongly reflect characteristics of the same places in the distant past. However, alongside unusually high t statistics, these regressions display severe spatial auto-correlation in residuals, and the purpose of this paper is to examine whether these two properties might be connected. We start by running artificial regressions where both variables are spatial noise and find that, even for modest ranges of spatial correlation between points, t statistics become severely inflated leading to significance levels that are in error by several orders of magnitude. We analyse 27 persistence studies in leading journals and find that in most cases if we replace the main explanatory variable with spatial noise the fit of the regression commonly improves; and if we replace the dependent variable with spatial noise, the persistence variable can still explain it at high significance levels. We can predict in advance which persistence results might be the outcome of fitting spatial noise from the degree of spatial au-tocorrelation in their residuals measured by a standard Moran statistic. Our findings suggest that the results of persistence studies, and of spatial regressions more generally, might be treated with some caution in the absence of reported Moran statistics and noise simulations."
in_NB  have_read  econometrics  regression  spatial_statistics  to_teach:data_over_space_and_time  via:jbdelong  to_teach:linear_models
july 2019 by cshalizi
Cheng , Chen : Nonparametric inference via bootstrapping the debiased estimator
"In this paper, we propose to construct confidence bands by bootstrapping the debiased kernel density estimator (for density estimation) and the debiased local polynomial regression estimator (for regression analysis). The idea of using a debiased estimator was recently employed by Calonico et al. (2018b) to construct a confidence interval of the density function (and regression function) at a given point by explicitly estimating stochastic variations. We extend their ideas of using the debiased estimator and further propose a bootstrap approach for constructing simultaneous confidence bands. This modified method has an advantage that we can easily choose the smoothing bandwidth from conventional bandwidth selectors and the confidence band will be asymptotically valid. We prove the validity of the bootstrap confidence band and generalize it to density level sets and inverse regression problems. Simulation studies confirm the validity of the proposed confidence bands/sets. We apply our approach to an Astronomy dataset to show its applicability."
to:NB  to_read  statistics  bootstrap  confidence_sets  regression  density_estimation  re:ADAfaEPoV
july 2019 by cshalizi
Fast Generalized Linear Models by Database Sampling and One-Step Polishing: Journal of Computational and Graphical Statistics: Vol 0, No 0
"In this article, I show how to fit a generalized linear model to N observations on p variables stored in a relational database, using one sampling query and one aggregation query, as long as N^{1/2+δ} observations can be stored in memory, for some δ>0. The resulting estimator is fully efficient and asymptotically equivalent to the maximum likelihood estimator, and so its variance can be estimated from the Fisher information in the usual way. A proof-of-concept implementation uses R with MonetDB and with SQLite, and could easily be adapted to other popular databases. I illustrate the approach with examples of taxi-trip data in New York City and factors related to car color in New Zealand. "
to:NB  computational_statistics  linear_regression  regression  databases  lumley.thomas  to_teach:statcomp
june 2019 by cshalizi
[1901.03719] Non-Parametric Inference Adaptive to Intrinsic Dimension
"We consider non-parametric estimation and inference of conditional moment models in high dimensions. We show that even when the dimension D of the conditioning variable is larger than the sample size n, estimation and inference is feasible as long as the distribution of the conditioning variable has small intrinsic dimension d, as measured by locally low doubling measures. Our estimation is based on a sub-sampled ensemble of the k-nearest neighbors (k-NN) Z-estimator. We show that if the intrinsic dimension of the covariate distribution is equal to d, then the finite sample estimation error of our estimator is of order n−1/(d+2) and our estimate is n1/(d+2)-asymptotically normal, irrespective of D. The sub-sampling size required for achieving these results depends on the unknown intrinsic dimension d. We propose an adaptive data-driven approach for choosing this parameter and prove that it achieves the desired rates. We discuss extensions and applications to heterogeneous treatment effect estimation."
to:NB  regression  high-dimensional_statistics  statistics
june 2019 by cshalizi
[1906.07177] (f)RFCDE: Random Forests for Conditional Density Estimation and Functional Data
"Random forests is a common non-parametric regression technique which performs well for mixed-type unordered data and irrelevant features, while being robust to monotonic variable transformations. Standard random forests, however, do not efficiently handle functional data and runs into a curse-of dimensionality when presented with high-resolution curves and surfaces. Furthermore, in settings with heteroskedasticity or multimodality, a regression point estimate with standard errors do not fully capture the uncertainty in our predictions. A more informative quantity is the conditional density p(y | x) which describes the full extent of the uncertainty in the response y given covariates x. In this paper we show how random forests can be efficiently leveraged for conditional density estimation, functional covariates, and multiple responses without increasing computational complexity. We provide open-source software for all procedures with R and Python versions that call a common C++ library."
to:NB  ensemble_methods  regression  density_estimation  statistics  kith_and_kin  decision_trees  lee.ann_b.  random_forests
june 2019 by cshalizi
[1702.03377] Uniform confidence bands for nonparametric errors-in-variables regression
"This paper develops a method to construct uniform confidence bands for a nonparametric regression function where a predictor variable is subject to a measurement error. We allow for the distribution of the measurement error to be unknown, but assume the availability of validation data or repeated measurements on the latent predictor variable. The proposed confidence band builds on the deconvolution kernel estimation and a novel application of the multiplier bootstrap method. We establish asymptotic validity of the proposed confidence band. To our knowledge, this is the first paper to derive asymptotically valid uniform confidence bands for nonparametric errors-in-variables regression."
to:NB  regression  confidence_sets  nonparametrics  statistics  errors-in-variables
june 2019 by cshalizi
[1801.06229] Anchor regression: heterogeneous data meets causality
"Estimating causal parameters from observational data is notoriously difficult. Popular approaches such as regression adjustment or the instrumental variables approach only work under relatively strong assumptions and are prone to mistakes. Furthermore, causal parameters can exhibit conservative predictive performance which can limit their usefulness in practice. Causal parameters can be written as the solution to a minimax risk problem, where the maximum is taken over a range of interventional (or perturbed) distributions. This motivates anchor regression, a method that makes use of exogeneous variables to solve a relaxation of the "causal" minimax problem. The procedure naturally provides an interpolation between the solution to ordinary least squares and two-stage least squares, but also has predictive guarantees if the instrumental variables assumptions are violated. We derive guarantees of the proposed procedure for predictive performance under perturbations for the population case and for high-dimensional data. An additional characterization of the procedure is given in terms of quantiles: If the data follow a Gaussian distribution, the method minimizes quantiles of the conditional mean squared error. If anchor regression and least squares provide the same answer ("anchor stability"), the relationship between targets and predictors is unconfounded and the coefficients have a causal interpretation. Furthermore, we show under which conditions anchor regression satisfies replicability among different experiments. Anchor regression is shown empirically to improve replicability and protect against distributional shifts"
to:NB  statistics  causal_inference  heard_the_talk  peters.jonas  buhlmann.peter  regression  instrumental_variables
june 2019 by cshalizi
An oracle property of the Nadaraya–Watson kernel estimator for high‐dimensional nonparametric regression - Conn - - Scandinavian Journal of Statistics - Wiley Online Library
"The Nadaraya–Watson estimator is among the most studied nonparametric regression methods. A classical result is that its convergence rate depends on the number of covariates and deteriorates quickly as the dimension grows. This underscores the “curse of dimensionality” and has limited its use in high‐dimensional settings. In this paper, however, we show that the Nadaraya–Watson estimator has an oracle property such that when the true regression function is single‐ or multi‐index, it discovers the low‐rank dependence structure between the response and the covariates, mitigating the curse of dimensionality. Specifically, we prove that, using K‐fold cross‐validation and a positive‐semidefinite bandwidth matrix, the Nadaraya–Watson estimator has a convergence rate that depends on the number of indices rather than on the number of covariates. This result follows by allowing the bandwidths to diverge to infinity rather than restricting them all to converge to zero at certain rates, as in previous theoretical studies."
to:NB  regression  kernel_estimators  smoothing  high-dimensional_statistics
june 2019 by cshalizi
Lugosi , Mendelson : Regularization, sparse recovery, and median-of-means tournaments
"We introduce a regularized risk minimization procedure for regression function estimation. The procedure is based on median-of-means tournaments, introduced by the authors in Lugosi and Mendelson (2018) and achieves near optimal accuracy and confidence under general conditions, including heavy-tailed predictor and response variables. It outperforms standard regularized empirical risk minimization procedures such as LASSO or SLOPE in heavy-tailed problems."
to:NB  regression  statistics  heavy_tails  mendelson.sahar  lugosi.gabor
june 2019 by cshalizi
[1906.05746] Nonlinear System Identification via Tensor Completion
"Function approximation from input and output data pairs constitutes a fundamental problem in supervised learning. Deep neural networks are currently the most popular method for learning to mimic the input-output relationship of a generic nonlinear system, as they have proven to be very effective in approximating complex highly nonlinear functions. In this work, we propose low-rank tensor completion as an appealing alternative for modeling and learning complex nonlinear systems. We model the interactions between the N input variables and the scalar output of a system by a single N-way tensor, and setup a weighted low-rank tensor completion problem with smoothness regularization which we tackle using a block coordinate descent algorithm. We extend our method to the multi-output setting and the case of partially observed data, which cannot be readily handled by neural networks. Finally, we demonstrate the effectiveness of the approach using several regression tasks including some standard benchmarks and a challenging student grade prediction task."
to:NB  approximation  tensors  regression  computational_statistics  statistics
june 2019 by cshalizi
[1906.01990] A Model-free Approach to Linear Least Squares Regression with Exact Probabilities and Applications to Covariate Selection
"The classical model for linear regression is ${\mathbold Y}={\mathbold x}{\mathbold \beta} +\sigma{\mathbold \varepsilon}$ with i.i.d. standard Gaussian errors. Much of the resulting statistical inference is based on Fisher's F-distribution. In this paper we give two approaches to least squares regression which are model free. The results hold forall data $({\mathbold y},{\mathbold x})$. The derived probabilities are not only exact, they agree with those using the F-distribution based on the classical model. This is achieved by replacing questions about the size of βj, for example βj=0, by questions about the degree to which the covariate ${\mathbold x}_j$ is better than Gaussian white noise or, alternatively, a random orthogonal rotation of ${\mathbold x}_j$. The idea can be extended to choice of covariates, post selection inference PoSI, step-wise choice of covariates, the determination of dependency graphs and to robust regression and non-linear regression. In the latter two cases the probabilities are no longer exact but are based on the chi-squared distribution. The step-wise choice of covariates is of particular interest: it is a very simple, very fast, very powerful, it controls the number of false positives and does not over fit even in the case where the number of covariates far exceeds the sample size"
in_NB  linear_regression  regression  statistics  to_be_shot_after_a_fair_trial  variable_selection
june 2019 by cshalizi
[1905.11436] Kalman Filter, Sensor Fusion, and Constrained Regression: Equivalences and Insights
"The Kalman filter (KF) is one of the most widely used tools for data assimilation and sequential estimation. In this paper, we show that the state estimates from the KF in a standard linear dynamical system setting are exactly equivalent to those given by the KF in a transformed system, with infinite process noise (a "flat prior") and an augmented measurement space. This reformulation--which we refer to as augmented measurement sensor fusion (SF)--is conceptually interesting, because the transformed system here is seemingly static (as there is effectively no process model), but we can still capture the state dynamics inherent to the KF by folding the process model into the measurement space. Apart from being interesting, this reformulation of the KF turns out to be useful in problem settings in which past states are eventually observed (at some lag). In such problems, when we use the empirical covariance to estimate the measurement noise covariance, we show that the state predictions from augmented measurement SF are exactly equivalent to those from a regression of past states on past measurements, subject to particular linear constraints (reflecting the relationships encoded in the measurement map). This allows us to port standard ideas (say, regularization methods) in regression over to dynamical systems. For example, we can posit multiple candidate process models, fold all of them into the measurement model, transform to the regression perspective, and apply ℓ1 penalization to perform process model selection. We give various empirical demonstrations, and focus on an application to nowcasting the weekly incidence of influenza in the US."
state_estimation  kalman_filter  regression  rosenfeld.roni  tibshirani.ryan  kith_and_kin  statistics  to_teach:data_over_space_and_time  in_NB  have_read
may 2019 by cshalizi
[1905.10634] Adaptive, Distribution-Free Prediction Intervals for Deep Neural Networks
"This paper addresses the problem of assessing the variability of predictions from deep neural networks. There is a growing literature on using and improving the predictive accuracy of deep networks, but a concomitant improvement in the quantification of their uncertainty is lacking. We provide a prediction interval network (PI-Network) which is a transparent, tractable modification of the standard predictive loss used to train deep networks. The PI-Network outputs three values instead of a single point estimate and optimizes a loss function inspired by quantile regression. We go beyond merely motivating the construction of these networks and provide two prediction interval methods with provable, finite sample coverage guarantees without any assumptions on the underlying distribution from which our data is drawn. We only require that the observations are independent and identically distributed. Furthermore, our intervals adapt to heteroskedasticity and asymmetry in the conditional distribution of the response given the covariates. The first method leverages the conformal inference framework and provides average coverage. The second method provides a new, stronger guarantee by conditioning on the observed data. Lastly, our loss function does not compromise the predictive accuracy of the network like other prediction interval methods. We demonstrate the ease of use of the PI-Network as well as its improvements over other methods on both simulated and real data. As the PI-Network can be used with a host of deep learning methods with only minor modifications, its use should become standard practice, much like reporting standard errors along with mean estimates."
to:NB  prediction  confidence_sets  neural_networks  regression  leeb.hannes  statistics  uncertainty_for_neural_networks
may 2019 by cshalizi
[1905.10176] Machine Learning Estimation of Heterogeneous Treatment Effects with Instruments
"We consider the estimation of heterogeneous treatment effects with arbitrary machine learning methods in the presence of unobserved confounders with the aid of a valid instrument. Such settings arise in A/B tests with an intent-to-treat structure, where the experimenter randomizes over which user will receive a recommendation to take an action, and we are interested in the effect of the downstream action. We develop a statistical learning approach to the estimation of heterogeneous effects, reducing the problem to the minimization of an appropriate loss function that depends on a set of auxiliary models (each corresponding to a separate prediction task). The reduction enables the use of all recent algorithmic advances (e.g. neural nets, forests). We show that the estimated effect model is robust to estimation errors in the auxiliary models, by showing that the loss satisfies a Neyman orthogonality criterion. Our approach can be used to estimate projections of the true effect model on simpler hypothesis spaces. When these spaces are parametric, then the parameter estimates are asymptotically normal, which enables construction of confidence sets. We applied our method to estimate the effect of membership on downstream webpage engagement on TripAdvisor, using as an instrument an intent-to-treat A/B test among 4 million TripAdvisor users, where some users received an easier membership sign-up process. We also validate our method on synthetic data and on public datasets for the effects of schooling on income."
to:NB  instrumental_variables  nonparametrics  regression  causal_inference  statistics
may 2019 by cshalizi
Bauer , Kohler : On deep learning as a remedy for the curse of dimensionality in nonparametric regression
"Assuming that a smoothness condition and a suitable restriction on the structure of the regression function hold, it is shown that least squares estimates based on multilayer feedforward neural networks are able to circumvent the curse of dimensionality in nonparametric regression. The proof is based on new approximation results concerning multilayer feedforward neural networks with bounded weights and a bounded number of hidden neurons. The estimates are compared with various other approaches by using simulated data."

!!!

ETA: "circumvent[ing] the curse of dimensionality" here means that the effective dimesionality is the largest order of interaction in the true regression function, which (I think) corresponds to the largest in-degree of nodes in the network. So it's more like saying "additive models circumvent the curse of dimensionality" than "deep learning is magic". (The paper is clear about this.)
to:NB  regression  neural_networks  learning_theory  statistics  have_skimmed
may 2019 by cshalizi
Han , Wellner : Convergence rates of least squares regression estimators with heavy-tailed errors
"We study the performance of the least squares estimator (LSE) in a general nonparametric regression model, when the errors are independent of the covariates but may only have a ppth moment (p≥1p≥1). In such a heavy-tailed regression setting, we show that if the model satisfies a standard “entropy condition” with exponent α∈(0,2)α∈(0,2), then the L2L2 loss of the LSE converges at a rate
OP(n−12+α∨n−12+12p).
Such a rate cannot be improved under the entropy condition alone.
"This rate quantifies both some positive and negative aspects of the LSE in a heavy-tailed regression setting. On the positive side, as long as the errors have p≥1+2/αp≥1+2/α moments, the L2L2 loss of the LSE converges at the same rate as if the errors are Gaussian. On the negative side, if p<1+2/αp<1+2/α, there are (many) hard models at any entropy level αα for which the L2L2 loss of the LSE converges at a strictly slower rate than other robust estimators.
"The validity of the above rate relies crucially on the independence of the covariates and the errors. In fact, the L2L2 loss of the LSE can converge arbitrarily slowly when the independence fails.
"The key technical ingredient is a new multiplier inequality that gives sharp bounds for the “multiplier empirical process” associated with the LSE. We further give an application to the sparse linear regression model with heavy-tailed covariates and errors to demonstrate the scope of this new inequality."
to:NB  regression  empirical_processes  statistics  heavy_tails
may 2019 by cshalizi
[1303.2236] COBRA: A Combined Regression Strategy
"A new method for combining several initial estimators of the regression function is introduced. Instead of building a linear or convex optimized combination over a collection of basic estimators r1,…,rM, we use them as a collective indicator of the proximity between the training data and a test observation. This local distance approach is model-free and very fast. More specifically, the resulting nonparametric/nonlinear combined estimator is shown to perform asymptotically at least as well in the L2 sense as the best combination of the basic estimators in the collective. A companion R package called \cobra (standing for COmBined Regression Alternative) is presented (downloadable on \url{this http URL}). Substantial numerical evidence is provided on both synthetic and real data sets to assess the excellent performance and velocity of our method in a large variety of prediction problems."
to:NB  ensemble_methods  statistics  regression
may 2019 by cshalizi
Bootstrap of residual processes in regression: to smooth or not to smooth? | Biometrika | Oxford Academic
"In this paper we consider regression models with centred errors, independent of the covariates. Given independent and identically distributed data and given an estimator of the regression function, which can be parametric or nonparametric in nature, we estimate the distribution of the error term by the empirical distribution of estimated residuals. To approximate the distribution of this estimator, Koul & Lahiri (1994) and Neumeyer (2009) proposed bootstrap procedures based on smoothing the residuals before drawing bootstrap samples. So far it has been an open question as to whether a classical nonsmooth residual bootstrap is asymptotically valid in this context. Here we solve this open problem and show that the nonsmooth residual bootstrap is consistent. We illustrate the theoretical result by means of simulations, which demonstrate the accuracy of this bootstrap procedure for various models, testing procedures and sample sizes."
to:NB  regression  bootstrap  statistics
may 2019 by cshalizi
Nonparametric Estimation of Triangular Simultaneous Equations Models on JSTOR
"This paper presents a simple two-step nonparametric estimator for a triangular simultaneous equation model. Our approach employs series approximations that exploit the additive structure of the model. The first step comprises the nonparametric estimation of the reduced form and the corresponding residuals. The second step is the estimation of the primary equation via nonparametric regression with the reduced form residuals included as a regressor. We derive consistency and asymptotic normality results for our estimator, including optimal convergence rates. Finally we present an empirical example, based on the relationship between the hourly wage rate and annual hours worked, which illustrates the utility of our approach."
to:NB  nonparametrics  instrumental_variables  causal_inference  statistics  regression  econometrics  re:ADAfaEPoV
april 2019 by cshalizi
AEA Web - American Economic Review - 103(3):550 - Abstract
"n many economic models, objects of interest are functions which satisfy conditional moment restrictions. Economics does not restrict the functional form of these models, motivating nonparametric methods. In this paper we review identification results and describe a simple nonparametric instrumental variables (NPIV) estimator. We also consider a simple method of inference. In addition we show how the ability to uncover nonlinearities with conditional moment restrictions is related to the strength of the instruments. We point to applications where important nonlinearities can be found with NPIV and applications where they cannot."
to:NB  nonparametrics  instrumental_variables  regression  causal_inference  statistics  econometrics  re:ADAfaEPoV
april 2019 by cshalizi
Nonparametric Instrumental Regression - Darolles - 2011 - Econometrica - Wiley Online Library
"The focus of this paper is the nonparametric estimation of an instrumental regression function ϕ defined by conditional moment restrictions that stem from a structural econometric model E[Y−ϕ(Z)|W]=0, and involve endogenous variables Y and Z and instruments W. The function ϕ is the solution of an ill‐posed inverse problem and we propose an estimation procedure based on Tikhonov regularization. The paper analyzes identification and overidentification of this model, and presents asymptotic properties of the estimated nonparametric instrumental regression function."
to:NB  nonparametrics  instrumental_variables  causal_inference  statistics  inverse_problems  regression  econometrics  re:ADAfaEPoV
april 2019 by cshalizi
Instrumental Variable Estimation of Nonparametric Models - Newey - 2003 - Econometrica - Wiley Online Library
"In econometrics there are many occasions where knowledge of the structural relationship among dependent variables is required to answer questions of interest. This paper gives identification and estimation results for nonparametric conditional moment restrictions. We characterize identification of structural functions as completeness of certain conditional distributions, and give sufficient identification conditions for exponential families and discrete variables. We also give a consistent, nonparametric estimator of the structural function. The estimator is nonparametric two‐stage least squares based on series approximation, which overcomes an ill‐posed inverse problem by placing bounds on integrals of higher‐order derivatives."
to:NB  instrumental_variables  nonparametrics  regression  causal_inference  statistics  econometrics
april 2019 by cshalizi
A Note on Parametric and Nonparametric Regression in the Presence of Endogenous Control Variables by Markus Frölich :: SSRN
"This note argues that nonparametric regression not only relaxes functional form assumptions vis-a-vis parametric regression, but that it also permits endogenous control variables. To control for selection bias or to make an exclusion restriction in instrumental variables regression valid, additional control variables are often added to a regression. If any of these control variables is endogenous, OLS or 2SLS would be inconsistent and would require further instrumental variables. Nonparametric approaches are still consistent, though. A few examples are examined and it is found that the asymptotic bias of OLS can indeed be very large."
to:NB  causal_inference  instrumental_variables  nonparametrics  regression  statistics  re:ADAfaEPoV
april 2019 by cshalizi
Nonparametric Instrumental Regression
"The focus of the paper is the nonparametric estimation of an instrumental regression function P defined by conditional moment restrictions stemming from a structural econometric model : E[Y-P(Z)|W]=0 and involving endogenous variables Y and Z and instruments W. The function P is the solution of an ill-posed inverse problem and we propose an estimation procedure based on Tikhonov regularization. The paper analyses identification and overidentification of this model and presents asymptotic properties of the estimated nonparametric instrumental regression function."

--- Was this ever published? It definitely seems like the most elegant approach to nonparametric IVs I've seen (French econometricians!).
to:NB  have_read  regression  instrumental_variables  nonparametrics  inverse_problems  causal_inference  re:ADAfaEPoV  econometrics
april 2019 by cshalizi
[1903.04641] Generalized Sparse Additive Models
"We present a unified framework for estimation and analysis of generalized additive models in high dimensions. The framework defines a large class of penalized regression estimators, encompassing many existing methods. An efficient computational algorithm for this class is presented that easily scales to thousands of observations and features. We prove minimax optimal convergence bounds for this class under a weak compatibility condition. In addition, we characterize the rate of convergence when this compatibility condition is not met. Finally, we also show that the optimal penalty parameters for structure and sparsity penalties in our framework are linked, allowing cross-validation to be conducted over only a single tuning parameter. We complement our theoretical results with empirical studies comparing some existing methods within this framework."
to:NB  sparsity  regression  additive_models  high-dimensional_statistics  statistics
april 2019 by cshalizi
[1903.08560] Surprises in High-Dimensional Ridgeless Least Squares Interpolation
"Interpolators---estimators that achieve zero training error---have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum ℓ2 norm (ridgeless') interpolation in high-dimensional least squares regression. We consider two different models for the feature distribution: a linear model, where the feature vectors xi∈ℝp are obtained by applying a linear transform to a vector of i.i.d.\ entries, xi=Σ1/2zi (with zi∈ℝp); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, xi=φ(Wzi) (with zi∈ℝd, W∈ℝp×d a matrix of i.i.d.\ entries, and φ an activation function acting componentwise on Wzi). We recover---in a precise quantitative way---several phenomena that have been observed in large-scale neural networks and kernel machines, including the double descent' behavior of the prediction risk, and the potential benefits of overparametrization."

--- "Heard the talk" = "Ryan came into my office to explain it all because he was so enthused".
to:NB  to_read  regression  high-dimensional_statistics  interpolation  kith_and_kin  tibshirani.ryan  rosset.saharon  montanari.andrea  hastie.trevor  statistics  neural_networks  heard_the_talk
april 2019 by cshalizi
[1904.01058] Tree Boosted Varying Coefficient Models
"This paper investigates the integration of gradient boosted decision trees and varying coefficient models. We introduce the tree boosted varying coefficient framework which justifies the implementation of decision tree boosting as the nonparametric effect modifiers in varying coefficient models. This framework requires no structural assumptions in the space containing the varying coefficient covariates, is easy to implement, and keeps a balance between model complexity and interpretability. To provide statistical guarantees, we prove the asymptotic consistency of the proposed method under the regression settings with L2 loss. We further conduct a thorough empirical study to show that the proposed method is capable of providing accurate predictions as well as intelligible visual explanations."
to:NB  regression  ensemble_methods  nonparametrics  statistics  hooker.giles
april 2019 by cshalizi
Generalized additive models with flexible response functions | SpringerLink
"Common generalized linear models depend on several assumptions: (i) the specified linear predictor, (ii) the chosen response distribution that determines the likelihood and (iii) the response function that "maps the linear predictor to the conditional expectation of the response. Generalized additive models (GAM) provide a convenient way to overcome the restriction to purely linear predictors. Therefore, the covariates may be included as flexible nonlinear or spatial functions to avoid potential bias arising from misspecification. Single index models, on the other hand, utilize flexible specifications of the response function and therefore avoid the deteriorating impact of a misspecified response function. However, such single index models are usually restricted to a linear predictor and aim to compensate for potential nonlinear structures only via the estimated response function. We will show that this is insufficient in many cases and present a solution by combining a flexible approach for response function estimation using monotonic P-splines with additive predictors as in GAMs. Our approach is based on maximum likelihood estimation and also allows us to provide confidence intervals of the estimated effects. To compare our approach with existing ones, we conduct extensive simulation studies and apply our approach on two empirical examples, namely the mortality rate in São Paulo due to respiratory diseases based on the Poisson distribution and credit scoring of a German bank with binary responses."
to:NB  additive_models  regression  statistics
february 2019 by cshalizi
[1608.00696] Can we trust the bootstrap in high-dimension?
"We consider the performance of the bootstrap in high-dimensions for the setting of linear regression, where p<n but p/n is not close to zero. We consider ordinary least-squares as well as robust regression methods and adopt a minimalist performance requirement: can the bootstrap give us good confidence intervals for a single coordinate of β? (where β is the true regression vector).
"We show through a mix of numerical and theoretical work that the bootstrap is fraught with problems. Both of the most commonly used methods of bootstrapping for regression -- residual bootstrap and pairs bootstrap -- give very poor inference on β as the ratio p/n grows. We find that the residuals bootstrap tend to give anti-conservative estimates (inflated Type I error), while the pairs bootstrap gives very conservative estimates (severe loss of power) as the ratio p/n grows. We also show that the jackknife resampling technique for estimating the variance of β̂  severely overestimates the variance in high dimensions.
"We contribute alternative bootstrap procedures based on our theoretical results that mitigate these problems. However, the corrections depend on assumptions regarding the underlying data-generation model, suggesting that in high-dimensions it may be difficult to have universal, robust bootstrapping techniques."
to:NB  bootstrap  high-dimensional_statistics  statistics  regression
february 2019 by cshalizi
Kernel Smoothing | Wiley Online Books
"Comprehensive theoretical overview of kernel smoothing methods with motivating examples
"Kernel smoothing is a flexible nonparametric curve estimation method that is applicable when parametric descriptions of the data are not sufficiently adequate. This book explores theory and methods of kernel smoothing in a variety of contexts, considering independent and correlated data e.g. with short-memory and long-memory correlations, as well as non-Gaussian data that are transformations of latent Gaussian processes. These types of data occur in many fields of research, e.g. the natural and the environmental sciences, and others. Nonparametric density estimation, nonparametric and semiparametric regression, trend and surface estimation in particular for time series and spatial data and other topics such as rapid change points, robustness etc. are introduced alongside a study of their theoretical properties and optimality issues, such as consistency and bandwidth selection.
"Addressing a variety of topics, Kernel Smoothing: Principles, Methods and Applications offers a user-friendly presentation of the mathematical content so that the reader can directly implement the formulas using any appropriate software. The overall aim of the book is to describe the methods and their theoretical backgrounds, while maintaining an analytically simple approach and including motivating examples—making it extremely useful in many sciences such as geophysics, climate research, forestry, ecology, and other natural and life sciences, as well as in finance, sociology, and engineering."
to:NB  books:noted  downloaded  kernel_estimators  smoothing  regression  statistics
january 2019 by cshalizi
Least Squares Data Fitting with Applications
"As one of the classical statistical regression techniques, and often the first to be taught to new students, least squares fitting can be a very effective tool in data analysis. Given measured data, we establish a relationship between independent and dependent variables so that we can use the data predictively. The main concern of Least Squares Data Fitting with Applications is how to do this on a computer with efficient and robust computational methods for linear and nonlinear relationships. The presentation also establishes a link between the statistical setting and the computational issues.
"In a number of applications, the accuracy and efficiency of the least squares fit is central, and Per Christian Hansen, Víctor Pereyra, and Godela Scherer survey modern computational methods and illustrate them in fields ranging from engineering and environmental sciences to geophysics. Anyone working with problems of linear and nonlinear least squares fitting will find this book invaluable as a hands-on guide, with accessible text and carefully explained problems."
to:NB  books:noted  regression  statistics
december 2018 by cshalizi
Confidence intervals for GLMs
For the trick about finding the inverse link function.
december 2018 by cshalizi
Object-oriented Computation of Sandwich Estimators | Zeileis | Journal of Statistical Software
"Sandwich covariance matrix estimators are a popular tool in applied regression modeling for performing inference that is robust to certain types of model misspecification. Suitable implementations are available in the R system for statistical computing for certain model fitting functions only (in particular lm()), but not for other standard regression functions, such as glm(), nls(), or survreg(). Therefore, conceptual tools and their translation to computational tools in the package sandwich are discussed, enabling the computation of sandwich estimators in general parametric models. Object orientation can be achieved by providing a few extractor functions' most importantly for the empirical estimating functions' from which various types of sandwich estimators can be computed."
to:NB  computational_statistics  R  estimation  regression  statistics  to_teach
october 2018 by cshalizi
Quantile Regression
"Quantile regression, as introduced by Koenker and Bassett (1978), may be viewed as an extension of classical least squares estimation of conditional mean models to the estimation of an ensemble of models for several conditional quantile functions. The central special case is the median regression estimator which minimizes a sum of absolute errors. Other conditional quantile functions are estimated by minimizing an asymmetrically weighted sum of absolute errors. Quantile regression methods are illustrated with applications to models for CEO pay, food expenditure, and infant birthweight."
to:NB  have_read  regression  statistics  econometrics
october 2018 by cshalizi
[1809.05651] Omitted and Included Variable Bias in Tests for Disparate Impact
"Policymakers often seek to gauge discrimination against groups defined by race, gender, and other protected attributes. One popular strategy is to estimate disparities after controlling for observed covariates, typically with a regression model. This approach, however, suffers from two statistical challenges. First, omitted-variable bias can skew results if the model does not control for all relevant factors; second, and conversely, included-variable bias can skew results if the set of controls includes irrelevant factors. Here we introduce a simple three-step strategy---which we call risk-adjusted regression---that addresses both concerns in settings where decision makers have clearly measurable objectives. In the first step, we use all available covariates to estimate the utility of possible decisions. In the second step, we measure disparities after controlling for these utility estimates alone, mitigating the problem of included-variable bias. Finally, in the third step, we examine the sensitivity of results to unmeasured confounding, addressing concerns about omitted-variable bias. We demonstrate this method on a detailed dataset of 2.2 million police stops of pedestrians in New York City, and show that traditional statistical tests of discrimination can yield misleading results. We conclude by discussing implications of our statistical approach for questions of law and policy."
to:NB  to_read  discrimination  racism  regression  statistics  goel.sharad
october 2018 by cshalizi
Generalized least squares can overcome the critical threshold in respondent-driven sampling | PNAS
"To sample marginalized and/or hard-to-reach populations, respondent-driven sampling (RDS) and similar techniques reach their participants via peer referral. Under a Markov model for RDS, previous research has shown that if the typical participant refers too many contacts, then the variance of common estimators does not decay like O(n−1)"
, where n is the sample size. This implies that confidence intervals will be far wider than under a typical sampling design. Here we show that generalized least squares (GLS) can effectively reduce the variance of RDS estimates. In particular, a theoretical analysis indicates that the variance of the GLS estimator is O(n−1)
. We then derive two classes of feasible GLS estimators. The first class is based upon a Degree Corrected Stochastic Blockmodel for the underlying social network. The second class is based upon a rank-two model. It might be of independent interest that in both model classes, the theoretical results show that it is possible to estimate the spectral properties of the population network from a random walk sample of the nodes. These theoretical results point the way to entirely different classes of estimators that account for the network structure beyond node degree. Diagnostic plots help to identify situations where feasible GLS estimators are more appropriate. The computational experiments show the potential benefits and also indicate that there is room to further develop these estimators in practical settings.
in_NB  respondent-driven_sampling  regression  rohe.karl  network_data_analysis  statistics
october 2018 by cshalizi
[1801.03896] Robust inference with knockoffs
"We consider the variable selection problem, which seeks to identify important variables influencing a response Y out of many candidate features X1,…,Xp. We wish to do so while offering finite-sample guarantees about the fraction of false positives - selected variables Xj that in fact have no effect on Y after the other features are known. When the number of features p is large (perhaps even larger than the sample size n), and we have no prior knowledge regarding the type of dependence between Y and X, the model-X knockoffs framework nonetheless allows us to select a model with a guaranteed bound on the false discovery rate, as long as the distribution of the feature vector X=(X1,…,Xp) is exactly known. This model selection procedure operates by constructing "knockoff copies'" of each of the p features, which are then used as a control group to ensure that the model selection algorithm is not choosing too many irrelevant features. In this work, we study the practical setting where the distribution of X could only be estimated, rather than known exactly, and the knockoff copies of the Xj's are therefore constructed somewhat incorrectly. Our results, which are free of any modeling assumption whatsoever, show that the resulting model selection procedure incurs an inflation of the false discovery rate that is proportional to our errors in estimating the distribution of each feature Xj conditional on the remaining features {Xk:k≠j}. The model-X knockoff framework is therefore robust to errors in the underlying assumptions on the distribution of X, making it an effective method for many practical applications, such as genome-wide association studies, where the underlying distribution on the features X1,…,Xp is estimated accurately but not known exactly."
in_NB  regression  variable_selection  statistics  samworth.richard_j.  knockoffs  to_teach:linear_models
september 2018 by cshalizi
Archive ouverte HAL - The Great Regression. Machine Learning, Econometrics, and the Future of Quantitative Social Sciences
"What can machine learning do for (social) scientific analysis, and what can it do to it? A contribution to the emerging debate on the role of machine learning for the social sciences, this article offers an introduction to this class of statistical techniques. It details its premises, logic, and the challenges it faces. This is done by comparing machine learning to more classical approaches to quantification – most notably parametric regression– both at a general level and in practice. The article is thus an intervention in the contentious debates about the role and possible consequences of adopting statistical learning in science. We claim that the revolution announced by many and feared by others will not happen any time soon, at least not in the terms that both proponents and critics of the technique have spelled out. The growing use of machine learning is not so much ushering in a radically new quantitative era as it is fostering an increased competition between the newly termed classic method and the learning approach. This, in turn, results in more uncertainty with respect to quantified results. Surprisingly enough, this may be good news for knowledge overall."

--- The correct line here is that 90%+ of "machine learning" is rebranded non-parametric regression, which is what the social sciences should have been doing all along anyway, because they have no good theories which suggest particular parametric forms. (Partial exceptions: demography and epidemiology.) If the resulting confidence sets are bigger than they'd like, that's still the actual range of uncertainty they need to live with, until they can reduce it with more and better empirical information, or additional constraints from well-supported theories. (Arguably, this was all in Haavelmo.) I look forward to seeing whether this paper grasps these obvious truths.
to:NB  to_read  regression  social_science_methodology  machine_learning  via:phnk  econometrics  to_be_shot_after_a_fair_trial
august 2018 by cshalizi
[1806.06850] Polynomial Regression As an Alternative to Neural Nets
"Despite the success of neural networks (NNs), there is still a concern among many over their "black box" nature. Why do they work? Here we present a simple analytic argument that NNs are in fact essentially polynomial regression models. This view will have various implications for NNs, e.g. providing an explanation for why convergence problems arise in NNs, and it gives rough guidance on avoiding overfitting. In addition, we use this phenomenon to predict and confirm a multicollinearity property of NNs not previously reported in the literature. Most importantly, given this loose correspondence, one may choose to routinely use polynomial models instead of NNs, thus avoiding some major problems of the latter, such as having to set many tuning parameters and dealing with convergence issues. We present a number of empirical results; in each case, the accuracy of the polynomial approach matches or exceeds that of NN approaches. A many-featured, open-source software package, polyreg, is available."

--- Matloff is the author of my favorite "R programming for n00bs" textbook...
--- ETA after reading: the argument that multi-layer neural networks "are essentially" polynomial regression is a bit weak. It would be true, exactly, if activation functions were exactly polynomial, which however they rarely are in practice. If non-polynomial activations happen to be implemented in computational practice by polynomials (e.g., Taylor approximations), well, either we get different hardware or we crank up the degree of approximation as much as we like. (Said a little differently, if you buy this line of argument, you should buy that _every_ smooth statistical model "is essentially" polynomial regression, which seems a bit much.) It is, also, an argument about the function-approximation properties of the model classes, and not the fitting processes, despite the explicit disclaimers.
to:NB  your_favorite_deep_neural_network_sucks  regression  neural_networks  statistics  matloff.norman  approximation  computational_statistics  have_read
july 2018 by cshalizi
[1706.08576] Invariant Causal Prediction for Nonlinear Models
"An important problem in many domains is to predict how a system will respond to interventions. This task is inherently linked to estimating the system's underlying causal structure. To this end, 'invariant causal prediction' (ICP) (Peters et al., 2016) has been proposed which learns a causal model exploiting the invariance of causal relations using data from different environments. When considering linear models, the implementation of ICP is relatively straight-forward. However, the nonlinear case is more challenging due to the difficulty of performing nonparametric tests for conditional independence. In this work, we present and evaluate an array of methods for nonlinear and nonparametric versions of ICP for learning the causal parents of given target variables. We find that an approach which first fits a nonlinear model with data pooled over all environments and then tests for differences between the residual distributions across environments is quite robust across a large variety of simulation settings. We call this procedure "Invariant residual distribution test". In general, we observe that the performance of all approaches is critically dependent on the true (unknown) causal structure and it becomes challenging to achieve high power if the parental set includes more than two variables. As a real-world example, we consider fertility rate modelling which is central to world population projections. We explore predicting the effect of hypothetical interventions using the accepted models from nonlinear ICP. The results reaffirm the previously observed central causal role of child mortality rates."
may 2018 by cshalizi
[1501.01332] Causal inference using invariant prediction: identification and confidence intervals
"What is the difference of a prediction that is made with a causal model and a non-causal model? Suppose we intervene on the predictor variables or change the whole environment. The predictions from a causal model will in general work as well under interventions as for observational data. In contrast, predictions from a non-causal model can potentially be very wrong if we actively intervene on variables. Here, we propose to exploit this invariance of a prediction under a causal model for causal inference: given different experimental settings (for example various interventions) we collect all models that do show invariance in their predictive accuracy across settings and interventions. The causal model will be a member of this set of models with high probability. This approach yields valid confidence intervals for the causal relationships in quite general scenarios. We examine the example of structural equation models in more detail and provide sufficient assumptions under which the set of causal predictors becomes identifiable. We further investigate robustness properties of our approach under model misspecification and discuss possible extensions. The empirical properties are studied for various data sets, including large-scale gene perturbation experiments."