**cshalizi + statistics**
3592

The Identification Zoo: Meanings of Identification in Econometrics

4 days ago by cshalizi

"Over two dozen different terms for identification appear in the econometrics literature, including set identification, causal identification, local identification, generic identification, weak identification, identification at infinity, and many more. This survey (i) gives a new framework unifying existing definitions of point identification; (ii) summarizes and compares the zooful of different terms associated with identification that appear in the literature; and (iii) discusses concepts closely related to identification, such as normalizations and the differences in identification between structural models and causal, reduced form models."

to:NB
identifiability
statistics
econometrics
4 days ago by cshalizi

Lei : Convergence and concentration of empirical measures under Wasserstein distance in unbounded functional spaces

7 days ago by cshalizi

"We provide upper bounds of the expected Wasserstein distance between a probability measure and its empirical version, generalizing recent results for finite dimensional Euclidean spaces and bounded functional spaces. Such a generalization can cover Euclidean spaces with large dimensionality, with the optimal dependence on the dimensionality. Our method also covers the important case of Gaussian processes in separable Hilbert spaces, with rate-optimal upper bounds for functional data distributions whose coordinates decay geometrically or polynomially. Moreover, our bounds of the expected value can be combined with mean-concentration results to yield improved exponential tail probability bounds for the Wasserstein error of empirical measures under Bernstein-type or log Sobolev-type conditions."

to:NB
concentration_of_measure
empirical_processes
statistics
7 days ago by cshalizi

Fréchet analysis of variance for random objects | Biometrika | Oxford Academic

7 days ago by cshalizi

"Fréchet mean and variance provide a way of obtaining a mean and variance for metric space-valued random variables, and can be used for statistical analysis of data objects that lie in abstract spaces devoid of algebraic structure and operations. Examples of such data objects include covariance matrices, graph Laplacians of networks and univariate probability distribution functions. We derive a central limit theorem for the Fréchet variance under mild regularity conditions, using empirical process theory, and also provide a consistent estimator of the asymptotic variance. These results lead to a test for comparing kk populations of metric space-valued data objects in terms of Fréchet means and variances. We examine the finite-sample performance of this novel inference procedure through simulation studies on several special cases that include probability distributions and graph Laplacians, leading to a test for comparing populations of networks. The proposed approach has good finite-sample performance in simulations for different kinds of random objects. We illustrate the proposed methods by analysing data on mortality profiles of various countries and resting-state functional magnetic resonance imaging data."

to:NB
statistics
statistics_on_manifolds
7 days ago by cshalizi

Network dependence testing via diffusion maps and distance-based correlations | Biometrika | Oxford Academic

7 days ago by cshalizi

"Deciphering the associations between network connectivity and nodal attributes is one of the core problems in network science. The dependency structure and high dimensionality of networks pose unique challenges to traditional dependency tests in terms of theoretical guarantees and empirical performance. We propose an approach to test network dependence via diffusion maps and distance-based correlations. We prove that the new method yields a consistent test statistic under mild distributional assumptions on the graph structure, and demonstrate that it is able to efficiently identify the most informative graph embedding with respect to the diffusion time. The methodology is illustrated on both simulated and real data."

to:NB
network_data_analysis
statistics
to_read
to_teach:baby-nets
7 days ago by cshalizi

Causal inference with confounders missing not at random | Biometrika | Oxford Academic

7 days ago by cshalizi

"It is important to draw causal inference from observational studies, but this becomes challenging if the confounders have missing values. Generally, causal effects are not identifiable if the confounders are missing not at random. In this article we propose a novel framework for nonparametric identification of causal effects with confounders subject to an outcome-independent missingness, which means that the missing data mechanism is independent of the outcome, given the treatment and possibly missing confounders. We then propose a nonparametric two-stage least squares estimator and a parametric estimator for causal effects."

to:NB
causal_inference
statistics
missing_data
7 days ago by cshalizi

On causal discovery with an equal-variance assumption | Biometrika | Oxford Academic

7 days ago by cshalizi

"Prior work has shown that causal structure can be uniquely identified from observational data when these follow a structural equation model whose error terms have equal variance. We show that this fact is implied by an ordering among conditional variances. We demonstrate that ordering estimates of these variances yields a simple yet state-of-the-art method for causal structure learning that is readily extendable to high-dimensional problems."

to:NB
causal_discovery
statistics
drton.mathias
7 days ago by cshalizi

Bootstrapping spectral statistics in high dimensions | Biometrika | Oxford Academic

8 days ago by cshalizi

"Statistics derived from the eigenvalues of sample covariance matrices are called spectral statistics, and they play a central role in multivariate testing. Although bootstrap methods are an established approach to approximating the laws of spectral statistics in low-dimensional problems, such methods are relatively unexplored in the high-dimensional setting. The aim of this article is to focus on linear spectral statistics as a class of prototypes for developing a new bootstrap in high dimensions, a method we refer to as the spectral bootstrap. In essence, the proposed method originates from the parametric bootstrap and is motivated by the fact that in high dimensions it is difficult to obtain a nonparametric approximation to the full data-generating distribution. From a practical standpoint, the method is easy to use and allows the user to circumvent the difficulties of complex asymptotic formulas for linear spectral statistics. In addition to proving the consistency of the proposed method, we present encouraging empirical results in a variety of settings. Lastly, and perhaps most interestingly, we show through simulations that the method can be applied successfully to statistics outside the class of linear spectral statistics, such as the largest sample eigenvalue and others."

to:NB
spectral_methods
bootstrap
high-dimensional_statistics
statistics
8 days ago by cshalizi

Kleijn , Zhao : Criteria for posterior consistency and convergence at a rate

9 days ago by cshalizi

"Frequentist conditions for asymptotic consistency of Bayesian procedures with i.i.d. data focus on lower bounds for prior mass in Kullback-Leibler neighbourhoods of the data distribution. The goal of this paper is to investigate the flexibility in these criteria. We derive a versatile new posterior consistency theorem, which is used to consider Kullback-Leibler consistency and indicate when it is sufficient to have a prior that charges metric balls instead of KL-neighbourhoods. We generalize our proposal to sieved models with Barron’s negligible prior mass condition and to separable models with variations on Walker’s condition. Results are also applied in semi-parametric consistency: support boundary estimation is considered explicitly and consistency is proved in a model for which Kullback-Leibler priors do not exist. As a further demonstration of applicability, we consider metric consistency at a rate: under a mild integrability condition, the second-order Ghosal-Ghosh-van der Vaart prior mass condition can be relaxed to a lower bound for ordinary KL-neighbourhoods. The posterior rate is derived in a parametric model for heavy-tailed distributions in which the Ghosal-Ghosh-van der Vaart condition cannot be satisfied by any prior."

to:NB
statistics
bayesian_consistency
nonparametrics
9 days ago by cshalizi

[1909.01464] Rates of Convergence for Large-scale Nearest Neighbor Classification

11 days ago by cshalizi

"Nearest neighbor is a popular class of classification methods with many desirable properties. For a large data set which cannot be loaded into the memory of a single machine due to computation, communication, privacy, or ownership limitations, we consider the divide and conquer scheme: the entire data set is divided into small subsamples, on which nearest neighbor predictions are made, and then a final decision is reached by aggregating the predictions on subsamples by majority voting. We name this method the big Nearest Neighbor (bigNN) classifier, and provide its rates of convergence under minimal assumptions, in terms of both the excess risk and the classification instability, which are proven to be the same rates as the oracle nearest neighbor classifier and cannot be improved. To significantly reduce the prediction time that is required for achieving the optimal rate, we also consider the pre-training acceleration technique applied to the bigNN method, with proven convergence rate. We find that in the distributed setting, the optimal choice of the neighbor k should scale with both the total sample size and the number of partitions, and there is a theoretical upper limit for the latter. Numerical studies have verified the theoretical findings."

to:NB
nearest_neighbors
prediction
learning_theory
statistics
distributed_systems
11 days ago by cshalizi

The accuracy, fairness, and limits of predicting recidivism | Science Advances

12 days ago by cshalizi

"Algorithms for predicting recidivism are commonly used to assess a criminal defendant’s likelihood of committing a crime. These predictions are used in pretrial, parole, and sentencing decisions. Proponents of these systems argue that big data and advanced machine learning make these analyses more accurate and less biased than humans. We show, however, that the widely used commercial risk assessment software COMPAS is no more accurate or fair than predictions made by people with little or no criminal justice expertise. In addition, despite COMPAS’s collection of 137 features, the same accuracy can be achieved with a simple linear classifier with only two features."

--- Pretty sure that The Kids matched the last result using decision trees (with the same two features).

to:NB
have_skimmed
crime
prediction
algorithmic_fairness
statistics
classifiers
to_teach:data-mining
--- Pretty sure that The Kids matched the last result using decision trees (with the same two features).

12 days ago by cshalizi

Unsupervised identification of the internal states that shape natural behavior | Nature Neuroscience

14 days ago by cshalizi

"Internal states shape stimulus responses and decision-making, but we lack methods to identify them. To address this gap, we developed an unsupervised method to identify internal states from behavioral data and applied it to a dynamic social interaction. During courtship, Drosophila melanogaster males pattern their songs using feedback cues from their partner. Our model uncovers three latent states underlying this behavior and is able to predict moment-to-moment variation in song-patterning decisions. These states correspond to different sensorimotor strategies, each of which is characterized by different mappings from feedback cues to song modes. We show that a pair of neurons previously thought to be command neurons for song production are sufficient to drive switching between states. Our results reveal how animals compose behavior from previously unidentified internal states, which is a necessary step for quantitative descriptions of animal behavior that link environmental cues, internal needs, neuronal activity and motor outputs."

--- Interesting to see how they select an HMM architecture. (I have Opnions about the Right Way To Do It, naturally.)

to:NB
neuroscience
state-space_models
time_series
psychology
inference_to_latent_objects
statistics
to_read
--- Interesting to see how they select an HMM architecture. (I have Opnions about the Right Way To Do It, naturally.)

14 days ago by cshalizi

The relationship between external variables and common factors | SpringerLink

14 days ago by cshalizi

"A theorem is presented which gives the range of possible correlations between a common factor and an external variable (i.e., a variable not included in the test battery factor analyzed). Analogous expressions for component (and regression component) theory are also derived. Some situations involving external correlations are then discussed which dramatize the theoretical differences between components and common factors."

in_NB
have_read
factor_analysis
inference_to_latent_objects
psychometrics
statistics
re:g_paper
14 days ago by cshalizi

Some new results on factor indeterminacy | SpringerLink

14 days ago by cshalizi

"Some relations between maximum likelihood factor analysis and factor indeterminacy are discussed. Bounds are derived for the minimum average correlation between equivalent sets of correlated factors which depend on the latent roots of the factor intercorrelation matrix ψ. Empirical examples are presented to illustrate some of the theory and indicate the extent to which it can be expected to be relevant in practice."

in_NB
have_read
a_long_time_ago
factor_analysis
low-rank_approximation
statistics
re:g_paper
14 days ago by cshalizi

Teacher Effects on Student Achievement and Height: A Cautionary Tale

15 days ago by cshalizi

"Estimates of teacher “value-added” suggest teachers vary substantially in their ability to promote student learning. Prompted by this finding, many states and school districts have adopted value-added measures as indicators of teacher job performance. In this paper, we conduct a new test of the validity of value-added models. Using administrative student data from New York City, we apply commonly estimated value-added models to an outcome teachers cannot plausibly affect: student height. We find the standard deviation of teacher effects on height is nearly as large as that for math and reading achievement, raising obvious questions about validity. Subsequent analysis finds these “effects” are largely spurious variation (noise), rather than bias resulting from sorting on unobserved factors related to achievement. Given the difficulty of differentiating signal from noise in real-world teacher effect estimates, this paper serves as a cautionary tale for their use in practice."

to:NB
value-added_measures
statistics
education
social_measurement
bad_data_analysis
15 days ago by cshalizi

Random Forests | SpringerLink

15 days ago by cshalizi

"Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression."

to:NB
have_read
breiman.leo
ensemble_methods
decision_trees
random_forests
to_teach:data-mining
machine_learning
statistics
prediction
15 days ago by cshalizi

Chen , Wang : Distributional properties and estimation in spatial image clustering

15 days ago by cshalizi

Clusters of different objects are of great interest in many fields, such as agriculture and ecology. One kind of clustering methods is very different from the traditional statistical clustering analysis, which is based on discrete data points. This method of clustering defines clusters as the connected areas where a well-defined spatial random field is above certain threshold. The statistical properties, especially the distributional properties, of the defined clusters are vital for the studies of related fields. However, the available statistical techniques for analyzing clustering models are not applicable to these problems. We study the distribution properties of the clusters by defining a distribution function of the clusters rigorously and providing methods to estimate the spatial distribution function. Our results are illustrated by numerical experiments and an application to a real world problem.

spatial_statistics
random_fields
to_teach:data_over_space_and_time
statistics
15 days ago by cshalizi

Alquier , Marie : Matrix factorization for multivariate time series analysis

15 days ago by cshalizi

"Matrix factorization is a powerful data analysis tool. It has been used in multivariate time series analysis, leading to the decomposition of the series in a small set of latent factors. However, little is known on the statistical performances of matrix factorization for time series. In this paper, we extend the results known for matrix estimation in the i.i.d setting to time series. Moreover, we prove that when the series exhibit some additional structure like periodicity or smoothness, it is possible to improve on the classical rates of convergence."

in_NB
low-rank_approximation
time_series
factor_analysis
statistics
to_read
to_teach:data_over_space_and_time
15 days ago by cshalizi

[1911.00535] Think-aloud interviews: A tool for exploring student statistical reasoning

22 days ago by cshalizi

"As statistics educators revise introductory courses to cover new topics and reach students from more diverse academic backgrounds, they need assessments to test if new teaching strategies and new curricula are meeting their goals. But assessing student understanding of statistics concepts can be difficult: conceptual questions are difficult to write clearly, and students often interpret questions in unexpected ways and give answers for unexpected reasons. Assessment results alone also do not clearly indicate the reasons students pick specific answers.

"We describe think-aloud interviews with students as a powerful tool to ensure that draft questions fulfill their intended purpose, uncover unexpected misconceptions or surprising readings of questions, and suggest new questions or further pedagogical research. We have conducted more than 40 hour-long think-aloud interviews to develop over 50 assessment questions, and have collected pre- and post-test assessment data from hundreds of introductory statistics students at two institutions.

"Think-alouds and assessment data have helped us refine draft questions and explore student misunderstandings. Our findings include previously under-reported statistical misconceptions about sampling distributions and causation. These results suggest directions for future statistics education research and show how think-aloud interviews can be effectively used to develop assessments and improve our understanding of student learning."

to:NB
to_read
heard_the_talk
kith_and_kin
statistics
cognitive_science
education
protocol_analysis
expertise
"We describe think-aloud interviews with students as a powerful tool to ensure that draft questions fulfill their intended purpose, uncover unexpected misconceptions or surprising readings of questions, and suggest new questions or further pedagogical research. We have conducted more than 40 hour-long think-aloud interviews to develop over 50 assessment questions, and have collected pre- and post-test assessment data from hundreds of introductory statistics students at two institutions.

"Think-alouds and assessment data have helped us refine draft questions and explore student misunderstandings. Our findings include previously under-reported statistical misconceptions about sampling distributions and causation. These results suggest directions for future statistics education research and show how think-aloud interviews can be effectively used to develop assessments and improve our understanding of student learning."

22 days ago by cshalizi

[1910.12820] Empirical Differential Privacy

29 days ago by cshalizi

"We show how to achieve differential privacy with no or reduced added noise, based on the empirical noise in the data itself. Unlike previous works on noiseless privacy, the empirical viewpoint avoids making any explicit assumptions about the random process generating the data."

to:NB
differential_privacy
statistics
29 days ago by cshalizi

[1910.14238] Learning Disentangled Representations for Recommendation

29 days ago by cshalizi

"User behavior data in recommender systems are driven by the complex interactions of many latent factors behind the users' decision making processes. The factors are highly entangled, and may range from high-level ones that govern user intentions, to low-level ones that characterize a user's preference when executing an intention. Learning representations that uncover and disentangle these latent factors can bring enhanced robustness, interpretability, and controllability. However, learning such disentangled representations from user behavior is challenging, and remains largely neglected by the existing literature. In this paper, we present the MACRo-mIcro Disentangled Variational Auto-Encoder (MacridVAE) for learning disentangled representations from user behavior. Our approach achieves macro disentanglement by inferring the high-level concepts associated with user intentions (e.g., to buy a shirt or a cellphone), while capturing the preference of a user regarding the different concepts separately. A micro-disentanglement regularizer, stemming from an information-theoretic interpretation of VAEs, then forces each dimension of the representations to independently reflect an isolated low-level factor (e.g., the size or the color of a shirt). Empirical results show that our approach can achieve substantial improvement over the state-of-the-art baselines. We further demonstrate that the learned representations are interpretable and controllable, which can potentially lead to a new paradigm for recommendation where users are given fine-grained control over targeted aspects of the recommendation lists."

to:NB
information_bottleneck
inference_to_latent_objects
recommender_systems
statistics
29 days ago by cshalizi

[1910.14212] Sobolev Independence Criterion

29 days ago by cshalizi

"We propose the Sobolev Independence Criterion (SIC), an interpretable dependency measure between a high dimensional random variable X and a response variable Y . SIC decomposes to the sum of feature importance scores and hence can be used for nonlinear feature selection. SIC can be seen as a gradient regularized Integral Probability Metric (IPM) between the joint distribution of the two random variables and the product of their marginals. We use sparsity inducing gradient penalties to promote input sparsity of the critic of the IPM. In the kernel version we show that SIC can be cast as a convex optimization problem by introducing auxiliary variables that play an important role in feature selection as they are normalized feature importance scores. We then present a neural version of SIC where the critic is parameterized as a homogeneous neural network, improving its representation power as well as its interpretability. We conduct experiments validating SIC for feature selection in synthetic and real-world experiments. We show that SIC enables reliable and interpretable discoveries, when used in conjunction with the holdout randomization test and knockoffs to control the False Discovery Rate. Code is available at this http URL."

to:NB
dependence_measures
statistics
variable_selection
29 days ago by cshalizi

[1910.02505] Boosting Local Causal Discovery in High-Dimensional Expression Data

29 days ago by cshalizi

"We study the performance of Local Causal Discovery (LCD), a simple and efficient constraint-based method for causal discovery, in predicting causal effects in large-scale gene expression data. We construct practical estimators specific to the high-dimensional regime. Inspired by the ICP algorithm, we use an optional preselection method and two different statistical tests. Empirically, the resulting LCD estimator is seen to closely approach the accuracy of ICP, the state-of-the-art method, while it is algorithmically simpler and computationally more efficient."

to:NB
causal_discovery
statistics
29 days ago by cshalizi

[1905.00175] Boosting: Why You Can Use the HP Filter

29 days ago by cshalizi

"The Hodrick-Prescott (HP) filter is one of the most widely used econometric methods in applied macroeconomic research. Like all nonparametric methods, the HP filter depends critically on a tuning parameter that controls the degree of smoothing. Yet in contrast to modern nonparametric methods and applied work with these procedures, empirical practice with the HP filter almost universally relies on standard settings for the tuning parameter that have been suggested largely by experimentation with macroeconomic data and heuristic reasoning. As recent research (Phillips and Jin, 2015) has shown, standard settings may not be adequate in removing trends, particularly stochastic trends, in economic data.

"This paper proposes an easy-to-implement practical procedure of iterating the HP smoother that is intended to make the filter a smarter smoothing device for trend estimation and trend elimination. We call this iterated HP technique the boosted HP filter in view of its connection to L2-boosting in machine learning. The paper develops limit theory to show that the boosted HP (bHP) filter asymptotically recovers trend mechanisms that involve unit root processes, deterministic polynomial drifts, and polynomial drifts with structural breaks. A stopping criterion is used to automate the iterative HP algorithm, making it a data-determined method that is ready for modern data-rich environments in economic research. The methodology is illustrated using three real data examples that highlight the differences between simple HP filtering, the data-determined boosted filter, and an alternative autoregressive approach. These examples show that the bHP filter is helpful in analyzing a large collection of heterogeneous macroeconomic time series that manifest various degrees of persistence, trend behavior, and volatility."

to:NB
boosting
time_series
splines
statistics
"This paper proposes an easy-to-implement practical procedure of iterating the HP smoother that is intended to make the filter a smarter smoothing device for trend estimation and trend elimination. We call this iterated HP technique the boosted HP filter in view of its connection to L2-boosting in machine learning. The paper develops limit theory to show that the boosted HP (bHP) filter asymptotically recovers trend mechanisms that involve unit root processes, deterministic polynomial drifts, and polynomial drifts with structural breaks. A stopping criterion is used to automate the iterative HP algorithm, making it a data-determined method that is ready for modern data-rich environments in economic research. The methodology is illustrated using three real data examples that highlight the differences between simple HP filtering, the data-determined boosted filter, and an alternative autoregressive approach. These examples show that the bHP filter is helpful in analyzing a large collection of heterogeneous macroeconomic time series that manifest various degrees of persistence, trend behavior, and volatility."

29 days ago by cshalizi

[1911.00190] Randomization as Regularization: A Degrees of Freedom Explanation for Random Forest Success

4 weeks ago by cshalizi

"Random forests remain among the most popular off-the-shelf supervised machine learning tools with a well-established track record of predictive accuracy in both regression and classification settings. Despite their empirical success as well as a bevy of recent work investigating their statistical properties, a full and satisfying explanation for their success has yet to be put forth. Here we aim to take a step forward in this direction by demonstrating that the additional randomness injected into individual trees serves as a form of implicit regularization, making random forests an ideal model in low signal-to-noise ratio (SNR) settings. Specifically, from a model-complexity perspective, we show that the mtry parameter in random forests serves much the same purpose as the shrinkage penalty in explicitly regularized regression procedures like lasso and ridge regression. To highlight this point, we design a randomized linear-model-based forward selection procedure intended as an analogue to tree-based random forests and demonstrate its surprisingly strong empirical performance. Numerous demonstrations on both real and synthetic data are provided."

to:NB
random_forests
statistics
decision_trees
ensemble_methods
4 weeks ago by cshalizi

[1605.00252] Fast Rates for General Unbounded Loss Functions: from ERM to Generalized Bayes

4 weeks ago by cshalizi

"We present new excess risk bounds for general unbounded loss functions including log loss and squared loss, where the distribution of the losses may be heavy-tailed. The bounds hold for general estimators, but they are optimized when applied to η-generalized Bayesian, MDL, and empirical risk minimization estimators. In the case of log loss, the bounds imply convergence rates for generalized Bayesian inference under misspecification in terms of a generalization of the Hellinger metric as long as the learning rate η is set correctly. For general loss functions, our bounds rely on two separate conditions: the v-GRIP (generalized reversed information projection) conditions, which control the lower tail of the excess loss; and the newly introduced witness condition, which controls the upper tail. The parameter v in the v-GRIP conditions determines the achievable rate and is akin to the exponent in the Tsybakov margin condition and the Bernstein condition for bounded losses, which the v-GRIP conditions generalize; favorable v in combination with small model complexity leads to Õ (1/n) rates. The witness condition allows us to connect the excess risk to an "annealed" version thereof, by which we generalize several previous results connecting Hellinger and Rényi divergence to KL divergence."

to:NB
learning_theory
grunwald.peter
statistics
4 weeks ago by cshalizi

[1807.05748] Learning Stochastic Differential Equations With Gaussian Processes Without Gradient Matching

4 weeks ago by cshalizi

"We introduce a novel paradigm for learning non-parametric drift and diffusion functions for stochastic differential equation (SDE). The proposed model learns to simulate path distributions that match observations with non-uniform time increments and arbitrary sparseness, which is in contrast with gradient matching that does not optimize simulated responses. We formulate sensitivity equations for learning and demonstrate that our general stochastic distribution optimisation leads to robust and efficient learning of SDE systems."

to:NB
stochastic_differential_equations
statistical_inference_for_stochastic_processes
gaussian_processes
statistics
nonparametrics
4 weeks ago by cshalizi

Phylogenetic Comparative Methods and the Evolution of Multivariate Phenotypes | Annual Review of Ecology, Evolution, and Systematics

4 weeks ago by cshalizi

"Evolutionary biology is multivariate, and advances in phylogenetic comparative methods for multivariate phenotypes have surged to accommodate this fact. Evolutionary trends in multivariate phenotypes are derived from distances and directions between species in a multivariate phenotype space. For these patterns to be interpretable, phenotypes should be characterized by traits in commensurate units and scale. Visualizing such trends, as is achieved with phylomorphospaces, should continue to play a prominent role in macroevolutionary analyses. Evaluating phylogenetic generalized least squares (PGLS) models (e.g., phylogenetic analysis of variance and regression) is valuable, but using parametric procedures is limited to only a few phenotypic variables. In contrast, nonparametric, permutation-based PGLS methods provide a flexible alternative and are thus preferred for high-dimensional multivariate phenotypes. Permutation-based methods for evaluating covariation within multivariate phenotypes are also well established and can test evolutionary trends in phenotypic integration. However, comparing evolutionary rates and modes in multivariate phenotypes remains an important area of future development."

to:NB
phylogenetics
statistics
4 weeks ago by cshalizi

[1911.01483] Statistical Inference for Model Parameters in Stochastic Gradient Descent via Batch Means

4 weeks ago by cshalizi

"Statistical inference of true model parameters based on stochastic gradient descent (SGD) has started receiving attention in recent years. In this paper, we study a simple algorithm to construct asymptotically valid confidence regions for model parameters using the batch means method. The main idea is to cancel out the covariance matrix which is hard/costly to estimate. In the process of developing the algorithm, we establish process-level function central limit theorem for Polyak-Ruppert averaging based SGD estimators. We also extend the batch means method to accommodate more general batch size specifications."

to:NB
optimization
estimation
statistics
re:HEAS
4 weeks ago by cshalizi

[1911.01429] The frontier of simulation-based inference

4 weeks ago by cshalizi

"Many domains of science have developed complex simulations to describe phenomena of interest. While these simulations provide high-fidelity models, they are poorly suited for inference and lead to challenging inverse problems. We review the rapidly developing field of simulation-based inference and identify the forces giving new momentum to the field. Finally, we describe how the frontier is expanding so that a broad audience can appreciate the profound change these developments may have on science."

to:NB
statistics
simulation
to_read
4 weeks ago by cshalizi

Modeling Uncertainty in Latent Class Membership: A Case Study in Criminology on JSTOR

4 weeks ago by cshalizi

"Social scientists are commonly interested in relating a latent trait (e.g., criminal tendency) to measurable individual covariates (e.g., poor parenting) to understand what defines or perhaps causes the latent trait. In this article we develop an efficient and convenient method for answering such questions. The basic model presumes that two types of variables have been measured: response variables (possibly longitudinal) that partially determine the latent class membership, and covariates or risk factors that we wish to relate to these latent class variables. The model assumes that these observable variables are conditionally independent, given the latent class variable. We use a mixture model for the joint distribution of the observables. We apply this model to a longitudinal dataset assembled as part of the Cambridge Study of Delinquent Development to test a fundamental theory of criminal development. This theory holds that crime is committed by two distinct groups within the population: adolescent-limited offenders and life-course-persistent offenders. As these labels suggest, the two groups are distinguished by the longevity of their offending careers. The theory also predicts that life-course-persistent offenders are disproportionately comprised of individuals born with neurological deficits and reared by caregivers without the skills and resources to effectively socialize a difficult child."

to:NB
crime
time_series
mixture_models
kith_and_kin
roeder.kathryn
statistics
inference_to_latent_objects
have_skimmed
4 weeks ago by cshalizi

[1705.04867] Nearest Neighbors for Matrix Estimation Interpreted as Blind Regression for Latent Variable Model

5 weeks ago by cshalizi

"We consider the setup of nonparametric {\em blind regression} for estimating the entries of a large m×n matrix, when provided with a small, random fraction of noisy measurements. We assume that all rows u∈[m] and columns i∈[n] of the matrix are associated to latent features xrow(u) and xcol(i) respectively, and the (u,i)-th entry of the matrix, A(u,i) is equal to f(xrow(u),xcol(i)) for a latent function f. Given noisy observations of a small, random subset of the matrix entries, our goal is to estimate the unobserved entries of the matrix as well as to "de-noise" the observed entries. As the main result of this work, we introduce a nearest-neighbor-based estimation algorithm, and establish its consistency when the underlying latent function f is Lipschitz, the underlying latent space is a bounded diameter Polish space, and the random fraction of observed entries in the matrix is at least max(m−1+δ,n−1/2+δ), for any δ>0. As an important byproduct, our analysis sheds light into the performance of the classical collaborative filtering algorithm for matrix completion, which has been widely utilized in practice. Experiments with the MovieLens and Netflix datasets suggest that our algorithm provides a principled improvement over basic collaborative filtering and is competitive with matrix factorization methods. Our algorithm has a natural extension to the setting of tensor completion via flattening the tensor to matrix. When applied to the setting of image in-painting, which is a 3-order tensor, we find that our approach is competitive with respect to state-of-art tensor completion algorithms across benchmark images."

to:NB
matrix_factorization
collaborative_filtering
nearest_neighbors
statistics
5 weeks ago by cshalizi

Sadhanala , Tibshirani : Additive models with trend filtering

5 weeks ago by cshalizi

"We study additive models built with trend filtering, that is, additive models whose components are each regularized by the (discrete) total variation of their kkth (discrete) derivative, for a chosen integer k≥0k≥0. This results in kkth degree piecewise polynomial components, (e.g., k=0k=0 gives piecewise constant components, k=1k=1 gives piecewise linear, k=2k=2 gives piecewise quadratic, etc.). Analogous to its advantages in the univariate case, additive trend filtering has favorable theoretical and computational properties, thanks in large part to the localized nature of the (discrete) total variation regularizer that it uses. On the theory side, we derive fast error rates for additive trend filtering estimates, and show these rates are minimax optimal when the underlying function is additive and has component functions whose derivatives are of bounded variation. We also show that these rates are unattainable by additive smoothing splines (and by additive models built from linear smoothers, in general). On the computational side, we use backfitting, to leverage fast univariate trend filtering solvers; we also describe a new backfitting algorithm whose iterations can be run in parallel, which (as far as we can tell) is the first of its kind. Lastly, we present a number of experiments to examine the empirical performance of trend filtering."

to:NB
regression
additive_models
statistics
kith_and_kin
tibshirani.ryan
5 weeks ago by cshalizi

Veitch , Roy : Sampling and estimation for (sparse) exchangeable graphs

5 weeks ago by cshalizi

"Sparse exchangeable graphs on ℝ+R+, and the associated graphex framework for sparse graphs, generalize exchangeable graphs on ℕN, and the associated graphon framework for dense graphs. We develop the graphex framework as a tool for statistical network analysis by identifying the sampling scheme that is naturally associated with the models of the framework, formalizing two natural notions of consistent estimation of the parameter (the graphex) underlying these models, and identifying general consistent estimators in each case. The sampling scheme is a modification of independent vertex sampling that throws away vertices that are isolated in the sampled subgraph. The estimators are variants of the empirical graphon estimator, which is known to be a consistent estimator for the distribution of dense exchangeable graphs; both can be understood as graph analogues to the empirical distribution in the i.i.d. sequence setting. Our results may be viewed as a generalization of consistent estimation via the empirical graphon from the dense graph regime to also include sparse graphs."

to:NB
graph_limits
network_data_analysis
graphons
nonparametrics
statistics
veitch.victor
to_teach:graphons
5 weeks ago by cshalizi

Rinaldo , Wasserman , G’Sell : Bootstrapping and sample splitting for high-dimensional, assumption-lean inference

5 weeks ago by cshalizi

"Several new methods have been recently proposed for performing valid inference after model selection. An older method is sample splitting: use part of the data for model selection and the rest for inference. In this paper, we revisit sample splitting combined with the bootstrap (or the Normal approximation). We show that this leads to a simple, assumption-lean approach to inference and we establish results on the accuracy of the method. In fact, we find new bounds on the accuracy of the bootstrap and the Normal approximation for general nonlinear parameters with increasing dimension which we then use to assess the accuracy of regression inference. We define new parameters that measure variable importance and that can be inferred with greater accuracy than the usual regression coefficients. Finally, we elucidate an inference-prediction trade-off: splitting increases the accuracy and robustness of inference but can decrease the accuracy of the predictions."

to:NB
linear_regression
sample_splitting
statistics
kith_and_kin
rinaldo.alessandro
wasserman.larry
g'sell.max
5 weeks ago by cshalizi

[1910.13289] Optimal nonparametric multivariate change point detection and localization

5 weeks ago by cshalizi

"We study the multivariate nonparametric change point detection problem, where the data are a collection of independent p-dimensional random vectors, whose distributions are assumed to have piecewise-constant and uniformly Lipschitz densities, changing at unknown times, called change points. We quantify the magnitude of the distributional change at any change point with the supremum norm of the difference between the corresponding densities. We are concerned with the localization task of estimating the positions of the change points. In our analysis, we allow for the model parameters to vary with the total number of time points, including the minimal spacing between consecutive change points and the magnitude of the smallest distributional change. We provide information-theoretic lower bounds on both the localization rate and the minimal signal-to-noise ratio required to guarantee consistent localization. We formulate a novel algorithm based on kernel density estimation that nearly achieves the minimax lower bound, save for logarithm factors. We have provided extensive numerical evidence to support our theoretical findings."

to:NB
change-point_problem
statistics
kith_and_kin
rinaldo.alessandro
5 weeks ago by cshalizi

[1905.11255] Kernel Conditional Density Operators

5 weeks ago by cshalizi

"We introduce a novel conditional density estimation model termed the conditional density operator (CDO). It naturally captures multivariate, multimodal output densities and shows performance that is competitive with recent neural conditional density models and Gaussian processes. The proposed model is based on a novel approach to the reconstruction of probability densities from their kernel mean embeddings by drawing connections to estimation of Radon-Nikodym derivatives in the reproducing kernel Hilbert space (RKHS). We prove finite sample bounds for the estimation error in a standard density reconstruction scenario, independent of problem dimensionality. Interestingly, when a kernel is used that is also a probability density, the CDO allows us to both evaluate and sample the output density efficiently. We demonstrate the versatility and performance of the proposed model on both synthetic and real-world data."

to:NB
kernel_estimators
hilbert_space
density_estimation
statistics
to_read
5 weeks ago by cshalizi

[1907.03783] Comparing EM with GD in Mixture Models of Two Components

5 weeks ago by cshalizi

"The expectation-maximization (EM) algorithm has been widely used in minimizing the negative log likelihood (also known as cross entropy) of mixture models. However, little is understood about the goodness of the fixed points it converges to. In this paper, we study the regions where one component is missing in two-component mixture models, which we call one-cluster regions. We analyze the propensity of such regions to trap EM and gradient descent (GD) for mixtures of two Gaussians and mixtures of two Bernoullis. In the case of Gaussian mixtures, EM escapes one-cluster regions exponentially fast, while GD escapes them linearly fast. In the case of mixtures of Bernoullis, we find that there exist one-cluster regions that are stable for GD and therefore trap GD, but those regions are unstable for EM, allowing EM to escape. Those regions are local minima that appear universally in experiments and can be arbitrarily bad. This work implies that EM is less likely than GD to converge to certain bad local optima in mixture models."

to:NB
mixture_models
em_algorithm
optimization
computational_statistics
statistics
5 weeks ago by cshalizi

[1910.12993] Characterizing Distribution Equivalence for Cyclic and Acyclic Directed Graphs

5 weeks ago by cshalizi

"The main way for defining equivalence among acyclic directed graphs is based on the conditional independencies of the distributions that they can generate. However, it is known that when cycles are allowed in the structure, conditional independence is not a suitable notion for equivalence of two structures, as it does not reflect all the information in the distribution that can be used for identification of the underlying structure. In this paper, we present a general, unified notion of equivalence for linear Gaussian directed graphs. Our proposed definition for equivalence is based on the set of distributions that the structure is able to generate. We take a first step towards devising methods for characterizing the equivalence of two structures, which may be cyclic or acyclic. Additionally, we propose a score-based method for learning the structure from observational data."

to:NB
graphical_models
causal_discovery
zhang.kun
statistics
5 weeks ago by cshalizi

[1809.05014] Statistical Estimation of Ergodic Markov Chain Kernel over Discrete State Space

5 weeks ago by cshalizi

"We investigate the statistical complexity of estimating the parameters of a discrete-state Markov chain kernel from a single long sequence of state observations. In the finite case, we characterize (modulo logarithmic factors) the minimax sample complexity of estimation with respect to the operator infinity norm, while in the countably infinite case, we analyze the problem with respect to a natural entry-wise norm derived from total variation. We show that in both cases, the sample complexity is governed by the mixing properties of the unknown chain, for which, in the finite-state case, there are known finite-sample estimators with fully empirical confidence intervals."

to:NB
markov_models
statistical_inference_for_stochastic_processes
statistics
time_series
kith_and_kin
kontorovich.aryeh
to_teach:data_over_space_and_time
5 weeks ago by cshalizi

[1910.13427] Distribution Density, Tails, and Outliers in Machine Learning: Metrics and Applications

5 weeks ago by cshalizi

"We develop techniques to quantify the degree to which a given (training or testing) example is an outlier in the underlying distribution. We evaluate five methods to score examples in a dataset by how well-represented the examples are, for different plausible definitions of "well-represented", and apply these to four common datasets: MNIST, Fashion-MNIST, CIFAR-10, and ImageNet. Despite being independent approaches, we find all five are highly correlated, suggesting that the notion of being well-represented can be quantified. Among other uses, we find these methods can be combined to identify (a) prototypical examples (that match human expectations); (b) memorized training examples; and, (c) uncommon submodes of the dataset. Further, we show how we can utilize our metrics to determine an improved ordering for curriculum learning, and impact adversarial robustness. We release all metric values on training and test sets we studied."

--- Interesting to see if they look at earlier work on outliers at all.

in_NB
outliers
adversarial_examples
statistics
--- Interesting to see if they look at earlier work on outliers at all.

5 weeks ago by cshalizi

[1909.05442] Nonstationary Nonparametric Online Learning: Balancing Dynamic Regret and Model Parsimony

6 weeks ago by cshalizi

"An open challenge in supervised learning is \emph{conceptual drift}: a data point begins as classified according to one label, but over time the notion of that label changes. Beyond linear autoregressive models, transfer and meta learning address drift, but require data that is representative of disparate domains at the outset of training. To relax this requirement, we propose a memory-efficient \emph{online} universal function approximator based on compressed kernel methods. Our approach hinges upon viewing non-stationary learning as online convex optimization with dynamic comparators, for which performance is quantified by dynamic regret.

"Prior works control dynamic regret growth only for linear models. In contrast, we hypothesize actions belong to reproducing kernel Hilbert spaces (RKHS). We propose a functional variant of online gradient descent (OGD) operating in tandem with greedy subspace projections. Projections are necessary to surmount the fact that RKHS functions have complexity proportional to time.

"For this scheme, we establish sublinear dynamic regret growth in terms of both loss variation and functional path length, and that the memory of the function sequence remains moderate. Experiments demonstrate the usefulness of the proposed technique for online nonlinear regression and classification problems with non-stationary data."

to:NB
non-stationarity
time_series
online_learning
statistics
re:growing_ensemble_project
"Prior works control dynamic regret growth only for linear models. In contrast, we hypothesize actions belong to reproducing kernel Hilbert spaces (RKHS). We propose a functional variant of online gradient descent (OGD) operating in tandem with greedy subspace projections. Projections are necessary to surmount the fact that RKHS functions have complexity proportional to time.

"For this scheme, we establish sublinear dynamic regret growth in terms of both loss variation and functional path length, and that the memory of the function sequence remains moderate. Experiments demonstrate the usefulness of the proposed technique for online nonlinear regression and classification problems with non-stationary data."

6 weeks ago by cshalizi

[1810.01509] Hierarchical community detection by recursive partitioning

6 weeks ago by cshalizi

"The problem of community detection in networks is usually formulated as finding a single partition of the network into some "correct" number of communities. We argue that it is more interpretable and in some regimes more accurate to construct a hierarchical tree of communities instead. This can be done with a simple top-down recursive partitioning algorithm, starting with a single community and separating the nodes into two communities by spectral clustering repeatedly, until a stopping rule suggests there are no further communities. This class of algorithms is model-free, computationally efficient, and requires no tuning other than selecting a stopping rule. We show that there are regimes where this approach outperforms K-way spectral clustering, and propose a natural framework for analyzing the algorithm's theoretical performance, the binary tree stochastic block model. Under this model, we prove that the algorithm correctly recovers the entire community tree under relatively mild assumptions. We also apply the algorithm to a dataset of statistics papers to construct a hierarchical tree of statistical research communities."

to:NB
community_discovery
statistics
network_data_analysis
levina.elizaveta
sarkar.purna
bickel.peter_j.
6 weeks ago by cshalizi

[1905.03806] Think Globally, Act Locally: A Deep Neural Network Approach to High-Dimensional Time Series Forecasting

6 weeks ago by cshalizi

"Forecasting high-dimensional time series plays a crucial role in many applications such as demand forecasting and financial predictions. Modern datasets can have millions of correlated time-series that evolve together, i.e they are extremely high dimensional (one dimension for each individual time-series). There is a need for exploiting global patterns and coupling them with local calibration for better prediction. However, most recent deep learning approaches in the literature are one-dimensional, i.e, even though they are trained on the whole dataset, during prediction, the future forecast for a single dimension mainly depends on past values from the same dimension. In this paper, we seek to correct this deficiency and propose DeepGLO, a deep forecasting model which thinks globally and acts locally. In particular, DeepGLO is a hybrid model that combines a global matrix factorization model regularized by a temporal convolution network, along with another temporal network that can capture local properties of each time-series and associated covariates. Our model can be trained effectively on high-dimensional but diverse time series, where different time series can have vastly different scales, without a priori normalization or rescaling. Empirical results demonstrate that DeepGLO can outperform state-of-the-art approaches; for example, we see more than 25% improvement in WAPE over other methods on a public dataset that contains more than 100K-dimensional time series."

to:NB
time_series
statistics
prediction
6 weeks ago by cshalizi

[1910.12207] An Active Approach for Model Interpretation

6 weeks ago by cshalizi

"Model interpretation, or explanation of a machine learning classifier, aims to extract generalizable knowledge from a trained classifier into a human-understandable format, for various purposes such as model assessment, debugging and trust. From a computaional viewpoint, it is formulated as approximating the target classifier using a simpler interpretable model, such as rule models like a decision set/list/tree. Often, this approximation is handled as standard supervised learning and the only difference is that the labels are provided by the target classifier instead of ground truth. This paradigm is particularly popular because there exists a variety of well-studied supervised algorithms for learning an interpretable classifier. However, we argue that this paradigm is suboptimal for it does not utilize the unique property of the model interpretation problem, that is, the ability to generate synthetic instances and query the target classifier for their labels. We call this the active-query property, suggesting that we should consider model interpretation from an active learning perspective. Following this insight, we argue that the active-query property should be employed when designing a model interpretation algorithm, and that the generation of synthetic instances should be integrated seamlessly with the algorithm that learns the model interpretation. In this paper, we demonstrate that by doing so, it is possible to achieve more faithful interpretation with simpler model complexity. As a technical contribution, we present an active algorithm Active Decision Set Induction (ADS) to learn a decision set, a set of if-else rules, for model interpretation. ADS performs a local search over the space of all decision sets. In every iteration, ADS computes confidence intervals for the value of the objective function of all local actions and utilizes active-query to determine the best one."

to:NB
experimental_design
statistics
data_mining
6 weeks ago by cshalizi

[1905.13195] Modeling Uncertainty by Learning a Hierarchy of Deep Neural Connections

6 weeks ago by cshalizi

"Modeling uncertainty in deep neural networks, despite recent important advances, is still an open problem. Bayesian neural networks are a powerful solution, where the prior over network weights is a design choice, often a normal distribution or other distribution encouraging sparsity. However, this prior is agnostic to the generative process of the input data, which might lead to unwarranted generalization for out-of-distribution tested data. We suggest the presence of a confounder for the relation between the input data and the discriminative function given the target label. We propose an approach for modeling this confounder by sharing neural connectivity patterns between the generative and discriminative networks. This approach leads to a new deep architecture, where networks are sampled from the posterior of local causal structures, and coupled into a compact hierarchy. We demonstrate that sampling networks from this hierarchy, proportionally to their posterior, is efficient and enables estimating various types of uncertainties. Empirical evaluations of our method demonstrate significant improvement compared to state-of-the-art calibration and out-of-distribution detection methods."

--- This seems totally ad-hoc.

to:NB
uncertainty_for_neural_networks
statistics
to_be_shot_after_a_fair_trial
--- This seems totally ad-hoc.

6 weeks ago by cshalizi

[1902.02979] Fair Decisions Despite Imperfect Predictions

6 weeks ago by cshalizi

"Consequential decisions are increasingly informed by sophisticated data-driven predictive models. However, consistently learning accurate predictive models requires access to ground truth labels. Unfortunately, in practice, labels may only exist conditional on certain decisions---if a loan is denied, there is not even an option for the individual to pay back the loan. In this paper, we show that, in this selective labels setting, learning to predict is suboptimal in terms of both fairness and utility. To avoid this undesirable behavior, we propose to directly learn stochastic decision policies that maximize utility under fairness constraints. In the context of fair machine learning, our results suggest the need for a paradigm shift from "learning to predict" to "learning to decide". Experiments on synthetic and real-world data illustrate the favorable properties of learning to decide, in terms of both utility and fairness."

to:NB
missing_data
classifiers
algorithmic_fairness
statistics
to_teach:data-mining
6 weeks ago by cshalizi

[1810.11953] Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift

6 weeks ago by cshalizi

"We might hope that when faced with unexpected inputs, well-designed software systems would fire off warnings. Machine learning (ML) systems, however, which depend strongly on properties of their inputs (e.g. the i.i.d. assumption), tend to fail silently. This paper explores the problem of building ML systems that fail loudly, investigating methods for detecting dataset shift, identifying exemplars that most typify the shift, and quantifying shift malignancy. We focus on several datasets and various perturbations to both covariates and label distributions with varying magnitudes and fractions of data affected. Interestingly, we show that across the dataset shifts that we explore, a two-sample-testing-based approach, using pre-trained classifiers for dimensionality reduction, performs best. Moreover, we demonstrate that domain-discriminating approaches tend to be helpful for characterizing shifts qualitatively and determining if they are harmful."

to:NB
dataset_shift
machine_learning
model_checking
lipton.zachary
two-sample_tests
statistics
6 weeks ago by cshalizi

[1910.12618] Textual Data for Time Series Forecasting

6 weeks ago by cshalizi

"While ubiquitous, textual sources of information such as company reports, social media posts, etc. are hardly included in prediction algorithms for time series, despite the relevant information they may contain. In this work, openly accessible daily weather reports from France and the United-Kingdom are leveraged to predict time series of national electricity consumption, average temperature and wind-speed with a single pipeline. Two methods of numerical representation of text are considered, namely traditional Term Frequency - Inverse Document Frequency (TF-IDF) as well as our own neural word embedding. Using exclusively text, we are able to predict the aforementioned time series with sufficient accuracy to be used to replace missing data. Furthermore the proposed word embeddings display geometric properties relating to the behavior of the time series and context similarity between words."

to:NB
text_mining
natural_language_processing
prediction
time_series
statistics
6 weeks ago by cshalizi

[1910.12478] Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes

6 weeks ago by cshalizi

"Wide neural networks with random weights and biases are Gaussian processes, as observed by Neal (1995) for shallow networks, and more recently by Lee et al. (2018) and Matthews et al. (2018) for deep fully-connected networks, as well as by Novak et al. (2019) and Garriga-Alonso et al. (2019) for deep convolutional networks. We show that this Neural Network-Gaussian Process correspondence surprisingly extends to all modern feedforward or recurrent neural networks composed of multilayer perceptron, RNNs (e.g. LSTMs, GRUs), (nD or graph) convolution, pooling, skip connection, attention, batch normalization, and/or layer normalization. More generally, we introduce a language for expressing neural network computations, and our result encompasses all such expressible neural networks. This work serves as a tutorial on the *tensor programs* technique formulated in Yang (2019) and elucidates the Gaussian Process results obtained there. We provide open-source implementations of the Gaussian Process kernels of simple RNN, GRU, transformer, and batchnorm+ReLU network at this http URL."

to:NB
gaussian_processes
neural_networks
statistics
6 weeks ago by cshalizi

[1910.12414] Locality-Sensitive Hashing for f-Divergences: Mutual Information Loss and Beyond

6 weeks ago by cshalizi

"Computing approximate nearest neighbors in high dimensional spaces is a central problem in large-scale data mining with a wide range of applications in machine learning and data science. A popular and effective technique in computing nearest neighbors approximately is the locality-sensitive hashing (LSH) scheme. In this paper, we aim to develop LSH schemes for distance functions that measure the distance between two probability distributions, particularly for f-divergences as well as a generalization to capture mutual information loss. First, we provide a general framework to design LHS schemes for f-divergence distance functions and develop LSH schemes for the generalized Jensen-Shannon divergence and triangular discrimination in this framework. We show a two-sided approximation result for approximation of the generalized Jensen-Shannon divergence by the Hellinger distance, which may be of independent interest. Next, we show a general method of reducing the problem of designing an LSH scheme for a Krein kernel (which can be expressed as the difference of two positive definite kernels) to the problem of maximum inner product search. We exemplify this method by applying it to the mutual information loss, due to its several important applications such as model compression."

to:NB
entropy_estimation
locality-senstive_hashing
computational_statistics
statistics
6 weeks ago by cshalizi

[1907.02306] Consistent Regression using Data-Dependent Coverings

6 weeks ago by cshalizi

"In this paper, we introduce a novel method to generate interpretable regression function estimators. The idea is based on called data-dependent coverings. The aim is to extract from the data a covering of the feature space instead of a partition. The estimator predicts the empirical conditional expectation over the cells of the partitions generated from the coverings. Thus, such estimator has the same form as those issued from data-dependent partitioning algorithms. We give sufficient conditions to ensure the consistency, avoiding the sufficient condition of shrinkage of the cells that appears in the former literature. Doing so, we reduce the number of covering elements. We show that such coverings are interpretable and each element of the covering is tagged as significant or insignificant. The proof of the consistency is based on a control of the error of the empirical estimation of conditional expectations which is interesting on its own."

to:NB
statistics
regression
nonparametrics
to_read
6 weeks ago by cshalizi

[1910.11397] The covariate-adjusted residual estimator and its use in both randomized trials and observational settings

6 weeks ago by cshalizi

"We often seek to estimate the causal effect of an exposure on a particular outcome in both randomized and observational settings. One such estimation method is the covariate-adjusted residuals estimator, which was designed for individually or cluster randomized trials. In this manuscript, we study the properties of this estimator and develop a new estimator that utilizes both covariate adjustment and inverse probability weighting We support our theoretical results with a simulation study and an application in an infectious disease setting. The covariate-adjusted residuals estimator is an efficient and unbiased estimator of the average treatment effect in randomized trials; however, it is not guaranteed to be unbiased in observational studies. Our novel estimator, the covariate-adjusted residuals estimator with inverse probability weighting, is unbiased in randomized and observational settings, under a reasonable set of assumptions. Furthermore, when these assumptions hold, it provides efficiency gains over inverse probability weighting in observational studies. The covariate-adjusted residuals estimator is valid for use in randomized trials, but should not be used in observational studies. The covariate-adjusted residuals estimator with inverse probability weighting provides an efficient alternative for use in randomized and observational settings."

--- I'll be interested to see how this differs from the usual "doubly-robust" estimator.

to:NB
causal_inference
statistics
--- I'll be interested to see how this differs from the usual "doubly-robust" estimator.

6 weeks ago by cshalizi

[1910.11410] Almost Politically Acceptable Criminal Justice Risk Assessment

6 weeks ago by cshalizi

"In criminal justice risk forecasting, one can prove that it is impossible to optimize accuracy and fairness at the same time. One can also prove that it is impossible optimize at once all of the usual group definitions of fairness. In the policy arena, one is left with tradeoffs about which many stakeholders will adamantly disagree. In this paper, we offer a different approach. We do not seek perfectly accurate and fair risk assessments. We seek politically acceptable risk assessments. We describe and apply to data on 300,000 offenders a machine learning approach that responds to many of the most visible charges of "racial bias." Regardless of whether such claims are true, we adjust our procedures to compensate. We begin by training the algorithm on White offenders only and computing risk with test data separately for White offenders and Black offenders. Thus, the fitted algorithm structure is exactly the same for both groups; the algorithm treats all offenders as if they are White. But because White and Black offenders can bring different predictors distributions to the white-trained algorithm, we provide additional adjustments if needed. Insofar are conventional machine learning procedures do not produce accuracy and fairness that some stakeholders require, it is possible to alter conventional practice to respond explicitly to many salient stakeholder claims even if they are unsupported by the facts. The results can be a politically acceptable risk assessment tools."

to:NB
prediction
crime
algorithmic_fairness
statistics
data_mining
to_teach:data-mining
berk.richard
not_often_you_see_an_abstract_framed_as_"fine_have_it_your_way"
6 weeks ago by cshalizi

[1910.11743] Boosting heritability: estimating the genetic component of phenotypic variation with multiple sample splitting

6 weeks ago by cshalizi

"Heritability is a central measure in genetics quantifying how much of the variability observed in a trait is attributable to genetic differences. Existing methods for estimating heritability are most often based on random-effect models, typically for computational reasons. The alternative of using a fixed-effect model has received much more limited attention in the literature. In this paper, we propose a generic strategy for heritability inference, termed as \textit{"boosting heritability"}, by combining several advantageous features of different recent methods to produce an estimate of the heritability with a high-dimensional linear model. Boosting heritability uses in particular a multiple sample splitting strategy which leads to a more stable estimate. We use antibiotic resistance data from a major human pathogen, \textit{Sptreptococcus pneumoniae}, to demonstrate the applicability of our inference strategy."

to:NB
heritability
statistics
genetics
ensemble_methods
6 weeks ago by cshalizi

[1910.11445] Finite Mixtures of ERGMs for Ensembles of Networks

6 weeks ago by cshalizi

"Ensembles of networks arise in many scientific fields, but currently there are few statistical models aimed at understanding their generative processes. To fill in this gap, we propose characterizing network ensembles via finite mixtures of exponential family random graph models, employing a Metropolis-within-Gibbs algorithm to conduct Bayesian inference. Simulation studies show that the proposed procedure can recover the true cluster assignments and cluster-specific parameters. We demonstrate the utility of the proposed approach using an ensemble of political co-voting networks among U.S. Senators."

to:NB
exponential_family_random_graphs
ensemble_methods
network_data_analysis
statistics
butts.carter
to_teach:baby-nets
6 weeks ago by cshalizi

[1910.11537] Unified model selection approach based on minimum description length principle in Granger causality analysis

6 weeks ago by cshalizi

"Granger causality analysis (GCA) provides a powerful tool for uncovering the patterns of brain connectivity mechanism using neuroimaging techniques. Conventional GCA applies two different mathematical theories in a two-stage scheme: (1) the Bayesian information criterion (BIC) or Akaike information criterion (AIC) for the regression model orders associated with endogenous and exogenous information; (2) F-statistics for determining the causal effects of exogenous variables. While specifying endogenous and exogenous effects are essentially the same model selection problem, this could produce different benchmarks in the two stages and therefore degrade the performance of GCA. In this course, we present a unified model selection approach based on the minimum description length (MDL) principle for GCA in the context of the general regression model paradigm. Compared with conventional methods, our approach emphasize that a single mathematical theory should be held throughout the GCA process. Under this framework, all candidate models within the model space might be compared freely in the context of the code length, without the need for an intermediate model. We illustrate its advantages over conventional two-stage GCA approach in a 3-node network and a 5-node network synthetic experiments. The unified model selection approach is capable of identifying the actual connectivity while avoiding the false influences of noise. More importantly, the proposed approach obtained more consistent results in a challenge fMRI dataset for causality investigation, mental calculation network under visual and auditory stimulus, respectively. The proposed approach has potential to accommodate other Granger causality representations in other function space. The comparison between different GC representations in different function spaces can also be naturally deal with in the framework."

to:NB
time_series
granger_causality
minimum_description_length
statistics
6 weeks ago by cshalizi

[1910.11540] Descriptive Dimensionality and Its Characterization of MDL-based Learning and Change Detection

6 weeks ago by cshalizi

"This paper introduces a new notion of dimensionality of probabilistic models from an information-theoretic view point. We call it the "descriptive dimension"(Ddim). We show that Ddim coincides with the number of independent parameters for the parametric class, and can further be extended to real-valued dimensionality when a number of models are mixed. The paper then derives the rate of convergence of the MDL (Minimum Description Length) learning algorithm which outputs a normalized maximum likelihood (NML) distribution with model of the shortest NML codelength. The paper proves that the rate is governed by Ddim. The paper also derives error probabilities of the MDL-based test for multiple model change detection. It proves that they are also governed by Ddim. Through the analysis, we demonstrate that Ddim is an intrinsic quantity which characterizes the performance of the MDL-based learning and change detection."

to:NB
minimum_description_length
statistics
estimation
prediction
change-point_problem
6 weeks ago by cshalizi

[1910.11492] Causal inference for climate change events from satellite image time series using computer vision and deep learning

6 weeks ago by cshalizi

"We propose a method for causal inference using satellite image time series, in order to determine the treatment effects of interventions which impact climate change, such as deforestation. Simply put, the aim is to quantify the 'before versus after' effect of climate related human driven interventions, such as urbanization; as well as natural disasters, such as hurricanes and forest fires. As a concrete example, we focus on quantifying forest tree cover change/ deforestation due to human led causes. The proposed method involves the following steps. First, we uae computer vision and machine learning/deep learning techniques to detect and quantify forest tree coverage levels over time, at every time epoch. We then look at this time series to identify changepoints. Next, we estimate the expected (forest tree cover) values using a Bayesian structural causal model and projecting/forecasting the counterfactual. This is compared to the values actually observed post intervention, and the difference in the two values gives us the effect of the intervention (as compared to the non intervention scenario, i.e. what would have possibly happened without the intervention). As a specific use case, we analyze deforestation levels before and after the hyperinflation event (intervention) in Brazil (which ended in 1993-94), for the Amazon rainforest region, around Rondonia, Brazil. For this deforestation use case, using our causal inference framework can help causally attribute change/reduction in forest tree cover and increasing deforestation rates due to human activities at various points in time."

to:NB
causal_inference
climate_change
statistics
spatio-temporal_statistics
to_teach:data_over_space_and_time
6 weeks ago by cshalizi

[1910.12343] Disentangling synchrony from serial dependency in paired event time series

6 weeks ago by cshalizi

"Quantifying synchronization phenomena based on the timing of events has recently attracted a great deal of interest in various disciplines such as neuroscience or climatology. A multitude of similarity measures has been proposed for thus purpose, including Event Synchronization (ES) and Event Coincidence Analysis (ECA) as two widely applicable examples. While ES defines synchrony in a data adaptive local way that does not distinguish between different time scales, ECA requires selecting a specific scale for analysis. In this paper, we use slightly modified versions of both ES and ECA that address previous issues with respect to proper normalization and boundary treatment, which are particularly relevant for short time series with low temporal resolution. By numerically studying threshold crossing events in coupled autoregressive processes, we identify a practical limitation of ES when attempting to study synchrony between serially dependent event sequences exhibiting event clustering in time. Practical implications of this observation are demonstrated for the case of functional network representations of climate extremes based on both ES and ECA, while no marked differences between both measures are observed for the case of epileptic electroencephalogram (EEG) data. Our findings suggest that careful event detection along with diligent preprocessing is recommended when applying ES while less crucial for ECA. Despite the lack of a general \emph{modus operandi} for both, event and synchronization detection, we suggest ECA as a widely robust method, especially for time resolved synchronization analyses of event time series from various disciplines."

to:NB
synchronization
time_series
statistics
to_teach:data_over_space_and_time
6 weeks ago by cshalizi

[1910.04851] Addressing Failure Prediction by Learning Model Confidence

6 weeks ago by cshalizi

"Assessing reliably the confidence of a deep neural network and predicting its failures is of primary importance for the practical deployment of these models. In this paper, we propose a new target criterion for model confidence, corresponding to the True Class Probability (TCP). We show how using the TCP is more suited than relying on the classic Maximum Class Probability (MCP). We provide in addition theoretical guarantees for TCP in the context of failure prediction. Since the true class is by essence unknown at test time, we propose to learn TCP criterion on the training set, introducing a specific learning scheme adapted to this context. Extensive experiments are conducted for validating the relevance of the proposed approach. We study various network architectures, small and large scale datasets for image classification and semantic segmentation. We show that our approach consistently outperforms several strong methods, from MCP to Bayesian uncertainty, as well as recent approaches specifically designed for failure prediction."

to:NB
prediction
uncertainty_for_neural_networks
statistics
6 weeks ago by cshalizi

[1810.11571] Analysis of KNN Information Estimators for Smooth Distributions

6 weeks ago by cshalizi

"KSG mutual information estimator, which is based on the distances of each sample to its k-th nearest neighbor, is widely used to estimate mutual information between two continuous random variables. Existing work has analyzed the convergence rate of this estimator for random variables whose densities are bounded away from zero in its support. In practice, however, KSG estimator also performs well for a much broader class of distributions, including not only those with bounded support and densities bounded away from zero, but also those with bounded support but densities approaching zero, and those with unbounded support. In this paper, we analyze the convergence rate of the error of KSG estimator for smooth distributions, whose support of density can be both bounded and unbounded. As KSG mutual information estimator can be viewed as an adaptive recombination of KL entropy estimators, in our analysis, we also provide convergence analysis of KL entropy estimator for a broad class of distributions."

to:NB
entropy_estimation
statistics
6 weeks ago by cshalizi

Audits as Evidence

6 weeks ago by cshalizi

"We develop tools for utilizing correspondence experiments to detect illegal discrimination by individual employers. Employers violate US employment law if their propensity to contact applicants depends on protected characteristics such as race or sex. We establish identification of higher moments of the causal effects of protected characteristics on callback rates as a function of the number of fictitious applications sent to each job ad. These moments are used to bound the fraction of jobs that illegally discriminate. Applying our results to three experimental datasets, we find evidence of significant employer heterogeneity in discriminatory behavior, with the standard deviation of gaps in job-specific callback probabilities across protected groups averaging roughly twice the mean gap. In a recent experiment manipulating racially distinctive names, we estimate that at least 85% of jobs that contact both of two white applications and neither of two black applications are engaged in illegal discrimination. To assess more carefully the tradeoff between type I and II errors presented by these behavioral patterns, we consider the performance of a series of decision rules for investigating suspicious callback behavior under a simple two-type model that rationalizes the experimental data. Though, in our preferred specification, only 17% of employers are estimated to discriminate on the basis of race, we find that an experiment sending 10 applications to each job would enable accurate detection of 7-10% of discriminators while falsely accusing fewer than 0.2% of non-discriminators. A minimax decision rule acknowledging partial identification of the joint distribution of callback rates yields higher error rates but more investigations than our baseline two-type model. Our results suggest illegal labor market discrimination can be reliably monitored with relatively small modifications to existing audit designs."

to:NB
racism
sociology
statistics
audit_studies
via:jbdelong
6 weeks ago by cshalizi

[1910.12327] A simple measure of conditional dependence

6 weeks ago by cshalizi

"We propose a coefficient of conditional dependence between two random variables Y and Z given a set of other variables X1,…,Xp, based on an i.i.d. sample. The coefficient has a long list of desirable properties, the most important of which is that under absolutely no distributional assumptions, it converges to a limit in [0,1], where the limit is 0 if and only if Y and Z are conditionally independent given X1,…,Xp, and is 1 if and only if Y is equal to a measurable function of Z given X1,…,Xp. Using this statistic, we devise a new variable selection algorithm, called Feature Ordering by Conditional Independence (FOCI), which is model-free, has no tuning parameters, and is provably consistent under sparsity assumptions. A number of applications to synthetic and real datasets are worked out."

to:NB
dependence_measures
statistics
variable_selection
6 weeks ago by cshalizi

[1711.02141] Optimal rates of entropy estimation over Lipschitz balls

6 weeks ago by cshalizi

"We consider the problem of minimax estimation of the entropy of a density over Lipschitz balls. Dropping the usual assumption that the density is bounded away from zero, we obtain the minimax rates (nlnn)−s/(s+d)+n−1/2 for 0<s≤2 for densities supported on [0,1]d, where s is the smoothness parameter and n is the number of independent samples. We generalize the results to densities with unbounded support: given an Orlicz functions Ψ of rapid growth (such as the sub-exponential and sub-Gaussian classes), the minimax rates for densities with bounded Ψ-Orlicz norm increase to (nlnn)−s/(s+d)(Ψ−1(n))d(1−d/p(s+d))+n−1/2, where p is the norm parameter in the Lipschitz ball. We also show that the integral-form plug-in estimators with kernel density estimates fail to achieve the minimax rates, and characterize their worst case performances over the Lipschitz ball.

"One of the key steps in analyzing the bias relies on a novel application of the Hardy-Littlewood maximal inequality, which also leads to a new inequality on the Fisher information that may be of independent interest."

to:NB
entropy_estimation
minimax
statistics
"One of the key steps in analyzing the bias relies on a novel application of the Hardy-Littlewood maximal inequality, which also leads to a new inequality on the Fisher information that may be of independent interest."

6 weeks ago by cshalizi

[1910.12128] Joint Analysis of Social and Item Response Networks with Latent Space Models

6 weeks ago by cshalizi

"The adjustment of students to a school environment is fundamentally linked to the friendship networks they form with their peers. Consequently, the complete picture of a student's adjustment can only be obtained by taking into account both their friendship network and their reported perceptions of the school environment. However, there is a lack of flexible statistical models and methods that can jointly analyze a social network with an item-response data matrix. In this paper, we propose a latent space model for heterogeneous (multimodal) networks (LSMH) and its extension LSMH-Item, which combine the framework of latent space modeling in network analysis with item response theory in psychometrics. Using LSMH, we summarize the information from the social network and the item responses in a person-item joint latent space. We use a Variational Bayesian Expectation-Maximization estimation algorithm to estimate the item and person locations in the joint latent space. This methodology allows for effective integration, informative visualization, and prediction of social networks and item responses. We apply the proposed methodology to data collected from 16 third-grade classrooms comprised of 451 third-grade students' self-reported friendships and school liking, which were collected as part of the Early Learning Ohio project. Through the person-item joint latent space, we are able to identify students with potential adjustment difficulties and found consistent connection between students' friendship networks and their well-being. We believe that using LSMH, researchers will be able to easily identify students in need of intervention and revolutionize the understanding of social behaviors."

to:NB
social_networks
network_data_analysis
psychometrics
statistics
6 weeks ago by cshalizi

[1602.02201] The Rate-Distortion Risk in Estimation from Compressed Data

6 weeks ago by cshalizi

"We consider the problem of estimating a latent signal from a lossy compressed version of the data. We assume that the data is generated by an underlying signal and compressed using a lossy compression scheme that is agnostic to this signal. In reconstruction, the underlying signal is estimated so as to minimize a prescribed loss measure. For the above setting and an arbitrary distortion measure between the data and its compressed version, we define the rate-distortion (RD) risk of an estimator as its risk with respect to the distribution achieving Shannon's RD function with respect to this distortion. We derive conditions under which the RD risk describes the risk in estimating from the compressed data. The main theoretical tools to obtain these conditions are transportation-cost inequalities in conjunction with properties of source codes achieving Shannon's RD function. We show that these conditions hold in various settings, including settings where the alphabet of the underlying signal is finite or when the RD achieving distribution is multivariate normal. We evaluate the RD risk in special cases under these settings. This risk provides an achievable loss in compress-and-estimate settings, i.e., when the data is first compressed, communicated or stored using a procedure that is agnostic to the underlying signal, which is later estimated from the compressed version of the data. Our results imply the following general procedure for designing estimators from datasets undergoing lossy compression without specifying the actual compression technique; train the estimator based on a perturbation of the data according to the RD achieving distribution. Under general conditions, this estimator achieves the RD risk when applied to the lossy compressed version of the data."

to:NB
estimation
information_theory
compression
statistics
6 weeks ago by cshalizi

[1910.11299] Fraud Detection in Networks: State-of-the-art

6 weeks ago by cshalizi

"Financial fraud detection represents the challenge of finding anomalies in networks of financial transactions. In general, the anomaly detection (AD) is the problem of distinguishing between normal data samples with well defined patterns or signatures and those that do not conform to the expected profiles. The fraudulent behaviour in money laundering may manifest itself through unusual patterns in financial transaction networks. In such networks, nodes represents customers and the edges are transactions: a directed edge between two nodes illustrates that there is a money transfer in the respective direction, where the weight on the edge is the transferred amount. In this paper we present a survey on the fundamental anomaly detection techniques and then present briefly the relevant literature in connection with fraud detection context."

to:NB
fraud
classifiers
network_data_analysis
relational_learning
statistics
6 weeks ago by cshalizi

[1910.10871] Preventing Adversarial Use of Datasets through Fair Core-Set Construction

6 weeks ago by cshalizi

We propose improving the privacy properties of a dataset by publishing only a strategically chosen "core-set" of the data containing a subset of the instances. The core-set allows strong performance on primary tasks, but forces poor performance on unwanted tasks. We give methods for both linear models and neural networks and demonstrate their efficacy on data.

data_mining
computational_statistics
statistics
prediction
compression
6 weeks ago by cshalizi

[1909.10831] Entropy from Machine Learning

6 weeks ago by cshalizi

"We translate the problem of calculating the entropy of a set of binary configurations/signals into a sequence of supervised classification tasks. Subsequently, one can use virtually any machine learning classification algorithm for computing entropy. This procedure can be used to compute entropy, and consequently the free energy directly from a set of Monte Carlo configurations at a given temperature. As a test of the proposed method, using an off-the-shelf machine learning classifier we reproduce the entropy and free energy of the 2D Ising model from Monte Carlo configurations at various temperatures throughout its phase diagram. Other potential applications include computing the entropy of spiking neurons or any other multidimensional binary signals."

to:NB
classifiers
entropy_estimation
statistics
to_be_shot_after_a_fair_trial
6 weeks ago by cshalizi

[1910.10382] How well can we learn large factor models without assuming strong factors?

6 weeks ago by cshalizi

"In this paper, we consider the problem of learning models with a latent factor structure. The focus is to find what is possible and what is impossible if the usual strong factor condition is not imposed. We study the minimax rate and adaptivity issues in two problems: pure factor models and panel regression with interactive fixed effects. For pure factor models, if the number of factors is known, we develop adaptive estimation and inference procedures that attain the minimax rate. However, when the number of factors is not specified a priori, we show that there is a tradeoff between validity and efficiency: any confidence interval that has uniform validity for arbitrary factor strength has to be conservative; in particular its width is bounded away from zero even when the factors are strong. Conversely, any data-driven confidence interval that does not require as an input the exact number of factors (including weak ones) and has shrinking width under strong factors does not have uniform coverage and the worst-case coverage probability is at most 1/2. For panel regressions with interactive fixed effects, the tradeoff is much better. We find that the minimax rate for learning the regression coefficient does not depend on the factor strength and propose a simple estimator that achieves this rate."

to:NB
factor_analysis
high-dimensional_statistics
confidence_sets
statistics
6 weeks ago by cshalizi

[1906.06583] Linear regression with stationary errors : the R package slm

6 weeks ago by cshalizi

"This paper introduces the R package slm which stands for Stationary Linear Models. The package contains a set of statistical procedures for linear regression in the general context where the error process is strictly stationary with short memory. We work in the setting of Hannan (1973), who proved the asymptotic normality of the (normalized) least squares estimators (LSE) under very mild conditions on the error process. We propose different ways to estimate the asymptotic covariance matrix of the LSE, and then to correct the type I error rates of the usual tests on the parameters (as well as confidence intervals). The procedures are evaluated through different sets of simulations, and two examples of real datasets are studied."

to:NB
linear_regression
time_series
statistics
to_teach:data_over_space_and_time
to_teach:linear_models
6 weeks ago by cshalizi

[1910.02301] Change Detection in Noisy Dynamic Networks: A Spectral Embedding Approach

6 weeks ago by cshalizi

"Change detection in dynamic networks is an important problem in many areas, such as fraud detection, cyber intrusion detection and health care monitoring. It is a challenging problem because it involves a time sequence of graphs, each of which is usually very large and sparse with heterogeneous vertex degrees, resulting in a complex, high dimensional mathematical object. Spectral embedding methods provide an effective way to transform a graph to a lower dimensional latent Euclidean space that preserves the underlying structure of the network. Although change detection methods that use spectral embedding are available, they do not address sparsity and degree heterogeneity that usually occur in noisy real-world graphs and a majority of these methods focus on changes in the behaviour of the overall network.

"In this paper, we adapt previously developed techniques in spectral graph theory and propose a novel concept of applying Procrustes techniques to embedded points for vertices in a graph to detect changes in entity behaviour. Our spectral embedding approach not only addresses sparsity and degree heterogeneity issues, but also obtains an estimate of the appropriate embedding dimension. We call this method CDP (change detection using Procrustes analysis). We demonstrate the performance of CDP through extensive simulation experiments and a real-world application. CDP successfully detects various types of vertex-based changes including (i) changes in vertex degree, (ii) changes in community membership of vertices, and (iii) unusual increase or decrease in edge weight between vertices. The change detection performance of CDP is compared with two other baseline methods that employ alternative spectral embedding approaches. In both cases, CDP generally shows superior performance."

to:NB
network_data_analysis
spectral_methods
re:network_differences
change-point_problem
statistics
"In this paper, we adapt previously developed techniques in spectral graph theory and propose a novel concept of applying Procrustes techniques to embedded points for vertices in a graph to detect changes in entity behaviour. Our spectral embedding approach not only addresses sparsity and degree heterogeneity issues, but also obtains an estimate of the appropriate embedding dimension. We call this method CDP (change detection using Procrustes analysis). We demonstrate the performance of CDP through extensive simulation experiments and a real-world application. CDP successfully detects various types of vertex-based changes including (i) changes in vertex degree, (ii) changes in community membership of vertices, and (iii) unusual increase or decrease in edge weight between vertices. The change detection performance of CDP is compared with two other baseline methods that employ alternative spectral embedding approaches. In both cases, CDP generally shows superior performance."

6 weeks ago by cshalizi

How to Use Economic Theory to Improve Estimators: Shrinking Toward Theoretical Restrictions | The Review of Economics and Statistics | MIT Press Journals

6 weeks ago by cshalizi

"We propose to use economic theories to construct shrinkage estimators that perform well when the theories' empirical implications are approximately correct but perform no worse than unrestricted estimators when the theories' implications do not hold. We implement this construction in various settings, including labor demand and wage inequality, and estimation of consumer demand. We provide asymptotic and finite sample characterizations of the behavior of the proposed estimators. Our approach is an alternative to the use of theory as something to be tested or to be imposed on estimates. Our approach complements uses of theory for identification and extrapolation."

to:NB
economics
econometrics
statistics
estimation
shrinkage
6 weeks ago by cshalizi

[1910.10174] Leveraging directed causal discovery to detect latent common causes

6 weeks ago by cshalizi

"The discovery of causal relationships is a fundamental problem in science and medicine. In recent years, many elegant approaches to discovering causal relationships between two variables from uncontrolled data have been proposed. However, most of these deal only with purely directed causal relationships and cannot detect latent common causes. Here, we devise a general method which takes a purely directed causal discovery algorithm and modifies it so that it can also detect latent common causes. The identifiability of the modified algorithm depends on the identifiability of the original, as well as an assumption that the strength of noise be relatively small. We apply our method to two directed causal discovery algorithms, the Information Geometric Causal Inference of (Daniusis et al., 2010) and the Kernel Conditional Deviance for Causal Inference of (Mitrovic, Sejdinovic, and Teh, 2018), and extensively test on synthetic data---detecting latent common causes in additive, multiplicative and complex noise regimes---and on real data, where we are able to detect known common causes. In addition to detecting latent common causes, our experiments demonstrate that both modified algorithms preserve the performance of the original directed algorithm in distinguishing directed causal relations."

to:NB
causal_discovery
inference_to_latent_objects
statistics
6 weeks ago by cshalizi

Anonymous Heterogeneous Distributed Detection: Optimal Decision Rules, Error Exponents, and the Price of Anonymity - IEEE Journals & Magazine

6 weeks ago by cshalizi

"We explore the fundamental limits of heterogeneous distributed detection in an anonymous sensor network with $n$ sensors and a single fusion center. The fusion center collects the single observation from each of the $n$ sensors to detect a binary parameter. The sensors are clustered into multiple groups, and different groups follow different distributions under a given hypothesis. The key challenge for the fusion center is the anonymity of sensors—although it knows the exact number of sensors and the distribution of observations in each group, it does not know which group each sensor belongs to. It is hence natural to consider it as a composite hypothesis testing problem. First, we propose an optimal test called mixture likelihood ratio test , which is a randomized threshold test based on the ratio of the uniform mixture of all the possible distributions under one hypothesis to that under the other hypothesis. Optimality is shown by first arguing that there exists an optimal test that is symmetric , that is, it does not depend on the order of observations across the sensors, and then proving that the mixture likelihood ratio test is optimal among all symmetric tests. Second, we focus on the Neyman–Pearson setting and characterize the error exponent of the worst-case type-II error probability as $n$ tends to infinity, assuming the number of sensors in each group is proportional to $n$ . Finally, we generalize our result to find the collection of all achievable type-I and type-II error exponents, showing that the boundary of the region can be obtained by solving an optimization problem. Our results elucidate the price of anonymity in heterogeneous distributed detection, and can be extended to $M$ -ary hypothesis testing with heterogeneous observations generated according to hidden latent variables."

to:NB
hypothesis_testing
distributed_systems
statistics
information_theory
6 weeks ago by cshalizi

[1910.04832] Learning interaction kernels in heterogeneous systems of agents from multiple trajectories

6 weeks ago by cshalizi

"Systems of interacting particles or agents have wide applications in many disciplines such as Physics, Chemistry, Biology and Economics. These systems are governed by interaction laws, which are often unknown: estimating them from observation data is a fundamental task that can provide meaningful insights and accurate predictions of the behaviour of the agents. In this paper, we consider the inverse problem of learning interaction laws given data from multiple trajectories, in a nonparametric fashion, when the interaction kernels depend on pairwise distances. We establish a condition for learnability of interaction kernels, and construct estimators that are guaranteed to converge in a suitable L2 space, at the optimal min-max rate for 1-dimensional nonparametric regression. We propose an efficient learning algorithm based on least squares, which can be implemented in parallel for multiple trajectories and is therefore well-suited for the high dimensional, big data regime. Numerical simulations on a variety examples, including opinion dynamics, predator-swarm dynamics and heterogeneous particle dynamics, suggest that the learnability condition is satisfied in models used in practice, and the rate of convergence of our estimator is consistent with the theory. These simulations also suggest that our estimators are robust to noise in the observations, and produce accurate predictions of dynamics in relative large time intervals, even when they are learned from data collected in short time intervals."

to:NB
interacting_particle_systems
statistical_inference_for_stochastic_processes
statistics
6 weeks ago by cshalizi

**related tags**

Copy this bookmark: