**cshalizi + computational_statistics**
197

[1802.08012] Learning Topic Models by Neighborhood Aggregation

21 hours ago by cshalizi

"Topic models are frequently used in machine learning owing to their high interpretability and modular structure. However, extending a topic model to include a supervisory signal, to incorporate pre-trained word embedding vectors and to include a nonlinear output function is not an easy task because one has to resort to a highly intricate approximate inference procedure. The present paper shows that topic modeling with pre-trained word embedding vectors can be viewed as implementing a neighborhood aggregation algorithm where messages are passed through a network defined over words. From the network view of topic models, nodes correspond to words in a document and edges correspond to either a relationship describing co-occurring words in a document or a relationship describing the same word in the corpus. The network view allows us to extend the model to include supervisory signals, incorporate pre-trained word embedding vectors and include a nonlinear output function in a simple manner. In experiments, we show that our approach outperforms the state-of-the-art supervised Latent Dirichlet Allocation implementation in terms of held-out document classification tasks."

to:NB
topic_models
computational_statistics
text_mining
21 hours ago by cshalizi

[1909.04791] Techniques All Classifiers Can Learn from Deep Networks: Models, Optimizations, and Regularization

3 days ago by cshalizi

"Deep neural networks have introduced novel and useful tools to the machine learning community, and other types of classifiers can make use of these tools to improve their performance and generality. This paper reviews the current state of the art for deep learning classifier technologies that are being used outside of deep neural networks. Many components of existing deep neural network architectures can be employed by non-network classifiers. In this paper, we review the feature learning, optimization, and regularization methods that form a core of deep network technologies. We then survey non-neural network learning algorithms that make innovative use of these methods to improve classification. We conclude by discussing directions that can be pursued to expand the area of deep learning for a variety of classification algorithms."

to:NB
classifiers
machine_learning
optimization
computational_statistics
statistics
neural_networks
3 days ago by cshalizi

[1909.05097] Spectral Non-Convex Optimization for Dimension Reduction with Hilbert-Schmidt Independence Criterion

3 days ago by cshalizi

"The Hilbert Schmidt Independence Criterion (HSIC) is a kernel dependence measure that has applications in various aspects of machine learning. Conveniently, the objectives of different dimensionality reduction applications using HSIC often reduce to the same optimization problem. However, the nonconvexity of the objective function arising from non-linear kernels poses a serious challenge to optimization efficiency and limits the potential of HSIC-based formulations. As a result, only linear kernels have been computationally tractable in practice. This paper proposes a spectral-based optimization algorithm that extends beyond the linear kernel. The algorithm identifies a family of suitable kernels and provides the first and second-order local guarantees when a fixed point is reached. Furthermore, we propose a principled initialization strategy, thereby removing the need to repeat the algorithm at random initialization points. Compared to state-of-the-art optimization algorithms, our empirical results on real data show a run-time improvement by as much as a factor of 10^5 while consistently achieving lower cost and classification/clustering errors. The implementation source code is publicly available on this https URL."

to:NB
kernel_methods
hilbert_space
probability
dimension_reduction
statistics
computational_statistics
spectral_methods
3 days ago by cshalizi

[1909.00835] Evolutionary reinforcement learning of dynamical large deviations

3 days ago by cshalizi

"We show how to calculate dynamical large deviations using evolutionary reinforcement learning. An agent, a stochastic model, propagates a continuous-time Monte Carlo trajectory, and receives a reward conditioned upon the values of certain path-extensive quantities. Evolution produces progressively fitter agents, allowing the calculation of a piece of a large-deviation rate function for a particular model and path-extensive quantity. For models with small state spaces the evolutionary process acts directly on rates, and for models with large state spaces the process acts on the weights of a neural network that parameterizes the model's rates. The present approach shows how path-extensive physics problems can be considered within a framework widely used in machine learning."

to:NB
reinforcement_learning
large_deviations
computational_statistics
probability
re:fitness_sampling
3 days ago by cshalizi

[1909.03320] Matrix Calculations for Moments of Markov Processes

3 days ago by cshalizi

"Matryoshka dolls, the traditional Russian nesting figurines, are known world-wide for each doll's encapsulation of a sequence of smaller dolls. In this paper, we identify a large class of Markov process whose moments are easy to compute by exploiting the structure of a new sequence of nested matrices we call Matryoshkhan matrices. We characterize the salient properties of Matryoshkhan matrices that allow us to compute these moments in closed form at a specific time without computing the entire path of the process. This speeds up the computation of the Markov process moments significantly in comparison to traditional differential equation methods, which we demonstrate through numerical experiments. Through our method, we derive explicit expressions for both transient and steady-state moments of this class of Markov processes. We demonstrate the applicability of this method through explicit examples such as shot-noise processes, growth-collapse processes, linear birth-death-immigration processes, and affine stochastic differential equations from the finance literature. We also show that we can derive explicit expressions for the self-exciting Hawkes process, for which finding closed form moment expressions has been an open problem since its introduction."

to:NB
markov_models
stochastic_processes
computational_statistics
3 days ago by cshalizi

[1908.06437] Block Nearest Neighboor Gaussian processes for large datasets

29 days ago by cshalizi

"This work develops a valid spatial block-Nearest Neighbor Gaussian process (block-NNGP) for estimation and prediction of location-referenced large spatial datasets. The key idea behind our approach is to subdivide the spatial domain into several blocks which are dependent under some constraints. The cross-blocks capture the large-scale spatial variation, while each block capture the small-scale dependence. The block-NNGP is embeded as a sparsity-inducing prior within a hierarchical modeling framework. Markov chain Monte Carlo (MCMC) algorithms are executed without storing or decomposing large matrices, while the sparse block precision matrix is efficiently computed through parallel computing. We also consider alternate MCMC algorithms through composite sampling for faster computing time, and more reproducible Bayesian inference. The performance of the block-NNGP is illustrated using simulation studies and applications with massive real data, for locations in the order of 10^4."

to:NB
spatial_statistics
prediction
computational_statistics
statistics
to_teach:data_over_space_and_time
29 days ago by cshalizi

[1908.06936] ExaGeoStatR: A Package for Large-Scale Geostatistics in R

29 days ago by cshalizi

"Parallel computing in Gaussian process calculation becomes a necessity for avoiding computational and memory restrictions associated with Geostatistics applications. The evaluation of the Gaussian log-likelihood function requires O(n^2) storage and O(n^3) operations where n is the number of geographical locations. In this paper, we present ExaGeoStatR, a package for large-scale Geostatistics in R that supports parallel computation of the maximum likelihood function on shared memory, GPU, and distributed systems. The parallelization depends on breaking down the numerical linear algebra operations into a set of tasks and rendering them for a task-based programming model. ExaGeoStatR supports several maximum likelihood computation variants such as exact, Diagonal Super Tile (DST), and Tile Low-Rank (TLR) approximation besides providing a tool to generate large-scale synthetic datasets which can be used to test and compare different approximations methods. The package can be used directly through the R environment without any C, CUDA, or MPIknowledge. Here, we demonstrate the ExaGeoStatR package by illustrating its implementation details, analyzing its performance on various parallel architectures, and assessing its accuracy using both synthetic datasets and a sea surface temperature dataset. The performance evaluation involves spatial datasets with up to 250K observations."

to:NB
spatial_statistics
prediction
computational_statistics
R
statistics
to_teach:data_over_space_and_time
29 days ago by cshalizi

Parallelising Particle Filters with Butterfly Interactions - Heine - - Scandinavian Journal of Statistics - Wiley Online Library

4 weeks ago by cshalizi

"Bootstrap particle filter (BPF) is the cornerstone of many algorithms used for solving generally intractable inference problems with Hidden Markov models. The long term stability of BPF arises from particle interactions that typically make parallel implementations of BPF nontrivial.

"We propose a method whereby the particle interaction is done in several stages. With the proposed method, full interaction can be accomplished even if we allow only pairwise communications between processing elements at each stage. We show that our method preserves the consistency and the long term stability of the BPF, although our analysis suggest that the constraints on the stagewise interactions introduce error leading to a lower convergence rate than standard Monte Carlo. The proposed method also suggests a new, more flexible, adaptive resampling scheme, which according to our numerical experiments is the method of choice, displaying a notable gain in efficiency in certain parallel computing scenarios."

to:NB
computational_statistics
particle_filters
time_series
state-space_models
state_estimation
re:fitness_sampling
"We propose a method whereby the particle interaction is done in several stages. With the proposed method, full interaction can be accomplished even if we allow only pairwise communications between processing elements at each stage. We show that our method preserves the consistency and the long term stability of the BPF, although our analysis suggest that the constraints on the stagewise interactions introduce error leading to a lower convergence rate than standard Monte Carlo. The proposed method also suggests a new, more flexible, adaptive resampling scheme, which according to our numerical experiments is the method of choice, displaying a notable gain in efficiency in certain parallel computing scenarios."

4 weeks ago by cshalizi

[1907.10597] Green AI

4 weeks ago by cshalizi

"The computations required for deep learning research have been doubling every few months, resulting in an estimated 300,000x increase from 2012 to 2018 [2]. These computations have a surprisingly large carbon footprint [38]. Ironically, deep learning was inspired by the human brain, which is remarkably energy efficient. Moreover, the financial cost of the computations can make it difficult for academics, students, and researchers, in particular those from emerging economies, to engage in deep learning research.

"This position paper advocates a practical solution by making efficiency an evaluation criterion for research alongside accuracy and related measures. In addition, we propose reporting the financial cost or "price tag" of developing, training, and running models to provide baselines for the investigation of increasingly efficient methods. Our goal is to make AI both greener and more inclusive---enabling any inspired undergraduate with a laptop to write high-quality research papers. Green AI is an emerging focus at the Allen Institute for AI."

to:NB
machine_learning
computational_statistics
smith.noah_a.
energy
"This position paper advocates a practical solution by making efficiency an evaluation criterion for research alongside accuracy and related measures. In addition, we propose reporting the financial cost or "price tag" of developing, training, and running models to provide baselines for the investigation of increasingly efficient methods. Our goal is to make AI both greener and more inclusive---enabling any inspired undergraduate with a laptop to write high-quality research papers. Green AI is an emerging focus at the Allen Institute for AI."

4 weeks ago by cshalizi

[1507.02646] Pareto Smoothed Importance Sampling

6 weeks ago by cshalizi

"Importance weighting is a general way to adjust Monte Carlo integration to account for draws from the wrong distribution, but the resulting estimate can be noisy when the importance ratios have a heavy right tail. This routinely occurs when there are aspects of the target distribution that are not well captured by the approximating distribution, in which case more stable estimates can be obtained by modifying extreme importance ratios. We present a new method for stabilizing importance weights using a generalized Pareto distribution fit to the upper tail of the distribution of the simulated importance ratios. The method, which empirically performs better than existing methods for stabilizing importance sampling estimates, includes stabilized effective sample size estimates, Monte Carlo error estimates and convergence diagnostics."

to:NB
monte_carlo
importance_sampling
heavy_tails
computational_statistics
statistics
re:fitness_sampling
gelman.andrew
6 weeks ago by cshalizi

[1908.01251] Measuring the Algorithmic Convergence of Randomized Ensembles: The Regression Setting

6 weeks ago by cshalizi

"When randomized ensemble methods such as bagging and random forests are implemented, a basic question arises: Is the ensemble large enough? In particular, the practitioner desires a rigorous guarantee that a given ensemble will perform nearly as well as an ideal infinite ensemble (trained on the same data). The purpose of the current paper is to develop a bootstrap method for solving this problem in the context of regression --- which complements our companion paper in the context of classification (Lopes 2019). In contrast to the classification setting, the current paper shows that theoretical guarantees for the proposed bootstrap can be established under much weaker assumptions. In addition, we illustrate the flexibility of the method by showing how it can be adapted to measure algorithmic convergence for variable selection. Lastly, we provide numerical results demonstrating that the method works well in a range of situations."

to:NB
ensemble_methods
computational_statistics
statistics
to_teach:data-mining
6 weeks ago by cshalizi

[1908.01087] Network Shrinkage Estimation

6 weeks ago by cshalizi

"Networks are a natural representation of complex systems across the sciences, and higher-order dependencies are central to the understanding and modeling of these systems. However, in many practical applications such as online social networks, networks are massive, dynamic, and naturally streaming, where pairwise interactions become available one at a time in some arbitrary order. The massive size and streaming nature of these networks allow only partial observation, since it is infeasible to analyze the entire network. Under such scenarios, it is challenging to study the higher-order structural and connectivity patterns of streaming networks. In this work, we consider the fundamental problem of estimating the higher-order dependencies using adaptive sampling. We propose a novel adaptive, single-pass sampling framework and unbiased estimators for higher-order network analysis of large streaming networks. Our algorithms exploit adaptive techniques to identify edges that are highly informative for efficiently estimating the higher-order structure of streaming networks from small sample data. We also introduce a novel James-Stein-type shrinkage estimator to minimize the estimation error. Our approach is fully analytic with theoretical guarantees, computationally efficient, and can be incrementally updated in a streaming setting. Numerical experiments on large networks show that our approach is superior to baseline methods."

to:NB
network_data_analysis
computational_statistics
statistics
6 weeks ago by cshalizi

[1805.11908] Who Learns Better Bayesian Network Structures: Accuracy and Speed of Structure Learning Algorithms

6 weeks ago by cshalizi

"Three classes of algorithms to learn the structure of Bayesian networks from data are common in the literature: constraint-based algorithms, which use conditional independence tests to learn the dependence structure of the data; score-based algorithms, which use goodness-of-fit scores as objective functions to maximise; and hybrid algorithms that combine both approaches. Constraint-based and score-based algorithms have been shown to learn the same structures when conditional independence and goodness of fit are both assessed using entropy and the topological ordering of the network is known (Cowell, 2001).

"In this paper, we investigate how these three classes of algorithms perform outside the assumptions above in terms of speed and accuracy of network reconstruction for both discrete and Gaussian Bayesian networks. We approach this question by recognising that structure learning is defined by the combination of a statistical criterion and an algorithm that determines how the criterion is applied to the data. Removing the confounding effect of different choices for the statistical criterion, we find using both simulated and real-world complex data that constraint-based algorithms are often less accurate than score-based algorithms, but are seldom faster (even at large sample sizes); and that hybrid algorithms are neither faster nor more accurate than constraint-based algorithms. This suggests that commonly held beliefs on structure learning in the literature are strongly influenced by the choice of particular statistical criteria rather than just by the properties of the algorithms themselves."

to:NB
causal_discovery
graphical_models
statistics
computational_statistics
"In this paper, we investigate how these three classes of algorithms perform outside the assumptions above in terms of speed and accuracy of network reconstruction for both discrete and Gaussian Bayesian networks. We approach this question by recognising that structure learning is defined by the combination of a statistical criterion and an algorithm that determines how the criterion is applied to the data. Removing the confounding effect of different choices for the statistical criterion, we find using both simulated and real-world complex data that constraint-based algorithms are often less accurate than score-based algorithms, but are seldom faster (even at large sample sizes); and that hybrid algorithms are neither faster nor more accurate than constraint-based algorithms. This suggests that commonly held beliefs on structure learning in the literature are strongly influenced by the choice of particular statistical criteria rather than just by the properties of the algorithms themselves."

6 weeks ago by cshalizi

Scalable Visualization Methods for Modern Generalized Additive Models: Journal of Computational and Graphical Statistics: Vol 0, No 0

8 weeks ago by cshalizi

"In the last two decades, the growth of computational resources has made it possible to handle generalized additive models (GAMs) that formerly were too costly for serious applications. However, the growth in model complexity has not been matched by improved visualizations for model development and results presentation. Motivated by an industrial application in electricity load forecasting, we identify the areas where the lack of modern visualization tools for GAMs is particularly severe, and we address the shortcomings of existing methods by proposing a set of visual tools that (a) are fast enough for interactive use, (b) exploit the additive structure of GAMs, (c) scale to large data sets, and (d) can be used in conjunction with a wide range of response distributions. The new visual methods proposed here are implemented by the mgcViz R package, available on the Comprehensive R Archive Network. Supplementary materials for this article are available online."

to:NB
additive_models
visual_display_of_quantitative_information
computational_statistics
statistics
R
to_teach:undergrad-ADA
8 weeks ago by cshalizi

[1907.08742] Estimating the Algorithmic Variance of Randomized Ensembles via the Bootstrap

8 weeks ago by cshalizi

"Although the methods of bagging and random forests are some of the most widely used prediction methods, relatively little is known about their algorithmic convergence. In particular, there are not many theoretical guarantees for deciding when an ensemble is "large enough" --- so that its accuracy is close to that of an ideal infinite ensemble. Due to the fact that bagging and random forests are randomized algorithms, the choice of ensemble size is closely related to the notion of "algorithmic variance" (i.e. the variance of prediction error due only to the training algorithm). In the present work, we propose a bootstrap method to estimate this variance for bagging, random forests, and related methods in the context of classification. To be specific, suppose the training dataset is fixed, and let the random variable Errt denote the prediction error of a randomized ensemble of size t. Working under a "first-order model" for randomized ensembles, we prove that the centered law of Errt can be consistently approximated via the proposed method as t→∞. Meanwhile, the computational cost of the method is quite modest, by virtue of an extrapolation technique. As a consequence, the method offers a practical guideline for deciding when the algorithmic fluctuations of Errt are negligible."

to:NB
ensemble_methods
computational_statistics
statistics
prediction
to_teach:data-mining
8 weeks ago by cshalizi

[1907.01954] An Econometric View of Algorithmic Subsampling

9 weeks ago by cshalizi

"Datasets that are terabytes in size are increasingly common, but computer bottlenecks often frustrate a complete analysis of the data. While more data are better than less, diminishing returns suggest that we may not need terabytes of data to estimate a parameter or test a hypothesis. But which rows of data should we analyze, and might an arbitrary subset of rows preserve the features of the original data? This paper reviews a line of work that is grounded in theoretical computer science and numerical linear algebra, and which finds that an algorithmically desirable {\em sketch} of the data must have a {\em subspace embedding} property. Building on this work, we study how prediction and inference is affected by data sketching within a linear regression setup. The sketching error is small compared to the sample size effect which is within the control of the researcher. As a sketch size that is algorithmically optimal may not be suitable for prediction and inference, we use statistical arguments to provide `inference conscious' guides to the sketch size. When appropriately implemented, an estimator that pools over different sketches can be nearly as efficient as the infeasible one using the full sample."

to:NB
computational_statistics
statistics
linear_regression
random_projections
9 weeks ago by cshalizi

Fast Generalized Linear Models by Database Sampling and One-Step Polishing: Journal of Computational and Graphical Statistics: Vol 0, No 0

12 weeks ago by cshalizi

"In this article, I show how to fit a generalized linear model to N observations on p variables stored in a relational database, using one sampling query and one aggregation query, as long as N^{1/2+δ} observations can be stored in memory, for some δ>0. The resulting estimator is fully efficient and asymptotically equivalent to the maximum likelihood estimator, and so its variance can be estimated from the Fisher information in the usual way. A proof-of-concept implementation uses R with MonetDB and with SQLite, and could easily be adapted to other popular databases. I illustrate the approach with examples of taxi-trip data in New York City and factors related to car color in New Zealand. "

to:NB
computational_statistics
linear_regression
regression
databases
lumley.thomas
to_teach:statcomp
12 weeks ago by cshalizi

An Expectation Conditional Maximization Approach for Gaussian Graphical Models: Journal of Computational and Graphical Statistics: Vol 0, No 0

12 weeks ago by cshalizi

"Bayesian graphical models are a useful tool for understanding dependence relationships among many variables, particularly in situations with external prior information. In high-dimensional settings, the space of possible graphs becomes enormous, rendering even state-of-the-art Bayesian stochastic search computationally infeasible. We propose a deterministic alternative to estimate Gaussian and Gaussian copula graphical models using an expectation conditional maximization (ECM) algorithm, extending the EM approach from Bayesian variable selection to graphical model estimation. We show that the ECM approach enables fast posterior exploration under a sequence of mixture priors, and can incorporate multiple sources of information."

to:NB
graphical_models
causal_discovery
em_algorithm
computational_statistics
statistics
mccormick.tyler
12 weeks ago by cshalizi

An asymptotic analysis of distributed nonparametric methods

june 2019 by cshalizi

"We investigate and compare the fundamental performance of several distributed learning methods that have been proposed recently. We do this in the context of a distributed version of the classical signal-in-Gaussian-white-noise model, which serves as a benchmark model for studying performance in this setting. The results show how the design and tuning of a distributed method can have great impact on convergence rates and validity of uncertainty quantification. Moreover, we highlight the difficulty of designing nonparametric distributed procedures that automatically adapt to smoothness."

to:NB
statistics
nonparametrics
computational_statistics
distributed_systems
june 2019 by cshalizi

[1906.05746] Nonlinear System Identification via Tensor Completion

june 2019 by cshalizi

"Function approximation from input and output data pairs constitutes a fundamental problem in supervised learning. Deep neural networks are currently the most popular method for learning to mimic the input-output relationship of a generic nonlinear system, as they have proven to be very effective in approximating complex highly nonlinear functions. In this work, we propose low-rank tensor completion as an appealing alternative for modeling and learning complex nonlinear systems. We model the interactions between the N input variables and the scalar output of a system by a single N-way tensor, and setup a weighted low-rank tensor completion problem with smoothness regularization which we tackle using a block coordinate descent algorithm. We extend our method to the multi-output setting and the case of partially observed data, which cannot be readily handled by neural networks. Finally, we demonstrate the effectiveness of the approach using several regression tasks including some standard benchmarks and a challenging student grade prediction task."

to:NB
approximation
tensors
regression
computational_statistics
statistics
june 2019 by cshalizi

Phys. Rev. E 99, 053311 (2019) - Randomization algorithms for large sparse networks

june 2019 by cshalizi

"In many domains it is necessary to generate surrogate networks, e.g., for hypothesis testing of different properties of a network. Generating surrogate networks typically requires that different properties of the network are preserved, e.g., edges may not be added or deleted and edge weights may be restricted to certain intervals. In this paper we present an efficient property-preserving Markov chain Monte Carlo method termed CycleSampler for generating surrogate networks in which (1) edge weights are constrained to intervals and vertex strengths are preserved exactly, and (2) edge and vertex strengths are both constrained to intervals. These two types of constraints cover a wide variety of practical use cases. The method is applicable to both undirected and directed graphs. We empirically demonstrate the efficiency of the CycleSampler method on real-world data sets. We provide an implementation of CycleSampler in R, with parts implemented in C."

to:NB
network_sampling
computational_statistics
network_data_analysis
june 2019 by cshalizi

[1806.06802] Almost-Exact Matching with Replacement for Causal Inference

june 2019 by cshalizi

"We aim to create the highest possible quality of treatment-control matches for categorical data in the potential outcomes framework. Matching methods are heavily used in the social sciences due to their interpretability, but most matching methods do not pass basic sanity checks: they fail when irrelevant variables are introduced, and tend to be either computationally slow or produce low-quality matches. The method proposed in this work aims to match units on a weighted Hamming distance, taking into account the relative importance of the covariates; the algorithm aims to match units on as many relevant variables as possible. To do this, the algorithm creates a hierarchy of covariate combinations on which to match (similar to downward closure), in the process solving an optimization problem for each unit in order to construct the optimal matches. The algorithm uses a single dynamic program to solve all of the optimization problems simultaneously. Notable advantages of our method over existing matching procedures are its high-quality matches, versatility in handling different data distributions that may have irrelevant variables, and ability to handle missing data by matching on as many available covariates as possible."

to:NB
causal_inference
matching
statistics
computational_statistics
june 2019 by cshalizi

Refining the Concept of Scientific Inference When Working with Big Data: Proceedings of a Workshop | The National Academies Press

may 2019 by cshalizi

"The concept of utilizing big data to enable scientific discovery has generated tremendous excitement and investment from both private and public sectors over the past decade, and expectations continue to grow. Using big data analytics to identify complex patterns hidden inside volumes of data that have never been combined could accelerate the rate of scientific discovery and lead to the development of beneficial technologies and products. However, producing actionable scientific knowledge from such large, complex data sets requires statistical models that produce reliable inferences (NRC, 2013). Without careful consideration of the suitability of both available data and the statistical models applied, analysis of big data may result in misleading correlations and false discoveries, which can potentially undermine confidence in scientific research if the results are not reproducible. In June 2016 the National Academies of Sciences, Engineering, and Medicine convened a workshop to examine critical challenges and opportunities in performing scientific inference reliably when working with big data. Participants explored new methodologic developments that hold significant promise and potential research program areas for the future. This publication summarizes the presentations and discussions from the workshop."

--- I am hapy with my contributions.

to:NB
books:recommended
books:contributed-to
statistics
computational_statistics
machine_learning
data_mining
--- I am hapy with my contributions.

may 2019 by cshalizi

[1902.00640] Particle Flow Bayes' Rule

may 2019 by cshalizi

"We present a particle flow realization of Bayes' rule, where an ODE-based neural operator is used to transport particles from a prior to its posterior after a new observation. We prove that such an ODE operator exists. Its neural parameterization can be trained in a meta-learning framework, allowing this operator to reason about the effect of an individual observation on the posterior, and thus generalize across different priors, observations and to sequential Bayesian inference. We demonstrated the generalization ability of our particle flow Bayes operator in several canonical and high dimensional examples."

to:NB
bayesianism
neural_networks
computational_statistics
statistics
re:fitness_sampling
may 2019 by cshalizi

[1905.10845] On Coresets for Regularized Loss Minimization

may 2019 by cshalizi

"We design and mathematically analyze sampling-based algorithms for regularized loss minimization problems that are implementable in popular computational models for large data, in which the access to the data is restricted in some way. Our main result is that if the regularizer's effect does not become negligible as the norm of the hypothesis scales, and as the data scales, then a uniform sample of modest size is with high probability a coreset. In the case that the loss function is either logistic regression or soft-margin support vector machines, and the regularizer is one of the common recommended choices, this result implies that a uniform sample of size O(dn‾√) is with high probability a coreset of n points in ℜd. We contrast this upper bound with two lower bounds. The first lower bound shows that our analysis of uniform sampling is tight; that is, a smaller uniform sample will likely not be a core set. The second lower bound shows that in some sense uniform sampling is close to optimal, as significantly smaller core sets do not generally exist."

to:NB
optimization
learning_theory
statistics
computational_statistics
randomized_algorithms
may 2019 by cshalizi

[1805.10927] Scalable and Robust Community Detection with Randomized Sketching

may 2019 by cshalizi

"This paper explores and analyzes the unsupervised clustering of large partially observed graphs. We propose a scalable and provable randomized framework for clustering graphs generated from the stochastic block model. The clustering is first applied to a sub-matrix of the graph's adjacency matrix associated with a reduced graph sketch constructed using random sampling. Then, the clusters of the full graph are inferred based on the clusters extracted from the sketch using a correlation-based retrieval step. Uniform random node sampling is shown to improve the computational complexity over clustering of the full graph when the cluster sizes are balanced. A new random degree-based node sampling algorithm is presented which significantly improves upon the performance of the clustering algorithm even when clusters are unbalanced. This algorithm improves the phase transitions for matrix-decomposition-based clustering with regard to computational complexity and minimum cluster size, which are shown to be nearly dimension-free in the low inter-cluster connectivity regime. A third sampling technique is shown to improve balance by randomly sampling nodes based on spatial distribution. We provide analysis and numerical results using a convex clustering algorithm based on matrix completion."

to:NB
community_discovery
random_projections
statistics
computational_statistics
may 2019 by cshalizi

Approximate Kernel-Based Conditional Independence Tests for Fast Non-Parametric Causal Discovery : Journal of Causal Inference

may 2019 by cshalizi

"Constraint-based causal discovery (CCD) algorithms require fast and accurate conditional independence (CI) testing. The Kernel Conditional Independence Test (KCIT) is currently one of the most popular CI tests in the non-parametric setting, but many investigators cannot use KCIT with large datasets because the test scales at least quadratically with sample size. We therefore devise two relaxations called the Randomized Conditional Independence Test (RCIT) and the Randomized conditional Correlation Test (RCoT) which both approximate KCIT by utilizing random Fourier features. In practice, both of the proposed tests scale linearly with sample size and return accurate p-values much faster than KCIT in the large sample size context. CCD algorithms run with RCIT or RCoT also return graphs at least as accurate as the same algorithms run with KCIT but with large reductions in run time."

to:NB
kernel_methods
dependence_measures
causal_discovery
statistics
computational_statistics
may 2019 by cshalizi

Modern statistics modern biology | Statistics for life sciences, medicine and health | Cambridge University Press

february 2019 by cshalizi

"If you are a biologist and want to get the best out of the powerful methods of modern computational statistics, this is your book. You can visualize and analyze your own data, apply unsupervised and supervised learning, integrate datasets, apply hypothesis testing, and make publication-quality figures using the power of R/Bioconductor and ggplot2. This book will teach you 'cooking from scratch', from raw data to beautiful illuminating output, as you learn to write your own scripts in the R language and to use advanced statistics packages from CRAN and Bioconductor. It covers a broad range of basic and advanced topics important in the analysis of high-throughput biological data, including principal component analysis and multidimensional scaling, clustering, multiple testing, unsupervised and supervised learning, resampling, the pitfalls of experimental design, and power simulations using Monte Carlo, and it even reaches networks, trees, spatial statistics, image data, and microbial ecology. Using a minimum of mathematical notation, it builds understanding from well-chosen examples, simulation, visualization, and above all hands-on interaction with data and code."

to:NB
books:noted
statistics
computational_statistics
biology
genomics
R
february 2019 by cshalizi

[1812.00681] Numerical computation of rare events via large deviation theory

december 2018 by cshalizi

"An overview of rare events algorithms based on large deviation theory (LDT) is presented. It covers a range of numerical schemes to compute the large deviation minimizer in various setups, and discusses best practices, common pitfalls, and implementation trade-offs. Generalizations, extensions, and improvements of the minimum action methods are proposed. These algorithms are tested on example problems which illustrate several common difficulties which arise e.g. when the forcing is degenerate or multiplicative, or the systems are infinite-dimensional. Generalizations to processes driven by non-Gaussian noises or random initial data and parameters are also discussed, along with the connection between the LDT-based approach reviewed here and other methods, such as stochastic field theory and optimal control. Finally, the integration of this approach in importance sampling methods using e.g. genealogical algorithms is explored."

to:NB
large_deviations
rar
rare-event_simulation
simulation
computational_statistics
probability
via:rvenkat
december 2018 by cshalizi

Solving Differential Equations in R: Package deSolve | Soetaert | Journal of Statistical Software

december 2018 by cshalizi

"In this paper we present the R package deSolve to solve initial value problems (IVP) written as ordinary differential equations (ODE), differential algebraic equations (DAE) of index 0 or 1 and partial differential equations (PDE), the latter solved using the method of lines approach. The differential equations can be represented in R code or as compiled code. In the latter case, R is used as a tool to trigger the integration and post-process the results, which facilitates model development and application, whilst the compiled code significantly increases simulation speed. The methods implemented are efficient, robust, and well documented public-domain Fortran routines. They include four integrators from the ODEPACK package (LSODE, LSODES, LSODA, LSODAR), DVODE and DASPK2.0. In addition, a suite of Runge-Kutta integrators and special-purpose solvers to efficiently integrate 1-, 2- and 3-dimensional partial differential equations are available. The routines solve both stiff and non-stiff systems, and include many options, e.g., to deal in an efficient way with the sparsity of the Jacobian matrix, or finding the root of equations. In this article, our objectives are threefold: (1) to demonstrate the potential of using R for dynamic modeling, (2) to highlight typical uses of the different methods implemented and (3) to compare the performance of models specified in R code and in compiled code for a number of test cases. These comparisons demonstrate that, if the use of loops is avoided, R code can efficiently integrate problems comprising several thousands of state variables. Nevertheless, the same problem may be solved from 2 to more than 50 times faster by using compiled code compared to an implementation using only R code. Still, amongst the benefits of R are a more flexible and interactive implementation, better readability of the code, and access to R’s high-level procedures. deSolve is the successor of package odesolve which will be deprecated in the future; it is free software and distributed under the GNU General Public License, as part of the R software project."

to:NB
dynamical_systems
computational_statistics
R
to_teach:data_over_space_and_time
have_read
december 2018 by cshalizi

[1808.04739] Simulating Markov random fields with a conclique-based Gibbs sampler

december 2018 by cshalizi

"For spatial and network data, we consider models formed from a Markov random field (MRF) structure and the specification of a conditional distribution for each observation. At issue, fast simulation from such MRF models is often an important consideration, particularly when repeated generation of large numbers of data sets is required (e.g., for approximating sampling distributions). However, a standard Gibbs strategy for simulating from MRF models involves single-updates, performed with the conditional distribution of each observation in a sequential manner, whereby a Gibbs iteration may become computationally involved even for relatively small samples. As an alternative, we describe a general way to simulate from MRF models using Gibbs sampling with "concliques" (i.e., groups of non-neighboring observations). Compared to standard Gibbs sampling, this simulation scheme can be much faster by reducing Gibbs steps and by independently updating all observations per conclique at once. We detail the simulation method, establish its validity, and assess its computational performance through numerical studies, where speed advantages are shown for several spatial and network examples."

--- Slides: http://andeekaplan.com/phd-thesis/slides/public.pdf

--- There's an R package on Github but I couldn't get it to compile...

random_fields
simulation
stochastic_processes
spatial_statistics
network_data_analysis
markov_models
statistics
computational_statistics
to_teach:data_over_space_and_time
have_read
in_NB
--- Slides: http://andeekaplan.com/phd-thesis/slides/public.pdf

--- There's an R package on Github but I couldn't get it to compile...

december 2018 by cshalizi

[1710.05013] A Case Study Competition Among Methods for Analyzing Large Spatial Data

october 2018 by cshalizi

"The Gaussian process is an indispensable tool for spatial data analysts. The onset of the "big data" era, however, has lead to the traditional Gaussian process being computationally infeasible for modern spatial data. As such, various alternatives to the full Gaussian process that are more amenable to handling big spatial data have been proposed. These modern methods often exploit low rank structures and/or multi-core and multi-threaded computing environments to facilitate computation. This study provides, first, an introductory overview of several methods for analyzing large spatial data. Second, this study describes the results of a predictive competition among the described methods as implemented by different groups with strong expertise in the methodology. Specifically, each research group was provided with two training datasets (one simulated and one observed) along with a set of prediction locations. Each group then wrote their own implementation of their method to produce predictions at the given location and each which was subsequently run on a common computing environment. The methods were then compared in terms of various predictive diagnostics. Supplementary materials regarding implementation details of the methods and code are available for this article online."

in_NB
spatial_statistics
prediction
computational_statistics
statistics
to_teach:data_over_space_and_time
october 2018 by cshalizi

Object-oriented Computation of Sandwich Estimators | Zeileis | Journal of Statistical Software

october 2018 by cshalizi

"Sandwich covariance matrix estimators are a popular tool in applied regression modeling for performing inference that is robust to certain types of model misspecification. Suitable implementations are available in the R system for statistical computing for certain model fitting functions only (in particular lm()), but not for other standard regression functions, such as glm(), nls(), or survreg(). Therefore, conceptual tools and their translation to computational tools in the package sandwich are discussed, enabling the computation of sandwich estimators in general parametric models. Object orientation can be achieved by providing a few extractor functions' most importantly for the empirical estimating functions' from which various types of sandwich estimators can be computed."

to:NB
computational_statistics
R
estimation
regression
statistics
to_teach
october 2018 by cshalizi

General Resampling Infrastructure • rsample

august 2018 by cshalizi

"rsample contains a set of functions that can create different types of resamples and corresponding classes for their analysis. The goal is to have a modular set of methods that can be used across different R packages for:

"traditional resampling techniques for estimating the sampling distribution of a statistic and

"estimating model performance using a holdout set

"The scope of rsample is to provide the basic building blocks for creating and analyzing resamples of a data set but does not include code for modeling or calculating statistics. The “Working with Resample Sets” vignette gives demonstrations of how rsample tools can be used."

to:NB
R
computational_statistics
to_teach:statcomp
to_teach:undergrad-ADA
via:?
"traditional resampling techniques for estimating the sampling distribution of a statistic and

"estimating model performance using a holdout set

"The scope of rsample is to provide the basic building blocks for creating and analyzing resamples of a data set but does not include code for modeling or calculating statistics. The “Working with Resample Sets” vignette gives demonstrations of how rsample tools can be used."

august 2018 by cshalizi

[1806.06850] Polynomial Regression As an Alternative to Neural Nets

july 2018 by cshalizi

"Despite the success of neural networks (NNs), there is still a concern among many over their "black box" nature. Why do they work? Here we present a simple analytic argument that NNs are in fact essentially polynomial regression models. This view will have various implications for NNs, e.g. providing an explanation for why convergence problems arise in NNs, and it gives rough guidance on avoiding overfitting. In addition, we use this phenomenon to predict and confirm a multicollinearity property of NNs not previously reported in the literature. Most importantly, given this loose correspondence, one may choose to routinely use polynomial models instead of NNs, thus avoiding some major problems of the latter, such as having to set many tuning parameters and dealing with convergence issues. We present a number of empirical results; in each case, the accuracy of the polynomial approach matches or exceeds that of NN approaches. A many-featured, open-source software package, polyreg, is available."

--- Matloff is the author of my favorite "R programming for n00bs" textbook...

--- ETA after reading: the argument that multi-layer neural networks "are essentially" polynomial regression is a bit weak. It would be true, exactly, if activation functions were exactly polynomial, which however they rarely are in practice. If non-polynomial activations happen to be implemented in computational practice by polynomials (e.g., Taylor approximations), well, either we get different hardware or we crank up the degree of approximation as much as we like. (Said a little differently, if you buy this line of argument, you should buy that _every_ smooth statistical model "is essentially" polynomial regression, which seems a bit much.) It is, also, an argument about the function-approximation properties of the model classes, and not the fitting processes, despite the explicit disclaimers.

to:NB
your_favorite_deep_neural_network_sucks
regression
neural_networks
statistics
matloff.norman
approximation
computational_statistics
have_read
--- Matloff is the author of my favorite "R programming for n00bs" textbook...

--- ETA after reading: the argument that multi-layer neural networks "are essentially" polynomial regression is a bit weak. It would be true, exactly, if activation functions were exactly polynomial, which however they rarely are in practice. If non-polynomial activations happen to be implemented in computational practice by polynomials (e.g., Taylor approximations), well, either we get different hardware or we crank up the degree of approximation as much as we like. (Said a little differently, if you buy this line of argument, you should buy that _every_ smooth statistical model "is essentially" polynomial regression, which seems a bit much.) It is, also, an argument about the function-approximation properties of the model classes, and not the fitting processes, despite the explicit disclaimers.

july 2018 by cshalizi

now publishers - Community Detection and Stochastic Block Models

june 2018 by cshalizi

"The stochastic block model (SBM) is a random graph model with different group of vertices connecting differently. It is widely employed as a canonical model to study clustering and community detection, and provides a fertile ground to study the information-theoretic and computational tradeoffs that arise in combinatorial statistics and more generally data science. This monograph surveys the recent developments that establish the fundamental limits for community detection in the SBM, both with respect to information-theoretic and computational tradeoffs, and for various recovery requirements such as exact, partial and weak recovery. The main results discussed are the phase transitions for exact recovery at the Chernoff-Hellinger threshold, the phase transition for weak recovery at the Kesten-Stigum threshold, the optimal SNR-mutual information tradeoff for partial recovery, and the gap between information-theoretic and computational thresholds."

to:NB
community_discovery
network_data_analysis
information_theory
computational_statistics
statistics
june 2018 by cshalizi

[1406.2661] Generative Adversarial Networks

june 2018 by cshalizi

"We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples."

--- Kind of astonishing that the phrases "actor-critic" or "co-evolution" do not appear anywhere in the paper. The bit about KL divergence is nice, though. (But even then, it doesn't involve multi-layer perceptrons at all, it's all at the level of probability distributions and could apply to any learning system.)

to:NB
have_read
neural_networks
machine_learning
computational_statistics
statistics
--- Kind of astonishing that the phrases "actor-critic" or "co-evolution" do not appear anywhere in the paper. The bit about KL divergence is nice, though. (But even then, it doesn't involve multi-layer perceptrons at all, it's all at the level of probability distributions and could apply to any learning system.)

june 2018 by cshalizi

Large-scale kernel methods for independence testing | SpringerLink

may 2018 by cshalizi

"Representations of probability measures in reproducing kernel Hilbert spaces provide a flexible framework for fully nonparametric hypothesis tests of independence, which can capture any type of departure from independence, including nonlinear associations and multivariate interactions. However, these approaches come with an at least quadratic computational cost in the number of observations, which can be prohibitive in many applications. Arguably, it is exactly in such large-scale datasets that capturing any type of dependence is of interest, so striking a favourable trade-off between computational efficiency and test performance for kernel independence tests would have a direct impact on their applicability in practice. In this contribution, we provide an extensive study of the use of large-scale kernel approximations in the context of independence testing, contrasting block-based, Nyström and random Fourier feature approaches. Through a variety of synthetic data experiments, it is demonstrated that our large-scale methods give comparable performance with existing methods while using significantly less computation time and memory."

to:NB
computational_statistics
dependence_measures
hypothesis_testing
statistics
kernel_methods
hilbert_space
gretton.arthurt
may 2018 by cshalizi

[1803.01422] DAGs with NO TEARS: Smooth Optimization for Structure Learning

march 2018 by cshalizi

"Estimating the structure of directed acyclic graphs (DAGs, also known as Bayesian networks) is a challenging problem since the search space of DAGs is combinatorial and scales superexponentially with the number of nodes. Existing approaches rely on various local heuristics for enforcing the acyclicity constraint and are not well-suited to general purpose optimization packages for their solution. In this paper, we introduce a fundamentally different strategy: We formulate the structure learning problem as a smooth, constrained optimization problem over real matrices that avoids this combinatorial constraint entirely. This is achieved by a novel characterization of acyclicity that is not only smooth but also exact. The resulting nonconvex, constrained program involves smooth functions whose gradients are easy to compute and only involve elementary matrix operations. By using existing black-box optimization routines, our method uses global search to find an optimal DAG and can be implemented in about 50 lines of Python and outperforms existing methods without imposing any structural constraints."

!!!

to:NB
to_read
graphical_models
optimization
computational_statistics
xing.eric
ravikumar.pradeep
!!!

march 2018 by cshalizi

[1802.01396] To understand deep learning we need to understand kernel learning

march 2018 by cshalizi

"Generalization performance of classifiers in deep learning has recently become a subject of intense study. Heavily over-parametrized deep models tend to fit training data exactly. Despite overfitting, they perform well on test data, a phenomenon not yet fully understood.

"The first point of our paper is that strong performance of overfitted classifiers is not a unique feature of deep learning. Using real-world and synthetic datasets, we establish that kernel classifiers trained to have zero classification error (overfitting) or even zero regression error (interpolation) perform very well on test data.

"We proceed to prove lower bounds on the norm of overfitted solutions for smooth kernels, showing that they increase nearly exponentially with the data size. Since most generalization bounds depend polynomially on the norm of the solution, this result implies that they diverge as data increases. Furthermore, the existing bounds do not apply to interpolated classifiers.

"We also show experimentally that (non-smooth) Laplacian kernels easily fit random labels using a version of SGD, a finding that parallels results reported for ReLU neural networks. In contrast, fitting noisy data requires many more epochs for smooth Gaussian kernels. The observation that the performance of overfitted Laplacian and Gaussian classifiers on the test is quite similar, suggests that generalization is tied to the properties of the kernel function rather than the optimization process.

"We see that some key phenomena of deep learning are manifested similarly in kernel methods in the overfitted regime. We argue that progress on understanding deep learning will be difficult, until more analytically tractable "shallow" kernel methods are better understood. The experimental and theoretical results presented in this paper indicate a need for new theoretical ideas for understanding classical kernel methods."

--- Of course, this also makes me wonder whether there really are practical advantages to deep networks, over and above what we'd get by throwing resources at kernels...

to:NB
neural_networks
kernel_methods
classifiers
regression
statistics
computational_statistics
learning_theory
belkin.mikhail
"The first point of our paper is that strong performance of overfitted classifiers is not a unique feature of deep learning. Using real-world and synthetic datasets, we establish that kernel classifiers trained to have zero classification error (overfitting) or even zero regression error (interpolation) perform very well on test data.

"We proceed to prove lower bounds on the norm of overfitted solutions for smooth kernels, showing that they increase nearly exponentially with the data size. Since most generalization bounds depend polynomially on the norm of the solution, this result implies that they diverge as data increases. Furthermore, the existing bounds do not apply to interpolated classifiers.

"We also show experimentally that (non-smooth) Laplacian kernels easily fit random labels using a version of SGD, a finding that parallels results reported for ReLU neural networks. In contrast, fitting noisy data requires many more epochs for smooth Gaussian kernels. The observation that the performance of overfitted Laplacian and Gaussian classifiers on the test is quite similar, suggests that generalization is tied to the properties of the kernel function rather than the optimization process.

"We see that some key phenomena of deep learning are manifested similarly in kernel methods in the overfitted regime. We argue that progress on understanding deep learning will be difficult, until more analytically tractable "shallow" kernel methods are better understood. The experimental and theoretical results presented in this paper indicate a need for new theoretical ideas for understanding classical kernel methods."

--- Of course, this also makes me wonder whether there really are practical advantages to deep networks, over and above what we'd get by throwing resources at kernels...

march 2018 by cshalizi

Perturbations, Optimization, and Statistics | The MIT Press

october 2017 by cshalizi

"In nearly all machine learning, decisions must be made given current knowledge. Surprisingly, making what is believed to be the best decision is not always the best strategy, even when learning in a supervised learning setting. An emerging body of work on learning under different rules applies perturbations to decision and learning procedures. These methods provide simple and highly efficient learning rules with improved theoretical guarantees. This book describes perturbation-based methods developed in machine learning to augment novel optimization methods with strong statistical guarantees, offering readers a state-of-the-art overview.

"Chapters address recent modeling ideas that have arisen within the perturbations framework, including Perturb & MAP, herding, and the use of neural networks to map generic noise to distribution over highly structured data. They describe new learning procedures for perturbation models, including an improved EM algorithm and a learning algorithm that aims to match moments of model samples to moments of data. They discuss understanding the relation of perturbation models to their traditional counterparts, with one chapter showing that the perturbations viewpoint can lead to new algorithms in the traditional setting. And they consider perturbation-based regularization in neural networks, offering a more complete understanding of dropout and studying perturbations in the context of deep neural networks."

to:NB
books:noted
computational_statistics
optimization
stochastic_approximation
learning_theory
statistics
"Chapters address recent modeling ideas that have arisen within the perturbations framework, including Perturb & MAP, herding, and the use of neural networks to map generic noise to distribution over highly structured data. They describe new learning procedures for perturbation models, including an improved EM algorithm and a learning algorithm that aims to match moments of model samples to moments of data. They discuss understanding the relation of perturbation models to their traditional counterparts, with one chapter showing that the perturbations viewpoint can lead to new algorithms in the traditional setting. And they consider perturbation-based regularization in neural networks, offering a more complete understanding of dropout and studying perturbations in the context of deep neural networks."

october 2017 by cshalizi

Deep Learning | The MIT Press

december 2016 by cshalizi

"Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning.

"The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models.

"Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors."

--- Bengio is worth listening to.

in_NB
books:noted
neural_networks
machine_learning
computational_statistics
bengio.yoshua
books:owned
"The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models.

"Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors."

--- Bengio is worth listening to.

december 2016 by cshalizi

[1611.05923] "Influence Sketching": Finding Influential Samples In Large-Scale Regressions

december 2016 by cshalizi

"There is an especially strong need in modern large-scale data analysis to prioritize samples for manual inspection. For example, the inspection could target important mislabeled samples or key vulnerabilities exploitable by an adversarial attack. In order to solve the "needle in the haystack" problem of which samples to inspect, we develop a new scalable version of Cook's distance, a classical statistical technique for identifying samples which unusually strongly impact the fit of a regression model (and its downstream predictions). In order to scale this technique up to very large and high-dimensional datasets, we introduce a new algorithm which we call "influence sketching." Influence sketching embeds random projections within the influence computation; in particular, the influence score is calculated using the randomly projected pseudo-dataset from the post-convergence General Linear Model (GLM). We validate that influence sketching can reliably and successfully discover influential samples by applying the technique to a malware detection dataset of over 2 million executable files, each represented with almost 100,000 features. For example, we find that randomly deleting approximately 10% of training samples reduces predictive accuracy only slightly from 99.47% to 99.45%, whereas deleting the same number of samples with high influence sketch scores reduces predictive accuracy all the way down to 90.24%. Moreover, we find that influential samples are especially likely to be mislabeled. In the case study, we manually inspect the most influential samples, and find that influence sketching pointed us to new, previously unidentified pieces of malware."

to:NB
regression
linear_regression
computational_statistics
random_projections
via:vaguery
december 2016 by cshalizi

[1311.4500] Time series prediction via aggregation : an oracle bound including numerical cost

december 2016 by cshalizi

"We address the problem of forecasting a time series meeting the Causal Bernoulli Shift model, using a parametric set of predictors. The aggregation technique provides a predictor with well established and quite satisfying theoretical properties expressed by an oracle inequality for the prediction risk. The numerical computation of the aggregated predictor usually relies on a Markov chain Monte Carlo method whose convergence should be evaluated. In particular, it is crucial to bound the number of simulations needed to achieve a numerical precision of the same order as the prediction risk. In this direction we present a fairly general result which can be seen as an oracle inequality including the numerical cost of the predictor computation. The numerical cost appears by letting the oracle inequality depend on the number of simulations required in the Monte Carlo approximation. Some numerical experiments are then carried out to support our findings."

to:NB
prediction
time_series
statistics
computational_statistics
monte_carlo
ensemble_methods
december 2016 by cshalizi

[1206.7051] Stochastic Variational Inference

november 2016 by cshalizi

We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet allocation and the hierarchical Dirichlet process topic model. Using stochastic variational inference, we analyze several large collections of documents: 300K articles from Nature, 1.8M articles from The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can easily handle data sets of this size and outperforms traditional variational inference, which can only handle a smaller subset. (We also show that the Bayesian nonparametric topic model outperforms its parametric counterpart.) Stochastic variational inference lets us apply complex Bayesian models to massive data sets.

in_NB
statistics
computational_statistics
variational_inference
stochastic_approximation
blei.david
november 2016 by cshalizi

Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning

february 2016 by cshalizi

"Randomized neural networks are immortalized in this AI Koan: _In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6. What are you doing?'' asked Minsky. I am training a randomly wired neural net to play tic-tac-toe,'' Sussman replied. Why is the net wired randomly?'' asked Minsky. Sussman replied, I do not want it to have any preconceptions of how to play.'' Minsky then shut his eyes. Why do you close your eyes?'' Sussman asked his teacher. So that the room will be empty,'' replied Minsky. At that moment, Sussman was enlightened._ We analyze shallow random networks with the help of concentration of measure inequalities. Specifically, we consider architectures that compute a weighted sum of their inputs after passing them through a bank of arbitrary randomized nonlinearities. We identify conditions under which these networks exhibit good classification performance, and bound their test error in terms of the size of the dataset and the number of random nonlinearities."

to:NB
have_read
random_projections
kernel_methods
prediction
computational_statistics
statistics
classifiers
february 2016 by cshalizi

Fastfood - Computing Hilbert Space Expansions in loglinear time | ICML 2013 | JMLR W&CP

february 2016 by cshalizi

"Fast nonlinear function classes are crucial for nonparametric estimation, such as in kernel methods. This paper proposes an improvement to random kitchen sinks that offers significantly faster computation in log-linear time without sacrificing accuracy. Furthermore, we show how one may adjust the regularization properties of the kernel simply by changing the spectral distribution of the projection matrix. We provide experimental results which show that even for for moderately small problems we already achieve two orders of magnitude faster computation and three orders of magnitude lower memory footprint."

to:NB
regression
nonparametrics
computational_statistics
hilbert_space
kernel_methods
smola.alex
le.quoc
prediction
statistics
random_projections
february 2016 by cshalizi

[1601.00670] Variational Inference: A Review for Statisticians

february 2016 by cshalizi

"One of the core problems of modern statistics is to approximate difficult-to-compute probability distributions. This problem is especially important in Bayesian statistics, which frames all inference about unknown quantities as a calculation about the posterior. In this paper, we review variational inference (VI), a method from machine learning that approximates probability distributions through optimization. VI has been used in myriad applications and tends to be faster than classical methods, such as Markov chain Monte Carlo sampling. The idea behind VI is to first posit a family of distributions and then to find the member of that family which is close to the target. Closeness is measured by Kullback-Leibler divergence. We review the ideas behind mean-field variational inference, discuss the special case of VI applied to exponential family models, present a full example with a Bayesian mixture of Gaussians, and derive a variant that uses stochastic optimization to scale up to massive data. We discuss modern research in VI and highlight important open problems. VI is powerful, but it is not yet well understood. Our hope in writing this paper is to catalyze statistical research on this widely-used class of algorithms."

to:NB
variational_inference
statistics
computational_statistics
approximation
blei.david
february 2016 by cshalizi

Sequential Testing for Large Scale Learning

december 2015 by cshalizi

"We argue that when faced with big data sets, learning and inference algorithms should compute updates using only subsets of data items. We introduce algorithms that use sequential hypothesis tests to adaptively select such a subset of data points. The statistical properties of this subsampling process can be used to control the efficiency and accuracy of learning or inference. In the context of learning by optimization, we test for the probability that the update direction is no more than 90 degrees in the wrong direction. In the context of posterior inference using Markov chain Monte Carlo, we test for the probability that our decision to accept or reject a sample is wrong. We experimentally evaluate our algorithms on a number of models and data sets."

to:NB
computational_statistics
hypothesis_testing
machine_learning
optimization
monte_carlo
december 2015 by cshalizi

[1507.01059] Remarks on kernel Bayes' rule

august 2015 by cshalizi

"Kernel Bayes' rule has been proposed as a nonparametric kernel-based method to realize Bayesian inference in reproducing kernel Hilbert spaces. However, we demonstrate both theoretically and experimentally that the prediction result by kernel Bayes' rule is in some cases unnatural. We consider that this phenomenon is in part due to the fact that the assumptions in kernel Bayes' rule do not hold in general."

--- The point about the "kernel posterior" being insensitive to the prior seems the strongest to me; so strong that despite reading the proof I am not sure it's right, as opposed to an algebraic mistake I'm missing.

to:NB
bayesianism
kernel_methods
hilbert_space
computational_statistics
statistics
have_read
--- The point about the "kernel posterior" being insensitive to the prior seems the strongest to me; so strong that despite reading the proof I am not sure it's right, as opposed to an algebraic mistake I'm missing.

august 2015 by cshalizi

[1507.05185] Fast Sparse Least-Squares Regression with Non-Asymptotic Guarantees

august 2015 by cshalizi

"In this paper, we study a fast approximation method for {\it large-scale high-dimensional} sparse least-squares regression problem by exploiting the Johnson-Lindenstrauss (JL) transforms, which embed a set of high-dimensional vectors into a low-dimensional space. In particular, we propose to apply the JL transforms to the data matrix and the target vector and then to solve a sparse least-squares problem on the compressed data with a {\it slightly larger regularization parameter}. Theoretically, we establish the optimization error bound of the learned model for two different sparsity-inducing regularizers, i.e., the elastic net and the ℓ1 norm. Compared with previous relevant work, our analysis is {\it non-asymptotic and exhibits more insights} on the bound, the sample complexity and the regularization. As an illustration, we also provide an error bound of the {\it Dantzig selector} under JL transforms."

to:NB
regression
random_projections
computational_statistics
statistics
sparsity
august 2015 by cshalizi

[1507.00066] Fast Cross-Validation for Incremental Learning

august 2015 by cshalizi

"Cross-validation (CV) is one of the main tools for performance estimation and parameter tuning in machine learning. The general recipe for computing CV estimate is to run a learning algorithm separately for each CV fold, a computationally expensive process. In this paper, we propose a new approach to reduce the computational burden of CV-based performance estimation. As opposed to all previous attempts, which are specific to a particular learning model or problem domain, we propose a general method applicable to a large class of incremental learning algorithms, which are uniquely fitted to big data problems. In particular, our method applies to a wide range of supervised and unsupervised learning tasks with different performance criteria, as long as the base learning algorithm is incremental. We show that the running time of the algorithm scales logarithmically, rather than linearly, in the number of CV folds. Furthermore, the algorithm has favorable properties for parallel and distributed implementation. Experiments with state-of-the-art incremental learning algorithms confirm the practicality of the proposed method."

to:NB
cross-validation
computational_statistics
statistics
august 2015 by cshalizi

[1507.04553] Approximate Maximum Likelihood Estimation

august 2015 by cshalizi

"In recent years, methods of approximate parameter estimation have attracted considerable interest in complex problems where exact likelihoods are hard to obtain. In their most basic form, Bayesian methods such as Approximate Bayesian Computation (ABC) involve sampling from the parameter space and keeping those parameters that produce data that fit sufficiently well to the actually observed data. Exploring the whole parameter space, however, makes this approach inefficient in high dimensional problems. This led to the proposal of more sophisticated iterative methods of inference such as particle filters.

"Here, we propose an alternative approach that is based on stochastic gradient methods and applicable both in a frequentist and a Bayesian setting. By moving along a simulated gradient, the algorithm produces a sequence of estimates that will eventually converge either to the maximum likelihood estimate or to the maximum of the posterior distribution, in each case under a set of observed summary statistics. To avoid reaching only a local maximum, we propose to run the algorithm from a set of random starting values.

"As good tuning of the algorithm is important, we explored several tuning strategies, and propose a set of guidelines that worked best in our simulations. We investigate the performance of our approach in simulation studies, and also apply the algorithm to two models with intractable likelihood functions. First, we present an application to inference in the context of queuing systems. We also re-analyze population genetic data and estimate parameters describing the demographic history of Sumatran and Bornean orang-utan populations."

in_NB
statistics
computational_statistics
stochastic_approximation
likelihood
estimation
primates
"Here, we propose an alternative approach that is based on stochastic gradient methods and applicable both in a frequentist and a Bayesian setting. By moving along a simulated gradient, the algorithm produces a sequence of estimates that will eventually converge either to the maximum likelihood estimate or to the maximum of the posterior distribution, in each case under a set of observed summary statistics. To avoid reaching only a local maximum, we propose to run the algorithm from a set of random starting values.

"As good tuning of the algorithm is important, we explored several tuning strategies, and propose a set of guidelines that worked best in our simulations. We investigate the performance of our approach in simulation studies, and also apply the algorithm to two models with intractable likelihood functions. First, we present an application to inference in the context of queuing systems. We also re-analyze population genetic data and estimate parameters describing the demographic history of Sumatran and Bornean orang-utan populations."

august 2015 by cshalizi

[1507.06346] Evaluation of Spectral Learning for the Identification of Hidden Markov Models

august 2015 by cshalizi

"Hidden Markov models have successfully been applied as models of discrete time series in many fields. Often, when applied in practice, the parameters of these models have to be estimated. The currently predominating identification methods, such as maximum-likelihood estimation and especially expectation-maximization, are iterative and prone to have problems with local minima. A non-iterative method employing a spectral subspace-like approach has recently been proposed in the machine learning literature. This paper evaluates the performance of this algorithm, and compares it to the performance of the expectation-maximization algorithm, on a number of numerical examples. We find that the performance is mixed; it successfully identifies some systems with relatively few available observations, but fails completely for some systems even when a large amount of observations is available. An open question is how this discrepancy can be explained. We provide some indications that it could be related to how well-conditioned some system parameters are."

to:NB
spectral_methods
markov_models
state-space_models
statistics
time_series
computational_statistics
august 2015 by cshalizi

Importance Sampling Over Sets: A New Probabilistic Inference Scheme

july 2015 by cshalizi

"Computing expectations in high-dimensional spaces is a key challenge in probabilistic infer- ence and machine learning. Monte Carlo sam- pling, and importance sampling in particular, is one of the leading approaches. We propose a generalized importance sampling scheme based on randomly selecting (exponentially large) sub- sets of states rather than individual ones. By col- lecting a small number of extreme states in the sampled sets, we obtain estimates of statistics of interest, such as the partition function of an undirected graphical model. We incorporate this idea into a novel maximum likelihood learning algorithm based on cutting planes. We demon- strate empirically that our scheme provides accu- rate answers and scales to problems with up to a million variables."

--- Applicable to the fitness-breeding scheme?

to:NB
computational_statistics
monte_carlo
statistics
approximation
re:fitness_sampling
--- Applicable to the fitness-breeding scheme?

july 2015 by cshalizi

Stable Spectral Learning Based on Schur Decomposition

july 2015 by cshalizi

"Spectral methods are a powerful tool for inferring the parameters of certain classes of probability distributions by means of standard eigenvalue- eigenvector decompositions. Spectral algorithms can be orders of magnitude faster than log- likelihood based and related iterative methods, and, thanks to the uniqueness of the spectral de- composition, they enjoy global optimality guar- antees. In practice, however, the applicability of spectral methods is limited due to their sensitiv- ity to model misspecification, which can cause instability issues in the case of non-exact models. We present a new spectral approach that is based on the Schur triangularization of an observable matrix, and we carry out the corresponding theo- retical analysis. Our main result is a bound on the estimation error that is shown to depend linearly on the condition number of the ground-truth con- ditional probability matrix and inversely on the eigengap of an observable matrix. Numerical ex- periments show that the proposed method is more stable, and performs better in general, than the classical spectral approach using direct matrix di- agonalization."

to:NB
to_read
spectral_methods
mixture_models
statistics
computational_statistics
estimation
misspecification
july 2015 by cshalizi

Random Features for Large-Scale Kernel Machines

july 2015 by cshalizi

"To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. The features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shift- invariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in large-scale classification and regression tasks linear machine learning al- gorithms applied to these features outperform state-of-the-art large-scale kernel machines."

in_NB
to_read
machine_learning
computational_statistics
approximation
random_projections
fourier_analysis
kernel_methods
statistics
re:hyperbolic_networks
july 2015 by cshalizi

Training generative neural networks via maximum mean discrepancy optimization

july 2015 by cshalizi

"We consider training a deep neural network to generate samples from an unknown distribu- tion given i.i.d. data. We frame learning as an optimization minimizing a two-sample test statistic—informally speaking, a good genera- tor network produces samples that cause a two- sample test to fail to reject the null hypothesis. As our two-sample test statistic, we use an un- biased estimate of the maximum mean discrep- ancy, which is the centerpiece of the nonpara- metric kernel two-sample test proposed by Gret- ton et al. [2]. We compare to the adversar- ial nets framework introduced by Goodfellow et al. [1], in which learning is a two-player game between a generator network and an adversarial discriminator network, both trained to outwit the other. From this perspective, the MMD statistic plays the role of the discriminator. In addition to empirical comparisons, we prove bounds on the generalization error incurred by optimizing the empirical MMD."

--- On first glance, there's no obvious limitation to neural networks, and indeed it's rather suggestive of indirect inference (to me)

to:NB
simulation
stochastic_models
neural_networks
machine_learning
two-sample_tests
hypothesis_testing
nonparametrics
kernel_methods
statistics
computational_statistics
ghahramani.zoubin
--- On first glance, there's no obvious limitation to neural networks, and indeed it's rather suggestive of indirect inference (to me)

july 2015 by cshalizi

Estimating the Partition Function by Discriminance Sampling

july 2015 by cshalizi

"Importance sampling (IS) and its variant, an- nealed IS (AIS) have been widely used for es- timating the partition function in graphical mod- els, such as Markov random fields and deep gen- erative models. However, IS tends to underesti- mate the partition function and is subject to high variance when the proposal distribution is more peaked than the target distribution. On the other hand, “reverse” versions of IS and AIS tend to overestimate the partition function, and degener- ate when the target distribution is more peaked than the proposal distribution. In this work, we present a simple, general method that gives much more reliable and robust estimates than either IS (AIS) or reverse IS (AIS). Our method works by converting the estimation problem into a simple classification problem that discriminates between the samples drawn from the target and the pro- posal. We give extensive theoretical and empir- ical justification; in particular, we show that an annealed version of our method significantly out- performs both AIS and reverse AIS as proposed by Burda et al. (2015), which has been the state- of-the-art for likelihood evaluation in deep gen- erative models."

in_NB
heard_the_talk
computational_statistics
exponential_families
statistics
monte_carlo
classifiers
july 2015 by cshalizi

[1505.00869] On the Feasibility of Distributed Kernel Regression for Big Data

july 2015 by cshalizi

"In modern scientific research, massive datasets with huge numbers of observations are frequently encountered. To facilitate the computational process, a divide-and-conquer scheme is often used for the analysis of big data. In such a strategy, a full dataset is first split into several manageable segments; the final output is then averaged from the individual outputs of the segments. Despite its popularity in practice, it remains largely unknown that whether such a distributive strategy provides valid theoretical inferences to the original data. In this paper, we address this fundamental issue for the distributed kernel regression (DKR), where the algorithmic feasibility is measured by the generalization performance of the resulting estimator. To justify DKR, a uniform convergence rate is needed for bounding the generalization error over the individual outputs, which brings new and challenging issues in the big data setup. Under mild conditions, we show that, with a proper number of segments, DKR leads to an estimator that is generalization consistent to the unknown regression function. The obtained results justify the method of DKR and shed light on the feasibility of using other distributed algorithms for processing big data. The promising preference of the method is supported by both simulation and real data examples."

to:NB
computational_statistics
kernel_estimators
regression
statistics
july 2015 by cshalizi

A Deterministic Partition Function Approximation for Exponential Random Graph Models

june 2015 by cshalizi

"Exponential Random Graphs Models (ERGM) are common, simple statistical models for social net- work and other network structures. Unfortunately, inference and learning with them is hard even for small networks because their partition functions are intractable for precise computation. In this paper, we introduce a new quadratic time deterministic approximation to these partition functions. Our main insight enabling this advance is that subgraph statistics is sufficient to derive a lower bound for partition functions given that the model is not dom- inated by a few graphs. The proposed method dif- fers from existing methods in its ways of exploiting asymptotic properties of subgraph statistics. Com- pared to the current Monte Carlo simulation based methods, the new method is scalable, stable, and precise enough for inference tasks."

--- To look at carefully. Doesn't do anything to address the lack of projectivity, but may suggest useful techniques for related models. (From a quick scan, it looks like it's some sort of Laplace approximation, thus presumably related to large deviations.)

in_NB
exponential_family_random_graphs
computational_statistics
statistics
network_data_analysis
--- To look at carefully. Doesn't do anything to address the lack of projectivity, but may suggest useful techniques for related models. (From a quick scan, it looks like it's some sort of Laplace approximation, thus presumably related to large deviations.)

june 2015 by cshalizi

[1504.04941] Fast Moment-Based Estimation for Hierarchical Models

may 2015 by cshalizi

"Hierarchical models allow for heterogeneous behaviours in a population while simultaneously borrowing estimation strength across all subpopulations. Unfortunately, existing likelihood-based methods for fitting hierarchical models have high computational demands, and these demands have limited their adoption in large-scale prediction and inference problems. This paper proposes a moment-based procedure for estimating the parameters of a hierarchical model which has its roots in a method originally introduced by Cochran in 1937. The method trades statistical efficiency for computational efficiency. It gives consistent parameter estimates, competitive prediction error performance, and substantial computational improvements. When applied to a large-scale recommender system application and compared to a standard maximum likelihood procedure, the method delivers competitive prediction performance while reducing the sequential computation time from hours to minutes."

have_read
hierarchical_statistical_models
computational_statistics
statistics
perry.patrick_o.
in_NB
may 2015 by cshalizi

On parallel implementation of sequential Monte Carlo methods: the island particle model - Springer

february 2015 by cshalizi

"The approximation of the Feynman-Kac semigroups by systems of interacting particles is a very active research field, with applications in many different areas. In this paper, we study the parallelization of such approximations. The total population of particles is divided into sub-populations, referred to as islands. The particles within each island follow the usual selection/mutation dynamics. We show that the evolution of each island is also driven by a Feynman-Kac semigroup, whose transition and potential can be explicitly related to ones of the original problem. Therefore, the same genetic type approximation of the Feynman-Kac semi-group may be used at the island level; each island might undergo selection/mutation algorithm. We investigate the impact of the population size within each island and the number of islands, and study different type of interactions. We find conditions under which introducing interactions between islands is beneficial. The theoretical results are supported by some Monte Carlo experiments."

to:NB
particle_filters
monte_carlo
computational_statistics
stochastic_processes
interacting_particle_systems
re:fitness_sampling
february 2015 by cshalizi

[1501.06794] Computing Functions of Random Variables via Reproducing Kernel Hilbert Space Representations

february 2015 by cshalizi

"We describe a method to perform functional operations on probability distributions of random variables. The method uses reproducing kernel Hilbert space representations of probability distributions, and it is applicable to all operations which can be applied to points drawn from the respective distributions. We refer to our approach as {\em kernel probabilistic programming}. We illustrate it on synthetic data, and show how it can be used for nonparametric structural equation models, with an application to causal inference."

to:NB
kernel_methods
hilbert_space
computational_statistics
causal_inference
statistics
february 2015 by cshalizi

Crisan , Míguez : Particle-kernel estimation of the filter density in state-space models

january 2015 by cshalizi

"Sequential Monte Carlo (SMC) methods, also known as particle filters, are simulation-based recursive algorithms for the approximation of the a posteriori probability measures generated by state-space dynamical models. At any given time t, a SMC method produces a set of samples over the state space of the system of interest (often termed “particles”) that is used to build a discrete and random approximation of the posterior probability distribution of the state variables, conditional on a sequence of available observations. One potential application of the methodology is the estimation of the densities associated to the sequence of a posteriori distributions. While practitioners have rather freely applied such density approximations in the past, the issue has received less attention from a theoretical perspective. In this paper, we address the problem of constructing kernel-based estimates of the posterior probability density function and its derivatives, and obtain asymptotic convergence results for the estimation errors. In particular, we find convergence rates for the approximation errors that hold uniformly on the state space and guarantee that the error vanishes almost surely as the number of particles in the filter grows. Based on this uniform convergence result, we first show how to build continuous measures that converge almost surely (with known rate) toward the posterior measure and then address a few applications. The latter include maximum a posteriori estimation of the system state using the approximate derivatives of the posterior density and the approximation of functionals of it, for example, Shannon’s entropy."

in_NB
particle_filters
kernel_estimators
density_estimation
filtering
state_estimation
state-space_models
statistics
computational_statistics
january 2015 by cshalizi

[1405.0058] Underestimating extreme events in power-law behavior due to machine-dependent cutoffs

january 2015 by cshalizi

"Power-law distributions are typical macroscopic features occurring in almost all complex systems observable in nature. As a result, researchers in quantitative analyses must often generate random synthetic variates obeying power-law distributions. The task is usually performed through standard methods that map uniform random variates into the desired probability space. Whereas all these algorithms are theoretically solid, in this paper we show that they are subject to severe machine-dependent limitations. As a result, two dramatic consequences arise: (i) the sampling in the tail of the distribution is not random but deterministic; (ii) the moments of the sample distribution, which are theoretically expected to diverge as functions of the sample sizes, converge instead to finite values. We provide quantitative indications for the range of distribution parameters that can be safely handled by standard libraries used in computational analyses. Whereas our findings indicate possible reinterpretations of numerical results obtained through flawed sampling methodologies, they also pave the way for the search for a concrete solution to this central issue shared by all quantitative sciences dealing with complexity."

to:NB
to_read
heavy_tails
approximation
computational_statistics
have_skimmed
january 2015 by cshalizi

[1406.5986] A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares

january 2015 by cshalizi

"We consider statistical aspects of solving large-scale least-squares (LS) problems using randomized sketching algorithms. Prior work has typically adopted an \emph{algorithmic perspective}, in that it has made no statistical assumptions on the input X and Y, and instead it has assumed that the data (X,Y) are fixed and worst-case. In this paper, we adopt a \emph{statistical perspective}, and we consider the mean-squared error performance of randomized sketching algorithms, when data (X,Y) are generated according to a statistical linear model Y=Xβ+ϵ, where ϵ is a noise process. To do this, we first develop a framework for assessing, in a unified manner, algorithmic and statistical aspects of randomized sketching methods. We then consider the statistical predicition efficiency (SPE) and the statistical residual efficiency (SRE) of the sketched LS estimator; and we use our framework to provide results for several types of random projection and random sampling sketching algorithms. Among other results, we show that the SRE can be bounded when p≲r≪n but that the SPE typically requires the sample size r to be substantially larger. Our theoretical results reveal that, depending on the specifics of the situation, leverage-based sampling methods can perform as well as or better than projection methods. Our empirical results reveal that when r is only slightly greater than p and much less than n, projection-based methods out-perform sampling-based methods, but as r grows, sampling methods start to out-perform projection methods."

to:NB
computational_statistics
regression
linear_regression
random_projections
statistics
january 2015 by cshalizi

[1409.3821] Computational Implications of Reducing Data to Sufficient Statistics

january 2015 by cshalizi

"Given a large dataset and an estimation task, it is common to pre-process the data by reducing them to a set of sufficient statistics. This step is often regarded as straightforward and advantageous (in that it simplifies statistical analysis). I show that -on the contrary- reducing data to sufficient statistics can change a computationally tractable estimation problem into an intractable one. I discuss connections with recent work in theoretical computer science, and implications for some techniques to estimate graphical models."

!!!

to:NB
to_read
sufficiency
computational_statistics
theoretical_computer_science
graphical_models
statistics
!!!

january 2015 by cshalizi

[1409.2638] Magging: maximin aggregation for inhomogeneous large-scale data

january 2015 by cshalizi

"Large-scale data analysis poses both statistical and computational problems which need to be addressed simultaneously. A solution is often straightforward if the data are homogeneous: one can use classical ideas of subsampling and mean aggregation to get a computationally efficient solution with acceptable statistical accuracy, where the aggregation step simply averages the results obtained on distinct subsets of the data. However, if the data exhibit inhomogeneities (and typically they do), the same approach will be inadequate, as it will be unduly influenced by effects that are not persistent across all the data due to, for example, outliers or time-varying effects. We show that a tweak to the aggregation step can produce an estimator of effects which are common to all data, and hence interesting for interpretation and often leading to better prediction than pooled effects."

to:NB
ensemble_methods
computational_statistics
statistics
prediction
buhlmann.peter
january 2015 by cshalizi

[1412.3779] Biips: Software for Bayesian Inference with Interacting Particle Systems

january 2015 by cshalizi

"Biips is a software platform for automatic Bayesian inference with interacting particle systems. Biips allows users to define their statistical model in the probabilistic programming BUGS language, as well as to add custom functions or samplers within this language. Then it runs sequential Monte Carlo based algorithms (particle filters, particle independent Metropolis-Hastings, particle marginal Metropolis-Hastings) in a black-box manner so that to approximate the posterior distribution of interest as well as the marginal likelihood. The software is developed in C++ with interfaces with the softwares R, Matlab and Octave."

to:NB
monte_carlo
particle_filters
computational_statistics
statistics
del_moral.pierre
january 2015 by cshalizi

[1409.5827] Software Alchemy: Turning Complex Statistical Computations into Embarrassingly-Parallel Ones

january 2015 by cshalizi

"The growth in the use of computationally intensive statistical procedures, especially with Big Data, has necessitated the usage of parallel computation on diverse platforms such as multicore, GPU, clusters and clouds. However, slowdown due to interprocess communication costs typically limits such methods to "embarrassingly parallel" (EP) algorithms, especially on non-shared memory platforms. This paper develops a broadly-applicable method for converting many non-EP algorithms into statistically equivalent EP ones. The method is shown to yield excellent levels of speedup for a variety of statistical computations. It also overcomes certain problems of memory limitations."

in_NB
to_read
distributed_systems
computational_statistics
statistics
matloff.norm
to_teach:statcomp
january 2015 by cshalizi

[1407.2724] On the Optimality of Averaging in Distributed Statistical Learning

january 2015 by cshalizi

"A common approach to statistical learning on big data is to randomly distribute it among m machines and calculate the parameter of interest by merging their m individual estimates. Two key questions related to this approach are: What is the optimal aggregation procedure, and what is the accuracy loss in comparison to centralized computation. We make several contributions to these questions, under the general framework of empirical risk minimization, a.k.a. M-estimation. As data is abundant, we assume the number of samples per machine, n, is large and study two asymptotic settings: one where n→∞ but the number of estimated parameters p is fixed, and a second high-dimensional case where both p,n→∞ with p/n→κ∈(0,1). Our main results include asymptotically exact expressions for the loss incurred by splitting the data, where only bounds were previously available. These are derived independently of the learning algorithm. Consequently, under suitable assumptions in the fixed-p setting, averaging is {\em first-order equivalent} to a centralized solution, and thus inherits statistical properties like efficiency and robustness. In the high-dimension setting, studied here for the first time in the context of parallelization, a qualitatively different behaviour appears. Parallelized computation generally incurs an accuracy loss, for which we derive a simple approximate formula. We conclude with several practical implications of our results."

to:NB
distributed_systems
computational_statistics
high-dimensional_statistics
statistics
january 2015 by cshalizi

[1412.1735] Speeding up bootstrap computations: a vectorized implementation for statistics based on sample moments

january 2015 by cshalizi

"In this note we propose a vectorized implementation of the non-parametric bootstrap for statistics based on sample moments. Basically, we adopt the multinomial sampling formulation of the non-parametric bootstrap, and compute bootstrap replications of sample moment statistics by simply weighting the observed data according to multinomial counts, instead of evaluating the statistic on a re-sampled version of the observed data. Using this formulation we can generate a matrix of bootstrap weights and compute the entire vector of bootstrap replications with a few matrix multiplications. Vectorization is particularly important for matrix-oriented programming languages such as R, where matrix/vector calculations tend to be faster than scalar operations implemented in a loop. We illustrate the gain in computational speed achieved by the vectorized implementation in real and simulated data sets, when bootstrapping Pearson's sample correlation coefficient."

--- Interesting, but presumes that weighting is enough to make things work.

to:NB
bootstrap
computational_statistics
statistics
--- Interesting, but presumes that weighting is enough to make things work.

january 2015 by cshalizi

[1411.0306] Fast Randomized Kernel Methods With Statistical Guarantees

january 2015 by cshalizi

"One approach to improving the running time of kernel-based machine learning methods is to build a small sketch of the input and use it in lieu of the full kernel matrix in the machine learning task of interest. Here, we describe a version of this approach that comes with running time guarantees as well as improved guarantees on its statistical performance. By extending the notion of \emph{statistical leverage scores} to the setting of kernel ridge regression, our main statistical result is to identify an importance sampling distribution that reduces the size of the sketch (i.e., the required number of columns to be sampled) to the \emph{effective dimensionality} of the problem. This quantity is often much smaller than previous bounds that depend on the \emph{maximal degrees of freedom}. Our main algorithmic result is to present a fast algorithm to compute approximations to these scores. This algorithm runs in time that is linear in the number of samples---more precisely, the running time is O(np2), where the parameter p depends only on the trace of the kernel matrix and the regularization parameter---and it can be applied to the matrix of feature vectors, without having to form the full kernel matrix. This is obtained via a variant of length-squared sampling that we adapt to the kernel setting in a way that is of independent interest. Lastly, we provide empirical results illustrating our theory, and we discuss how this new notion of the statistical leverage of a data point captures in a fine way the difficulty of the original statistical learning problem."

to:NB
kernel_methods
low-rank_approximation
randomized_algorithms
computational_statistics
statistics
january 2015 by cshalizi

[1501.01265] The ABC of Simulation Estimation with Auxiliary Statistics

january 2015 by cshalizi

"This paper provides a synthesis of the simulation based minimum distance estimators used in economics with the method of ABC (Approximate Bayesian Computation) used in other disciplines. While the two strands of work are seemingly related, the relation between them is not well understood beyond the fact that they all replace the likelihood by auxiliary statistics that are informative about the data. We connect the two methods by using a reverse sampler to engineer the ABC posterior distribution from a sequence of simulated minimum distance estimates. Focusing on the exactly identified case, we find that Bayesian and frequentist estimators have different bias properties for an arbitrary choice of prior. The difference can be traced to whether we match the sample auxiliary statistics to each or to the average of the simulated ones. In principle, the likelihood-free Bayesian estimators can completely eliminate the order 1T bias with suitable choice of the prior. But under a specific relation between the structural parameters and auxiliary statistics, the frequentist simulated distance estimators are automatically second order unbiased. These differences are illustrated using an analytical example and a simulation study of the dynamic panel model."

in_NB
indirect_inference
approximate_bayesian_computation
statistics
computational_statistics
simulation
re:stacs
to_read
january 2015 by cshalizi

[1501.03326] Unbiased Bayes for Big Data: Paths of Partial Posteriors

january 2015 by cshalizi

"Bayesian inference proceeds based on expectations of certain functions with respect to the posterior. Markov Chain Monte Carlo is a fundamental tool to compute these expectations. However, its feasibility is being challenged in the era of so called Big Data as all data needs to be processed in every iteration. Realising that such simulation is an unnecessarily hard problem if the goal is estimation, we construct a computationally scalable methodology that allows unbiased estimation of the required expectations -- without explicit simulation from the full posterior. The average computational complexity of our scheme is sub-linear in the size of the dataset and its variance is straightforward to control, leading to algorithms that are provably unbiased and naturally arrive at a desired error tolerance. We demonstrate the utility and generality of the methodology on a range of common statistical models applied to large scale benchmark datasets."

to:NB
monte_carlo
bayesianism
computational_statistics
statistics
january 2015 by cshalizi

A general construction for parallelizing Metropolis−Hastings algorithms

december 2014 by cshalizi

"Markov chain Monte Carlo methods (MCMC) are essential tools for solving many modern-day statistical and computational problems; however, a major limitation is the inherently sequential nature of these algorithms. In this paper, we propose a natural generalization of the Metropolis−Hastings algorithm that allows for parallelizing a single chain using existing MCMC methods. We do so by proposing multiple points in parallel, then constructing and sampling from a finite-state Markov chain on the proposed points such that the overall procedure has the correct target density as its stationary distribution. Our approach is generally applicable and straightforward to implement. We demonstrate how this construction may be used to greatly increase the computational speed and statistical efficiency of a variety of existing MCMC methods, including Metropolis-Adjusted Langevin Algorithms and Adaptive MCMC. Furthermore, we show how it allows for a principled way of using every integration step within Hamiltonian Monte Carlo methods; our approach increases robustness to the choice of algorithmic parameters and results in increased accuracy of Monte Carlo estimates with little extra computational cost."

to:NB
to_read
monte_carlo
parallelism
computational_statistics
statistics
december 2014 by cshalizi

**related tags**

Copy this bookmark: