cshalizi + computational_statistics   197

[1802.08012] Learning Topic Models by Neighborhood Aggregation
"Topic models are frequently used in machine learning owing to their high interpretability and modular structure. However, extending a topic model to include a supervisory signal, to incorporate pre-trained word embedding vectors and to include a nonlinear output function is not an easy task because one has to resort to a highly intricate approximate inference procedure. The present paper shows that topic modeling with pre-trained word embedding vectors can be viewed as implementing a neighborhood aggregation algorithm where messages are passed through a network defined over words. From the network view of topic models, nodes correspond to words in a document and edges correspond to either a relationship describing co-occurring words in a document or a relationship describing the same word in the corpus. The network view allows us to extend the model to include supervisory signals, incorporate pre-trained word embedding vectors and include a nonlinear output function in a simple manner. In experiments, we show that our approach outperforms the state-of-the-art supervised Latent Dirichlet Allocation implementation in terms of held-out document classification tasks."
to:NB  topic_models  computational_statistics  text_mining 
21 hours ago by cshalizi
[1909.04791] Techniques All Classifiers Can Learn from Deep Networks: Models, Optimizations, and Regularization
"Deep neural networks have introduced novel and useful tools to the machine learning community, and other types of classifiers can make use of these tools to improve their performance and generality. This paper reviews the current state of the art for deep learning classifier technologies that are being used outside of deep neural networks. Many components of existing deep neural network architectures can be employed by non-network classifiers. In this paper, we review the feature learning, optimization, and regularization methods that form a core of deep network technologies. We then survey non-neural network learning algorithms that make innovative use of these methods to improve classification. We conclude by discussing directions that can be pursued to expand the area of deep learning for a variety of classification algorithms."
to:NB  classifiers  machine_learning  optimization  computational_statistics  statistics  neural_networks 
3 days ago by cshalizi
[1909.05097] Spectral Non-Convex Optimization for Dimension Reduction with Hilbert-Schmidt Independence Criterion
"The Hilbert Schmidt Independence Criterion (HSIC) is a kernel dependence measure that has applications in various aspects of machine learning. Conveniently, the objectives of different dimensionality reduction applications using HSIC often reduce to the same optimization problem. However, the nonconvexity of the objective function arising from non-linear kernels poses a serious challenge to optimization efficiency and limits the potential of HSIC-based formulations. As a result, only linear kernels have been computationally tractable in practice. This paper proposes a spectral-based optimization algorithm that extends beyond the linear kernel. The algorithm identifies a family of suitable kernels and provides the first and second-order local guarantees when a fixed point is reached. Furthermore, we propose a principled initialization strategy, thereby removing the need to repeat the algorithm at random initialization points. Compared to state-of-the-art optimization algorithms, our empirical results on real data show a run-time improvement by as much as a factor of 10^5 while consistently achieving lower cost and classification/clustering errors. The implementation source code is publicly available on this https URL."
to:NB  kernel_methods  hilbert_space  probability  dimension_reduction  statistics  computational_statistics  spectral_methods 
3 days ago by cshalizi
[1909.00835] Evolutionary reinforcement learning of dynamical large deviations
"We show how to calculate dynamical large deviations using evolutionary reinforcement learning. An agent, a stochastic model, propagates a continuous-time Monte Carlo trajectory, and receives a reward conditioned upon the values of certain path-extensive quantities. Evolution produces progressively fitter agents, allowing the calculation of a piece of a large-deviation rate function for a particular model and path-extensive quantity. For models with small state spaces the evolutionary process acts directly on rates, and for models with large state spaces the process acts on the weights of a neural network that parameterizes the model's rates. The present approach shows how path-extensive physics problems can be considered within a framework widely used in machine learning."
to:NB  reinforcement_learning  large_deviations  computational_statistics  probability  re:fitness_sampling 
3 days ago by cshalizi
[1909.03320] Matrix Calculations for Moments of Markov Processes
"Matryoshka dolls, the traditional Russian nesting figurines, are known world-wide for each doll's encapsulation of a sequence of smaller dolls. In this paper, we identify a large class of Markov process whose moments are easy to compute by exploiting the structure of a new sequence of nested matrices we call Matryoshkhan matrices. We characterize the salient properties of Matryoshkhan matrices that allow us to compute these moments in closed form at a specific time without computing the entire path of the process. This speeds up the computation of the Markov process moments significantly in comparison to traditional differential equation methods, which we demonstrate through numerical experiments. Through our method, we derive explicit expressions for both transient and steady-state moments of this class of Markov processes. We demonstrate the applicability of this method through explicit examples such as shot-noise processes, growth-collapse processes, linear birth-death-immigration processes, and affine stochastic differential equations from the finance literature. We also show that we can derive explicit expressions for the self-exciting Hawkes process, for which finding closed form moment expressions has been an open problem since its introduction."
to:NB  markov_models  stochastic_processes  computational_statistics 
3 days ago by cshalizi
[1908.06437] Block Nearest Neighboor Gaussian processes for large datasets
"This work develops a valid spatial block-Nearest Neighbor Gaussian process (block-NNGP) for estimation and prediction of location-referenced large spatial datasets. The key idea behind our approach is to subdivide the spatial domain into several blocks which are dependent under some constraints. The cross-blocks capture the large-scale spatial variation, while each block capture the small-scale dependence. The block-NNGP is embeded as a sparsity-inducing prior within a hierarchical modeling framework. Markov chain Monte Carlo (MCMC) algorithms are executed without storing or decomposing large matrices, while the sparse block precision matrix is efficiently computed through parallel computing. We also consider alternate MCMC algorithms through composite sampling for faster computing time, and more reproducible Bayesian inference. The performance of the block-NNGP is illustrated using simulation studies and applications with massive real data, for locations in the order of 10^4."
to:NB  spatial_statistics  prediction  computational_statistics  statistics  to_teach:data_over_space_and_time 
29 days ago by cshalizi
[1908.06936] ExaGeoStatR: A Package for Large-Scale Geostatistics in R
"Parallel computing in Gaussian process calculation becomes a necessity for avoiding computational and memory restrictions associated with Geostatistics applications. The evaluation of the Gaussian log-likelihood function requires O(n^2) storage and O(n^3) operations where n is the number of geographical locations. In this paper, we present ExaGeoStatR, a package for large-scale Geostatistics in R that supports parallel computation of the maximum likelihood function on shared memory, GPU, and distributed systems. The parallelization depends on breaking down the numerical linear algebra operations into a set of tasks and rendering them for a task-based programming model. ExaGeoStatR supports several maximum likelihood computation variants such as exact, Diagonal Super Tile (DST), and Tile Low-Rank (TLR) approximation besides providing a tool to generate large-scale synthetic datasets which can be used to test and compare different approximations methods. The package can be used directly through the R environment without any C, CUDA, or MPIknowledge. Here, we demonstrate the ExaGeoStatR package by illustrating its implementation details, analyzing its performance on various parallel architectures, and assessing its accuracy using both synthetic datasets and a sea surface temperature dataset. The performance evaluation involves spatial datasets with up to 250K observations."
to:NB  spatial_statistics  prediction  computational_statistics  R  statistics  to_teach:data_over_space_and_time 
29 days ago by cshalizi
Parallelising Particle Filters with Butterfly Interactions - Heine - - Scandinavian Journal of Statistics - Wiley Online Library
"Bootstrap particle filter (BPF) is the cornerstone of many algorithms used for solving generally intractable inference problems with Hidden Markov models. The long term stability of BPF arises from particle interactions that typically make parallel implementations of BPF nontrivial.
"We propose a method whereby the particle interaction is done in several stages. With the proposed method, full interaction can be accomplished even if we allow only pairwise communications between processing elements at each stage. We show that our method preserves the consistency and the long term stability of the BPF, although our analysis suggest that the constraints on the stagewise interactions introduce error leading to a lower convergence rate than standard Monte Carlo. The proposed method also suggests a new, more flexible, adaptive resampling scheme, which according to our numerical experiments is the method of choice, displaying a notable gain in efficiency in certain parallel computing scenarios."
to:NB  computational_statistics  particle_filters  time_series  state-space_models  state_estimation  re:fitness_sampling 
4 weeks ago by cshalizi
[1907.10597] Green AI
"The computations required for deep learning research have been doubling every few months, resulting in an estimated 300,000x increase from 2012 to 2018 [2]. These computations have a surprisingly large carbon footprint [38]. Ironically, deep learning was inspired by the human brain, which is remarkably energy efficient. Moreover, the financial cost of the computations can make it difficult for academics, students, and researchers, in particular those from emerging economies, to engage in deep learning research.
"This position paper advocates a practical solution by making efficiency an evaluation criterion for research alongside accuracy and related measures. In addition, we propose reporting the financial cost or "price tag" of developing, training, and running models to provide baselines for the investigation of increasingly efficient methods. Our goal is to make AI both greener and more inclusive---enabling any inspired undergraduate with a laptop to write high-quality research papers. Green AI is an emerging focus at the Allen Institute for AI."
to:NB  machine_learning  computational_statistics  smith.noah_a.  energy 
4 weeks ago by cshalizi
[1507.02646] Pareto Smoothed Importance Sampling
"Importance weighting is a general way to adjust Monte Carlo integration to account for draws from the wrong distribution, but the resulting estimate can be noisy when the importance ratios have a heavy right tail. This routinely occurs when there are aspects of the target distribution that are not well captured by the approximating distribution, in which case more stable estimates can be obtained by modifying extreme importance ratios. We present a new method for stabilizing importance weights using a generalized Pareto distribution fit to the upper tail of the distribution of the simulated importance ratios. The method, which empirically performs better than existing methods for stabilizing importance sampling estimates, includes stabilized effective sample size estimates, Monte Carlo error estimates and convergence diagnostics."
to:NB  monte_carlo  importance_sampling  heavy_tails  computational_statistics  statistics  re:fitness_sampling  gelman.andrew 
6 weeks ago by cshalizi
[1908.01251] Measuring the Algorithmic Convergence of Randomized Ensembles: The Regression Setting
"When randomized ensemble methods such as bagging and random forests are implemented, a basic question arises: Is the ensemble large enough? In particular, the practitioner desires a rigorous guarantee that a given ensemble will perform nearly as well as an ideal infinite ensemble (trained on the same data). The purpose of the current paper is to develop a bootstrap method for solving this problem in the context of regression --- which complements our companion paper in the context of classification (Lopes 2019). In contrast to the classification setting, the current paper shows that theoretical guarantees for the proposed bootstrap can be established under much weaker assumptions. In addition, we illustrate the flexibility of the method by showing how it can be adapted to measure algorithmic convergence for variable selection. Lastly, we provide numerical results demonstrating that the method works well in a range of situations."
to:NB  ensemble_methods  computational_statistics  statistics  to_teach:data-mining 
6 weeks ago by cshalizi
[1908.01087] Network Shrinkage Estimation
"Networks are a natural representation of complex systems across the sciences, and higher-order dependencies are central to the understanding and modeling of these systems. However, in many practical applications such as online social networks, networks are massive, dynamic, and naturally streaming, where pairwise interactions become available one at a time in some arbitrary order. The massive size and streaming nature of these networks allow only partial observation, since it is infeasible to analyze the entire network. Under such scenarios, it is challenging to study the higher-order structural and connectivity patterns of streaming networks. In this work, we consider the fundamental problem of estimating the higher-order dependencies using adaptive sampling. We propose a novel adaptive, single-pass sampling framework and unbiased estimators for higher-order network analysis of large streaming networks. Our algorithms exploit adaptive techniques to identify edges that are highly informative for efficiently estimating the higher-order structure of streaming networks from small sample data. We also introduce a novel James-Stein-type shrinkage estimator to minimize the estimation error. Our approach is fully analytic with theoretical guarantees, computationally efficient, and can be incrementally updated in a streaming setting. Numerical experiments on large networks show that our approach is superior to baseline methods."
to:NB  network_data_analysis  computational_statistics  statistics 
6 weeks ago by cshalizi
[1805.11908] Who Learns Better Bayesian Network Structures: Accuracy and Speed of Structure Learning Algorithms
"Three classes of algorithms to learn the structure of Bayesian networks from data are common in the literature: constraint-based algorithms, which use conditional independence tests to learn the dependence structure of the data; score-based algorithms, which use goodness-of-fit scores as objective functions to maximise; and hybrid algorithms that combine both approaches. Constraint-based and score-based algorithms have been shown to learn the same structures when conditional independence and goodness of fit are both assessed using entropy and the topological ordering of the network is known (Cowell, 2001).
"In this paper, we investigate how these three classes of algorithms perform outside the assumptions above in terms of speed and accuracy of network reconstruction for both discrete and Gaussian Bayesian networks. We approach this question by recognising that structure learning is defined by the combination of a statistical criterion and an algorithm that determines how the criterion is applied to the data. Removing the confounding effect of different choices for the statistical criterion, we find using both simulated and real-world complex data that constraint-based algorithms are often less accurate than score-based algorithms, but are seldom faster (even at large sample sizes); and that hybrid algorithms are neither faster nor more accurate than constraint-based algorithms. This suggests that commonly held beliefs on structure learning in the literature are strongly influenced by the choice of particular statistical criteria rather than just by the properties of the algorithms themselves."
to:NB  causal_discovery  graphical_models  statistics  computational_statistics 
6 weeks ago by cshalizi
Scalable Visualization Methods for Modern Generalized Additive Models: Journal of Computational and Graphical Statistics: Vol 0, No 0
"In the last two decades, the growth of computational resources has made it possible to handle generalized additive models (GAMs) that formerly were too costly for serious applications. However, the growth in model complexity has not been matched by improved visualizations for model development and results presentation. Motivated by an industrial application in electricity load forecasting, we identify the areas where the lack of modern visualization tools for GAMs is particularly severe, and we address the shortcomings of existing methods by proposing a set of visual tools that (a) are fast enough for interactive use, (b) exploit the additive structure of GAMs, (c) scale to large data sets, and (d) can be used in conjunction with a wide range of response distributions. The new visual methods proposed here are implemented by the mgcViz R package, available on the Comprehensive R Archive Network. Supplementary materials for this article are available online."
to:NB  additive_models  visual_display_of_quantitative_information  computational_statistics  statistics  R  to_teach:undergrad-ADA 
8 weeks ago by cshalizi
[1907.08742] Estimating the Algorithmic Variance of Randomized Ensembles via the Bootstrap
"Although the methods of bagging and random forests are some of the most widely used prediction methods, relatively little is known about their algorithmic convergence. In particular, there are not many theoretical guarantees for deciding when an ensemble is "large enough" --- so that its accuracy is close to that of an ideal infinite ensemble. Due to the fact that bagging and random forests are randomized algorithms, the choice of ensemble size is closely related to the notion of "algorithmic variance" (i.e. the variance of prediction error due only to the training algorithm). In the present work, we propose a bootstrap method to estimate this variance for bagging, random forests, and related methods in the context of classification. To be specific, suppose the training dataset is fixed, and let the random variable Errt denote the prediction error of a randomized ensemble of size t. Working under a "first-order model" for randomized ensembles, we prove that the centered law of Errt can be consistently approximated via the proposed method as t→∞. Meanwhile, the computational cost of the method is quite modest, by virtue of an extrapolation technique. As a consequence, the method offers a practical guideline for deciding when the algorithmic fluctuations of Errt are negligible."
to:NB  ensemble_methods  computational_statistics  statistics  prediction  to_teach:data-mining 
8 weeks ago by cshalizi
[1907.01954] An Econometric View of Algorithmic Subsampling
"Datasets that are terabytes in size are increasingly common, but computer bottlenecks often frustrate a complete analysis of the data. While more data are better than less, diminishing returns suggest that we may not need terabytes of data to estimate a parameter or test a hypothesis. But which rows of data should we analyze, and might an arbitrary subset of rows preserve the features of the original data? This paper reviews a line of work that is grounded in theoretical computer science and numerical linear algebra, and which finds that an algorithmically desirable {\em sketch} of the data must have a {\em subspace embedding} property. Building on this work, we study how prediction and inference is affected by data sketching within a linear regression setup. The sketching error is small compared to the sample size effect which is within the control of the researcher. As a sketch size that is algorithmically optimal may not be suitable for prediction and inference, we use statistical arguments to provide `inference conscious' guides to the sketch size. When appropriately implemented, an estimator that pools over different sketches can be nearly as efficient as the infeasible one using the full sample."
to:NB  computational_statistics  statistics  linear_regression  random_projections 
9 weeks ago by cshalizi
Fast Generalized Linear Models by Database Sampling and One-Step Polishing: Journal of Computational and Graphical Statistics: Vol 0, No 0
"In this article, I show how to fit a generalized linear model to N observations on p variables stored in a relational database, using one sampling query and one aggregation query, as long as N^{1/2+δ} observations can be stored in memory, for some δ>0. The resulting estimator is fully efficient and asymptotically equivalent to the maximum likelihood estimator, and so its variance can be estimated from the Fisher information in the usual way. A proof-of-concept implementation uses R with MonetDB and with SQLite, and could easily be adapted to other popular databases. I illustrate the approach with examples of taxi-trip data in New York City and factors related to car color in New Zealand. "
to:NB  computational_statistics  linear_regression  regression  databases  lumley.thomas  to_teach:statcomp 
12 weeks ago by cshalizi
An Expectation Conditional Maximization Approach for Gaussian Graphical Models: Journal of Computational and Graphical Statistics: Vol 0, No 0
"Bayesian graphical models are a useful tool for understanding dependence relationships among many variables, particularly in situations with external prior information. In high-dimensional settings, the space of possible graphs becomes enormous, rendering even state-of-the-art Bayesian stochastic search computationally infeasible. We propose a deterministic alternative to estimate Gaussian and Gaussian copula graphical models using an expectation conditional maximization (ECM) algorithm, extending the EM approach from Bayesian variable selection to graphical model estimation. We show that the ECM approach enables fast posterior exploration under a sequence of mixture priors, and can incorporate multiple sources of information."
to:NB  graphical_models  causal_discovery  em_algorithm  computational_statistics  statistics  mccormick.tyler 
12 weeks ago by cshalizi
An asymptotic analysis of distributed nonparametric methods
"We investigate and compare the fundamental performance of several distributed learning methods that have been proposed recently. We do this in the context of a distributed version of the classical signal-in-Gaussian-white-noise model, which serves as a benchmark model for studying performance in this setting. The results show how the design and tuning of a distributed method can have great impact on convergence rates and validity of uncertainty quantification. Moreover, we highlight the difficulty of designing nonparametric distributed procedures that automatically adapt to smoothness."
to:NB  statistics  nonparametrics  computational_statistics  distributed_systems 
june 2019 by cshalizi
[1906.05746] Nonlinear System Identification via Tensor Completion
"Function approximation from input and output data pairs constitutes a fundamental problem in supervised learning. Deep neural networks are currently the most popular method for learning to mimic the input-output relationship of a generic nonlinear system, as they have proven to be very effective in approximating complex highly nonlinear functions. In this work, we propose low-rank tensor completion as an appealing alternative for modeling and learning complex nonlinear systems. We model the interactions between the N input variables and the scalar output of a system by a single N-way tensor, and setup a weighted low-rank tensor completion problem with smoothness regularization which we tackle using a block coordinate descent algorithm. We extend our method to the multi-output setting and the case of partially observed data, which cannot be readily handled by neural networks. Finally, we demonstrate the effectiveness of the approach using several regression tasks including some standard benchmarks and a challenging student grade prediction task."
to:NB  approximation  tensors  regression  computational_statistics  statistics 
june 2019 by cshalizi
Phys. Rev. E 99, 053311 (2019) - Randomization algorithms for large sparse networks
"In many domains it is necessary to generate surrogate networks, e.g., for hypothesis testing of different properties of a network. Generating surrogate networks typically requires that different properties of the network are preserved, e.g., edges may not be added or deleted and edge weights may be restricted to certain intervals. In this paper we present an efficient property-preserving Markov chain Monte Carlo method termed CycleSampler for generating surrogate networks in which (1) edge weights are constrained to intervals and vertex strengths are preserved exactly, and (2) edge and vertex strengths are both constrained to intervals. These two types of constraints cover a wide variety of practical use cases. The method is applicable to both undirected and directed graphs. We empirically demonstrate the efficiency of the CycleSampler method on real-world data sets. We provide an implementation of CycleSampler in R, with parts implemented in C."
to:NB  network_sampling  computational_statistics  network_data_analysis 
june 2019 by cshalizi
[1806.06802] Almost-Exact Matching with Replacement for Causal Inference
"We aim to create the highest possible quality of treatment-control matches for categorical data in the potential outcomes framework. Matching methods are heavily used in the social sciences due to their interpretability, but most matching methods do not pass basic sanity checks: they fail when irrelevant variables are introduced, and tend to be either computationally slow or produce low-quality matches. The method proposed in this work aims to match units on a weighted Hamming distance, taking into account the relative importance of the covariates; the algorithm aims to match units on as many relevant variables as possible. To do this, the algorithm creates a hierarchy of covariate combinations on which to match (similar to downward closure), in the process solving an optimization problem for each unit in order to construct the optimal matches. The algorithm uses a single dynamic program to solve all of the optimization problems simultaneously. Notable advantages of our method over existing matching procedures are its high-quality matches, versatility in handling different data distributions that may have irrelevant variables, and ability to handle missing data by matching on as many available covariates as possible."
to:NB  causal_inference  matching  statistics  computational_statistics 
june 2019 by cshalizi
Refining the Concept of Scientific Inference When Working with Big Data: Proceedings of a Workshop | The National Academies Press
"The concept of utilizing big data to enable scientific discovery has generated tremendous excitement and investment from both private and public sectors over the past decade, and expectations continue to grow. Using big data analytics to identify complex patterns hidden inside volumes of data that have never been combined could accelerate the rate of scientific discovery and lead to the development of beneficial technologies and products. However, producing actionable scientific knowledge from such large, complex data sets requires statistical models that produce reliable inferences (NRC, 2013). Without careful consideration of the suitability of both available data and the statistical models applied, analysis of big data may result in misleading correlations and false discoveries, which can potentially undermine confidence in scientific research if the results are not reproducible. In June 2016 the National Academies of Sciences, Engineering, and Medicine convened a workshop to examine critical challenges and opportunities in performing scientific inference reliably when working with big data. Participants explored new methodologic developments that hold significant promise and potential research program areas for the future. This publication summarizes the presentations and discussions from the workshop."

--- I am hapy with my contributions.
to:NB  books:recommended  books:contributed-to  statistics  computational_statistics  machine_learning  data_mining 
may 2019 by cshalizi
[1902.00640] Particle Flow Bayes' Rule
"We present a particle flow realization of Bayes' rule, where an ODE-based neural operator is used to transport particles from a prior to its posterior after a new observation. We prove that such an ODE operator exists. Its neural parameterization can be trained in a meta-learning framework, allowing this operator to reason about the effect of an individual observation on the posterior, and thus generalize across different priors, observations and to sequential Bayesian inference. We demonstrated the generalization ability of our particle flow Bayes operator in several canonical and high dimensional examples."
to:NB  bayesianism  neural_networks  computational_statistics  statistics  re:fitness_sampling 
may 2019 by cshalizi
[1905.10845] On Coresets for Regularized Loss Minimization
"We design and mathematically analyze sampling-based algorithms for regularized loss minimization problems that are implementable in popular computational models for large data, in which the access to the data is restricted in some way. Our main result is that if the regularizer's effect does not become negligible as the norm of the hypothesis scales, and as the data scales, then a uniform sample of modest size is with high probability a coreset. In the case that the loss function is either logistic regression or soft-margin support vector machines, and the regularizer is one of the common recommended choices, this result implies that a uniform sample of size O(dn‾√) is with high probability a coreset of n points in ℜd. We contrast this upper bound with two lower bounds. The first lower bound shows that our analysis of uniform sampling is tight; that is, a smaller uniform sample will likely not be a core set. The second lower bound shows that in some sense uniform sampling is close to optimal, as significantly smaller core sets do not generally exist."
to:NB  optimization  learning_theory  statistics  computational_statistics  randomized_algorithms 
may 2019 by cshalizi
[1805.10927] Scalable and Robust Community Detection with Randomized Sketching
"This paper explores and analyzes the unsupervised clustering of large partially observed graphs. We propose a scalable and provable randomized framework for clustering graphs generated from the stochastic block model. The clustering is first applied to a sub-matrix of the graph's adjacency matrix associated with a reduced graph sketch constructed using random sampling. Then, the clusters of the full graph are inferred based on the clusters extracted from the sketch using a correlation-based retrieval step. Uniform random node sampling is shown to improve the computational complexity over clustering of the full graph when the cluster sizes are balanced. A new random degree-based node sampling algorithm is presented which significantly improves upon the performance of the clustering algorithm even when clusters are unbalanced. This algorithm improves the phase transitions for matrix-decomposition-based clustering with regard to computational complexity and minimum cluster size, which are shown to be nearly dimension-free in the low inter-cluster connectivity regime. A third sampling technique is shown to improve balance by randomly sampling nodes based on spatial distribution. We provide analysis and numerical results using a convex clustering algorithm based on matrix completion."
to:NB  community_discovery  random_projections  statistics  computational_statistics 
may 2019 by cshalizi
Approximate Kernel-Based Conditional Independence Tests for Fast Non-Parametric Causal Discovery : Journal of Causal Inference
"Constraint-based causal discovery (CCD) algorithms require fast and accurate conditional independence (CI) testing. The Kernel Conditional Independence Test (KCIT) is currently one of the most popular CI tests in the non-parametric setting, but many investigators cannot use KCIT with large datasets because the test scales at least quadratically with sample size. We therefore devise two relaxations called the Randomized Conditional Independence Test (RCIT) and the Randomized conditional Correlation Test (RCoT) which both approximate KCIT by utilizing random Fourier features. In practice, both of the proposed tests scale linearly with sample size and return accurate p-values much faster than KCIT in the large sample size context. CCD algorithms run with RCIT or RCoT also return graphs at least as accurate as the same algorithms run with KCIT but with large reductions in run time."
to:NB  kernel_methods  dependence_measures  causal_discovery  statistics  computational_statistics 
may 2019 by cshalizi
Modern statistics modern biology | Statistics for life sciences, medicine and health | Cambridge University Press
"If you are a biologist and want to get the best out of the powerful methods of modern computational statistics, this is your book. You can visualize and analyze your own data, apply unsupervised and supervised learning, integrate datasets, apply hypothesis testing, and make publication-quality figures using the power of R/Bioconductor and ggplot2. This book will teach you 'cooking from scratch', from raw data to beautiful illuminating output, as you learn to write your own scripts in the R language and to use advanced statistics packages from CRAN and Bioconductor. It covers a broad range of basic and advanced topics important in the analysis of high-throughput biological data, including principal component analysis and multidimensional scaling, clustering, multiple testing, unsupervised and supervised learning, resampling, the pitfalls of experimental design, and power simulations using Monte Carlo, and it even reaches networks, trees, spatial statistics, image data, and microbial ecology. Using a minimum of mathematical notation, it builds understanding from well-chosen examples, simulation, visualization, and above all hands-on interaction with data and code."
to:NB  books:noted  statistics  computational_statistics  biology  genomics  R 
february 2019 by cshalizi
[1812.00681] Numerical computation of rare events via large deviation theory
"An overview of rare events algorithms based on large deviation theory (LDT) is presented. It covers a range of numerical schemes to compute the large deviation minimizer in various setups, and discusses best practices, common pitfalls, and implementation trade-offs. Generalizations, extensions, and improvements of the minimum action methods are proposed. These algorithms are tested on example problems which illustrate several common difficulties which arise e.g. when the forcing is degenerate or multiplicative, or the systems are infinite-dimensional. Generalizations to processes driven by non-Gaussian noises or random initial data and parameters are also discussed, along with the connection between the LDT-based approach reviewed here and other methods, such as stochastic field theory and optimal control. Finally, the integration of this approach in importance sampling methods using e.g. genealogical algorithms is explored."
to:NB  large_deviations  rar  rare-event_simulation  simulation  computational_statistics  probability  via:rvenkat 
december 2018 by cshalizi
Solving Differential Equations in R: Package deSolve | Soetaert | Journal of Statistical Software
"In this paper we present the R package deSolve to solve initial value problems (IVP) written as ordinary differential equations (ODE), differential algebraic equations (DAE) of index 0 or 1 and partial differential equations (PDE), the latter solved using the method of lines approach. The differential equations can be represented in R code or as compiled code. In the latter case, R is used as a tool to trigger the integration and post-process the results, which facilitates model development and application, whilst the compiled code significantly increases simulation speed. The methods implemented are efficient, robust, and well documented public-domain Fortran routines. They include four integrators from the ODEPACK package (LSODE, LSODES, LSODA, LSODAR), DVODE and DASPK2.0. In addition, a suite of Runge-Kutta integrators and special-purpose solvers to efficiently integrate 1-, 2- and 3-dimensional partial differential equations are available. The routines solve both stiff and non-stiff systems, and include many options, e.g., to deal in an efficient way with the sparsity of the Jacobian matrix, or finding the root of equations. In this article, our objectives are threefold: (1) to demonstrate the potential of using R for dynamic modeling, (2) to highlight typical uses of the different methods implemented and (3) to compare the performance of models specified in R code and in compiled code for a number of test cases. These comparisons demonstrate that, if the use of loops is avoided, R code can efficiently integrate problems comprising several thousands of state variables. Nevertheless, the same problem may be solved from 2 to more than 50 times faster by using compiled code compared to an implementation using only R code. Still, amongst the benefits of R are a more flexible and interactive implementation, better readability of the code, and access to R’s high-level procedures. deSolve is the successor of package odesolve which will be deprecated in the future; it is free software and distributed under the GNU General Public License, as part of the R software project."
to:NB  dynamical_systems  computational_statistics  R  to_teach:data_over_space_and_time  have_read 
december 2018 by cshalizi
[1808.04739] Simulating Markov random fields with a conclique-based Gibbs sampler
"For spatial and network data, we consider models formed from a Markov random field (MRF) structure and the specification of a conditional distribution for each observation. At issue, fast simulation from such MRF models is often an important consideration, particularly when repeated generation of large numbers of data sets is required (e.g., for approximating sampling distributions). However, a standard Gibbs strategy for simulating from MRF models involves single-updates, performed with the conditional distribution of each observation in a sequential manner, whereby a Gibbs iteration may become computationally involved even for relatively small samples. As an alternative, we describe a general way to simulate from MRF models using Gibbs sampling with "concliques" (i.e., groups of non-neighboring observations). Compared to standard Gibbs sampling, this simulation scheme can be much faster by reducing Gibbs steps and by independently updating all observations per conclique at once. We detail the simulation method, establish its validity, and assess its computational performance through numerical studies, where speed advantages are shown for several spatial and network examples."

--- Slides: http://andeekaplan.com/phd-thesis/slides/public.pdf
--- There's an R package on Github but I couldn't get it to compile...
random_fields  simulation  stochastic_processes  spatial_statistics  network_data_analysis  markov_models  statistics  computational_statistics  to_teach:data_over_space_and_time  have_read  in_NB 
december 2018 by cshalizi
[1710.05013] A Case Study Competition Among Methods for Analyzing Large Spatial Data
"The Gaussian process is an indispensable tool for spatial data analysts. The onset of the "big data" era, however, has lead to the traditional Gaussian process being computationally infeasible for modern spatial data. As such, various alternatives to the full Gaussian process that are more amenable to handling big spatial data have been proposed. These modern methods often exploit low rank structures and/or multi-core and multi-threaded computing environments to facilitate computation. This study provides, first, an introductory overview of several methods for analyzing large spatial data. Second, this study describes the results of a predictive competition among the described methods as implemented by different groups with strong expertise in the methodology. Specifically, each research group was provided with two training datasets (one simulated and one observed) along with a set of prediction locations. Each group then wrote their own implementation of their method to produce predictions at the given location and each which was subsequently run on a common computing environment. The methods were then compared in terms of various predictive diagnostics. Supplementary materials regarding implementation details of the methods and code are available for this article online."
in_NB  spatial_statistics  prediction  computational_statistics  statistics  to_teach:data_over_space_and_time 
october 2018 by cshalizi
Object-oriented Computation of Sandwich Estimators | Zeileis | Journal of Statistical Software
"Sandwich covariance matrix estimators are a popular tool in applied regression modeling for performing inference that is robust to certain types of model misspecification. Suitable implementations are available in the R system for statistical computing for certain model fitting functions only (in particular lm()), but not for other standard regression functions, such as glm(), nls(), or survreg(). Therefore, conceptual tools and their translation to computational tools in the package sandwich are discussed, enabling the computation of sandwich estimators in general parametric models. Object orientation can be achieved by providing a few extractor functions' most importantly for the empirical estimating functions' from which various types of sandwich estimators can be computed."
to:NB  computational_statistics  R  estimation  regression  statistics  to_teach 
october 2018 by cshalizi
General Resampling Infrastructure • rsample
"rsample contains a set of functions that can create different types of resamples and corresponding classes for their analysis. The goal is to have a modular set of methods that can be used across different R packages for:
"traditional resampling techniques for estimating the sampling distribution of a statistic and
"estimating model performance using a holdout set
"The scope of rsample is to provide the basic building blocks for creating and analyzing resamples of a data set but does not include code for modeling or calculating statistics. The “Working with Resample Sets” vignette gives demonstrations of how rsample tools can be used."
to:NB  R  computational_statistics  to_teach:statcomp  to_teach:undergrad-ADA  via:? 
august 2018 by cshalizi
[1806.06850] Polynomial Regression As an Alternative to Neural Nets
"Despite the success of neural networks (NNs), there is still a concern among many over their "black box" nature. Why do they work? Here we present a simple analytic argument that NNs are in fact essentially polynomial regression models. This view will have various implications for NNs, e.g. providing an explanation for why convergence problems arise in NNs, and it gives rough guidance on avoiding overfitting. In addition, we use this phenomenon to predict and confirm a multicollinearity property of NNs not previously reported in the literature. Most importantly, given this loose correspondence, one may choose to routinely use polynomial models instead of NNs, thus avoiding some major problems of the latter, such as having to set many tuning parameters and dealing with convergence issues. We present a number of empirical results; in each case, the accuracy of the polynomial approach matches or exceeds that of NN approaches. A many-featured, open-source software package, polyreg, is available."

--- Matloff is the author of my favorite "R programming for n00bs" textbook...
--- ETA after reading: the argument that multi-layer neural networks "are essentially" polynomial regression is a bit weak. It would be true, exactly, if activation functions were exactly polynomial, which however they rarely are in practice. If non-polynomial activations happen to be implemented in computational practice by polynomials (e.g., Taylor approximations), well, either we get different hardware or we crank up the degree of approximation as much as we like. (Said a little differently, if you buy this line of argument, you should buy that _every_ smooth statistical model "is essentially" polynomial regression, which seems a bit much.) It is, also, an argument about the function-approximation properties of the model classes, and not the fitting processes, despite the explicit disclaimers.
to:NB  your_favorite_deep_neural_network_sucks  regression  neural_networks  statistics  matloff.norman  approximation  computational_statistics  have_read 
july 2018 by cshalizi
now publishers - Community Detection and Stochastic Block Models
"The stochastic block model (SBM) is a random graph model with different group of vertices connecting differently. It is widely employed as a canonical model to study clustering and community detection, and provides a fertile ground to study the information-theoretic and computational tradeoffs that arise in combinatorial statistics and more generally data science. This monograph surveys the recent developments that establish the fundamental limits for community detection in the SBM, both with respect to information-theoretic and computational tradeoffs, and for various recovery requirements such as exact, partial and weak recovery. The main results discussed are the phase transitions for exact recovery at the Chernoff-Hellinger threshold, the phase transition for weak recovery at the Kesten-Stigum threshold, the optimal SNR-mutual information tradeoff for partial recovery, and the gap between information-theoretic and computational thresholds."
to:NB  community_discovery  network_data_analysis  information_theory  computational_statistics  statistics 
june 2018 by cshalizi
[1406.2661] Generative Adversarial Networks
"We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples."

--- Kind of astonishing that the phrases "actor-critic" or "co-evolution" do not appear anywhere in the paper. The bit about KL divergence is nice, though. (But even then, it doesn't involve multi-layer perceptrons at all, it's all at the level of probability distributions and could apply to any learning system.)
to:NB  have_read  neural_networks  machine_learning  computational_statistics  statistics 
june 2018 by cshalizi
Large-scale kernel methods for independence testing | SpringerLink
"Representations of probability measures in reproducing kernel Hilbert spaces provide a flexible framework for fully nonparametric hypothesis tests of independence, which can capture any type of departure from independence, including nonlinear associations and multivariate interactions. However, these approaches come with an at least quadratic computational cost in the number of observations, which can be prohibitive in many applications. Arguably, it is exactly in such large-scale datasets that capturing any type of dependence is of interest, so striking a favourable trade-off between computational efficiency and test performance for kernel independence tests would have a direct impact on their applicability in practice. In this contribution, we provide an extensive study of the use of large-scale kernel approximations in the context of independence testing, contrasting block-based, Nyström and random Fourier feature approaches. Through a variety of synthetic data experiments, it is demonstrated that our large-scale methods give comparable performance with existing methods while using significantly less computation time and memory."
to:NB  computational_statistics  dependence_measures  hypothesis_testing  statistics  kernel_methods  hilbert_space  gretton.arthurt 
may 2018 by cshalizi
[1803.01422] DAGs with NO TEARS: Smooth Optimization for Structure Learning
"Estimating the structure of directed acyclic graphs (DAGs, also known as Bayesian networks) is a challenging problem since the search space of DAGs is combinatorial and scales superexponentially with the number of nodes. Existing approaches rely on various local heuristics for enforcing the acyclicity constraint and are not well-suited to general purpose optimization packages for their solution. In this paper, we introduce a fundamentally different strategy: We formulate the structure learning problem as a smooth, constrained optimization problem over real matrices that avoids this combinatorial constraint entirely. This is achieved by a novel characterization of acyclicity that is not only smooth but also exact. The resulting nonconvex, constrained program involves smooth functions whose gradients are easy to compute and only involve elementary matrix operations. By using existing black-box optimization routines, our method uses global search to find an optimal DAG and can be implemented in about 50 lines of Python and outperforms existing methods without imposing any structural constraints."

to:NB  to_read  graphical_models  optimization  computational_statistics  xing.eric  ravikumar.pradeep 
march 2018 by cshalizi
[1802.01396] To understand deep learning we need to understand kernel learning
"Generalization performance of classifiers in deep learning has recently become a subject of intense study. Heavily over-parametrized deep models tend to fit training data exactly. Despite overfitting, they perform well on test data, a phenomenon not yet fully understood.
"The first point of our paper is that strong performance of overfitted classifiers is not a unique feature of deep learning. Using real-world and synthetic datasets, we establish that kernel classifiers trained to have zero classification error (overfitting) or even zero regression error (interpolation) perform very well on test data.
"We proceed to prove lower bounds on the norm of overfitted solutions for smooth kernels, showing that they increase nearly exponentially with the data size. Since most generalization bounds depend polynomially on the norm of the solution, this result implies that they diverge as data increases. Furthermore, the existing bounds do not apply to interpolated classifiers.
"We also show experimentally that (non-smooth) Laplacian kernels easily fit random labels using a version of SGD, a finding that parallels results reported for ReLU neural networks. In contrast, fitting noisy data requires many more epochs for smooth Gaussian kernels. The observation that the performance of overfitted Laplacian and Gaussian classifiers on the test is quite similar, suggests that generalization is tied to the properties of the kernel function rather than the optimization process.
"We see that some key phenomena of deep learning are manifested similarly in kernel methods in the overfitted regime. We argue that progress on understanding deep learning will be difficult, until more analytically tractable "shallow" kernel methods are better understood. The experimental and theoretical results presented in this paper indicate a need for new theoretical ideas for understanding classical kernel methods."

--- Of course, this also makes me wonder whether there really are practical advantages to deep networks, over and above what we'd get by throwing resources at kernels...
to:NB  neural_networks  kernel_methods  classifiers  regression  statistics  computational_statistics  learning_theory  belkin.mikhail 
march 2018 by cshalizi
Perturbations, Optimization, and Statistics | The MIT Press
"In nearly all machine learning, decisions must be made given current knowledge. Surprisingly, making what is believed to be the best decision is not always the best strategy, even when learning in a supervised learning setting. An emerging body of work on learning under different rules applies perturbations to decision and learning procedures. These methods provide simple and highly efficient learning rules with improved theoretical guarantees. This book describes perturbation-based methods developed in machine learning to augment novel optimization methods with strong statistical guarantees, offering readers a state-of-the-art overview.
"Chapters address recent modeling ideas that have arisen within the perturbations framework, including Perturb & MAP, herding, and the use of neural networks to map generic noise to distribution over highly structured data. They describe new learning procedures for perturbation models, including an improved EM algorithm and a learning algorithm that aims to match moments of model samples to moments of data. They discuss understanding the relation of perturbation models to their traditional counterparts, with one chapter showing that the perturbations viewpoint can lead to new algorithms in the traditional setting. And they consider perturbation-based regularization in neural networks, offering a more complete understanding of dropout and studying perturbations in the context of deep neural networks."
to:NB  books:noted  computational_statistics  optimization  stochastic_approximation  learning_theory  statistics 
october 2017 by cshalizi
Deep Learning | The MIT Press
"Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning.
"The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models.
"Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors."

--- Bengio is worth listening to.
in_NB  books:noted  neural_networks  machine_learning  computational_statistics  bengio.yoshua  books:owned 
december 2016 by cshalizi
[1611.05923] "Influence Sketching": Finding Influential Samples In Large-Scale Regressions
"There is an especially strong need in modern large-scale data analysis to prioritize samples for manual inspection. For example, the inspection could target important mislabeled samples or key vulnerabilities exploitable by an adversarial attack. In order to solve the "needle in the haystack" problem of which samples to inspect, we develop a new scalable version of Cook's distance, a classical statistical technique for identifying samples which unusually strongly impact the fit of a regression model (and its downstream predictions). In order to scale this technique up to very large and high-dimensional datasets, we introduce a new algorithm which we call "influence sketching." Influence sketching embeds random projections within the influence computation; in particular, the influence score is calculated using the randomly projected pseudo-dataset from the post-convergence General Linear Model (GLM). We validate that influence sketching can reliably and successfully discover influential samples by applying the technique to a malware detection dataset of over 2 million executable files, each represented with almost 100,000 features. For example, we find that randomly deleting approximately 10% of training samples reduces predictive accuracy only slightly from 99.47% to 99.45%, whereas deleting the same number of samples with high influence sketch scores reduces predictive accuracy all the way down to 90.24%. Moreover, we find that influential samples are especially likely to be mislabeled. In the case study, we manually inspect the most influential samples, and find that influence sketching pointed us to new, previously unidentified pieces of malware."
to:NB  regression  linear_regression  computational_statistics  random_projections  via:vaguery 
december 2016 by cshalizi
[1311.4500] Time series prediction via aggregation : an oracle bound including numerical cost
"We address the problem of forecasting a time series meeting the Causal Bernoulli Shift model, using a parametric set of predictors. The aggregation technique provides a predictor with well established and quite satisfying theoretical properties expressed by an oracle inequality for the prediction risk. The numerical computation of the aggregated predictor usually relies on a Markov chain Monte Carlo method whose convergence should be evaluated. In particular, it is crucial to bound the number of simulations needed to achieve a numerical precision of the same order as the prediction risk. In this direction we present a fairly general result which can be seen as an oracle inequality including the numerical cost of the predictor computation. The numerical cost appears by letting the oracle inequality depend on the number of simulations required in the Monte Carlo approximation. Some numerical experiments are then carried out to support our findings."
to:NB  prediction  time_series  statistics  computational_statistics  monte_carlo  ensemble_methods 
december 2016 by cshalizi
[1206.7051] Stochastic Variational Inference
We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet allocation and the hierarchical Dirichlet process topic model. Using stochastic variational inference, we analyze several large collections of documents: 300K articles from Nature, 1.8M articles from The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can easily handle data sets of this size and outperforms traditional variational inference, which can only handle a smaller subset. (We also show that the Bayesian nonparametric topic model outperforms its parametric counterpart.) Stochastic variational inference lets us apply complex Bayesian models to massive data sets.
in_NB  statistics  computational_statistics  variational_inference  stochastic_approximation  blei.david 
november 2016 by cshalizi
Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning
"Randomized neural networks are immortalized in this AI Koan: _In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6. What are you doing?'' asked Minsky. I am training a randomly wired neural net to play tic-tac-toe,'' Sussman replied. Why is the net wired randomly?'' asked Minsky. Sussman replied, I do not want it to have any preconceptions of how to play.'' Minsky then shut his eyes. Why do you close your eyes?'' Sussman asked his teacher. So that the room will be empty,'' replied Minsky. At that moment, Sussman was enlightened._ We analyze shallow random networks with the help of concentration of measure inequalities. Specifically, we consider architectures that compute a weighted sum of their inputs after passing them through a bank of arbitrary randomized nonlinearities. We identify conditions under which these networks exhibit good classification performance, and bound their test error in terms of the size of the dataset and the number of random nonlinearities."
to:NB  have_read  random_projections  kernel_methods  prediction  computational_statistics  statistics  classifiers 
february 2016 by cshalizi
Fastfood - Computing Hilbert Space Expansions in loglinear time | ICML 2013 | JMLR W&CP
"Fast nonlinear function classes are crucial for nonparametric estimation, such as in kernel methods. This paper proposes an improvement to random kitchen sinks that offers significantly faster computation in log-linear time without sacrificing accuracy. Furthermore, we show how one may adjust the regularization properties of the kernel simply by changing the spectral distribution of the projection matrix. We provide experimental results which show that even for for moderately small problems we already achieve two orders of magnitude faster computation and three orders of magnitude lower memory footprint."
to:NB  regression  nonparametrics  computational_statistics  hilbert_space  kernel_methods  smola.alex  le.quoc  prediction  statistics  random_projections 
february 2016 by cshalizi
[1601.00670] Variational Inference: A Review for Statisticians
"One of the core problems of modern statistics is to approximate difficult-to-compute probability distributions. This problem is especially important in Bayesian statistics, which frames all inference about unknown quantities as a calculation about the posterior. In this paper, we review variational inference (VI), a method from machine learning that approximates probability distributions through optimization. VI has been used in myriad applications and tends to be faster than classical methods, such as Markov chain Monte Carlo sampling. The idea behind VI is to first posit a family of distributions and then to find the member of that family which is close to the target. Closeness is measured by Kullback-Leibler divergence. We review the ideas behind mean-field variational inference, discuss the special case of VI applied to exponential family models, present a full example with a Bayesian mixture of Gaussians, and derive a variant that uses stochastic optimization to scale up to massive data. We discuss modern research in VI and highlight important open problems. VI is powerful, but it is not yet well understood. Our hope in writing this paper is to catalyze statistical research on this widely-used class of algorithms."
to:NB  variational_inference  statistics  computational_statistics  approximation  blei.david 
february 2016 by cshalizi
Sequential Testing for Large Scale Learning
"We argue that when faced with big data sets, learning and inference algorithms should compute updates using only subsets of data items. We introduce algorithms that use sequential hypothesis tests to adaptively select such a subset of data points. The statistical properties of this subsampling process can be used to control the efficiency and accuracy of learning or inference. In the context of learning by optimization, we test for the probability that the update direction is no more than 90 degrees in the wrong direction. In the context of posterior inference using Markov chain Monte Carlo, we test for the probability that our decision to accept or reject a sample is wrong. We experimentally evaluate our algorithms on a number of models and data sets."
to:NB  computational_statistics  hypothesis_testing  machine_learning  optimization  monte_carlo 
december 2015 by cshalizi
[1507.01059] Remarks on kernel Bayes' rule
"Kernel Bayes' rule has been proposed as a nonparametric kernel-based method to realize Bayesian inference in reproducing kernel Hilbert spaces. However, we demonstrate both theoretically and experimentally that the prediction result by kernel Bayes' rule is in some cases unnatural. We consider that this phenomenon is in part due to the fact that the assumptions in kernel Bayes' rule do not hold in general."

--- The point about the "kernel posterior" being insensitive to the prior seems the strongest to me; so strong that despite reading the proof I am not sure it's right, as opposed to an algebraic mistake I'm missing.
to:NB  bayesianism  kernel_methods  hilbert_space  computational_statistics  statistics  have_read 
august 2015 by cshalizi
[1507.05185] Fast Sparse Least-Squares Regression with Non-Asymptotic Guarantees
"In this paper, we study a fast approximation method for {\it large-scale high-dimensional} sparse least-squares regression problem by exploiting the Johnson-Lindenstrauss (JL) transforms, which embed a set of high-dimensional vectors into a low-dimensional space. In particular, we propose to apply the JL transforms to the data matrix and the target vector and then to solve a sparse least-squares problem on the compressed data with a {\it slightly larger regularization parameter}. Theoretically, we establish the optimization error bound of the learned model for two different sparsity-inducing regularizers, i.e., the elastic net and the ℓ1 norm. Compared with previous relevant work, our analysis is {\it non-asymptotic and exhibits more insights} on the bound, the sample complexity and the regularization. As an illustration, we also provide an error bound of the {\it Dantzig selector} under JL transforms."
to:NB  regression  random_projections  computational_statistics  statistics  sparsity 
august 2015 by cshalizi
[1507.00066] Fast Cross-Validation for Incremental Learning
"Cross-validation (CV) is one of the main tools for performance estimation and parameter tuning in machine learning. The general recipe for computing CV estimate is to run a learning algorithm separately for each CV fold, a computationally expensive process. In this paper, we propose a new approach to reduce the computational burden of CV-based performance estimation. As opposed to all previous attempts, which are specific to a particular learning model or problem domain, we propose a general method applicable to a large class of incremental learning algorithms, which are uniquely fitted to big data problems. In particular, our method applies to a wide range of supervised and unsupervised learning tasks with different performance criteria, as long as the base learning algorithm is incremental. We show that the running time of the algorithm scales logarithmically, rather than linearly, in the number of CV folds. Furthermore, the algorithm has favorable properties for parallel and distributed implementation. Experiments with state-of-the-art incremental learning algorithms confirm the practicality of the proposed method."
to:NB  cross-validation  computational_statistics  statistics 
august 2015 by cshalizi
[1507.04553] Approximate Maximum Likelihood Estimation
"In recent years, methods of approximate parameter estimation have attracted considerable interest in complex problems where exact likelihoods are hard to obtain. In their most basic form, Bayesian methods such as Approximate Bayesian Computation (ABC) involve sampling from the parameter space and keeping those parameters that produce data that fit sufficiently well to the actually observed data. Exploring the whole parameter space, however, makes this approach inefficient in high dimensional problems. This led to the proposal of more sophisticated iterative methods of inference such as particle filters.
"Here, we propose an alternative approach that is based on stochastic gradient methods and applicable both in a frequentist and a Bayesian setting. By moving along a simulated gradient, the algorithm produces a sequence of estimates that will eventually converge either to the maximum likelihood estimate or to the maximum of the posterior distribution, in each case under a set of observed summary statistics. To avoid reaching only a local maximum, we propose to run the algorithm from a set of random starting values.
"As good tuning of the algorithm is important, we explored several tuning strategies, and propose a set of guidelines that worked best in our simulations. We investigate the performance of our approach in simulation studies, and also apply the algorithm to two models with intractable likelihood functions. First, we present an application to inference in the context of queuing systems. We also re-analyze population genetic data and estimate parameters describing the demographic history of Sumatran and Bornean orang-utan populations."
in_NB  statistics  computational_statistics  stochastic_approximation  likelihood  estimation  primates 
august 2015 by cshalizi
[1507.06346] Evaluation of Spectral Learning for the Identification of Hidden Markov Models
"Hidden Markov models have successfully been applied as models of discrete time series in many fields. Often, when applied in practice, the parameters of these models have to be estimated. The currently predominating identification methods, such as maximum-likelihood estimation and especially expectation-maximization, are iterative and prone to have problems with local minima. A non-iterative method employing a spectral subspace-like approach has recently been proposed in the machine learning literature. This paper evaluates the performance of this algorithm, and compares it to the performance of the expectation-maximization algorithm, on a number of numerical examples. We find that the performance is mixed; it successfully identifies some systems with relatively few available observations, but fails completely for some systems even when a large amount of observations is available. An open question is how this discrepancy can be explained. We provide some indications that it could be related to how well-conditioned some system parameters are."
to:NB  spectral_methods  markov_models  state-space_models  statistics  time_series  computational_statistics 
august 2015 by cshalizi
Importance Sampling Over Sets: A New Probabilistic Inference Scheme
"Computing expectations in high-dimensional spaces is a key challenge in probabilistic infer- ence and machine learning. Monte Carlo sam- pling, and importance sampling in particular, is one of the leading approaches. We propose a generalized importance sampling scheme based on randomly selecting (exponentially large) sub- sets of states rather than individual ones. By col- lecting a small number of extreme states in the sampled sets, we obtain estimates of statistics of interest, such as the partition function of an undirected graphical model. We incorporate this idea into a novel maximum likelihood learning algorithm based on cutting planes. We demon- strate empirically that our scheme provides accu- rate answers and scales to problems with up to a million variables."

--- Applicable to the fitness-breeding scheme?
to:NB  computational_statistics  monte_carlo  statistics  approximation  re:fitness_sampling 
july 2015 by cshalizi
Stable Spectral Learning Based on Schur Decomposition
"Spectral methods are a powerful tool for inferring the parameters of certain classes of probability distributions by means of standard eigenvalue- eigenvector decompositions. Spectral algorithms can be orders of magnitude faster than log- likelihood based and related iterative methods, and, thanks to the uniqueness of the spectral de- composition, they enjoy global optimality guar- antees. In practice, however, the applicability of spectral methods is limited due to their sensitiv- ity to model misspecification, which can cause instability issues in the case of non-exact models. We present a new spectral approach that is based on the Schur triangularization of an observable matrix, and we carry out the corresponding theo- retical analysis. Our main result is a bound on the estimation error that is shown to depend linearly on the condition number of the ground-truth con- ditional probability matrix and inversely on the eigengap of an observable matrix. Numerical ex- periments show that the proposed method is more stable, and performs better in general, than the classical spectral approach using direct matrix di- agonalization."
to:NB  to_read  spectral_methods  mixture_models  statistics  computational_statistics  estimation  misspecification 
july 2015 by cshalizi
Random Features for Large-Scale Kernel Machines
"To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. The features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shift- invariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in large-scale classification and regression tasks linear machine learning al- gorithms applied to these features outperform state-of-the-art large-scale kernel machines."
in_NB  to_read  machine_learning  computational_statistics  approximation  random_projections  fourier_analysis  kernel_methods  statistics  re:hyperbolic_networks 
july 2015 by cshalizi
Training generative neural networks via maximum mean discrepancy optimization
"We consider training a deep neural network to generate samples from an unknown distribu- tion given i.i.d. data. We frame learning as an optimization minimizing a two-sample test statistic—informally speaking, a good genera- tor network produces samples that cause a two- sample test to fail to reject the null hypothesis. As our two-sample test statistic, we use an un- biased estimate of the maximum mean discrep- ancy, which is the centerpiece of the nonpara- metric kernel two-sample test proposed by Gret- ton et al. [2]. We compare to the adversar- ial nets framework introduced by Goodfellow et al. [1], in which learning is a two-player game between a generator network and an adversarial discriminator network, both trained to outwit the other. From this perspective, the MMD statistic plays the role of the discriminator. In addition to empirical comparisons, we prove bounds on the generalization error incurred by optimizing the empirical MMD."

--- On first glance, there's no obvious limitation to neural networks, and indeed it's rather suggestive of indirect inference (to me)
to:NB  simulation  stochastic_models  neural_networks  machine_learning  two-sample_tests  hypothesis_testing  nonparametrics  kernel_methods  statistics  computational_statistics  ghahramani.zoubin 
july 2015 by cshalizi
Estimating the Partition Function by Discriminance Sampling
"Importance sampling (IS) and its variant, an- nealed IS (AIS) have been widely used for es- timating the partition function in graphical mod- els, such as Markov random fields and deep gen- erative models. However, IS tends to underesti- mate the partition function and is subject to high variance when the proposal distribution is more peaked than the target distribution. On the other hand, “reverse” versions of IS and AIS tend to overestimate the partition function, and degener- ate when the target distribution is more peaked than the proposal distribution. In this work, we present a simple, general method that gives much more reliable and robust estimates than either IS (AIS) or reverse IS (AIS). Our method works by converting the estimation problem into a simple classification problem that discriminates between the samples drawn from the target and the pro- posal. We give extensive theoretical and empir- ical justification; in particular, we show that an annealed version of our method significantly out- performs both AIS and reverse AIS as proposed by Burda et al. (2015), which has been the state- of-the-art for likelihood evaluation in deep gen- erative models."
in_NB  heard_the_talk  computational_statistics  exponential_families  statistics  monte_carlo  classifiers 
july 2015 by cshalizi
[1505.00869] On the Feasibility of Distributed Kernel Regression for Big Data
"In modern scientific research, massive datasets with huge numbers of observations are frequently encountered. To facilitate the computational process, a divide-and-conquer scheme is often used for the analysis of big data. In such a strategy, a full dataset is first split into several manageable segments; the final output is then averaged from the individual outputs of the segments. Despite its popularity in practice, it remains largely unknown that whether such a distributive strategy provides valid theoretical inferences to the original data. In this paper, we address this fundamental issue for the distributed kernel regression (DKR), where the algorithmic feasibility is measured by the generalization performance of the resulting estimator. To justify DKR, a uniform convergence rate is needed for bounding the generalization error over the individual outputs, which brings new and challenging issues in the big data setup. Under mild conditions, we show that, with a proper number of segments, DKR leads to an estimator that is generalization consistent to the unknown regression function. The obtained results justify the method of DKR and shed light on the feasibility of using other distributed algorithms for processing big data. The promising preference of the method is supported by both simulation and real data examples."
to:NB  computational_statistics  kernel_estimators  regression  statistics 
july 2015 by cshalizi
A Deterministic Partition Function Approximation for Exponential Random Graph Models
"Exponential Random Graphs Models (ERGM) are common, simple statistical models for social net- work and other network structures. Unfortunately, inference and learning with them is hard even for small networks because their partition functions are intractable for precise computation. In this paper, we introduce a new quadratic time deterministic approximation to these partition functions. Our main insight enabling this advance is that subgraph statistics is sufficient to derive a lower bound for partition functions given that the model is not dom- inated by a few graphs. The proposed method dif- fers from existing methods in its ways of exploiting asymptotic properties of subgraph statistics. Com- pared to the current Monte Carlo simulation based methods, the new method is scalable, stable, and precise enough for inference tasks."

--- To look at carefully. Doesn't do anything to address the lack of projectivity, but may suggest useful techniques for related models. (From a quick scan, it looks like it's some sort of Laplace approximation, thus presumably related to large deviations.)
in_NB  exponential_family_random_graphs  computational_statistics  statistics  network_data_analysis 
june 2015 by cshalizi
[1504.04941] Fast Moment-Based Estimation for Hierarchical Models
"Hierarchical models allow for heterogeneous behaviours in a population while simultaneously borrowing estimation strength across all subpopulations. Unfortunately, existing likelihood-based methods for fitting hierarchical models have high computational demands, and these demands have limited their adoption in large-scale prediction and inference problems. This paper proposes a moment-based procedure for estimating the parameters of a hierarchical model which has its roots in a method originally introduced by Cochran in 1937. The method trades statistical efficiency for computational efficiency. It gives consistent parameter estimates, competitive prediction error performance, and substantial computational improvements. When applied to a large-scale recommender system application and compared to a standard maximum likelihood procedure, the method delivers competitive prediction performance while reducing the sequential computation time from hours to minutes."
have_read  hierarchical_statistical_models  computational_statistics  statistics  perry.patrick_o.  in_NB 
may 2015 by cshalizi
On parallel implementation of sequential Monte Carlo methods: the island particle model - Springer
"The approximation of the Feynman-Kac semigroups by systems of interacting particles is a very active research field, with applications in many different areas. In this paper, we study the parallelization of such approximations. The total population of particles is divided into sub-populations, referred to as islands. The particles within each island follow the usual selection/mutation dynamics. We show that the evolution of each island is also driven by a Feynman-Kac semigroup, whose transition and potential can be explicitly related to ones of the original problem. Therefore, the same genetic type approximation of the Feynman-Kac semi-group may be used at the island level; each island might undergo selection/mutation algorithm. We investigate the impact of the population size within each island and the number of islands, and study different type of interactions. We find conditions under which introducing interactions between islands is beneficial. The theoretical results are supported by some Monte Carlo experiments."
to:NB  particle_filters  monte_carlo  computational_statistics  stochastic_processes  interacting_particle_systems  re:fitness_sampling 
february 2015 by cshalizi
[1501.06794] Computing Functions of Random Variables via Reproducing Kernel Hilbert Space Representations
"We describe a method to perform functional operations on probability distributions of random variables. The method uses reproducing kernel Hilbert space representations of probability distributions, and it is applicable to all operations which can be applied to points drawn from the respective distributions. We refer to our approach as {\em kernel probabilistic programming}. We illustrate it on synthetic data, and show how it can be used for nonparametric structural equation models, with an application to causal inference."
to:NB  kernel_methods  hilbert_space  computational_statistics  causal_inference  statistics 
february 2015 by cshalizi
Crisan , Míguez : Particle-kernel estimation of the filter density in state-space models
"Sequential Monte Carlo (SMC) methods, also known as particle filters, are simulation-based recursive algorithms for the approximation of the a posteriori probability measures generated by state-space dynamical models. At any given time t, a SMC method produces a set of samples over the state space of the system of interest (often termed “particles”) that is used to build a discrete and random approximation of the posterior probability distribution of the state variables, conditional on a sequence of available observations. One potential application of the methodology is the estimation of the densities associated to the sequence of a posteriori distributions. While practitioners have rather freely applied such density approximations in the past, the issue has received less attention from a theoretical perspective. In this paper, we address the problem of constructing kernel-based estimates of the posterior probability density function and its derivatives, and obtain asymptotic convergence results for the estimation errors. In particular, we find convergence rates for the approximation errors that hold uniformly on the state space and guarantee that the error vanishes almost surely as the number of particles in the filter grows. Based on this uniform convergence result, we first show how to build continuous measures that converge almost surely (with known rate) toward the posterior measure and then address a few applications. The latter include maximum a posteriori estimation of the system state using the approximate derivatives of the posterior density and the approximation of functionals of it, for example, Shannon’s entropy."
in_NB  particle_filters  kernel_estimators  density_estimation  filtering  state_estimation  state-space_models  statistics  computational_statistics 
january 2015 by cshalizi
[1405.0058] Underestimating extreme events in power-law behavior due to machine-dependent cutoffs
"Power-law distributions are typical macroscopic features occurring in almost all complex systems observable in nature. As a result, researchers in quantitative analyses must often generate random synthetic variates obeying power-law distributions. The task is usually performed through standard methods that map uniform random variates into the desired probability space. Whereas all these algorithms are theoretically solid, in this paper we show that they are subject to severe machine-dependent limitations. As a result, two dramatic consequences arise: (i) the sampling in the tail of the distribution is not random but deterministic; (ii) the moments of the sample distribution, which are theoretically expected to diverge as functions of the sample sizes, converge instead to finite values. We provide quantitative indications for the range of distribution parameters that can be safely handled by standard libraries used in computational analyses. Whereas our findings indicate possible reinterpretations of numerical results obtained through flawed sampling methodologies, they also pave the way for the search for a concrete solution to this central issue shared by all quantitative sciences dealing with complexity."
to:NB  to_read  heavy_tails  approximation  computational_statistics  have_skimmed 
january 2015 by cshalizi
[1406.5986] A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares
"We consider statistical aspects of solving large-scale least-squares (LS) problems using randomized sketching algorithms. Prior work has typically adopted an \emph{algorithmic perspective}, in that it has made no statistical assumptions on the input X and Y, and instead it has assumed that the data (X,Y) are fixed and worst-case. In this paper, we adopt a \emph{statistical perspective}, and we consider the mean-squared error performance of randomized sketching algorithms, when data (X,Y) are generated according to a statistical linear model Y=Xβ+ϵ, where ϵ is a noise process. To do this, we first develop a framework for assessing, in a unified manner, algorithmic and statistical aspects of randomized sketching methods. We then consider the statistical predicition efficiency (SPE) and the statistical residual efficiency (SRE) of the sketched LS estimator; and we use our framework to provide results for several types of random projection and random sampling sketching algorithms. Among other results, we show that the SRE can be bounded when p≲r≪n but that the SPE typically requires the sample size r to be substantially larger. Our theoretical results reveal that, depending on the specifics of the situation, leverage-based sampling methods can perform as well as or better than projection methods. Our empirical results reveal that when r is only slightly greater than p and much less than n, projection-based methods out-perform sampling-based methods, but as r grows, sampling methods start to out-perform projection methods."
to:NB  computational_statistics  regression  linear_regression  random_projections  statistics 
january 2015 by cshalizi
[1409.3821] Computational Implications of Reducing Data to Sufficient Statistics
"Given a large dataset and an estimation task, it is common to pre-process the data by reducing them to a set of sufficient statistics. This step is often regarded as straightforward and advantageous (in that it simplifies statistical analysis). I show that -on the contrary- reducing data to sufficient statistics can change a computationally tractable estimation problem into an intractable one. I discuss connections with recent work in theoretical computer science, and implications for some techniques to estimate graphical models."

to:NB  to_read  sufficiency  computational_statistics  theoretical_computer_science  graphical_models  statistics 
january 2015 by cshalizi
[1409.2638] Magging: maximin aggregation for inhomogeneous large-scale data
"Large-scale data analysis poses both statistical and computational problems which need to be addressed simultaneously. A solution is often straightforward if the data are homogeneous: one can use classical ideas of subsampling and mean aggregation to get a computationally efficient solution with acceptable statistical accuracy, where the aggregation step simply averages the results obtained on distinct subsets of the data. However, if the data exhibit inhomogeneities (and typically they do), the same approach will be inadequate, as it will be unduly influenced by effects that are not persistent across all the data due to, for example, outliers or time-varying effects. We show that a tweak to the aggregation step can produce an estimator of effects which are common to all data, and hence interesting for interpretation and often leading to better prediction than pooled effects."
to:NB  ensemble_methods  computational_statistics  statistics  prediction  buhlmann.peter 
january 2015 by cshalizi
[1412.3779] Biips: Software for Bayesian Inference with Interacting Particle Systems
"Biips is a software platform for automatic Bayesian inference with interacting particle systems. Biips allows users to define their statistical model in the probabilistic programming BUGS language, as well as to add custom functions or samplers within this language. Then it runs sequential Monte Carlo based algorithms (particle filters, particle independent Metropolis-Hastings, particle marginal Metropolis-Hastings) in a black-box manner so that to approximate the posterior distribution of interest as well as the marginal likelihood. The software is developed in C++ with interfaces with the softwares R, Matlab and Octave."
to:NB  monte_carlo  particle_filters  computational_statistics  statistics  del_moral.pierre 
january 2015 by cshalizi
[1409.5827] Software Alchemy: Turning Complex Statistical Computations into Embarrassingly-Parallel Ones
"The growth in the use of computationally intensive statistical procedures, especially with Big Data, has necessitated the usage of parallel computation on diverse platforms such as multicore, GPU, clusters and clouds. However, slowdown due to interprocess communication costs typically limits such methods to "embarrassingly parallel" (EP) algorithms, especially on non-shared memory platforms. This paper develops a broadly-applicable method for converting many non-EP algorithms into statistically equivalent EP ones. The method is shown to yield excellent levels of speedup for a variety of statistical computations. It also overcomes certain problems of memory limitations."
in_NB  to_read  distributed_systems  computational_statistics  statistics  matloff.norm  to_teach:statcomp 
january 2015 by cshalizi
[1407.2724] On the Optimality of Averaging in Distributed Statistical Learning
"A common approach to statistical learning on big data is to randomly distribute it among m machines and calculate the parameter of interest by merging their m individual estimates. Two key questions related to this approach are: What is the optimal aggregation procedure, and what is the accuracy loss in comparison to centralized computation. We make several contributions to these questions, under the general framework of empirical risk minimization, a.k.a. M-estimation. As data is abundant, we assume the number of samples per machine, n, is large and study two asymptotic settings: one where n→∞ but the number of estimated parameters p is fixed, and a second high-dimensional case where both p,n→∞ with p/n→κ∈(0,1). Our main results include asymptotically exact expressions for the loss incurred by splitting the data, where only bounds were previously available. These are derived independently of the learning algorithm. Consequently, under suitable assumptions in the fixed-p setting, averaging is {\em first-order equivalent} to a centralized solution, and thus inherits statistical properties like efficiency and robustness. In the high-dimension setting, studied here for the first time in the context of parallelization, a qualitatively different behaviour appears. Parallelized computation generally incurs an accuracy loss, for which we derive a simple approximate formula. We conclude with several practical implications of our results."
to:NB  distributed_systems  computational_statistics  high-dimensional_statistics  statistics 
january 2015 by cshalizi
[1412.1735] Speeding up bootstrap computations: a vectorized implementation for statistics based on sample moments
"In this note we propose a vectorized implementation of the non-parametric bootstrap for statistics based on sample moments. Basically, we adopt the multinomial sampling formulation of the non-parametric bootstrap, and compute bootstrap replications of sample moment statistics by simply weighting the observed data according to multinomial counts, instead of evaluating the statistic on a re-sampled version of the observed data. Using this formulation we can generate a matrix of bootstrap weights and compute the entire vector of bootstrap replications with a few matrix multiplications. Vectorization is particularly important for matrix-oriented programming languages such as R, where matrix/vector calculations tend to be faster than scalar operations implemented in a loop. We illustrate the gain in computational speed achieved by the vectorized implementation in real and simulated data sets, when bootstrapping Pearson's sample correlation coefficient."

--- Interesting, but presumes that weighting is enough to make things work.
to:NB  bootstrap  computational_statistics  statistics 
january 2015 by cshalizi
[1411.0306] Fast Randomized Kernel Methods With Statistical Guarantees
"One approach to improving the running time of kernel-based machine learning methods is to build a small sketch of the input and use it in lieu of the full kernel matrix in the machine learning task of interest. Here, we describe a version of this approach that comes with running time guarantees as well as improved guarantees on its statistical performance. By extending the notion of \emph{statistical leverage scores} to the setting of kernel ridge regression, our main statistical result is to identify an importance sampling distribution that reduces the size of the sketch (i.e., the required number of columns to be sampled) to the \emph{effective dimensionality} of the problem. This quantity is often much smaller than previous bounds that depend on the \emph{maximal degrees of freedom}. Our main algorithmic result is to present a fast algorithm to compute approximations to these scores. This algorithm runs in time that is linear in the number of samples---more precisely, the running time is O(np2), where the parameter p depends only on the trace of the kernel matrix and the regularization parameter---and it can be applied to the matrix of feature vectors, without having to form the full kernel matrix. This is obtained via a variant of length-squared sampling that we adapt to the kernel setting in a way that is of independent interest. Lastly, we provide empirical results illustrating our theory, and we discuss how this new notion of the statistical leverage of a data point captures in a fine way the difficulty of the original statistical learning problem."
to:NB  kernel_methods  low-rank_approximation  randomized_algorithms  computational_statistics  statistics 
january 2015 by cshalizi
[1501.01265] The ABC of Simulation Estimation with Auxiliary Statistics
"This paper provides a synthesis of the simulation based minimum distance estimators used in economics with the method of ABC (Approximate Bayesian Computation) used in other disciplines. While the two strands of work are seemingly related, the relation between them is not well understood beyond the fact that they all replace the likelihood by auxiliary statistics that are informative about the data. We connect the two methods by using a reverse sampler to engineer the ABC posterior distribution from a sequence of simulated minimum distance estimates. Focusing on the exactly identified case, we find that Bayesian and frequentist estimators have different bias properties for an arbitrary choice of prior. The difference can be traced to whether we match the sample auxiliary statistics to each or to the average of the simulated ones. In principle, the likelihood-free Bayesian estimators can completely eliminate the order 1T bias with suitable choice of the prior. But under a specific relation between the structural parameters and auxiliary statistics, the frequentist simulated distance estimators are automatically second order unbiased. These differences are illustrated using an analytical example and a simulation study of the dynamic panel model."
in_NB  indirect_inference  approximate_bayesian_computation  statistics  computational_statistics  simulation  re:stacs  to_read 
january 2015 by cshalizi
[1501.03326] Unbiased Bayes for Big Data: Paths of Partial Posteriors
"Bayesian inference proceeds based on expectations of certain functions with respect to the posterior. Markov Chain Monte Carlo is a fundamental tool to compute these expectations. However, its feasibility is being challenged in the era of so called Big Data as all data needs to be processed in every iteration. Realising that such simulation is an unnecessarily hard problem if the goal is estimation, we construct a computationally scalable methodology that allows unbiased estimation of the required expectations -- without explicit simulation from the full posterior. The average computational complexity of our scheme is sub-linear in the size of the dataset and its variance is straightforward to control, leading to algorithms that are provably unbiased and naturally arrive at a desired error tolerance. We demonstrate the utility and generality of the methodology on a range of common statistical models applied to large scale benchmark datasets."
to:NB  monte_carlo  bayesianism  computational_statistics  statistics 
january 2015 by cshalizi
A general construction for parallelizing Metropolis−Hastings algorithms
"Markov chain Monte Carlo methods (MCMC) are essential tools for solving many modern-day statistical and computational problems; however, a major limitation is the inherently sequential nature of these algorithms. In this paper, we propose a natural generalization of the Metropolis−Hastings algorithm that allows for parallelizing a single chain using existing MCMC methods. We do so by proposing multiple points in parallel, then constructing and sampling from a finite-state Markov chain on the proposed points such that the overall procedure has the correct target density as its stationary distribution. Our approach is generally applicable and straightforward to implement. We demonstrate how this construction may be used to greatly increase the computational speed and statistical efficiency of a variety of existing MCMC methods, including Metropolis-Adjusted Langevin Algorithms and Adaptive MCMC. Furthermore, we show how it allows for a principled way of using every integration step within Hamiltonian Monte Carlo methods; our approach increases robustness to the choice of algorithmic parameters and results in increased accuracy of Monte Carlo estimates with little extra computational cost."
to:NB  to_read  monte_carlo  parallelism  computational_statistics  statistics 
december 2014 by cshalizi
« earlier      
per page:    204080120160

related tags

additive_models  approximate_bayesian_computation  approximation  auerbach.david  automated_diagnosis  bad_data_analysis  bayesianism  belkin.mikhail  bender-de_moll.skye  bengio.yoshua  biology  blei.david  books:contributed-to  books:noted  books:owned  books:recommended  bootstrap  buhlmann.peter  butts.carter_t.  c++  causal_discovery  causal_inference  classifiers  clermont.gilles  clustering  community_discovery  computational_complexity  computational_statistics  conferences  contingency_tables  convexity  coveted  cross-validation  databases  data_analysis  data_mining  debugging  debunking  del_moral.pierre  density_estimation  dependence_measures  devroye.luc  diaconis.persi  dimension_reduction  distributed_systems  dynamical_systems  em_algorithm  energy  ensemble_methods  entableted  entropy_estimation  ergodic_theory  estimation  exponential_families  exponential_family_random_graphs  filtering  fourier_analysis  gelman.andrew  generalized_linear_models  genomics  geometry  geyer.charles_j.  ghahramani.zoubin  going_to_miss_the_talk  goodreau.steven_m  graphical_models  graph_spectra  gretton.arthurt  handcock.mark  hashing  haslinger.rob  have_read  have_skimmed  hayfield.tristen  heard_the_talk  heavy_tails  hierarchical_statistical_models  high-dimensional_statistics  hilbert_space  history_of_mathematics  history_of_statistics  hofmann.thomas  holmes.susan  how_outsiders_see_us  hunter.david_r.  hypothesis_testing  identifiability  importance_sampling  indirect_inference  inference_to_latent_objects  information_criteria  information_retrieval  information_theory  interacting_particle_systems  inverse_problems  in_NB  in_wishlist  jordan.michael_i.  k-means  kernel_estimators  kernel_methods  kith_and_kin  klemens.ben  large_deviations  lasso  latent_semantic_analysis  latent_variables  lauritzen.steffen  le.quoc  learning_theory  lebanon.guy  likelihood  linear_programming  linear_regression  linguistics  locality-sensitive_hashing  low-rank_approximation  lumley.thomas  machine_learning  manifold_learning  markov_models  matching  matloff.norm  matloff.norman  mccormick.tyler  measurement  misspecification  mixture_models  model_checking  model_search  model_selection  monte_carlo  moody.james  moore.cristopher  morris.martina  multidimensional_scaling  neal.radford  networks  network_data_analysis  network_sampling  neural_coding_and_decoding  neural_networks  neuroscience  nonparametrics  numerical_methods  oconnor.brendan  optimization  paninski.liam  parallelism  parallel_computing  parallel_processing  partial_identification  particle_filters  perry.patrick_o.  philosophy_of_science  prediction  primates  principal_components  probability  programming  programming_languages  r  racine.jeffrey  racine.jeffrey_s.  randomized_algorithms  random_fields  random_projections  rar  rare-event_simulation  ravikumar.pradeep  re:AoS_project  re:data_science_whitepaper  re:fitness_sampling  re:hyperbolic_networks  re:stacs  regression  reinforcement_learning  relational_learning  relative_distributions  rubin.jonathan  ryden.tobias  sarkar.purnamrita  schweinberger.michael  scientific_computing  scooped  self-promotion  simulation  simulation-based_inference  singh.satinder_baveja  smith.noah_a.  smola.alex  social_measurement  social_media  social_science_methodology  sociolinguistics  sociology  sparsity  spatial_statistics  spatio-temporal_statistics  spectral_clustering  spectral_methods  splines  state-space_models  state_estimation  statistical_mechanics  statistics  statistics_on_manifolds  stochastic_approximation  stochastic_differential_equations  stochastic_models  stochastic_processes  structured_data  sufficiency  surveys  tensors  text_mining  theoretical_computer_science  the_spreadsheet_menace  time_series  to:blog  to:NB  topic_models  to_be_shot_after_a_fair_trial  to_read  to_teach  to_teach:complexity-and-inference  to_teach:data-mining  to_teach:data_over_space_and_time  to_teach:statcomp  to_teach:undergrad-ADA  track_down_references  two-sample_tests  utter_stupidity  vanhoudnos.nathan  variational_inference  ver_steeg.greg  via:?  via:aaron_clauset  via:arthegall  via:gelman  via:georg  via:rvenkat  via:vaguery  via:vqv  visual_display_of_quantitative_information  was_on_the_committee  why_oh_why_cant_we_have_a_better_academic_publishing_system  wood.frank  xing.eric  your_favorite_deep_neural_network_sucks  zenker.sven 

Copy this bookmark: