cshalizi + approximation   55

[1711.10414] When are epsilon-nets small?
"In many interesting situations the size of epsilon-nets depends only on ϵ together with different complexity measures. The aim of this paper is to give a systematic treatment of such complexity measures arising in Discrete and Computational Geometry and Statistical Learning, and to bridge the gap between the results appearing in these two fields. As a byproduct, we obtain several new upper bounds on the sizes of epsilon-nets that generalize/improve the best known general guarantees. In particular, our results work with regimes when small epsilon-nets of size o(1ϵ) exist, which are not usually covered by standard upper bounds. Inspired by results in Statistical Learning we also give a short proof of the Haussler's upper bound on packing numbers."
to:NB  learning_theory  approximation
13 days ago by cshalizi
[1906.09282] Robustness of Dynamical Quantities of Interest via Goal-Oriented Information Theory
"Variational-principle-based methods that relate expectations of a quantity of interest to information-theoretic divergences have proven to be effective tools for obtaining distributional robustness bounds under both parametric and non-parametric model-form uncertainty. Here, we extend these ideas to utilize information divergences that are specifically targeted at the quantity-of-interest being investigated. Due to their goal-oriented nature, and when combined with the data processing inequality, the resulting robustness bounds are tighter and a wider class of problems can be treated. We find the method especially useful for problems involving unbounded time-horizons, a case where established information-theoretic methods typically result in trivial (infinite) bounds. Problem types that can be treated within this framework include robustness bounds on the expected value and distribution of a stopping time, time averages, and exponentially discounted quantities. We illustrate these results with several examples, including option pricing, stochastic control, semi-Markov queueing models, and expectations and distributions of hitting times."
to:NB  information_theory  probability  stochastic_processes  approximation
16 days ago by cshalizi
[1906.05746] Nonlinear System Identification via Tensor Completion
"Function approximation from input and output data pairs constitutes a fundamental problem in supervised learning. Deep neural networks are currently the most popular method for learning to mimic the input-output relationship of a generic nonlinear system, as they have proven to be very effective in approximating complex highly nonlinear functions. In this work, we propose low-rank tensor completion as an appealing alternative for modeling and learning complex nonlinear systems. We model the interactions between the N input variables and the scalar output of a system by a single N-way tensor, and setup a weighted low-rank tensor completion problem with smoothness regularization which we tackle using a block coordinate descent algorithm. We extend our method to the multi-output setting and the case of partially observed data, which cannot be readily handled by neural networks. Finally, we demonstrate the effectiveness of the approach using several regression tasks including some standard benchmarks and a challenging student grade prediction task."
to:NB  approximation  tensors  regression  computational_statistics  statistics
9 weeks ago by cshalizi
[1905.12122] Deep Learning Moment Closure Approximations using Dynamic Boltzmann Distributions
"The moments of spatial probabilistic systems are often given by an infinite hierarchy of coupled differential equations. Moment closure methods are used to approximate a subset of low order moments by terminating the hierarchy at some order and replacing higher order terms with functions of lower order ones. For a given system, it is not known beforehand which closure approximation is optimal, i.e. which higher order terms are relevant in the current regime. Further, the generalization of such approximations is typically poor, as higher order corrections may become relevant over long timescales. We have developed a method to learn moment closure approximations directly from data using dynamic Boltzmann distributions (DBDs). The dynamics of the distribution are parameterized using basis functions from finite element methods, such that the approach can be applied without knowing the true dynamics of the system under consideration. We use the hierarchical architecture of deep Boltzmann machines (DBMs) with multinomial latent variables to learn closure approximations for progressively higher order spatial correlations. The learning algorithm uses a centering transformation, allowing the dynamic DBM to be trained without the need for pre-training. We demonstrate the method for a Lotka-Volterra system on a lattice, a typical example in spatial chemical reaction networks. The approach can be applied broadly to learn deep generative models in applications where infinite systems of differential equations arise."
to:NB  approximation  stochastic_processes  mjolsness.eric  moment_closures
11 weeks ago by cshalizi
[1905.10409] Greedy Shallow Networks: A New Approach for Constructing and Training Neural Networks
"We present a novel greedy approach to obtain a single layer neural network approximation to a target function with the use of a ReLU activation function. In our approach we construct a shallow network by utilizing a greedy algorithm where the set of possible inner weights acts as a parametrization of the prescribed dictionary. To facilitate the greedy selection we employ an integral representation of the network, based on the ridgelet transform, that significantly reduces the cardinality of the dictionary and hence promotes feasibility of the proposed method. Our approach allows for the construction of efficient architectures which can be treated either as improved initializations to be used in place of random-based alternatives, or as fully-trained networks, thus potentially nullifying the need for training and/or calibrating based on backpropagation. Numerical experiments demonstrate the tenability of the proposed concept and its advantages compared to the classical techniques for training and constructing neural networks."

--- After reading, compare with the algorithms for shallow networks at the end of Anthony & Bartlett's book on neural network learning from the 1990s.
to:NB  learning_theory  approximation  neural_networks  your_favorite_deep_neural_network_sucks
12 weeks ago by cshalizi
[1904.00687] On the Power and Limitations of Random Features for Understanding Neural Networks
"Recently, a spate of papers have provided positive theoretical results for training over-parameterized neural networks (where the network size is larger than what is needed to achieve low error). The key insight is that with sufficient over-parameterization, gradient-based methods will implicitly leave some components of the network relatively unchanged, so the optimization dynamics will behave as if those components are essentially fixed at their initial random values. In fact, fixing these explicitly leads to the well-known approach of learning with random features. In other words, these techniques imply that we can successfully learn with neural networks, whenever we can successfully learn with random features. In this paper, we first review these techniques, providing a simple and self-contained analysis for one-hidden-layer networks. We then argue that despite the impressive positive results, random feature approaches are also inherently limited in what they can explain. In particular, we rigorously show that random features cannot be used to learn even a single ReLU neuron with standard Gaussian inputs, unless the network size (or magnitude of the weights) is exponentially large. Since a single neuron is learnable with gradient-based methods, we conclude that we are still far from a satisfying general explanation for the empirical success of neural networks."
to:NB  neural_networks  random_projections  learning_theory  optimization  approximation  your_favorite_deep_neural_network_sucks
may 2019 by cshalizi
A Spline Theory of Deep Learning
"We build a rigorous bridge between deep networks (DNs) and approximation theory via spline functions and operators. Our key result is that a large class of DNs can be written as a composition of max-affine spline operators (MASOs), which provide a powerful portal through which to view and analyze their inner workings. For instance, conditioned on the input signal, the output of a MASO DN can be written as a simple affine transformation of the input. This implies that a DN constructs a set of signal-dependent, class-specific templates against which the signal is compared via a simple inner product; we explore the links to the classical theory of optimal classification via matched filters and the effects of data memorization. Going further, we propose a simple penalty term that can be added to the cost function of any DN learning algorithm to force the templates to be orthogonal with each other; this leads to significantly improved classification performance and reduced overfitting with no change to the DN architecture. The spline partition of the input signal space opens up a new geometric avenue to study how DNs organize signals in a hierarchical fashion. As an application, we develop and validate a new distance metric for signals that quantifies the difference between their partition encodings."
to:NB  to_read  approximation  splines  neural_networks  machine_learning  your_favorite_deep_neural_network_sucks  via:csantos
april 2019 by cshalizi
[1904.03673] Every Local Minimum is a Global Minimum of an Induced Model
"For non-convex optimization in machine learning, this paper proves that every local minimum achieves the global optimality of the perturbable gradient basis model at any differentiable point. As a result, non-convex machine learning is theoretically as supported as convex machine learning with a hand-crafted basis in terms of the loss at differentiable local minima, except in the case when a preference is given to the hand-crafted basis over the perturbable gradient basis. The proofs of these results are derived under mild assumptions. Accordingly, the proven results are directly applicable to many machine learning models, including practical deep neural networks, without any modification of practical methods. Furthermore, as special cases of our general results, this paper improves or complements several state-of-the-art theoretical results in the literature with a simple and unified proof technique."
to:NB  optimization  approximation  neural_networks  kaelbling.leslie_pack
april 2019 by cshalizi
[1904.02505] A deterministic and computable Bernstein-von Mises theorem
"Bernstein-von Mises results (BvM) establish that the Laplace approximation is asymptotically correct in the large-data limit. However, these results are inappropriate for computational purposes since they only hold over most, and not all, datasets and involve hard-to-estimate constants. In this article, I present a new BvM theorem which bounds the Kullback-Leibler (KL) divergence between a fixed log-concave density f(θ) and its Laplace approximation. The bound goes to 0 as the higher-derivatives of f(θ) tend to 0 and f(θ) becomes increasingly Gaussian. The classical BvM theorem in the IID large-data asymptote is recovered as a corollary.
"Critically, this theorem further suggests a number of computable approximations of the KL divergence with the most promising being:
KL(gLAP,f)≈12Varθ∼g(θ)(log[f(θ)]−log[gLAP(θ)])
An empirical investigation of these bounds in the logistic classification model reveals that these approximations are great surrogates for the KL divergence. This result, and future results of a similar nature, could provide a path towards rigorously controlling the error due to the Laplace approximation and more modern approximation methods."
to;NB  statistics  approximation  laplace_approximation  probability
april 2019 by cshalizi
The Lasserre Hierarch in Approximation Algorithms
"The Lasserre hierarchy is a systematic procedure to strengthen a relaxation for
the last years this hierarchy moved into the focus of researchers in approximation algorithms as the obtain relaxations have provably nice properties. In particular on the t -th level, the relaxation can be solved in time n^O(t) and every constraint that one could derive from looking just at t variables is automatically satisfied. Additionally, it provides a vector embedding of events so that probabilities are expressable as inner products.
"The goal of these lecture notes is to give short but rigorous proofs of all key properties of the Lasserre hierarchy. In the second part we will demonstrate how the Lasserre SDP can be applied to (mostly NP-hard) optimization problems such as KNAPSACK, MATCHING, MAXCUT (in general and in dense graphs), 3-COLORING and SETCOVER."

--- I remember Cris trying to explain this to me once...
april 2019 by cshalizi
[1612.07545] A Revisit of Hashing Algorithms for Approximate Nearest Neighbor Search
"Approximate Nearest Neighbor Search (ANNS) is a fundamental problem in many areas of machine learning and data mining. During the past decade, numerous hashing algorithms are proposed to solve this problem. Every proposed algorithm claims outperform other state-of-the-art hashing methods. However, the evaluation of these hashing papers was not thorough enough, and those claims should be re-examined. The ultimate goal of an ANNS method is returning the most accurate answers (nearest neighbors) in the shortest time. If implemented correctly, almost all the hashing methods will have their performance improved as the code length increases. However, many existing hashing papers only report the performance with the code length shorter than 128. In this paper, we carefully revisit the problem of search with a hash index, and analyze the pros and cons of two popular hash index search procedures. Then we proposed a very simple but effective two level index structures and make a thorough comparison of eleven popular hashing algorithms. Surprisingly, the random-projection-based Locality Sensitive Hashing (LSH) is the best performed algorithm, which is in contradiction to the claims in all the other ten hashing papers. Despite the extreme simplicity of random-projection-based LSH, our results show that the capability of this algorithm has been far underestimated. For the sake of reproducibility, all the codes used in the paper are released on GitHub, which can be used as a testing platform for a fair comparison between various hashing algorithms."
to:NB  data_mining  approximation  nearest_neighbors  locality-sensitive_hashing  hashing  have_read  via:vaguery  random_projections  k-means  databases
january 2019 by cshalizi
Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks
"The performance of neural networks on high-dimensional data distributions suggests that it may be possible to parameterize a representation of a given high-dimensional function with controllably small errors, potentially outperforming standard interpolation methods. We demonstrate, both theoretically and numerically, that this is indeed the case. We map the parameters of a neural network to a system of particles relaxing with an interaction potential determined by the loss function. We show that in the limit that the number of parameters n is large, the landscape of the mean-squared error becomes convex and the representation error in the function scales as O(n−1). In this limit, we prove a dynamical variant of the universal approximation theorem showing that the optimal representation can be attained by stochastic gradient descent, the algorithm ubiquitously used for parameter optimization in machine learning. In the asymptotic regime, we study the fluctuations around the optimal representation and show that they arise at a scale O(n−1). These fluctuations in the landscape identify the natural scale for the noise in stochastic gradient descent. Our results apply to both single and multi-layer neural networks, as well as standard kernel methods like radial basis functions."

!!!
to:NB  to_read  neural_networks  approximation  learning_theory  via:rvenkat
december 2018 by cshalizi
Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning
"Randomized neural networks are immortalized in this AI Koan: In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6. What are you doing?'' asked Minsky. I am training a randomly wired neural net to play tic-tac-toe,'' Sussman replied. Why is the net wired randomly?'' asked Minsky. Sussman replied, I do not want it to have any preconceptions of how to play.'' Minsky then shut his eyes. Why do you close your eyes?'' Sussman asked his teacher. So that the room will be empty,'' replied Minsky. At that moment, Sussman was enlightened. We analyze shallow random networks with the help of concentration of measure inequalities. Specifically, we consider architectures that compute a weighted sum of their inputs after passing them through a bank of arbitrary randomized nonlinearities. We identify conditions under which these networks exhibit good classification performance, and bound their test error in terms of the size of the dataset and the number of random nonlinearities."

--- Have I never bookmarked this before?
in_NB  approximation  kernel_methods  random_projections  statistics  prediction  classifiers  rahimi.ali  recht.benjamin  machine_learning  have_read
september 2018 by cshalizi
[1806.06850] Polynomial Regression As an Alternative to Neural Nets
"Despite the success of neural networks (NNs), there is still a concern among many over their "black box" nature. Why do they work? Here we present a simple analytic argument that NNs are in fact essentially polynomial regression models. This view will have various implications for NNs, e.g. providing an explanation for why convergence problems arise in NNs, and it gives rough guidance on avoiding overfitting. In addition, we use this phenomenon to predict and confirm a multicollinearity property of NNs not previously reported in the literature. Most importantly, given this loose correspondence, one may choose to routinely use polynomial models instead of NNs, thus avoiding some major problems of the latter, such as having to set many tuning parameters and dealing with convergence issues. We present a number of empirical results; in each case, the accuracy of the polynomial approach matches or exceeds that of NN approaches. A many-featured, open-source software package, polyreg, is available."

--- Matloff is the author of my favorite "R programming for n00bs" textbook...
--- ETA after reading: the argument that multi-layer neural networks "are essentially" polynomial regression is a bit weak. It would be true, exactly, if activation functions were exactly polynomial, which however they rarely are in practice. If non-polynomial activations happen to be implemented in computational practice by polynomials (e.g., Taylor approximations), well, either we get different hardware or we crank up the degree of approximation as much as we like. (Said a little differently, if you buy this line of argument, you should buy that _every_ smooth statistical model "is essentially" polynomial regression, which seems a bit much.) It is, also, an argument about the function-approximation properties of the model classes, and not the fitting processes, despite the explicit disclaimers.
to:NB  your_favorite_deep_neural_network_sucks  regression  neural_networks  statistics  matloff.norman  approximation  computational_statistics  have_read
july 2018 by cshalizi
Multivariate approximation | Numerical analysis | Cambridge University Press
"This self-contained, systematic treatment of multivariate approximation begins with classical linear approximation, and moves on to contemporary nonlinear approximation. It covers substantial new developments in the linear approximation theory of classes with mixed smoothness, and shows how it is directly related to deep problems in other areas of mathematics. For example, numerical integration of these classes is closely related to discrepancy theory and to nonlinear approximation with respect to special redundant dictionaries, and estimates of the entropy numbers of classes with mixed smoothness are closely related to (in some cases equivalent to) the Small Ball Problem from probability theory. The useful background material included in the book makes it accessible to graduate students. Researchers will find that the many open problems in the theory outlined in the book provide helpful directions and guidance for their own research in this exciting and active area."
to:NB  approximation  mathematics  books:noted
june 2018 by cshalizi
One Parameter Is Always Enough
This is very cute, and deserves some attention.
Being me, I'll grump a bit. As he says at the end, this is just an elaboration of the point that we get a class of _infinite_ VC dimension by thresholding sin(kx), i.e., we can correctly classify any arbitrarily large set of points by tweaking k appropriately. Since this was known in the 1970s (proved by V&C themselves, if memory serves), it's been kind of insane that people continue to count parameters and pat themselves on the back about Occam. (See http://bactra.org/weblog/921.html et seq. for more.) Still, the elephant is nice.
model_selection  approximation  dynamical_systems  chaos  have_read  to:blog
may 2018 by cshalizi
As If — Kwame Anthony Appiah | Harvard University Press
"Idealization is a fundamental feature of human thought. We build simplified models in our scientific research and utopias in our political imaginations. Concepts like belief, desire, reason, and justice are bound up with idealizations and ideals. Life is a constant adjustment between the models we make and the realities we encounter. In idealizing, we proceed “as if” our representations were true, while knowing they are not. This is not a dangerous or distracting occupation, Kwame Anthony Appiah shows. Our best chance of understanding nature, society, and ourselves is to open our minds to a plurality of imperfect depictions that together allow us to manage and interpret our world.
"The philosopher Hans Vaihinger first delineated the “as if” impulse at the turn of the twentieth century, drawing on Kant, who argued that rational agency required us to act as if we were free. Appiah extends this strategy to examples across philosophy and the human and natural sciences. In a broad range of activities, we have some notion of the truth yet continue with theories that we recognize are, strictly speaking, false. From this vantage point, Appiah demonstrates that a picture one knows to be unreal can be a vehicle for accessing reality.
"As If explores how strategic untruth plays a critical role in far-flung areas of inquiry: decision theory, psychology, natural science, and political philosophy. A polymath who writes with mainstream clarity, Appiah defends the centrality of the imagination not just in the arts but in science, morality, and everyday life."
in_NB  books:noted  philosophy  epistemology  philosophy_of_science  approximation  modeling  idealization
september 2017 by cshalizi
A Novel Algorithm for Coarse-Graining of Cellular Automata - Springer
"The coarse-graining is an approximation procedure widely used for simplification of mathematical and numerical models of multiscale systems. It reduces superfluous – microscopic – degrees of freedom. Israeli and Goldenfeld demonstrated in [1,2] that the coarse-graining can be employed for elementary cellular automata (CA), producing interesting interdependences between them. However, extending their investigation on more complex CA rules appeared to be impossible due to the high computational complexity of the coarse-graining algorithm. We demonstrate here that this complexity can be substantially decreased. It allows for scrutinizing much broader class of cellular automata in terms of their coarse graining. By using our algorithm we found out that the ratio of the numbers of elementary CAs having coarse grained representation to “degenerate” – irreducible – cellular automata, strongly increases with increasing the “grain” size of the approximation procedure. This rises principal questions about the formal limits in modeling of realistic multiscale systems."
to:NB  macro_from_micro  approximation  cellular_automata  via:?
december 2016 by cshalizi
[1611.06168] On $p$-values
"Models are consistently treated as approximations and all procedures are consistent with this. They do not treat the model as being true. In this context p-values are one measure of approximation, a small p-value indicating a poor approximation. Approximation regions are defined and distinguished from confidence regions."
to:NB  statistics  misspecification  approximation  p-values  hypothesis_testing  via:vaguery
december 2016 by cshalizi
A Universal Approximation Theorem for Mixture-of-Experts Models | Neural Computation | MIT Press Journals
"The mixture-of-experts (MoE) model is a popular neural network architecture for nonlinear regression and classification. The class of MoE mean functions is known to be uniformly convergent to any unknown target function, assuming that the target function is from a Sobolev space that is sufficiently differentiable and that the domain of estimation is a compact unit hypercube. We provide an alternative result, which shows that the class of MoE mean functions is dense in the class of all continuous functions over arbitrary compact domains of estimation. Our result can be viewed as a universal approximation theorem for MoE models. The theorem we present allows MoE users to be confident in applying such models for estimation when data arise from nonlinear and nondifferentiable generative processes."
to:NB  ensemble_methods  statistics  graphical_models  approximation
november 2016 by cshalizi
[1508.05906] Chaining, Interpolation, and Convexity
"We show that classical chaining bounds on the suprema of random processes in terms of entropy numbers can be systematically improved when the underlying set is convex: the entropy numbers need not be computed for the entire set, but only for certain "thin" subsets. This phenomenon arises from the observation that real interpolation can be used as a natural chaining mechanism. Unlike the general form of Talagrand's generic chaining method, which is sharp but often difficult to use, the resulting bounds involve only entropy numbers but are nonetheless sharp in many situations in which classical entropy bounds are suboptimal. Such bounds are readily amenable to explicit computations in specific examples, and we discover some old and new geometric principles for the control of chaining functionals as special cases."
in_NB  empirical_processes  learning_theory  approximation  convexity  functional_analysis  van_handel.ramon
september 2016 by cshalizi
Approximation Methods in Probability Theory | Vydas Čekanavičius | Springer
"This book presents a wide range of well-known and less common methods used for estimating the accuracy of probabilistic approximations, including the Esseen type inversion formulas, the Stein method as well as the methods of convolutions and triangle function. Emphasising the correct usage of the methods presented, each step required for the proofs is examined in detail. As a result, this textbook provides valuable tools for proving approximation theorems.
"While Approximation Methods in Probability Theory will appeal to everyone interested in limit theorems of probability theory, the book is particularly aimed at graduate students who have completed a standard intermediate course in probability theory. Furthermore, experienced researchers wanting to enlarge their toolkit will also find this book useful."
to:NB  probability  approximation  mathematics  convergence_of_stochastic_processes
july 2016 by cshalizi
A Single Hidden Layer Feedforward Network with Only One Neuron in the Hidden Layer Can Approximate Any Univariate Function
"The possibility of approximating a continuous function on a compact subset of the real line by a feedforward single hidden layer neural network with a sigmoidal activation function has been studied in many papers. Such networks can approximate an arbitrary continuous function provided that an unlimited number of neurons in a hidden layer is permitted. In this note, we consider constructive approximation on any finite interval of $\mathbb{R}$ by neural networks with only one neuron in the hidden layer. We construct algorithmically a smooth, sigmoidal, almost monotone activation function $\sigma$ providing approximation to an arbitrary continuous function within any degree of accuracy. This algorithm is implemented in a computer program, which computes the value of $\sigma$ at any reasonable point of the real axis."
june 2016 by cshalizi
[1601.00670] Variational Inference: A Review for Statisticians
"One of the core problems of modern statistics is to approximate difficult-to-compute probability distributions. This problem is especially important in Bayesian statistics, which frames all inference about unknown quantities as a calculation about the posterior. In this paper, we review variational inference (VI), a method from machine learning that approximates probability distributions through optimization. VI has been used in myriad applications and tends to be faster than classical methods, such as Markov chain Monte Carlo sampling. The idea behind VI is to first posit a family of distributions and then to find the member of that family which is close to the target. Closeness is measured by Kullback-Leibler divergence. We review the ideas behind mean-field variational inference, discuss the special case of VI applied to exponential family models, present a full example with a Bayesian mixture of Gaussians, and derive a variant that uses stochastic optimization to scale up to massive data. We discuss modern research in VI and highlight important open problems. VI is powerful, but it is not yet well understood. Our hope in writing this paper is to catalyze statistical research on this widely-used class of algorithms."
to:NB  variational_inference  statistics  computational_statistics  approximation  blei.david
february 2016 by cshalizi
Importance Sampling Over Sets: A New Probabilistic Inference Scheme
"Computing expectations in high-dimensional spaces is a key challenge in probabilistic infer- ence and machine learning. Monte Carlo sam- pling, and importance sampling in particular, is one of the leading approaches. We propose a generalized importance sampling scheme based on randomly selecting (exponentially large) sub- sets of states rather than individual ones. By col- lecting a small number of extreme states in the sampled sets, we obtain estimates of statistics of interest, such as the partition function of an undirected graphical model. We incorporate this idea into a novel maximum likelihood learning algorithm based on cutting planes. We demon- strate empirically that our scheme provides accu- rate answers and scales to problems with up to a million variables."

--- Applicable to the fitness-breeding scheme?
to:NB  computational_statistics  monte_carlo  statistics  approximation  re:fitness_sampling
july 2015 by cshalizi
Random Features for Large-Scale Kernel Machines
"To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. The features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shift- invariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in large-scale classification and regression tasks linear machine learning al- gorithms applied to these features outperform state-of-the-art large-scale kernel machines."
in_NB  to_read  machine_learning  computational_statistics  approximation  random_projections  fourier_analysis  kernel_methods  statistics  re:hyperbolic_networks
july 2015 by cshalizi
[1404.1578] Models as Approximations: How Random Predictors and Model Violations Invalidate Classical Inference in Regression
"We review and interpret the early insights of Halbert White who over thirty years ago inaugurated a form of statistical inference for regression models that is asymptotically correct even under "model misspecification," that is, under the assumption that models are approximations rather than generative truths. This form of inference, which is pervasive in econometrics, relies on the "sandwich estimator" of standard error. Whereas linear models theory in statistics assumes models to be true and predictors to be fixed, White's theory permits models to be approximate and predictors to be random. Careful reading of his work shows that the deepest consequences for statistical inference arise from a synergy --- a "conspiracy" --- of nonlinearity and randomness of the predictors which invalidates the ancillarity argument that justifies conditioning on the predictors when they are random. Unlike the standard error of linear models theory, the sandwich estimator provides asymptotically correct inference in the presence of both nonlinearity and heteroskedasticity. An asymptotic comparison of the two types of standard error shows that discrepancies between them can be of arbitrary magnitude. If there exist discrepancies, standard errors from linear models theory are usually too liberal even though occasionally they can be too conservative as well. A valid alternative to the sandwich estimator is provided by the "pairs bootstrap"; in fact, the sandwich estimator can be shown to be a limiting case of the pairs bootstrap. We conclude by giving meaning to regression slopes when the linear model is an approximation rather than a truth. --- In this review we limit ourselves to linear least squares regression, but many qualitative insights hold for most forms of regression."

-- Very close to what I teach in my class, though I haven't really talked about sandwich variances.
in_NB  have_read  statistics  regression  linear_regression  bootstrap  misspecification  estimation  approximation
february 2015 by cshalizi
Dehling , Durieu , Tusche : Approximating class approach for empirical processes of dependent sequences indexed by functions
"We study weak convergence of empirical processes of dependent data (Xi)i≥0, indexed by classes of functions. Our results are especially suitable for data arising from dynamical systems and Markov chains, where the central limit theorem for partial sums of observables is commonly derived via the spectral gap technique. We are specifically interested in situations where the index class  is different from the class of functions f for which we have good properties of the observables (f(Xi))i≥0. We introduce a new bracketing number to measure the size of the index class  which fits this setting. Our results apply to the empirical process of data (Xi)i≥0 satisfying a multiple mixing condition. This includes dynamical systems and Markov chains, if the Perron–Frobenius operator or the Markov operator has a spectral gap, but also extends beyond this class, for example, to ergodic torus automorphisms."
in_NB  empirical_processes  approximation  stochastic_processes  markov_models  dynamical_systems  ergodic_theory  mixing
january 2015 by cshalizi
Fully Exponential Laplace Approximations to Expectations and Variances of Nonpositive Functions (Tierny, Kass and Kadane, 1989)
The un-numbered equation defining $A_K$, between (2.3) and (2.4), is wrong --- the O(1/n) terms are off by various powers of $\sigma^2$. However, both (2.3) and (2.4) are right...
in_NB  have_read  laplace_approximation  statistics  approximation  kith_and_kin  kass.robert
january 2015 by cshalizi
[1405.0058] Underestimating extreme events in power-law behavior due to machine-dependent cutoffs
"Power-law distributions are typical macroscopic features occurring in almost all complex systems observable in nature. As a result, researchers in quantitative analyses must often generate random synthetic variates obeying power-law distributions. The task is usually performed through standard methods that map uniform random variates into the desired probability space. Whereas all these algorithms are theoretically solid, in this paper we show that they are subject to severe machine-dependent limitations. As a result, two dramatic consequences arise: (i) the sampling in the tail of the distribution is not random but deterministic; (ii) the moments of the sample distribution, which are theoretically expected to diverge as functions of the sample sizes, converge instead to finite values. We provide quantitative indications for the range of distribution parameters that can be safely handled by standard libraries used in computational analyses. Whereas our findings indicate possible reinterpretations of numerical results obtained through flawed sampling methodologies, they also pave the way for the search for a concrete solution to this central issue shared by all quantitative sciences dealing with complexity."
to:NB  to_read  heavy_tails  approximation  computational_statistics  have_skimmed
january 2015 by cshalizi
A Primer on Regression Splines
"B-splines constitute an appealing method for the nonparametric estimation of a range of statis- tical objects of interest. In this primer we focus our attention on the estimation of a conditional mean, i.e. the ‘regression function’."
may 2014 by cshalizi
Distortion of genealogical properties when the sample is very large
"Study sample sizes in human genetics are growing rapidly, and in due course it will become routine to analyze samples with hundreds of thousands, if not millions, of individuals. In addition to posing computational challenges, such large sample sizes call for carefully reexamining the theoretical foundation underlying commonly used analytical tools. Here, we study the accuracy of the coalescent, a central model for studying the ancestry of a sample of individuals. The coalescent arises as a limit of a large class of random mating models, and it is an accurate approximation to the original model provided that the population size is sufficiently larger than the sample size. We develop a method for performing exact computation in the discrete-time Wright–Fisher (DTWF) model and compare several key genealogical quantities of interest with the coalescent predictions. For recently inferred demographic scenarios, we find that there are a significant number of multiple- and simultaneous-merger events under the DTWF model, which are absent in the coalescent by construction. Furthermore, for large sample sizes, there are noticeable differences in the expected number of rare variants between the coalescent and the DTWF model. To balance the trade-off between accuracy and computational efficiency, we propose a hybrid algorithm that uses the DTWF model for the recent past and the coalescent for the more distant past. Our results demonstrate that the hybrid method with only a handful of generations of the DTWF model leads to a frequency spectrum that is quite close to the prediction of the full DTWF model."
to:NB  stochastic_processes  approximation  historical_genetics  statistics
march 2014 by cshalizi
[1402.1694] Asymptotically Exact MCMC Algorithms via Local Approximations of Computationally Intensive Models
"We construct a new framework for accelerating MCMC algorithms for sampling from posterior distributions in the context of computationally intensive models. We proceed by constructing local surrogates of the forward model within the Metropolis-Hastings kernel, borrowing ideas from deterministic approximation theory, optimization, and experimental design. Our work departs from previous work in surrogate-based inference by exploiting useful convergence characteristics of local approximations. We prove the ergodicity of our approximate Markov chain and show that it samples asymptotically from the \emph{exact} posterior distribution of interest. We describe variations of the algorithm that construct either local polynomial approximations or Gaussian process regressors, thus spanning two important classes of surrogate models. Our theoretical results reinforce the key observation underlying this paper: when the likelihood has some \emph{local} regularity, the number of model evaluations per MCMC step can be greatly reduced, without incurring significant bias in the Monte Carlo average. Our numerical experiments demonstrate order-of-magnitude reductions in the number of forward model evaluations used in representative ODE or PDE inference problems, in both real and synthetic data examples."
to:NB  monte_carlo  statistics  approximation
february 2014 by cshalizi
[1311.7286] Approximate Bayesian Computation with composite score functions
"Both Approximate Bayesian Computation (ABC) and composite likelihood methods are useful for Bayesian and frequentist inference when the likelihood function is intractable. We show that composite likelihoods score functions can be fruitfully used as automatic informative summary statistics in ABC in order to obtain accurate approximations to the posterior distribution of the parameter of interest. This is formally motivated by the use of the score function of the full likelihood, and extended to general unbiased estimating functions in complex models. Examples illustrate that the proposed ABC procedure can significantly improve upon usual ABC methods based on ordinary data summaries."
to:NB  approximation  likelihood  statistics  estimation
december 2013 by cshalizi
[1310.5543] Universalities of Reproducing Kernels Revisited
"Kernel methods have been widely applied to machine learning and other questions of approximating an unknown function from its finite sample data. To ensure arbitrary accuracy of such approximation, various denseness conditions are imposed on the selected kernel. This note contributes to the study of universal, characteristic, and C0-universal kernels. We first give simple and direct description of the difference and relation among these three kinds of universalities of kernels. We then focus on translation-invariant and weighted polynomial kernels. A simple and shorter proof of the known characterization of characteristic translation-invariant kernels will be presented. The main purpose of the note is to give a delicate discussion on the universalities of weighted polynomial kernels."
to:NB  kernel_methods  hilbert_space  approximation  learning_theory
october 2013 by cshalizi
[1212.6906] Central Limit Theorems and Multiplier Bootstrap when p is much larger than n
"We derive a central limit theorem for the maximum of a sum of high dimensional random vectors. Specifically, we establish conditions under which the distribution of the maximum is approximated by that of the maximum of a sum of the Gaussian random vectors with the same covariance matrices as the original vectors. The key innovation of this result is that it applies even when the dimension of random vectors (p) is large compared to the sample size (n); in fact, p can be much larger than n. We also show that the distribution of the maximum of a sum of the random vectors with unknown covariance matrices can be consistently estimated by the distribution of the maximum of a sum of the conditional Gaussian random vectors obtained by multiplying the original vectors with i.i.d. Gaussian multipliers. This is the multiplier bootstrap procedure. Here too, p can be large or even much larger than n. These distributional approximations, either Gaussian or conditional Gaussian, yield a high-quality approximation to the distribution of the original maximum, often with approximation error decreasing polynomially in the sample size, and hence are of interest in many applications. We demonstrate how our central limit theorem and the multiplier bootstrap can be used for high dimensional estimation, multiple hypothesis testing, and adaptive specification testing. All these results contain non-asymptotic bounds on approximation errors. "

- For the reading group this week.
to:NB  empirical_processes  stochastic_processes  approximation  central_limit_theorem  bootstrap  statistics  probability
september 2013 by cshalizi
Bryant, J. and Sangwin, C.: How Round Is Your Circle? Where Engineering and Mathematics Meet.
"How do you draw a straight line? How do you determine if a circle is really round? These may sound like simple or even trivial mathematical problems, but to an engineer the answers can mean the difference between success and failure. How Round Is Your Circle? invites readers to explore many of the same fundamental questions that working engineers deal with every day--it's challenging, hands-on, and fun.
"John Bryant and Chris Sangwin illustrate how physical models are created from abstract mathematical ones. Using elementary geometry and trigonometry, they guide readers through paper-and-pencil reconstructions of mathematical problems and show them how to construct actual physical models themselves--directions included. It's an effective and entertaining way to explain how applied mathematics and engineering work together to solve problems, everything from keeping a piston aligned in its cylinder to ensuring that automotive driveshafts rotate smoothly. Intriguingly, checking the roundness of a manufactured object is trickier than one might think. When does the width of a saw blade affect an engineer's calculations--or, for that matter, the width of a physical line? When does a measurement need to be exact and when will an approximation suffice? Bryant and Sangwin tackle questions like these and enliven their discussions with many fascinating highlights from engineering history. Generously illustrated, How Round Is Your Circle? reveals some of the hidden complexities in everyday things."
books:noted  approximation  geometry  engineering  modeling  to_teach  books:owned
may 2013 by cshalizi
Simons Institute for the Theory of Computing - Programs | Fall 2013: Theoretical Foundations of Big Data Analysis
"We live in an era of "Big Data": science, engineering and technology are producing increasingly large data streams, with petabyte and exabyte scales becoming increasingly common. In scientific fields such data arise in part because tests of standard theories increasingly focus on extreme physical conditions (cf., particle physics) and in part because science has become increasingly exploratory (cf., astronomy and genomics). In commerce, massive data arise because so much of human activity is now online, and because business models aim to provide services that are increasingly personalized.
"The Big Data phenomenon presents opportunities and perils. On the optimistic side of the coin, massive data may amplify the inferential power of algorithms that have been shown to be successful on modest-sized data sets. The challenge is to develop the theoretical principles needed to scale inference and learning algorithms to massive, even arbitrary, scale. On the pessimistic side of the coin, massive data may amplify the error rates that are part and parcel of any inferential algorithm. The challenge is to control such errors even in the face of the heterogeneity and uncontrolled sampling processes underlying many massive data sets. Another major issue is that Big Data problems often come with time constraints, where a high-quality answer that is obtained slowly can be less useful than a medium-quality answer that is obtained quickly. Overall we have a problem in which the classical resources of the theory of computation—e.g., time, space and energy—trade off in complex ways with the data resource.
"Various aspects of this general problem are being faced in the theory of computation, statistics and related disciplines—where topics such as dimension reduction, distributed optimization, Monte Carlo sampling, compressed sampling, low-rank matrix factorization, streaming and hardness of approximation are of clear relevance—but the general problem remains untackled. This program will bring together experts from these areas with the aim of laying the theoretical foundations of the emerging field of Big Data."
machine_learning  statistics  computational_complexity  approximation  conferences
december 2012 by cshalizi
[1212.3850] Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
"The sum-product or belief propagation (BP) algorithm is a widely used message-passing technique for computing approximate marginals in graphical models. We introduce a new technique, called stochastic orthogonal series message-passing (SOSMP), for computing the BP fixed point in models with continuous random variables. It is based on a deterministic approximation of the messages via orthogonal series expansion, and a stochastic approximation via Monte Carlo estimates of the integral updates of the basis coefficients. We prove that the SOSMP iterates converge to a delta-neighborhood of the unique BP fixed point for any tree-structured graph, and for any graphs with cycles in which the BP updates satisfy a contractivity condition. In addition, we demonstrate how to choose the number of basis coefficients as a function of the desired approximation accuracy delta and smoothness of the compatibility functions. We illustrate our theory with both simulated examples and in application to optical flow estimation."
to:NB  machine_learning  graphical_models  belief_propagation  approximation  wainwright.martin_j.
december 2012 by cshalizi
Taylor & Francis Online :: Fast Inference for the Latent Space Network Model Using a Case-Control Approximate Likelihood - Journal of Computational and Graphical Statistics - Volume 21, Issue 4
"Network models are widely used in social sciences and genome sciences. The latent space model proposed by Hoff et al. (2002), and extended by Handcock et al. (2007) to incorporate clustering, provides a visually interpretable model-based spatial representation of relational data and takes account of several intrinsic network properties. Due to the structure of the likelihood function of the latent space model, the computational cost is of order O(N 2), where N is the number of nodes. This makes it infeasible for large networks. In this article, we propose an approximation of the log-likelihood function. We adapt the case-control idea from epidemiology and construct a case-control log-likelihood, which is an unbiased estimator of the log-full likelihood. Replacing the full likelihood by the case-control likelihood in the Markov chain Monte Carlo estimation of the latent space model reduces the computational time from O(N 2) to O(N), making it feasible for large networks. We evaluate its performance using simulated and real data. We fit the model to a large protein–protein interaction data using the case-control likelihood and use the model fitted link probabilities to identify false positive links. Supplemental materials are available online."
to:NB  statistics  network_data_analysis  approximation  likelihood  inference_to_latent_objects  hoff.peter  raftery.adrian
december 2012 by cshalizi
Complexity of Combinatorial Market Makers
"We analyze the computational complexity of market maker pricing algorithms for combinatorial prediction markets. We focus on Hanson’s popular logarithmic market scoring rule market maker (LMSR). Our goal is to implicitly main- tain correct LMSR prices across an exponentially large out- come space. We examine both permutation combinatorics, where outcomes are permutations of objects, and Boolean combinatorics, where outcomes are combinations of binary events. We look at three restrictive languages that limit what traders can bet on. Even with severely limited lan- guages, we find that LMSR pricing is #P-hard, even when the same language admits polynomial-time matching with- out the market maker. We then propose an approximation technique for pricing permutation markets based on an al- gorithm for online permutation learning. The connections we draw between LMSR pricing and the literature on online learning with expert advice may be of independent interest."
in_NB  to_read  computational_complexity  economics  via:nikete  online_learning  approximation  re:computational_lens
october 2012 by cshalizi
[1206.7051] Stochastic Variational Inference
"We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet allocation and the hierarchical Dirichlet process topic model. Using stochastic variational inference, we analyze several large collections of documents: 300K articles from Nature, 1.8M articles from The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can handle the full data, and outperforms traditional variational inference on a subset. (Further, we show that the Bayesian nonparametric topic model outperforms its parametric counterpart.) Stochastic variational inference lets us apply complex Bayesian models to very large data sets."
to:NB  machine_learning  blei.david  approximation
july 2012 by cshalizi
[1201.5871] Null models for network data
"The analysis of datasets taking the form of simple, undirected graphs continues to gain in importance across a variety of disciplines. Two choices of null model, the logistic-linear model and the implicit log-linear model, have come into common use for analyzing such network data, in part because each accounts for the heterogeneity of network node degrees typically observed in practice. Here we show how these both may be viewed as instances of a broader class of null models, with the property that all members of this class give rise to essentially the same likelihood-based estimates of link probabilities in sparse graph regimes. This facilitates likelihood-based computation and inference, and enables practitioners to choose the most appropriate null model from this family based on application context. Comparative model fits for a variety of network datasets demonstrate the practical implications of our results."
april 2012 by cshalizi
[0802.4192] Maxisets for Model Selection
"We address the statistical issue of determining the maximal spaces (maxisets) where model selection procedures attain a given rate of convergence. By considering first general dictionaries, then orthonormal bases, we characterize these maxisets in terms of approximation spaces. These results are illustrated by classical choices of wavelet model collections. For each of them, the maxisets are described in terms of functional spaces. We take a special care of the issue of calculability and measure the induced loss of performance in terms of maxisets."
in_NB  statistics  model_selection  approximation
april 2012 by cshalizi
JSTOR: Philosophy of Science, Vol. 79, No. 2 (April 2012), pp. 207-232
"It is proposed that we use the term “approximation” for inexact description of a target system and “idealization” for another system whose properties also provide an inexact description of the target system. Since systems generated by a limiting process can often have quite unexpected—even inconsistent—properties, familiar limit processes used in statistical physics can fail to provide idealizations but merely provide approximations."
to:NB  modeling  philosophy_of_science  approximation  norton.john
march 2012 by cshalizi
[1111.5899] Sampling, Filtering and Sparse Approximations on Combinatorial Graphs
In this paper we address sampling and approximation of functions on combinatorial graphs. We develop filtering on graphs by using Schr"odinger's group of operators generated by combinatorial Laplace operator. Then we construct a sampling theory by proving Poincare and Plancherel-Polya-type inequalities for functions on graphs. These results lead to a theory of sparse approximations on graphs and have potential applications to filtering, denoising, data dimension reduction, image processing, image compression, computer graphics, visualization and learning theory.
to:NB  network_data_analysis  approximation  graph_theory
december 2011 by cshalizi
[1111.0483] Optimally approximating exponential families
"This article studies exponential families $mathcal{E}$ on finite sets such that the information divergence $D(P|mathcal{E})$ of an arbitrary probability distribution from $mathcal{E}$ is bounded by some constant $D>0$. A particular class of low-dimensional exponential families that have low values of $D$ can be obtained from partitions of the state space. The main results concern optimality properties of these partition exponential families. Exponential families where $D=log(2)$ are studied in detail. This case is special, because if $D<log(2)$, then $mathcal{E}$ contains all probability measures with full support."
in_NB  exponential_families  probability  information_theory  approximation
november 2011 by cshalizi
[1107.4353] Upper Bounds for Markov Approximations of Ergodic Processes
"Chains of infinite order are generalizations of Markov chains that constitute a flexible class of models in the general theory of stochastic processes. These processes can be naturally studied using approximating Markov chains. Here we derive new upper bounds for the $bar{d}$-distance and the K"ullback-Leibler divergence between chains of infinite order and their canonical $k$-step Markov approximations. In contrast to the bounds available in the literature our results apply to chains of infinite order compatible with general classes of probability kernels. In particular, we allow kernels with discontinuities and null transition probabilities.""  (Pedantry: Pretty sure Kullback did not spell his name with an umlaut!)
markov_models  stochastic_processes  re:AoS_project  to_read  in_NB  approximation  re:your_favorite_dsge_sucks
july 2011 by cshalizi
[math/0211142] Another universal differential equation
"I construct a new universal differential equation (B), in the sense of Rubel. That is, its solutions approximate to arbitrary accuracy any continuous function on any interval of the real line."
differential_equations  approximation
january 2008 by cshalizi

Copy this bookmark:

description:

tags: