cshalizi + to_teach:statcomp   136

[1909.11827] Convergence diagnostics for Markov chain Monte Carlo
"Markov chain Monte Carlo (MCMC) is one of the most useful approaches to scientific computing because of its flexible construction, ease of use and generality. Indeed, MCMC is indispensable for performing Bayesian analysis. In spite of its widespread use, two critical questions that MCMC practitioners need to address are where to start and when to stop the simulation. Although a great amount of research has gone into establishing convergence criteria and stopping rules with sound theoretical foundation, in practice, MCMC users often decide convergence by applying empirical diagnostic tools. This review article discusses the most widely used MCMC convergence diagnostic tools. Some recently proposed stopping rules with firm theoretical footing are also presented. The convergence diagnostics and stopping rules are illustrated using three detailed examples."
to:NB  monte_carlo  statistics  computational_statistics  simulation  to_teach:statcomp 
20 days ago by cshalizi
[1909.03813] INTEREST: INteractive Tool for Exploring REsults from Simulation sTudies
"Simulation studies allow us to explore the properties of statistical methods. They provide a powerful tool with a multiplicity of aims; among others: evaluating and comparing new or existing statistical methods, assessing violations of modelling assumptions, helping with the understanding of statistical concepts, and supporting the design of clinical trials. The increased availability of powerful computational tools and usable software has contributed to the rise of simulation studies in the current literature. However, simulation studies involve increasingly complex designs, making it difficult to provide all relevant results clearly. Dissemination of results plays a focal role in simulation studies: it can drive applied analysts to use methods that have been shown to perform well in their settings, guide researchers to develop new methods in a promising direction, and provide insights into less established methods. It is crucial that we can digest relevant results of simulation studies. Therefore, we developed INTEREST: an INteractive Tool for Exploring REsults from Simulation sTudies. The tool has been developed using the Shiny framework in R and is available as a web app or as a standalone package. It requires uploading a tidy format dataset with the results of a simulation study in R, Stata, SAS, SPSS, or comma-separated format. A variety of performance measures are estimated automatically along with Monte Carlo standard errors; results and performance summaries are displayed both in tabular and graphical fashion, with a wide variety of available plots. Consequently, the reader can focus on simulation parameters and estimands of most interest. In conclusion, INTEREST can facilitate the investigation of results from simulation studies and supplement the reporting of results, allowing researchers to share detailed results from their simulations and readers to explore them freely."
to:NB  simulation  R  to_teach:statcomp 
4 weeks ago by cshalizi
[1106.4929] Simulating rare events in dynamical processes
"Atypical, rare trajectories of dynamical systems are important: they are often the paths for chemical reactions, the haven of (relative) stability of planetary systems, the rogue waves that are detected in oil platforms, the structures that are responsible for intermittency in a turbulent liquid, the active regions that allow a supercooled liquid to flow... Simulating them in an efficient, accelerated way, is in fact quite simple.
"In this paper we review a computational technique to study such rare events in both stochastic and Hamiltonian systems. The method is based on the evolution of a family of copies of the system which are replicated or killed in such a way as to favor the realization of the atypical trajectories. We illustrate this with various examples."
to:NB  stochastic_processes  simulation  large_deviations  to_teach:data_over_space_and_time  to_teach:statcomp  re:fitness_sampling  re:do-institutions-evolve 
12 weeks ago by cshalizi
Fast Generalized Linear Models by Database Sampling and One-Step Polishing: Journal of Computational and Graphical Statistics: Vol 0, No 0
"In this article, I show how to fit a generalized linear model to N observations on p variables stored in a relational database, using one sampling query and one aggregation query, as long as N^{1/2+δ} observations can be stored in memory, for some δ>0. The resulting estimator is fully efficient and asymptotically equivalent to the maximum likelihood estimator, and so its variance can be estimated from the Fisher information in the usual way. A proof-of-concept implementation uses R with MonetDB and with SQLite, and could easily be adapted to other popular databases. I illustrate the approach with examples of taxi-trip data in New York City and factors related to car color in New Zealand. "
to:NB  computational_statistics  linear_regression  regression  databases  lumley.thomas  to_teach:statcomp 
june 2019 by cshalizi
Allesina, S. and Wilmes, M.: Computing Skills for Biologists: A Toolbox (Hardcover, Paperback and eBook) | Princeton University Press
"While biological data continues to grow exponentially in size and quality, many of today’s biologists are not trained adequately in the computing skills necessary for leveraging this information deluge. In Computing Skills for Biologists, Stefano Allesina and Madlen Wilmes present a valuable toolbox for the effective analysis of biological data.
"Based on the authors’ experiences teaching scientific computing at the University of Chicago, this textbook emphasizes the automation of repetitive tasks and the construction of pipelines for data organization, analysis, visualization, and publication. Stressing practice rather than theory, the book’s examples and exercises are drawn from actual biological data and solve cogent problems spanning the entire breadth of biological disciplines, including ecology, genetics, microbiology, and molecular biology. Beginners will benefit from the many examples explained step-by-step, while more seasoned researchers will learn how to combine tools to make biological data analysis robust and reproducible. The book uses free software and code that can be run on any platform.
"Computing Skills for Biologists is ideal for scientists wanting to improve their technical skills and instructors looking to teach the main computing tools essential for biology research in the twenty-first century."
to:NB  books:noted  scientific_computing  R  to_teach:statcomp 
january 2019 by cshalizi
General Resampling Infrastructure • rsample
"rsample contains a set of functions that can create different types of resamples and corresponding classes for their analysis. The goal is to have a modular set of methods that can be used across different R packages for:
"traditional resampling techniques for estimating the sampling distribution of a statistic and
"estimating model performance using a holdout set
"The scope of rsample is to provide the basic building blocks for creating and analyzing resamples of a data set but does not include code for modeling or calculating statistics. The “Working with Resample Sets” vignette gives demonstrations of how rsample tools can be used."
to:NB  R  computational_statistics  to_teach:statcomp  to_teach:undergrad-ADA  via:? 
august 2018 by cshalizi
[1511.01437] The sample size required in importance sampling
"The goal of importance sampling is to estimate the expected value of a given function with respect to a probability measure ν using a random sample of size n drawn from a different probability measure μ. If the two measures μ and ν are nearly singular with respect to each other, which is often the case in practice, the sample size required for accurate estimation is large. In this article it is shown that in a fairly general setting, a sample of size approximately exp(D(ν||μ)) is necessary and sufficient for accurate estimation by importance sampling, where D(ν||μ) is the Kullback--Leibler divergence of μ from ν. In particular, the required sample size exhibits a kind of cut-off in the logarithmic scale. The theory is applied to obtain a fairly general formula for the sample size required in importance sampling for exponential families (Gibbs measures). We also show that the standard variance-based diagnostic for convergence of importance sampling is fundamentally problematic. An alternative diagnostic that provably works in certain situations is suggested."
to:NB  to_read  statistics  monte_carlo  probability  information_theory  to_teach:statcomp  chatterjee.sourav  diaconis.persi  via:ded-maxim  re:fitness_sampling 
october 2016 by cshalizi
[1609.00037] Good Enough Practices in Scientific Computing
"We present a set of computing tools and techniques that every researcher can and should adopt. These recommendations synthesize inspiration from our own work, from the experiences of the thousands of people who have taken part in Software Carpentry and Data Carpentry workshops over the past six years, and from a variety of other guides. Unlike some other guides, our recommendations are aimed specifically at people who are new to research computing."
to:NB  to_teach:statcomp  to_teach  scientific_computing  have_read  to:blog 
september 2016 by cshalizi
Interactive R On-Line
"IROL was developed by the team of Howard Seltman (email feedback), Rebecca Nugent, Sam Ventura, Ryan Tibshirani, and Chris Genovese at the Department of Statistics at Carnegie Mellon University."

--- I mark this as "to_teach:statcomp", but of course the point is to have people go through this _before_ that course, so the class can cover more interesting stuff.
R  kith_and_kin  seltman.howard  nugent.rebecca  genovese.christopher  ventura.samuel  tibshirani.ryan  to_teach:statcomp 
august 2016 by cshalizi
Quantifying Life: A Symbiosis of Computation, Mathematics, and Biology, Kondrashov
"Since the time of Isaac Newton, physicists have used mathematics to describe the behavior of matter of all sizes, from subatomic particles to galaxies. In the past three decades, as advances in molecular biology have produced an avalanche of data, computational and mathematical techniques have also become necessary tools in the arsenal of biologists. But while quantitative approaches are now providing fundamental insights into biological systems, the college curriculum for biologists has not caught up, and most biology majors are never exposed to the computational and probabilistic mathematical approaches that dominate in biological research.
"With Quantifying Life, Dmitry A. Kondrashov offers an accessible introduction to the breadth of mathematical modeling used in biology today. Assuming only a foundation in high school mathematics, Quantifying Life takes an innovative computational approach to developing mathematical skills and intuition. Through lessons illustrated with copious examples, mathematical and programming exercises, literature discussion questions, and computational projects of various degrees of difficulty, students build and analyze models based on current research papers and learn to implement them in the R programming language. This interplay of mathematical ideas, systematically developed programming skills, and a broad selection of biological research topics makes Quantifying Life an invaluable guide for seasoned life scientists and the next generation of biologists alike."

--- Mineable for examples?
books:noted  biology  programming  modeling  to_teach:statcomp  to_teach:complexity-and-inference 
january 2016 by cshalizi
CRAN - Package BatchJobs
"Provides Map, Reduce and Filter variants to generate jobs on batch computing systems like PBS/Torque, LSF, SLURM and Sun Grid Engine. Multicore and SSH systems are also supported."
to_read  R  programming  to_teach:statcomp 
april 2015 by cshalizi
CRAN - Package markovchain
"Functions and S4 methods to create and manage discrete time Markov chains (DTMC) more easily. In addition functions to perform statistical (fitting and drawing random variates) and probabilistic (analysis of DTMC proprieties) analysis are provided."
markov_models  R  to_teach:statcomp 
march 2015 by cshalizi
Data Science at the Command Line - O'Reilly Media
"This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data."
books:noted  unix  to_teach:statcomp  in_wishlist 
february 2015 by cshalizi
A practical introduction to functional programming at Mary Rose Cook
Uses python, but these ideas are exactly the ones I try to teach in that part of my R course, only better expressed.
programming  functional_programming  have_read  via:tealtan  to_teach:statcomp 
january 2015 by cshalizi
[1409.3531] Object-Oriented Programming, Functional Programming and R
"This paper reviews some programming techniques in R that have proved useful, particularly for substantial projects. These include several versions of object-oriented programming, used in a large number of R packages. The review tries to clarify the origins and ideas behind the various versions, each of which is valuable in the appropriate context. R has also been strongly influenced by the ideas of functional programming and, in particular, by the desire to combine functional with object oriented programming. To clarify how this particular mix of ideas has turned out in the current R language and supporting software, the paper will first review the basic ideas behind object-oriented and functional programming, and then examine the evolution of R with these ideas providing context. Functional programming supports well-defined, defensible software giving reproducible results. Object-oriented programming is the mechanism par excellence for managing complexity while keeping things simple for the user. The two paradigms have been valuable in supporting major software for fitting models to data and numerous other statistical applications. The paradigms have been adopted, and adapted, distinctively in R. Functional programming motivates much of R but R does not enforce the paradigm. Object-oriented programming from a functional perspective differs from that used in non-functional languages, a distinction that needs to be emphasized to avoid confusion. R initially replicated the S language from Bell Labs, which in turn was strongly influenced by earlier program libraries. At each stage, new ideas have been added, but the previous software continues to show its influence in the design as well. Outlining the evolution will further clarify why we currently have this somewhat unusual combination of ideas."
to:NB  to_read  programming  R  chambers.john  to_teach:statcomp 
january 2015 by cshalizi
[1409.5827] Software Alchemy: Turning Complex Statistical Computations into Embarrassingly-Parallel Ones
"The growth in the use of computationally intensive statistical procedures, especially with Big Data, has necessitated the usage of parallel computation on diverse platforms such as multicore, GPU, clusters and clouds. However, slowdown due to interprocess communication costs typically limits such methods to "embarrassingly parallel" (EP) algorithms, especially on non-shared memory platforms. This paper develops a broadly-applicable method for converting many non-EP algorithms into statistically equivalent EP ones. The method is shown to yield excellent levels of speedup for a variety of statistical computations. It also overcomes certain problems of memory limitations."
in_NB  to_read  distributed_systems  computational_statistics  statistics  matloff.norm  to_teach:statcomp 
january 2015 by cshalizi
Optimization Models | Cambridge University Press
"Emphasizing practical understanding over the technicalities of specific algorithms, this elegant textbook is an accessible introduction to the field of optimization, focusing on powerful and reliable convex optimization techniques. Students and practitioners will learn how to recognize, simplify, model and solve optimization problems - and apply these principles to their own projects. A clear and self-contained introduction to linear algebra demonstrates core mathematical concepts in a way that is easy to follow, and helps students to understand their practical relevance. Requiring only a basic understanding of geometry, calculus, probability and statistics, and striking a careful balance between accessibility and rigor, it enables students to quickly understand the material, without being overwhelmed by complex mathematics. Accompanied by numerous end-of-chapter problems, an online solutions manual for instructors, and relevant examples from diverse fields including engineering, data science, economics, finance, and management, this is the perfect introduction to optimization for undergraduate and graduate students."
in_NB  optimization  convexity  books:noted  to_teach:statcomp  to_teach:freshman_seminar_on_optimization 
october 2014 by cshalizi
Quality and efficiency for kernel density estimates in large data
"Kernel density estimates are important for a broad variety of applications. Their construction has been well-studied, but existing techniques are expensive on massive datasets and/or only provide heuristic approximations without theoretical guarantees. We propose randomized and deterministic algorithms with quality guarantees which are orders of magnitude more efficient than previous algorithms. Our algorithms do not require knowledge of the kernel or its bandwidth parameter and are easily parallelizable. We demonstrate how to implement our ideas in a centralized setting and in MapReduce, although our algorithms are applicable to any large-scale data processing framework. Extensive experiments on large real datasets demonstrate the quality, efficiency, and scalability of our techniques."

--- Ungated version: http://www.cs.utah.edu/~lifeifei/papers/kernelsigmod13.pdf
to:NB  have_read  kernel_estimators  computational_statistics  statistics  density_estimation  to_teach:statcomp  to_teach:undergrad-ADA 
october 2014 by cshalizi
An Algebraic Process for Visual Representation Design
"We present a model of visualization design based on algebraic considerations of the visualization process. The model helps characterize visual encodings, guide their design, evaluate their effectiveness, and highlight their shortcomings. The model has three components: the underlying mathematical structure of the data or object being visualized, the concrete representation of the data in a computer, and (to the extent possible) a mathematical description of how humans perceive the visualization. Because we believe the value of our model lies in its practical application, we propose three general principles for good visualization design. We work through a collection of examples where our model helps explain the known properties of existing visualizations methods, both good and not-so-good, as well as suggesting some novel methods. We describe how to use the model alongside experimental user studies, since it can help frame experiment outcomes in an actionable manner. Exploring the implications and applications of our model and its design principles should provide many directions for future visualization research."
to:NB  to_read  visual_display_of_quantitative_information  representation  to_teach:statcomp  to_teach:data-mining  algebra 
september 2014 by cshalizi
Notifications from R | The stupidest thing...
Couldn't one just do a system call to mail(1), rather than GIVING AN R SCRIPT YOUR PASSWORD?
R  to_teach:statcomp  via:phnk 
september 2014 by cshalizi
Visualizing Algorithms
Some of the sorting algorithm visualizations reminded me of my old days working with 1D cellular automata, and the spanning-tree ones are great. (I had never realized the connection between mazes and spanning trees.)
algorithms  programming  to_teach:statcomp  pretty_pictures  visual_display_of_quantitative_information  have_read  via:? 
july 2014 by cshalizi
Piketty in R markdown – we need some help from the crowd | Simply Statistics
The non-proportional spacing of points on the time axis bugged me too, but I think it's more a case of spreadsheet defaults than anything else.
piketty.thomas  economics  data_sets  to_teach:statcomp 
july 2014 by cshalizi
LIWC: Linguistic Inquiry and Word Count
Have they really just stuck words into various categories, and then counted up how often they appear in the document? It seems so, since "It was a beautiful funeral" scores as 20% positive, 0% negative. (If so: problem set for the kids in statistical computing?) Maybe this would get the emotional drift from a long piece of text, but from short snippets like Twitter or Facebook status updates, this has got to be super noisy.
Memo to self, look at whether CMU has a site license before shelling out $29.95.

ETA: The classic "I am in no way unhappy" scores as 1/6 negative.
text_mining  linguistics  psychology  to:blog  to_teach:statcomp 
july 2014 by cshalizi
A Primer on Regression Splines
"B-splines constitute an appealing method for the nonparametric estimation of a range of statis- tical objects of interest. In this primer we focus our attention on the estimation of a conditional mean, i.e. the ‘regression function’."
in_NB  splines  nonparametrics  regression  approximation  statistics  computational_statistics  racine.jeffrey_s.  to_teach:statcomp  to_teach:undergrad-ADA  have_read 
may 2014 by cshalizi
Scalable Strategies for Computing with Massive Data
"This paper presents two complementary statistical computing frameworks that address challenges in parallel processing and the analysis of massive data. First, the foreach package allows users of the R programming environment to define parallel loops that may be run sequentially on a single machine, in parallel on a symmetric multiprocessing (SMP) machine, or in cluster environments without platform-specific code. Second, the bigmemory package implements memory- and file-mapped data structures that provide (a) access to arbitrarily large data while retaining a look and feel that is familiar to R users and (b) data structures that are shared across processor cores in order to support efficient parallel computing techniques. Although these packages may be used independently, this paper shows how they can be used in combination to address challenges that have effectively been beyond the reach of researchers who lack specialized software development skills or expensive hardware."
have_read  R  computational_statistics  data_analysis  in_NB  to_teach:statcomp 
february 2014 by cshalizi
[1401.6389] Parallel Optimisation of Bootstrapping in R
"Bootstrapping is a popular and computationally demanding resampling method used for measuring the accuracy of sample estimates and assisting with statistical inference. R is a freely available language and environment for statistical computing popular with biostatisticians for genomic data analyses. A survey of such R users highlighted its implementation of bootstrapping as a prime candidate for parallelization to overcome computational bottlenecks. The Simple Parallel R Interface (SPRINT) is a package that allows R users to exploit high performance computing in multi-core desktops and supercomputers without expert knowledge of such systems. This paper describes the parallelization of bootstrapping for inclusion in the SPRINT R package. Depending on the complexity of the bootstrap statistic and the number of resamples, this implementation has close to optimal speed up on up to 16 nodes of a supercomputer and close to 100 on 512 nodes. This performance in a multi-node setting compares favourably with an existing parallelization option in the native R implementation of bootstrapping."
to:NB  bootstrap  parallel_computing  computational_statistics  R  to_teach:statcomp 
february 2014 by cshalizi
[1402.1894] R Markdown: Integrating A Reproducible Analysis Tool into Introductory Statistics
"Nolan and Temple Lang argue that "the ability to express statistical computations is an essential skill." A key related capacity is the ability to conduct and present data analysis in a way that another person can understand and replicate. The copy-and-paste workflow that is an artifact of antiquated user-interface design makes reproducibility of statistical analysis more difficult, especially as data become increasingly complex and statistical methods become increasingly sophisticated. R Markdown is a new technology that makes creating fully-reproducible statistical analysis simple and painless. It provides a solution suitable not only for cutting edge research, but also for use in an introductory statistics course. We present evidence that R Markdown can be used effectively in introductory statistics courses, and discuss its role in the rapidly-changing world of statistical computation."
to:NB  R  teaching  statistics  to_teach:statcomp 
february 2014 by cshalizi
CRAN - Package alabama
"Augmented Lagrangian Adaptive Barrier Minimization Algorithm for optimizing smooth nonlinear objective functions with constraints. Linear or nonlinear equality and inequality constraints are allowed."

- Well, that solved my problem.
optimization  R  to_teach:statcomp 
december 2013 by cshalizi
Literate Testing in R | Data Analysis Visually Enforced
Nicely-named functions for testing some kinds of numerical properties.
R  programming  to_teach:statcomp 
november 2013 by cshalizi
[1310.2059] Distributed Coordinate Descent Method for Learning with Big Data
"In this paper we develop and analyze Hydra: HYbriD cooRdinAte descent method for solving loss minimization problems with big data. We initially partition the coordinates (features) and assign each partition to a different node of a cluster. At every iteration, each node picks a random subset of the coordinates from those it owns, independently from the other computers, and in parallel computes and applies updates to the selected coordinates based on a simple closed-form formula. We give bounds on the number of iterations sufficient to approximately solve the problem with high probability, and show how it depends on the data and on the partitioning. We perform numerical experiments with a LASSO instance described by a 3TB matrix."
to:NB  optimization  high-dimensional_statistics  computational_statistics  statistics  lasso  to_teach:statcomp 
october 2013 by cshalizi
CRAN - Package hash
"This package implements a data structure similar to hashes in Perl and dictionaries in Python but with a purposefully R flavor. For objects of appreciable size, access using hashes outperforms native named lists and vectors."
R  programming  to_teach:statcomp  hashing 
october 2013 by cshalizi
10 Easy Steps to a Complete Understanding of SQL - Tech.Pro
And by "to_teach", I mean "to mention".

ETA: arthegall calls item #2 somewhere between incoherent and wrong, and he'd know better than I...
databases  programming  to_teach:statcomp  via:kjhealy 
september 2013 by cshalizi
Red State/Blue State Divisions in the 2012 Presidential Election
"The so-called “red/blue paradox” is that rich individuals are more likely to vote Republican but rich states are more likely to support the Democrats. Previ- ous research argued that this seeming paradox could be explained by comparing rich and poor voters within each state – the difference in the Republican vote share between rich and poor voters was much larger in low-income, con- servative, middle-American states like Mississippi than in high-income, liberal, coastal states like Connecticut. We use exit poll and other survey data to assess whether this was still the case for the 2012 Presidential election. Based on this preliminary analysis, we find that, while the red/ blue paradox is still strong, the explanation offered by Gel- man et al. no longer appears to hold. We explore several empirical patterns from this election and suggest possible avenues for resolving the questions posed by the new data."
to:NB  have_read  us_politics  statistics  to_teach:undergrad-ADA  to_teach:statcomp  kith_and_kin  gelman.andrew 
july 2013 by cshalizi
Convex Optimization in R
"Convex optimization now plays an essential role in many facets of statistics. We briefly survey some recent developments and describe some implementations of some methods in R."

- Really, there's that little support for semi-definite programming?
to:NB  optimization  convexity  statistics  to_teach:statcomp  have_read  re:small-area_estimation_by_smoothing 
july 2013 by cshalizi
The Economic Impacts of Tax Expenditures Evidence from Spatial Variation Across the U.S.
Looks nice, and sharing the data is great. But allow me to be geekier than thou for a moment: _Excel_ files, gentlemen? Do the words "Reinhart and Rogoff" mean nothing to you?
economics  inequality  class_struggles_in_america  spatial_statistics  data_sets  statistics  to_teach:undergrad-ADA  to_teach:statcomp  have_read  have_taught  to_teach:data_over_space_and_time 
july 2013 by cshalizi
Christopher Gandrud (간드루드 크리스토파): Slide: one function for lag/lead variables in data frames, including time-series cross-sectional data
"I often want to quickly create a lag or lead variable in an R data frame. Sometimes I also want to create the lag or lead variable for different groups in a data frame, for example, if I want to lag GDP for each country in a data frame.
"I've found the various R methods for doing this hard to remember and usually need to look at old blog posts. Any time we find ourselves using the same series of codes over and over, it's probably time to put them into a function."

--- I think this might make a good exercise for statistical computing.
R  time_series  to_teach:statcomp 
july 2013 by cshalizi
Frontiers in Massive Data Analysis
"From Facebook to Google searches to bookmarking a webpage in our browsers, today's society has become one with an enormous amount of data. Some internet-based companies such as Yahoo! are even storing exabytes (10 to the 18 bytes) of data. Like these companies and the rest of the world, scientific communities are also generating large amounts of data-—mostly terabytes and in some cases near petabytes—from experiments, observations, and numerical simulation. However, the scientific community, along with defense enterprise, has been a leader in generating and using large data sets for many years. The issue that arises with this new type of large data is how to handle it—this includes sharing the data, enabling data security, working with different data formats and structures, dealing with the highly distributed data sources, and more.
"Frontiers in Massive Data Analysis presents the Committee on the Analysis of Massive Data's work to make sense of the current state of data analysis for mining of massive sets of data, to identify gaps in the current practice and to develop methods to fill these gaps. The committee thus examines the frontiers of research that is enabling the analysis of massive data which includes data representation and methods for including humans in the data-analysis loop. The report includes the committee's recommendations, details concerning types of data that build into massive data, and information on the seven computational giants of massive data analysis. "
to:NB  to_read  data_mining  data_analysis  computational_statistics  statistics  machine_learning  via:gelman  to_teach:statcomp  re:data_science_whitepaper  entableted 
july 2013 by cshalizi
My Stat Bytes talk, with slides and code | Nathan VanHoudnos
"I will present a grab bag of tricks to speed up your R code. Topics will include: installing an optimized BLAS, how to profile your R code to find which parts are slow, replacing slow code with inline C/C++, and running code in parallel on multiple cores. My running example will be fitting a 2PL IRT model with a hand coded MCMC sampler. The idea is to start with naive, pedagogically clear code and end up with fast, production quality code."
kith_and_kin  computational_statistics  R  vanhoudnos.nathan  to_teach:statcomp 
june 2013 by cshalizi
[1306.3574] Early stopping and non-parametric regression: An optimal data-dependent stopping rule
"The strategy of early stopping is a regularization technique based on choosing a stopping time for an iterative algorithm. Focusing on non-parametric regression in a reproducing kernel Hilbert space, we analyze the early stopping strategy for a form of gradient-descent applied to the least-squares loss function. We propose a data-dependent stopping rule that does not involve hold-out or cross-validation data, and we prove upper bounds on the squared error of the resulting function estimate, measured in either the $L^2(P)$ and $L^2(P_n)$ norm. These upper bounds lead to minimax-optimal rates for various kernel classes, including Sobolev smoothness classes and other forms of reproducing kernel Hilbert spaces. We show through simulation that our stopping rule compares favorably to two other stopping rules, one based on hold-out data and the other based on Stein's unbiased risk estimate. We also establish a tight connection between our early stopping strategy and the solution path of a kernel ridge regression estimator."
in_NB  optimization  kernel_estimators  hilbert_space  nonparametrics  regression  minimax  yu.bin  wainwright.martin_j.  to_teach:statcomp  have_read 
june 2013 by cshalizi
[1306.2119] Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)
"We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which includes machine learning methods based on the minimization of the empirical risk. We focus on problems without strong convexity, for which all previously known algorithms achieve a convergence rate for function values of O(1/n^{1/2}). We consider and analyze two algorithms that achieve a rate of O(1/n) for classical supervised learning problems. For least-squares regression, we show that averaged stochastic gradient descent with constant step-size achieves the desired rate. For logistic regression, this is achieved by a simple novel stochastic gradient algorithm that (a) constructs successive local quadratic approximations of the loss functions, while (b) preserving the same running time complexity as stochastic gradient descent. For these algorithms, we provide a non-asymptotic analysis of the generalization error (in expectation, and also in high probability for least-squares), and run extensive experiments on standard machine learning benchmarks showing that they often outperform existing approaches."
in_NB  optimization  learning_theory  statistics  estimation  stochastic_approximation  to_teach:statcomp 
june 2013 by cshalizi
[1306.1840] Loss-Proportional Subsampling for Subsequent ERM
"We propose a sampling scheme suitable for reducing a data set prior to selecting a hypothesis with minimum empirical risk. The sampling only considers a subset of the ultimate (unknown) hypothesis set, but can nonetheless guarantee that the final excess risk will compare favorably with utilizing the entire original data set. We demonstrate the practical benefits of our approach on a large dataset which we subsample and subsequently fit with boosted trees."

- To_teach is speculative. The trick is to pick some easy-to-compute hypothesis which can be applied to the whole data set, preferentially sample the points with high loss under this pilot model, and then do importance weighting. The excess risk they get for a sub-sample of size m from n observations is O(n^{-0.5}) + O(m^{-0.75}), as opposed to O(m^{-0.5}) for just naively drawing a sub-sample. I don't think they ever compare this to e.g. stochastic gradient descent.
in_NB  estimation  optimization  computational_statistics  statistics  learning_theory  to_teach:statcomp  have_read 
june 2013 by cshalizi
Troubling Trends in Scientific Software Use
"Software pervades every domain of science (1–3), perhaps nowhere more decisively than in modeling. In key scientific areas of great societal importance, models and the software that implement them define both how science is done and what science is done (4, 5). Across all science, this dependence has led to concerns around the need for open access to software (6, 7), centered on the reproducibility of research (1, 8–10). From fields such as high-performance computing, we learn key insights and best practices for how to develop, standardize, and implement software (11). Open and systematic approaches to the development of software are essential for all sciences. But for many scientists this is not sufficient. We describe problems with the adoption and use of scientific software."

--- Shorter: the situation isn't as bad as you'd fear; it's worse.
scientific_computing  programming  to_teach:statcomp 
june 2013 by cshalizi
« earlier      
per page:    204080120160

related tags

algebra  algorithms  aligheri.dante  approximation  artificial_intelligence  bad_data_analysis  biology  blogged  books:noted  books:owned  books:recommended  bootstrap  bryan.jennifer  burns.patrick  cartoons  cat_map  chambers.john  chatterjee.sourav  class_struggles_in_america  cold_war  computation  computational_complexity  computational_statistics  computational_thinking  convexity  cryptography  databases  data_analysis  data_mining  data_sets  debugging  density_estimation  dependence_measures  design  diaconis.persi  diffusion_of_innovations  distributed_systems  dynamical_systems  earthquakes  economics  education  entableted  ergodic_theory  estimation  finance  financial_markets  functional_programming  funny:academic  funny:because_its_true  funny:geeky  gelman.andrew  generalized_linear_models  genovese.christopher  geyer.charles_j.  git  hashing  have_read  have_taught  high-dimensional_statistics  hilbert_space  history_of_technology  how_outsiders_see_us  humanities  hypothesis_testing  individual_sequence_prediction  inequality  information_theory  intelligence_(spying)  intro_prob  intro_stats  in_NB  in_wishlist  i_see_what_you_did_there  jones.derek  journalism  kernel_estimators  kith_and_kin  knitr  lahiri.s.n.  large_deviations  lasso  latex  learning_theory  liberman.mark  likelihood  linear_regression  linguistics  literary_homage  lumley.thomas  machine_learning  maps  markdown  markov_models  mathematics  matloff.norm  minimax  modeling  modest_proposals  monte_carlo  movies  network_data_analysis  nonparametrics  nugent.rebecca  nukes  o'neil.cathy  online_learning  optimization  or_perhaps_the_nightmare_into_which_we_are_slipping  paninski.liam  paper_writing  parallel_computing  pedagogy  piketty.thomas  point_processes  prediction  pretty_pictures  probability  problem-solving  programming  programming_languages  psychology  r  racine.jeffrey_s.  racism  re:ADAfaEPoV  re:data_science_whitepaper  re:do-institutions-evolve  re:fitness_sampling  re:freshman_seminar_on_optimization  re:small-area_estimation_by_smoothing  regression  representation  reproducibility  science_as_a_social_process  scientific_computing  seltman.howard  simulation  social_networks  software  spatial_statistics  spatio-temporal_statistics  splines  spreadsheets  stafford.tom  statistics  stochastic_approximation  stochastic_processes  stodden.victoria  teaching  text_mining  the_nightmare_from_which_we_are_trying_to_awake  the_robo-nuclear_apocalypse_in_our_past_light_cone  the_spreadsheet_menace  tibshirani.ryan  time_series  to:blog  to:NB  to_read  to_teach  to_teach:baby-nets  to_teach:complexity-and-inference  to_teach:data-mining  to_teach:data_over_space_and_time  to_teach:freshman_seminar_on_optimization  to_teach:linear_models  to_teach:statcomp  to_teach:undergrad-ADA  to_teach:undergrad-research  track_down_references  tutorials  unix  ussr  us_politics  utter_stupidity  vanhoudnos.nathan  ventura.samuel  version_control  verzani.john  via:?  via:aaron_clauset  via:absfac  via:arsyed  via:arthegall  via:civilstat  via:ded-maxim  via:flowing_data  via:gelman  via:james-nicoll  via:jbdelong  via:jbowers  via:jhofman  via:kjhealy  via:klk  via:mraginsky  via:phnk  via:slaniel  via:sparkcamp  via:tealtan  via:tslumley  via:vqv  visual_display_of_quantitative_information  wainwright.martin_j.  wickham.hadley  writing  xkcd  yu.bin 

Copy this bookmark: