**cshalizi + to_teach:statcomp**
136

[1909.11827] Convergence diagnostics for Markov chain Monte Carlo

20 days ago by cshalizi

"Markov chain Monte Carlo (MCMC) is one of the most useful approaches to scientific computing because of its flexible construction, ease of use and generality. Indeed, MCMC is indispensable for performing Bayesian analysis. In spite of its widespread use, two critical questions that MCMC practitioners need to address are where to start and when to stop the simulation. Although a great amount of research has gone into establishing convergence criteria and stopping rules with sound theoretical foundation, in practice, MCMC users often decide convergence by applying empirical diagnostic tools. This review article discusses the most widely used MCMC convergence diagnostic tools. Some recently proposed stopping rules with firm theoretical footing are also presented. The convergence diagnostics and stopping rules are illustrated using three detailed examples."

to:NB
monte_carlo
statistics
computational_statistics
simulation
to_teach:statcomp
20 days ago by cshalizi

[1909.03813] INTEREST: INteractive Tool for Exploring REsults from Simulation sTudies

4 weeks ago by cshalizi

"Simulation studies allow us to explore the properties of statistical methods. They provide a powerful tool with a multiplicity of aims; among others: evaluating and comparing new or existing statistical methods, assessing violations of modelling assumptions, helping with the understanding of statistical concepts, and supporting the design of clinical trials. The increased availability of powerful computational tools and usable software has contributed to the rise of simulation studies in the current literature. However, simulation studies involve increasingly complex designs, making it difficult to provide all relevant results clearly. Dissemination of results plays a focal role in simulation studies: it can drive applied analysts to use methods that have been shown to perform well in their settings, guide researchers to develop new methods in a promising direction, and provide insights into less established methods. It is crucial that we can digest relevant results of simulation studies. Therefore, we developed INTEREST: an INteractive Tool for Exploring REsults from Simulation sTudies. The tool has been developed using the Shiny framework in R and is available as a web app or as a standalone package. It requires uploading a tidy format dataset with the results of a simulation study in R, Stata, SAS, SPSS, or comma-separated format. A variety of performance measures are estimated automatically along with Monte Carlo standard errors; results and performance summaries are displayed both in tabular and graphical fashion, with a wide variety of available plots. Consequently, the reader can focus on simulation parameters and estimands of most interest. In conclusion, INTEREST can facilitate the investigation of results from simulation studies and supplement the reporting of results, allowing researchers to share detailed results from their simulations and readers to explore them freely."

to:NB
simulation
R
to_teach:statcomp
4 weeks ago by cshalizi

[1106.4929] Simulating rare events in dynamical processes

12 weeks ago by cshalizi

"Atypical, rare trajectories of dynamical systems are important: they are often the paths for chemical reactions, the haven of (relative) stability of planetary systems, the rogue waves that are detected in oil platforms, the structures that are responsible for intermittency in a turbulent liquid, the active regions that allow a supercooled liquid to flow... Simulating them in an efficient, accelerated way, is in fact quite simple.

"In this paper we review a computational technique to study such rare events in both stochastic and Hamiltonian systems. The method is based on the evolution of a family of copies of the system which are replicated or killed in such a way as to favor the realization of the atypical trajectories. We illustrate this with various examples."

to:NB
stochastic_processes
simulation
large_deviations
to_teach:data_over_space_and_time
to_teach:statcomp
re:fitness_sampling
re:do-institutions-evolve
"In this paper we review a computational technique to study such rare events in both stochastic and Hamiltonian systems. The method is based on the evolution of a family of copies of the system which are replicated or killed in such a way as to favor the realization of the atypical trajectories. We illustrate this with various examples."

12 weeks ago by cshalizi

Fast Generalized Linear Models by Database Sampling and One-Step Polishing: Journal of Computational and Graphical Statistics: Vol 0, No 0

june 2019 by cshalizi

"In this article, I show how to fit a generalized linear model to N observations on p variables stored in a relational database, using one sampling query and one aggregation query, as long as N^{1/2+δ} observations can be stored in memory, for some δ>0. The resulting estimator is fully efficient and asymptotically equivalent to the maximum likelihood estimator, and so its variance can be estimated from the Fisher information in the usual way. A proof-of-concept implementation uses R with MonetDB and with SQLite, and could easily be adapted to other popular databases. I illustrate the approach with examples of taxi-trip data in New York City and factors related to car color in New Zealand. "

to:NB
computational_statistics
linear_regression
regression
databases
lumley.thomas
to_teach:statcomp
june 2019 by cshalizi

Allesina, S. and Wilmes, M.: Computing Skills for Biologists: A Toolbox (Hardcover, Paperback and eBook) | Princeton University Press

january 2019 by cshalizi

"While biological data continues to grow exponentially in size and quality, many of today’s biologists are not trained adequately in the computing skills necessary for leveraging this information deluge. In Computing Skills for Biologists, Stefano Allesina and Madlen Wilmes present a valuable toolbox for the effective analysis of biological data.

"Based on the authors’ experiences teaching scientific computing at the University of Chicago, this textbook emphasizes the automation of repetitive tasks and the construction of pipelines for data organization, analysis, visualization, and publication. Stressing practice rather than theory, the book’s examples and exercises are drawn from actual biological data and solve cogent problems spanning the entire breadth of biological disciplines, including ecology, genetics, microbiology, and molecular biology. Beginners will benefit from the many examples explained step-by-step, while more seasoned researchers will learn how to combine tools to make biological data analysis robust and reproducible. The book uses free software and code that can be run on any platform.

"Computing Skills for Biologists is ideal for scientists wanting to improve their technical skills and instructors looking to teach the main computing tools essential for biology research in the twenty-first century."

to:NB
books:noted
scientific_computing
R
to_teach:statcomp
"Based on the authors’ experiences teaching scientific computing at the University of Chicago, this textbook emphasizes the automation of repetitive tasks and the construction of pipelines for data organization, analysis, visualization, and publication. Stressing practice rather than theory, the book’s examples and exercises are drawn from actual biological data and solve cogent problems spanning the entire breadth of biological disciplines, including ecology, genetics, microbiology, and molecular biology. Beginners will benefit from the many examples explained step-by-step, while more seasoned researchers will learn how to combine tools to make biological data analysis robust and reproducible. The book uses free software and code that can be run on any platform.

"Computing Skills for Biologists is ideal for scientists wanting to improve their technical skills and instructors looking to teach the main computing tools essential for biology research in the twenty-first century."

january 2019 by cshalizi

General Resampling Infrastructure • rsample

august 2018 by cshalizi

"rsample contains a set of functions that can create different types of resamples and corresponding classes for their analysis. The goal is to have a modular set of methods that can be used across different R packages for:

"traditional resampling techniques for estimating the sampling distribution of a statistic and

"estimating model performance using a holdout set

"The scope of rsample is to provide the basic building blocks for creating and analyzing resamples of a data set but does not include code for modeling or calculating statistics. The “Working with Resample Sets” vignette gives demonstrations of how rsample tools can be used."

to:NB
R
computational_statistics
to_teach:statcomp
to_teach:undergrad-ADA
via:?
"traditional resampling techniques for estimating the sampling distribution of a statistic and

"estimating model performance using a holdout set

"The scope of rsample is to provide the basic building blocks for creating and analyzing resamples of a data set but does not include code for modeling or calculating statistics. The “Working with Resample Sets” vignette gives demonstrations of how rsample tools can be used."

august 2018 by cshalizi

[1511.01437] The sample size required in importance sampling

october 2016 by cshalizi

"The goal of importance sampling is to estimate the expected value of a given function with respect to a probability measure ν using a random sample of size n drawn from a different probability measure μ. If the two measures μ and ν are nearly singular with respect to each other, which is often the case in practice, the sample size required for accurate estimation is large. In this article it is shown that in a fairly general setting, a sample of size approximately exp(D(ν||μ)) is necessary and sufficient for accurate estimation by importance sampling, where D(ν||μ) is the Kullback--Leibler divergence of μ from ν. In particular, the required sample size exhibits a kind of cut-off in the logarithmic scale. The theory is applied to obtain a fairly general formula for the sample size required in importance sampling for exponential families (Gibbs measures). We also show that the standard variance-based diagnostic for convergence of importance sampling is fundamentally problematic. An alternative diagnostic that provably works in certain situations is suggested."

to:NB
to_read
statistics
monte_carlo
probability
information_theory
to_teach:statcomp
chatterjee.sourav
diaconis.persi
via:ded-maxim
re:fitness_sampling
october 2016 by cshalizi

[1609.00037] Good Enough Practices in Scientific Computing

september 2016 by cshalizi

"We present a set of computing tools and techniques that every researcher can and should adopt. These recommendations synthesize inspiration from our own work, from the experiences of the thousands of people who have taken part in Software Carpentry and Data Carpentry workshops over the past six years, and from a variety of other guides. Unlike some other guides, our recommendations are aimed specifically at people who are new to research computing."

to:NB
to_teach:statcomp
to_teach
scientific_computing
have_read
to:blog
september 2016 by cshalizi

Interactive R On-Line

august 2016 by cshalizi

"IROL was developed by the team of Howard Seltman (email feedback), Rebecca Nugent, Sam Ventura, Ryan Tibshirani, and Chris Genovese at the Department of Statistics at Carnegie Mellon University."

--- I mark this as "to_teach:statcomp", but of course the point is to have people go through this _before_ that course, so the class can cover more interesting stuff.

R
kith_and_kin
seltman.howard
nugent.rebecca
genovese.christopher
ventura.samuel
tibshirani.ryan
to_teach:statcomp
--- I mark this as "to_teach:statcomp", but of course the point is to have people go through this _before_ that course, so the class can cover more interesting stuff.

august 2016 by cshalizi

Draw the rest of the owl

march 2016 by cshalizi

This strikes me as really excellent pedagogy.

R
programming
to_teach:statcomp
problem-solving
bryan.jennifer
via:tslumley
march 2016 by cshalizi

Jenny Bryan on Twitter: "An Incomplete List of #rstats troubleshooting tips https://t.co/OKKoGkSYzq"

march 2016 by cshalizi

It misses

* Did you use attach()? Don't

but is otherwise pretty good.

R
to_teach:undergrad-ADA
to_teach:statcomp
via:tslumley
bryan.jennifer
* Did you use attach()? Don't

but is otherwise pretty good.

march 2016 by cshalizi

Quantifying Life: A Symbiosis of Computation, Mathematics, and Biology, Kondrashov

january 2016 by cshalizi

"Since the time of Isaac Newton, physicists have used mathematics to describe the behavior of matter of all sizes, from subatomic particles to galaxies. In the past three decades, as advances in molecular biology have produced an avalanche of data, computational and mathematical techniques have also become necessary tools in the arsenal of biologists. But while quantitative approaches are now providing fundamental insights into biological systems, the college curriculum for biologists has not caught up, and most biology majors are never exposed to the computational and probabilistic mathematical approaches that dominate in biological research.

"With Quantifying Life, Dmitry A. Kondrashov offers an accessible introduction to the breadth of mathematical modeling used in biology today. Assuming only a foundation in high school mathematics, Quantifying Life takes an innovative computational approach to developing mathematical skills and intuition. Through lessons illustrated with copious examples, mathematical and programming exercises, literature discussion questions, and computational projects of various degrees of difficulty, students build and analyze models based on current research papers and learn to implement them in the R programming language. This interplay of mathematical ideas, systematically developed programming skills, and a broad selection of biological research topics makes Quantifying Life an invaluable guide for seasoned life scientists and the next generation of biologists alike."

--- Mineable for examples?

books:noted
biology
programming
modeling
to_teach:statcomp
to_teach:complexity-and-inference
"With Quantifying Life, Dmitry A. Kondrashov offers an accessible introduction to the breadth of mathematical modeling used in biology today. Assuming only a foundation in high school mathematics, Quantifying Life takes an innovative computational approach to developing mathematical skills and intuition. Through lessons illustrated with copious examples, mathematical and programming exercises, literature discussion questions, and computational projects of various degrees of difficulty, students build and analyze models based on current research papers and learn to implement them in the R programming language. This interplay of mathematical ideas, systematically developed programming skills, and a broad selection of biological research topics makes Quantifying Life an invaluable guide for seasoned life scientists and the next generation of biologists alike."

--- Mineable for examples?

january 2016 by cshalizi

That time I was nearly burned alive by a machine-learning model and didn’t even notice for 33 years | The Yorkshire Ranter

december 2015 by cshalizi

This is so rich in morals for what I do and teach I hardly know where to start. Beyond: holy shit.

nukes
cold_war
machine_learning
prediction
data_mining
ussr
bad_data_analysis
the_nightmare_from_which_we_are_trying_to_awake
or_perhaps_the_nightmare_into_which_we_are_slipping
the_robo-nuclear_apocalypse_in_our_past_light_cone
track_down_references
to_teach:data-mining
to_teach:statcomp
via:james-nicoll
to:blog
intelligence_(spying)
december 2015 by cshalizi

Commented Scripts to Build Maps with cartography

october 2015 by cshalizi

These look nice. Maybe for the spatial data examples in the book?

R
visual_display_of_quantitative_information
maps
to_teach:statcomp
re:ADAfaEPoV
via:phnk
october 2015 by cshalizi

CRAN - Package BatchJobs

april 2015 by cshalizi

"Provides Map, Reduce and Filter variants to generate jobs on batch computing systems like PBS/Torque, LSF, SLURM and Sun Grid Engine. Multicore and SSH systems are also supported."

to_read
R
programming
to_teach:statcomp
april 2015 by cshalizi

CRAN - Package markovchain

march 2015 by cshalizi

"Functions and S4 methods to create and manage discrete time Markov chains (DTMC) more easily. In addition functions to perform statistical (fitting and drawing random variates) and probabilistic (analysis of DTMC proprieties) analysis are provided."

markov_models
R
to_teach:statcomp
march 2015 by cshalizi

Data Science at the Command Line - O'Reilly Media

february 2015 by cshalizi

"This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data."

books:noted
unix
to_teach:statcomp
in_wishlist
february 2015 by cshalizi

r - How to add elements to a plot using a knitr chunk without original markdown output? - Stack Overflow

february 2015 by cshalizi

Need to check whether this works when knitting a latex document as well. (Presumably.)

latex
R
knitr
to_teach:statcomp
re:ADAfaEPoV
february 2015 by cshalizi

A practical introduction to functional programming at Mary Rose Cook

january 2015 by cshalizi

Uses python, but these ideas are exactly the ones I try to teach in that part of my R course, only better expressed.

programming
functional_programming
have_read
via:tealtan
to_teach:statcomp
january 2015 by cshalizi

[1409.3531] Object-Oriented Programming, Functional Programming and R

january 2015 by cshalizi

"This paper reviews some programming techniques in R that have proved useful, particularly for substantial projects. These include several versions of object-oriented programming, used in a large number of R packages. The review tries to clarify the origins and ideas behind the various versions, each of which is valuable in the appropriate context. R has also been strongly influenced by the ideas of functional programming and, in particular, by the desire to combine functional with object oriented programming. To clarify how this particular mix of ideas has turned out in the current R language and supporting software, the paper will first review the basic ideas behind object-oriented and functional programming, and then examine the evolution of R with these ideas providing context. Functional programming supports well-defined, defensible software giving reproducible results. Object-oriented programming is the mechanism par excellence for managing complexity while keeping things simple for the user. The two paradigms have been valuable in supporting major software for fitting models to data and numerous other statistical applications. The paradigms have been adopted, and adapted, distinctively in R. Functional programming motivates much of R but R does not enforce the paradigm. Object-oriented programming from a functional perspective differs from that used in non-functional languages, a distinction that needs to be emphasized to avoid confusion. R initially replicated the S language from Bell Labs, which in turn was strongly influenced by earlier program libraries. At each stage, new ideas have been added, but the previous software continues to show its influence in the design as well. Outlining the evolution will further clarify why we currently have this somewhat unusual combination of ideas."

to:NB
to_read
programming
R
chambers.john
to_teach:statcomp
january 2015 by cshalizi

[1409.5827] Software Alchemy: Turning Complex Statistical Computations into Embarrassingly-Parallel Ones

january 2015 by cshalizi

"The growth in the use of computationally intensive statistical procedures, especially with Big Data, has necessitated the usage of parallel computation on diverse platforms such as multicore, GPU, clusters and clouds. However, slowdown due to interprocess communication costs typically limits such methods to "embarrassingly parallel" (EP) algorithms, especially on non-shared memory platforms. This paper develops a broadly-applicable method for converting many non-EP algorithms into statistically equivalent EP ones. The method is shown to yield excellent levels of speedup for a variety of statistical computations. It also overcomes certain problems of memory limitations."

in_NB
to_read
distributed_systems
computational_statistics
statistics
matloff.norm
to_teach:statcomp
january 2015 by cshalizi

Optimization Models | Cambridge University Press

october 2014 by cshalizi

"Emphasizing practical understanding over the technicalities of specific algorithms, this elegant textbook is an accessible introduction to the field of optimization, focusing on powerful and reliable convex optimization techniques. Students and practitioners will learn how to recognize, simplify, model and solve optimization problems - and apply these principles to their own projects. A clear and self-contained introduction to linear algebra demonstrates core mathematical concepts in a way that is easy to follow, and helps students to understand their practical relevance. Requiring only a basic understanding of geometry, calculus, probability and statistics, and striking a careful balance between accessibility and rigor, it enables students to quickly understand the material, without being overwhelmed by complex mathematics. Accompanied by numerous end-of-chapter problems, an online solutions manual for instructors, and relevant examples from diverse fields including engineering, data science, economics, finance, and management, this is the perfect introduction to optimization for undergraduate and graduate students."

in_NB
optimization
convexity
books:noted
to_teach:statcomp
to_teach:freshman_seminar_on_optimization
october 2014 by cshalizi

Quality and efficiency for kernel density estimates in large data

october 2014 by cshalizi

"Kernel density estimates are important for a broad variety of applications. Their construction has been well-studied, but existing techniques are expensive on massive datasets and/or only provide heuristic approximations without theoretical guarantees. We propose randomized and deterministic algorithms with quality guarantees which are orders of magnitude more efficient than previous algorithms. Our algorithms do not require knowledge of the kernel or its bandwidth parameter and are easily parallelizable. We demonstrate how to implement our ideas in a centralized setting and in MapReduce, although our algorithms are applicable to any large-scale data processing framework. Extensive experiments on large real datasets demonstrate the quality, efficiency, and scalability of our techniques."

--- Ungated version: http://www.cs.utah.edu/~lifeifei/papers/kernelsigmod13.pdf

to:NB
have_read
kernel_estimators
computational_statistics
statistics
density_estimation
to_teach:statcomp
to_teach:undergrad-ADA
--- Ungated version: http://www.cs.utah.edu/~lifeifei/papers/kernelsigmod13.pdf

october 2014 by cshalizi

An Algebraic Process for Visual Representation Design

september 2014 by cshalizi

"We present a model of visualization design based on algebraic considerations of the visualization process. The model helps characterize visual encodings, guide their design, evaluate their effectiveness, and highlight their shortcomings. The model has three components: the underlying mathematical structure of the data or object being visualized, the concrete representation of the data in a computer, and (to the extent possible) a mathematical description of how humans perceive the visualization. Because we believe the value of our model lies in its practical application, we propose three general principles for good visualization design. We work through a collection of examples where our model helps explain the known properties of existing visualizations methods, both good and not-so-good, as well as suggesting some novel methods. We describe how to use the model alongside experimental user studies, since it can help frame experiment outcomes in an actionable manner. Exploring the implications and applications of our model and its design principles should provide many directions for future visualization research."

to:NB
to_read
visual_display_of_quantitative_information
representation
to_teach:statcomp
to_teach:data-mining
algebra
september 2014 by cshalizi

Notifications from R | The stupidest thing...

september 2014 by cshalizi

Couldn't one just do a system call to mail(1), rather than GIVING AN R SCRIPT YOUR PASSWORD?

R
to_teach:statcomp
via:phnk
september 2014 by cshalizi

Visualizing Algorithms

july 2014 by cshalizi

Some of the sorting algorithm visualizations reminded me of my old days working with 1D cellular automata, and the spanning-tree ones are great. (I had never realized the connection between mazes and spanning trees.)

algorithms
programming
to_teach:statcomp
pretty_pictures
visual_display_of_quantitative_information
have_read
via:?
july 2014 by cshalizi

The Mature Optimization Handbook

july 2014 by cshalizi

Possible reference for statcomp?

programming
to_teach:statcomp
july 2014 by cshalizi

Piketty in R markdown – we need some help from the crowd | Simply Statistics

july 2014 by cshalizi

The non-proportional spacing of points on the time axis bugged me too, but I think it's more a case of spreadsheet defaults than anything else.

piketty.thomas
economics
data_sets
to_teach:statcomp
july 2014 by cshalizi

LIWC: Linguistic Inquiry and Word Count

july 2014 by cshalizi

Have they really just stuck words into various categories, and then counted up how often they appear in the document? It seems so, since "It was a beautiful funeral" scores as 20% positive, 0% negative. (If so: problem set for the kids in statistical computing?) Maybe this would get the emotional drift from a long piece of text, but from short snippets like Twitter or Facebook status updates, this has got to be super noisy.

Memo to self, look at whether CMU has a site license before shelling out $29.95.

ETA: The classic "I am in no way unhappy" scores as 1/6 negative.

text_mining
linguistics
psychology
to:blog
to_teach:statcomp
Memo to self, look at whether CMU has a site license before shelling out $29.95.

ETA: The classic "I am in no way unhappy" scores as 1/6 negative.

july 2014 by cshalizi

A Primer on Regression Splines

may 2014 by cshalizi

"B-splines constitute an appealing method for the nonparametric estimation of a range of statis- tical objects of interest. In this primer we focus our attention on the estimation of a conditional mean, i.e. the ‘regression function’."

in_NB
splines
nonparametrics
regression
approximation
statistics
computational_statistics
racine.jeffrey_s.
to_teach:statcomp
to_teach:undergrad-ADA
have_read
may 2014 by cshalizi

The Overview Project » Algorithms are not enough: lessons bringing computer science to journalism

april 2014 by cshalizi

(Makes loud and prolonged noises of approval)

(Looks guiltily at own practices)

data_analysis
text_mining
journalism
programming
design
data_mining
visual_display_of_quantitative_information
to_teach:data-mining
to_teach:statcomp
to:blog
(Looks guiltily at own practices)

april 2014 by cshalizi

Language Log » Literate programming and reproducible research

march 2014 by cshalizi

I should get around to reading Knuth on literate programming...

programming
movies
to_teach:statcomp
liberman.mark
march 2014 by cshalizi

Scalable Strategies for Computing with Massive Data

february 2014 by cshalizi

"This paper presents two complementary statistical computing frameworks that address challenges in parallel processing and the analysis of massive data. First, the foreach package allows users of the R programming environment to define parallel loops that may be run sequentially on a single machine, in parallel on a symmetric multiprocessing (SMP) machine, or in cluster environments without platform-specific code. Second, the bigmemory package implements memory- and file-mapped data structures that provide (a) access to arbitrarily large data while retaining a look and feel that is familiar to R users and (b) data structures that are shared across processor cores in order to support efficient parallel computing techniques. Although these packages may be used independently, this paper shows how they can be used in combination to address challenges that have effectively been beyond the reach of researchers who lack specialized software development skills or expensive hardware."

have_read
R
computational_statistics
data_analysis
in_NB
to_teach:statcomp
february 2014 by cshalizi

[1401.6389] Parallel Optimisation of Bootstrapping in R

february 2014 by cshalizi

"Bootstrapping is a popular and computationally demanding resampling method used for measuring the accuracy of sample estimates and assisting with statistical inference. R is a freely available language and environment for statistical computing popular with biostatisticians for genomic data analyses. A survey of such R users highlighted its implementation of bootstrapping as a prime candidate for parallelization to overcome computational bottlenecks. The Simple Parallel R Interface (SPRINT) is a package that allows R users to exploit high performance computing in multi-core desktops and supercomputers without expert knowledge of such systems. This paper describes the parallelization of bootstrapping for inclusion in the SPRINT R package. Depending on the complexity of the bootstrap statistic and the number of resamples, this implementation has close to optimal speed up on up to 16 nodes of a supercomputer and close to 100 on 512 nodes. This performance in a multi-node setting compares favourably with an existing parallelization option in the native R implementation of bootstrapping."

to:NB
bootstrap
parallel_computing
computational_statistics
R
to_teach:statcomp
february 2014 by cshalizi

ŷhat | 10 R packages I wish I knew about earlier

february 2014 by cshalizi

I know most but not all of these...

R
to_teach:statcomp
via:?
february 2014 by cshalizi

[1402.1894] R Markdown: Integrating A Reproducible Analysis Tool into Introductory Statistics

february 2014 by cshalizi

"Nolan and Temple Lang argue that "the ability to express statistical computations is an essential skill." A key related capacity is the ability to conduct and present data analysis in a way that another person can understand and replicate. The copy-and-paste workflow that is an artifact of antiquated user-interface design makes reproducibility of statistical analysis more difficult, especially as data become increasingly complex and statistical methods become increasingly sophisticated. R Markdown is a new technology that makes creating fully-reproducible statistical analysis simple and painless. It provides a solution suitable not only for cutting edge research, but also for use in an introductory statistics course. We present evidence that R Markdown can be used effectively in introductory statistics courses, and discuss its role in the rapidly-changing world of statistical computation."

to:NB
R
teaching
statistics
to_teach:statcomp
february 2014 by cshalizi

On undoing, fixing, or removing commits in git

february 2014 by cshalizi

May kill the to_teach tag after reading.

git
programming
to_teach:statcomp
via:?
february 2014 by cshalizi

CRAN - Package alabama

december 2013 by cshalizi

"Augmented Lagrangian Adaptive Barrier Minimization Algorithm for optimizing smooth nonlinear objective functions with constraints. Linear or nonlinear equality and inequality constraints are allowed."

- Well, that solved my problem.

optimization
R
to_teach:statcomp
- Well, that solved my problem.

december 2013 by cshalizi

Literate Testing in R | Data Analysis Visually Enforced

november 2013 by cshalizi

Nicely-named functions for testing some kinds of numerical properties.

R
programming
to_teach:statcomp
november 2013 by cshalizi

[1310.2059] Distributed Coordinate Descent Method for Learning with Big Data

october 2013 by cshalizi

"In this paper we develop and analyze Hydra: HYbriD cooRdinAte descent method for solving loss minimization problems with big data. We initially partition the coordinates (features) and assign each partition to a different node of a cluster. At every iteration, each node picks a random subset of the coordinates from those it owns, independently from the other computers, and in parallel computes and applies updates to the selected coordinates based on a simple closed-form formula. We give bounds on the number of iterations sufficient to approximately solve the problem with high probability, and show how it depends on the data and on the partitioning. We perform numerical experiments with a LASSO instance described by a 3TB matrix."

to:NB
optimization
high-dimensional_statistics
computational_statistics
statistics
lasso
to_teach:statcomp
october 2013 by cshalizi

CRAN - Package hash

october 2013 by cshalizi

"This package implements a data structure similar to hashes in Perl and dictionaries in Python but with a purposefully R flavor. For objects of appreciable size, access using hashes outperforms native named lists and vectors."

R
programming
to_teach:statcomp
hashing
october 2013 by cshalizi

10 Easy Steps to a Complete Understanding of SQL - Tech.Pro

september 2013 by cshalizi

And by "to_teach", I mean "to mention".

ETA: arthegall calls item #2 somewhere between incoherent and wrong, and he'd know better than I...

databases
programming
to_teach:statcomp
via:kjhealy
ETA: arthegall calls item #2 somewhere between incoherent and wrong, and he'd know better than I...

september 2013 by cshalizi

Red State/Blue State Divisions in the 2012 Presidential Election

july 2013 by cshalizi

"The so-called “red/blue paradox” is that rich individuals are more likely to vote Republican but rich states are more likely to support the Democrats. Previ- ous research argued that this seeming paradox could be explained by comparing rich and poor voters within each state – the difference in the Republican vote share between rich and poor voters was much larger in low-income, con- servative, middle-American states like Mississippi than in high-income, liberal, coastal states like Connecticut. We use exit poll and other survey data to assess whether this was still the case for the 2012 Presidential election. Based on this preliminary analysis, we find that, while the red/ blue paradox is still strong, the explanation offered by Gel- man et al. no longer appears to hold. We explore several empirical patterns from this election and suggest possible avenues for resolving the questions posed by the new data."

to:NB
have_read
us_politics
statistics
to_teach:undergrad-ADA
to_teach:statcomp
kith_and_kin
gelman.andrew
july 2013 by cshalizi

Convex Optimization in R

july 2013 by cshalizi

"Convex optimization now plays an essential role in many facets of statistics. We briefly survey some recent developments and describe some implementations of some methods in R."

- Really, there's that little support for semi-definite programming?

to:NB
optimization
convexity
statistics
to_teach:statcomp
have_read
re:small-area_estimation_by_smoothing
- Really, there's that little support for semi-definite programming?

july 2013 by cshalizi

The Economic Impacts of Tax Expenditures Evidence from Spatial Variation Across the U.S.

july 2013 by cshalizi

Looks nice, and sharing the data is great. But allow me to be geekier than thou for a moment: _Excel_ files, gentlemen? Do the words "Reinhart and Rogoff" mean nothing to you?

economics
inequality
class_struggles_in_america
spatial_statistics
data_sets
statistics
to_teach:undergrad-ADA
to_teach:statcomp
have_read
have_taught
to_teach:data_over_space_and_time
july 2013 by cshalizi

Christopher Gandrud (간드루드 크리스토파): Slide: one function for lag/lead variables in data frames, including time-series cross-sectional data

july 2013 by cshalizi

"I often want to quickly create a lag or lead variable in an R data frame. Sometimes I also want to create the lag or lead variable for different groups in a data frame, for example, if I want to lag GDP for each country in a data frame.

"I've found the various R methods for doing this hard to remember and usually need to look at old blog posts. Any time we find ourselves using the same series of codes over and over, it's probably time to put them into a function."

--- I think this might make a good exercise for statistical computing.

R
time_series
to_teach:statcomp
"I've found the various R methods for doing this hard to remember and usually need to look at old blog posts. Any time we find ourselves using the same series of codes over and over, it's probably time to put them into a function."

--- I think this might make a good exercise for statistical computing.

july 2013 by cshalizi

Frontiers in Massive Data Analysis

july 2013 by cshalizi

"From Facebook to Google searches to bookmarking a webpage in our browsers, today's society has become one with an enormous amount of data. Some internet-based companies such as Yahoo! are even storing exabytes (10 to the 18 bytes) of data. Like these companies and the rest of the world, scientific communities are also generating large amounts of data-—mostly terabytes and in some cases near petabytes—from experiments, observations, and numerical simulation. However, the scientific community, along with defense enterprise, has been a leader in generating and using large data sets for many years. The issue that arises with this new type of large data is how to handle it—this includes sharing the data, enabling data security, working with different data formats and structures, dealing with the highly distributed data sources, and more.

"Frontiers in Massive Data Analysis presents the Committee on the Analysis of Massive Data's work to make sense of the current state of data analysis for mining of massive sets of data, to identify gaps in the current practice and to develop methods to fill these gaps. The committee thus examines the frontiers of research that is enabling the analysis of massive data which includes data representation and methods for including humans in the data-analysis loop. The report includes the committee's recommendations, details concerning types of data that build into massive data, and information on the seven computational giants of massive data analysis. "

to:NB
to_read
data_mining
data_analysis
computational_statistics
statistics
machine_learning
via:gelman
to_teach:statcomp
re:data_science_whitepaper
entableted
"Frontiers in Massive Data Analysis presents the Committee on the Analysis of Massive Data's work to make sense of the current state of data analysis for mining of massive sets of data, to identify gaps in the current practice and to develop methods to fill these gaps. The committee thus examines the frontiers of research that is enabling the analysis of massive data which includes data representation and methods for including humans in the data-analysis loop. The report includes the committee's recommendations, details concerning types of data that build into massive data, and information on the seven computational giants of massive data analysis. "

july 2013 by cshalizi

My Stat Bytes talk, with slides and code | Nathan VanHoudnos

june 2013 by cshalizi

"I will present a grab bag of tricks to speed up your R code. Topics will include: installing an optimized BLAS, how to profile your R code to find which parts are slow, replacing slow code with inline C/C++, and running code in parallel on multiple cores. My running example will be fitting a 2PL IRT model with a hand coded MCMC sampler. The idea is to start with naive, pedagogically clear code and end up with fast, production quality code."

kith_and_kin
computational_statistics
R
vanhoudnos.nathan
to_teach:statcomp
june 2013 by cshalizi

[1306.3574] Early stopping and non-parametric regression: An optimal data-dependent stopping rule

june 2013 by cshalizi

"The strategy of early stopping is a regularization technique based on choosing a stopping time for an iterative algorithm. Focusing on non-parametric regression in a reproducing kernel Hilbert space, we analyze the early stopping strategy for a form of gradient-descent applied to the least-squares loss function. We propose a data-dependent stopping rule that does not involve hold-out or cross-validation data, and we prove upper bounds on the squared error of the resulting function estimate, measured in either the $L^2(P)$ and $L^2(P_n)$ norm. These upper bounds lead to minimax-optimal rates for various kernel classes, including Sobolev smoothness classes and other forms of reproducing kernel Hilbert spaces. We show through simulation that our stopping rule compares favorably to two other stopping rules, one based on hold-out data and the other based on Stein's unbiased risk estimate. We also establish a tight connection between our early stopping strategy and the solution path of a kernel ridge regression estimator."

in_NB
optimization
kernel_estimators
hilbert_space
nonparametrics
regression
minimax
yu.bin
wainwright.martin_j.
to_teach:statcomp
have_read
june 2013 by cshalizi

[1306.2119] Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)

june 2013 by cshalizi

"We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which includes machine learning methods based on the minimization of the empirical risk. We focus on problems without strong convexity, for which all previously known algorithms achieve a convergence rate for function values of O(1/n^{1/2}). We consider and analyze two algorithms that achieve a rate of O(1/n) for classical supervised learning problems. For least-squares regression, we show that averaged stochastic gradient descent with constant step-size achieves the desired rate. For logistic regression, this is achieved by a simple novel stochastic gradient algorithm that (a) constructs successive local quadratic approximations of the loss functions, while (b) preserving the same running time complexity as stochastic gradient descent. For these algorithms, we provide a non-asymptotic analysis of the generalization error (in expectation, and also in high probability for least-squares), and run extensive experiments on standard machine learning benchmarks showing that they often outperform existing approaches."

in_NB
optimization
learning_theory
statistics
estimation
stochastic_approximation
to_teach:statcomp
june 2013 by cshalizi

[1306.1840] Loss-Proportional Subsampling for Subsequent ERM

june 2013 by cshalizi

"We propose a sampling scheme suitable for reducing a data set prior to selecting a hypothesis with minimum empirical risk. The sampling only considers a subset of the ultimate (unknown) hypothesis set, but can nonetheless guarantee that the final excess risk will compare favorably with utilizing the entire original data set. We demonstrate the practical benefits of our approach on a large dataset which we subsample and subsequently fit with boosted trees."

- To_teach is speculative. The trick is to pick some easy-to-compute hypothesis which can be applied to the whole data set, preferentially sample the points with high loss under this pilot model, and then do importance weighting. The excess risk they get for a sub-sample of size m from n observations is O(n^{-0.5}) + O(m^{-0.75}), as opposed to O(m^{-0.5}) for just naively drawing a sub-sample. I don't think they ever compare this to e.g. stochastic gradient descent.

in_NB
estimation
optimization
computational_statistics
statistics
learning_theory
to_teach:statcomp
have_read
- To_teach is speculative. The trick is to pick some easy-to-compute hypothesis which can be applied to the whole data set, preferentially sample the points with high loss under this pilot model, and then do importance weighting. The excess risk they get for a sub-sample of size m from n observations is O(n^{-0.5}) + O(m^{-0.75}), as opposed to O(m^{-0.5}) for just naively drawing a sub-sample. I don't think they ever compare this to e.g. stochastic gradient descent.

june 2013 by cshalizi

Troubling Trends in Scientific Software Use

june 2013 by cshalizi

"Software pervades every domain of science (1–3), perhaps nowhere more decisively than in modeling. In key scientific areas of great societal importance, models and the software that implement them define both how science is done and what science is done (4, 5). Across all science, this dependence has led to concerns around the need for open access to software (6, 7), centered on the reproducibility of research (1, 8–10). From fields such as high-performance computing, we learn key insights and best practices for how to develop, standardize, and implement software (11). Open and systematic approaches to the development of software are essential for all sciences. But for many scientists this is not sufficient. We describe problems with the adoption and use of scientific software."

--- Shorter: the situation isn't as bad as you'd fear; it's worse.

scientific_computing
programming
to_teach:statcomp
--- Shorter: the situation isn't as bad as you'd fear; it's worse.

june 2013 by cshalizi

**related tags**

Copy this bookmark: