[1909.03813] INTEREST: INteractive Tool for Exploring REsults from Simulation sTudies

4 weeks ago by cshalizi

"Simulation studies allow us to explore the properties of statistical methods. They provide a powerful tool with a multiplicity of aims; among others: evaluating and comparing new or existing statistical methods, assessing violations of modelling assumptions, helping with the understanding of statistical concepts, and supporting the design of clinical trials. The increased availability of powerful computational tools and usable software has contributed to the rise of simulation studies in the current literature. However, simulation studies involve increasingly complex designs, making it difficult to provide all relevant results clearly. Dissemination of results plays a focal role in simulation studies: it can drive applied analysts to use methods that have been shown to perform well in their settings, guide researchers to develop new methods in a promising direction, and provide insights into less established methods. It is crucial that we can digest relevant results of simulation studies. Therefore, we developed INTEREST: an INteractive Tool for Exploring REsults from Simulation sTudies. The tool has been developed using the Shiny framework in R and is available as a web app or as a standalone package. It requires uploading a tidy format dataset with the results of a simulation study in R, Stata, SAS, SPSS, or comma-separated format. A variety of performance measures are estimated automatically along with Monte Carlo standard errors; results and performance summaries are displayed both in tabular and graphical fashion, with a wide variety of available plots. Consequently, the reader can focus on simulation parameters and estimands of most interest. In conclusion, INTEREST can facilitate the investigation of results from simulation studies and supplement the reporting of results, allowing researchers to share detailed results from their simulations and readers to explore them freely."

to:NB
simulation
R
to_teach:statcomp
4 weeks ago by cshalizi

[1908.06936] ExaGeoStatR: A Package for Large-Scale Geostatistics in R

8 weeks ago by cshalizi

"Parallel computing in Gaussian process calculation becomes a necessity for avoiding computational and memory restrictions associated with Geostatistics applications. The evaluation of the Gaussian log-likelihood function requires O(n^2) storage and O(n^3) operations where n is the number of geographical locations. In this paper, we present ExaGeoStatR, a package for large-scale Geostatistics in R that supports parallel computation of the maximum likelihood function on shared memory, GPU, and distributed systems. The parallelization depends on breaking down the numerical linear algebra operations into a set of tasks and rendering them for a task-based programming model. ExaGeoStatR supports several maximum likelihood computation variants such as exact, Diagonal Super Tile (DST), and Tile Low-Rank (TLR) approximation besides providing a tool to generate large-scale synthetic datasets which can be used to test and compare different approximations methods. The package can be used directly through the R environment without any C, CUDA, or MPIknowledge. Here, we demonstrate the ExaGeoStatR package by illustrating its implementation details, analyzing its performance on various parallel architectures, and assessing its accuracy using both synthetic datasets and a sea surface temperature dataset. The performance evaluation involves spatial datasets with up to 250K observations."

to:NB
spatial_statistics
prediction
computational_statistics
R
statistics
to_teach:data_over_space_and_time
8 weeks ago by cshalizi

[1904.02101] The Landscape of R Packages for Automated Exploratory Data Analysis

11 weeks ago by cshalizi

"The increasing availability of large but noisy data sets with a large number of heterogeneous variables leads to the increasing interest in the automation of common tasks for data analysis. The most time-consuming part of this process is the Exploratory Data Analysis, crucial for better domain understanding, data cleaning, data validation, and feature engineering. "

There is a growing number of libraries that attempt to automate some of the typical Exploratory Data Analysis tasks to make the search for new insights easier and faster. In this paper, we present a systematic review of existing tools for Automated Exploratory Data Analysis (autoEDA). We explore the features of twelve popular R packages to identify the parts of analysis that can be effectively automated with the current tools and to point out new directions for further autoEDA development.

to:NB
R
exploratory_data_analysis
data_analysis
statistics
to_teach:data-mining
There is a growing number of libraries that attempt to automate some of the typical Exploratory Data Analysis tasks to make the search for new insights easier and faster. In this paper, we present a systematic review of existing tools for Automated Exploratory Data Analysis (autoEDA). We explore the features of twelve popular R packages to identify the parts of analysis that can be effectively automated with the current tools and to point out new directions for further autoEDA development.

11 weeks ago by cshalizi

Evolutionary Genetics - Hardcover - Glenn-Peter Saetre; Mark Ravinet - Oxford University Press

12 weeks ago by cshalizi

"With recent technological advances, vast quantities of genetic and genomic data are being generated at an ever-increasing pace. The explosion in access to data has transformed the field of evolutionary genetics. A thorough understanding of evolutionary principles is essential for making sense of this, but new skill sets are also needed to handle and analyze big data. This contemporary textbook covers all the major components of modern evolutionary genetics, carefully explaining fundamental processes such as mutation, natural selection, genetic drift, and speciation. It also draws on a rich literature of exciting and inspiring examples to demonstrate the diversity of evolutionary research, including an emphasis on how evolution and selection has shaped our own species.

"Practical experience is essential for developing an understanding of how to use genetic and genomic data to analyze and interpret results in meaningful ways. In addition to the main text, a series of online tutorials using the R language serves as an introduction to programming, statistics, and analysis. Indeed the R environment stands out as an ideal all-purpose source platform to handle and analyze such data. The book and its online materials take full advantage of the authors' own experience in working in a post-genomic revolution world, and introduces readers to the plethora of molecular and analytical methods that have only recently become available.

"Evolutionary Genetics is an advanced but accessible textbook aimed principally at students of various levels (from undergraduate to postgraduate) but also for researchers looking for an updated introduction to modern evolutionary biology and genetics. "

to:NB
genetics
evolutionary_biology
statistics
R
books:noted
"Practical experience is essential for developing an understanding of how to use genetic and genomic data to analyze and interpret results in meaningful ways. In addition to the main text, a series of online tutorials using the R language serves as an introduction to programming, statistics, and analysis. Indeed the R environment stands out as an ideal all-purpose source platform to handle and analyze such data. The book and its online materials take full advantage of the authors' own experience in working in a post-genomic revolution world, and introduces readers to the plethora of molecular and analytical methods that have only recently become available.

"Evolutionary Genetics is an advanced but accessible textbook aimed principally at students of various levels (from undergraduate to postgraduate) but also for researchers looking for an updated introduction to modern evolutionary biology and genetics. "

12 weeks ago by cshalizi

Scalable Visualization Methods for Modern Generalized Additive Models: Journal of Computational and Graphical Statistics: Vol 0, No 0

12 weeks ago by cshalizi

"In the last two decades, the growth of computational resources has made it possible to handle generalized additive models (GAMs) that formerly were too costly for serious applications. However, the growth in model complexity has not been matched by improved visualizations for model development and results presentation. Motivated by an industrial application in electricity load forecasting, we identify the areas where the lack of modern visualization tools for GAMs is particularly severe, and we address the shortcomings of existing methods by proposing a set of visual tools that (a) are fast enough for interactive use, (b) exploit the additive structure of GAMs, (c) scale to large data sets, and (d) can be used in conjunction with a wide range of response distributions. The new visual methods proposed here are implemented by the mgcViz R package, available on the Comprehensive R Archive Network. Supplementary materials for this article are available online."

to:NB
additive_models
visual_display_of_quantitative_information
computational_statistics
statistics
R
to_teach:undergrad-ADA
12 weeks ago by cshalizi

[1703.04467] spmoran: An R package for Moran's eigenvector-based spatial regression analysis

12 weeks ago by cshalizi

"This study illustrates how to use "spmoran," which is an R package for Moran's eigenvector-based spatial regression analysis for up to millions of observations. This package estimates fixed or random effects eigenvector spatial filtering models and their extensions including a spatially varying coefficient model, a spatial unconditional quantile regression model, and low rank spatial econometric models. These models are estimated computationally efficiently."

--- ETA after reading: The approach sounds interesting enough that I want to track down the references that actually explain it, rather than just the software.

in_NB
spatial_statistics
regression
statistics
to_teach:data_over_space_and_time
R
have_read
--- ETA after reading: The approach sounds interesting enough that I want to track down the references that actually explain it, rather than just the software.

12 weeks ago by cshalizi

A Quick and Tidy Look at the 2018 GSS

march 2019 by cshalizi

Where by "to_teach" I mean "to work through myself".

R
visual_display_of_quantitative_information
to_teach
march 2019 by cshalizi

Modern statistics modern biology | Statistics for life sciences, medicine and health | Cambridge University Press

february 2019 by cshalizi

"If you are a biologist and want to get the best out of the powerful methods of modern computational statistics, this is your book. You can visualize and analyze your own data, apply unsupervised and supervised learning, integrate datasets, apply hypothesis testing, and make publication-quality figures using the power of R/Bioconductor and ggplot2. This book will teach you 'cooking from scratch', from raw data to beautiful illuminating output, as you learn to write your own scripts in the R language and to use advanced statistics packages from CRAN and Bioconductor. It covers a broad range of basic and advanced topics important in the analysis of high-throughput biological data, including principal component analysis and multidimensional scaling, clustering, multiple testing, unsupervised and supervised learning, resampling, the pitfalls of experimental design, and power simulations using Monte Carlo, and it even reaches networks, trees, spatial statistics, image data, and microbial ecology. Using a minimum of mathematical notation, it builds understanding from well-chosen examples, simulation, visualization, and above all hands-on interaction with data and code."

to:NB
books:noted
statistics
computational_statistics
biology
genomics
R
february 2019 by cshalizi

How to be a Quantitative Ecologist | Wiley Online Books

january 2019 by cshalizi

"Ecological research is becoming increasingly quantitative, yet students often opt out of courses in mathematics and statistics, unwittingly limiting their ability to carry out research in the future. This textbook provides a practical introduction to quantitative ecology for students and practitioners who have realised that they need this opportunity.

"The text is addressed to readers who haven't used mathematics since school, who were perhaps more confused than enlightened by their undergraduate lectures in statistics and who have never used a computer for much more than word processing and data entry. From this starting point, it slowly but surely instils an understanding of mathematics, statistics and programming, sufficient for initiating research in ecology. The book’s practical value is enhanced by extensive use of biological examples and the computer language R for graphics, programming and data analysis."

to:NB
books:noted
downloaded
ecology
statistics
R
"The text is addressed to readers who haven't used mathematics since school, who were perhaps more confused than enlightened by their undergraduate lectures in statistics and who have never used a computer for much more than word processing and data entry. From this starting point, it slowly but surely instils an understanding of mathematics, statistics and programming, sufficient for initiating research in ecology. The book’s practical value is enhanced by extensive use of biological examples and the computer language R for graphics, programming and data analysis."

january 2019 by cshalizi

An Introduction to the Advanced Theory and Practice of Nonparametric Econometrics by Jeffrey S. Racine

january 2019 by cshalizi

"Interest in nonparametric methodology has grown considerably over the past few decades, stemming in part from vast improvements in computer hardware and the availability of new software that allows practitioners to take full advantage of these numerically intensive methods. This book is written for advanced undergraduate students, intermediate graduate students, and faculty, and provides a complete teaching and learning course at a more accessible level of theoretical rigor than Racine's earlier book co-authored with Qi Li, Nonparametric Econometrics: Theory and Practice (2007). The open source R platform for statistical computing and graphics is used throughout in conjunction with the R package np. Recent developments in reproducible research is emphasized throughout with appendices devoted to helping the reader get up to speed with R, R Markdown, TeX and Git."

--- Ooh.

to:NB
books:noted
coveted
econometrics
nonparametrics
statistics
racine.jeffrey
R
--- Ooh.

january 2019 by cshalizi

Simulation and Inference for Stochastic Processes with YUIMA - A Comprehensive R Framework for SDEs and Other Stochastic Processes | Stefano M. Iacus | Springer

to:NB books:noted stochastic_processes statistical_inference_for_stochastic_processes R to_teach:data_over_space_and_time

january 2019 by cshalizi

to:NB books:noted stochastic_processes statistical_inference_for_stochastic_processes R to_teach:data_over_space_and_time

january 2019 by cshalizi

Allesina, S. and Wilmes, M.: Computing Skills for Biologists: A Toolbox (Hardcover, Paperback and eBook) | Princeton University Press

january 2019 by cshalizi

"While biological data continues to grow exponentially in size and quality, many of today’s biologists are not trained adequately in the computing skills necessary for leveraging this information deluge. In Computing Skills for Biologists, Stefano Allesina and Madlen Wilmes present a valuable toolbox for the effective analysis of biological data.

"Based on the authors’ experiences teaching scientific computing at the University of Chicago, this textbook emphasizes the automation of repetitive tasks and the construction of pipelines for data organization, analysis, visualization, and publication. Stressing practice rather than theory, the book’s examples and exercises are drawn from actual biological data and solve cogent problems spanning the entire breadth of biological disciplines, including ecology, genetics, microbiology, and molecular biology. Beginners will benefit from the many examples explained step-by-step, while more seasoned researchers will learn how to combine tools to make biological data analysis robust and reproducible. The book uses free software and code that can be run on any platform.

"Computing Skills for Biologists is ideal for scientists wanting to improve their technical skills and instructors looking to teach the main computing tools essential for biology research in the twenty-first century."

to:NB
books:noted
scientific_computing
R
to_teach:statcomp
"Based on the authors’ experiences teaching scientific computing at the University of Chicago, this textbook emphasizes the automation of repetitive tasks and the construction of pipelines for data organization, analysis, visualization, and publication. Stressing practice rather than theory, the book’s examples and exercises are drawn from actual biological data and solve cogent problems spanning the entire breadth of biological disciplines, including ecology, genetics, microbiology, and molecular biology. Beginners will benefit from the many examples explained step-by-step, while more seasoned researchers will learn how to combine tools to make biological data analysis robust and reproducible. The book uses free software and code that can be run on any platform.

"Computing Skills for Biologists is ideal for scientists wanting to improve their technical skills and instructors looking to teach the main computing tools essential for biology research in the twenty-first century."

january 2019 by cshalizi

Confidence intervals for GLMs

december 2018 by cshalizi

For the trick about finding the inverse link function.

regression
R
to_teach:undergrad-ADA
via:kjhealy
december 2018 by cshalizi

Solving Differential Equations in R: Package deSolve | Soetaert | Journal of Statistical Software

december 2018 by cshalizi

"In this paper we present the R package deSolve to solve initial value problems (IVP) written as ordinary differential equations (ODE), differential algebraic equations (DAE) of index 0 or 1 and partial differential equations (PDE), the latter solved using the method of lines approach. The differential equations can be represented in R code or as compiled code. In the latter case, R is used as a tool to trigger the integration and post-process the results, which facilitates model development and application, whilst the compiled code significantly increases simulation speed. The methods implemented are efficient, robust, and well documented public-domain Fortran routines. They include four integrators from the ODEPACK package (LSODE, LSODES, LSODA, LSODAR), DVODE and DASPK2.0. In addition, a suite of Runge-Kutta integrators and special-purpose solvers to efficiently integrate 1-, 2- and 3-dimensional partial differential equations are available. The routines solve both stiff and non-stiff systems, and include many options, e.g., to deal in an efficient way with the sparsity of the Jacobian matrix, or finding the root of equations. In this article, our objectives are threefold: (1) to demonstrate the potential of using R for dynamic modeling, (2) to highlight typical uses of the different methods implemented and (3) to compare the performance of models specified in R code and in compiled code for a number of test cases. These comparisons demonstrate that, if the use of loops is avoided, R code can efficiently integrate problems comprising several thousands of state variables. Nevertheless, the same problem may be solved from 2 to more than 50 times faster by using compiled code compared to an implementation using only R code. Still, amongst the benefits of R are a more flexible and interactive implementation, better readability of the code, and access to R’s high-level procedures. deSolve is the successor of package odesolve which will be deprecated in the future; it is free software and distributed under the GNU General Public License, as part of the R software project."

to:NB
dynamical_systems
computational_statistics
R
to_teach:data_over_space_and_time
have_read
december 2018 by cshalizi

Object-oriented Computation of Sandwich Estimators | Zeileis | Journal of Statistical Software

october 2018 by cshalizi

"Sandwich covariance matrix estimators are a popular tool in applied regression modeling for performing inference that is robust to certain types of model misspecification. Suitable implementations are available in the R system for statistical computing for certain model fitting functions only (in particular lm()), but not for other standard regression functions, such as glm(), nls(), or survreg(). Therefore, conceptual tools and their translation to computational tools in the package sandwich are discussed, enabling the computation of sandwich estimators in general parametric models. Object orientation can be achieved by providing a few extractor functions' most importantly for the empirical estimating functions' from which various types of sandwich estimators can be computed."

to:NB
computational_statistics
R
estimation
regression
statistics
to_teach
october 2018 by cshalizi

rWind package on CRAN

october 2018 by cshalizi

For access to wind velocity data sets. (Surprisingly slow access, but very glad somebody has written this so I don't have to!)

--- ETA: The server they're yanking the data from is very temperamental, and grabbing a long temporal stretch is almost sure to fail. But grabbing about 30 days of data at a time seems OK.

R
data_sets
to_teach:data_over_space_and_time
--- ETA: The server they're yanking the data from is very temperamental, and grabbing a long temporal stretch is almost sure to fail. But grabbing about 30 days of data at a time seems OK.

october 2018 by cshalizi

Analyze and Create Elegant Directed Acyclic Graphs • ggdag

august 2018 by cshalizi

"ggdag: An R Package for visualizing and analyzing directed acyclic graphs"

R
graphical_models
visual_display_of_quantitative_information
via:arsyed
to_teach:undergrad-ADA
re:ADAfaEPoV
august 2018 by cshalizi

[1705.08105] FRK: An R Package for Spatial and Spatio-Temporal Prediction with Large Datasets

august 2018 by cshalizi

"FRK is an R software package for spatial/spatio-temporal modelling and prediction with large datasets. It facilitates optimal spatial prediction (kriging) on the most commonly used manifolds (in Euclidean space and on the surface of the sphere), for both spatial and spatio-temporal fields. It differs from many of the packages for spatial modelling and prediction by avoiding stationary and isotropic covariance and variogram models, instead constructing a spatial random effects (SRE) model on a fine-resolution discretised spatial domain. The discrete element is known as a basic areal unit (BAU), whose introduction in the software leads to several practical advantages. The software can be used to (i) integrate multiple observations with different supports with relative ease; (ii) obtain exact predictions at millions of prediction locations (without conditional simulation); and (iii) distinguish between measurement error and fine-scale variation at the resolution of the BAU, thereby allowing for reliable uncertainty quantification. The temporal component is included by adding another dimension. A key component of the SRE model is the specification of spatial or spatio-temporal basis functions; in the package, they can be generated automatically or by the user. The package also offers automatic BAU construction, an expectation-maximisation (EM) algorithm for parameter estimation, and functionality for prediction over any user-specified polygons or BAUs. Use of the package is illustrated on several spatial and spatio-temporal datasets, and its predictions and the model it implements are extensively compared to others commonly used for spatial prediction and modelling."

in_NB
to_read
R
heard_the_talk
prediction
spatial_statistics
spatio-temporal_statistics
to_teach:data_over_space_and_time
august 2018 by cshalizi

General Resampling Infrastructure • rsample

august 2018 by cshalizi

"rsample contains a set of functions that can create different types of resamples and corresponding classes for their analysis. The goal is to have a modular set of methods that can be used across different R packages for:

"traditional resampling techniques for estimating the sampling distribution of a statistic and

"estimating model performance using a holdout set

"The scope of rsample is to provide the basic building blocks for creating and analyzing resamples of a data set but does not include code for modeling or calculating statistics. The “Working with Resample Sets” vignette gives demonstrations of how rsample tools can be used."

to:NB
R
computational_statistics
to_teach:statcomp
to_teach:undergrad-ADA
via:?
"traditional resampling techniques for estimating the sampling distribution of a statistic and

"estimating model performance using a holdout set

"The scope of rsample is to provide the basic building blocks for creating and analyzing resamples of a data set but does not include code for modeling or calculating statistics. The “Working with Resample Sets” vignette gives demonstrations of how rsample tools can be used."

august 2018 by cshalizi

Quantitative methods archaeology using R | Archaeological theory and methods | Cambridge University Press

may 2018 by cshalizi

"Quantitative Methods in Archaeology Using R is the first hands-on guide to using the R statistical computing system written specifically for archaeologists. It shows how to use the system to analyze many types of archaeological data. Part I includes tutorials on R, with applications to real archaeological data showing how to compute descriptive statistics, create tables, and produce a wide variety of charts and graphs. Part II addresses the major multivariate approaches used by archaeologists, including multiple regression (and the generalized linear model); multiple analysis of variance and discriminant analysis; principal components analysis; correspondence analysis; distances and scaling; and cluster analysis. Part III covers specialized topics in archaeology, including intra-site spatial analysis, seriation, and assemblage diversity."

--- This looks like it might be an interesting source of teaching examples. (OTOH, I'm not sure how many of The Kids would get it...)

to:NB
books:noted
archaeology
statistics
R
books:owned
--- This looks like it might be an interesting source of teaching examples. (OTOH, I'm not sure how many of The Kids would get it...)

may 2018 by cshalizi

FFTrees: A toolbox to create, visualize, and evaluate fast-and-frugal decision trees

august 2017 by cshalizi

"Fast-and-frugal trees (FFTs) are simple algorithms that facilitate efficient and accurate decisions based on limited information. But despite their successful use in many applied domains, there is no widely available toolbox that allows anyone to easily create, visualize, and evaluate FFTs. We fill this gap by introducing the R package FFTrees. In this paper, we explain how FFTs work, introduce a new class of algorithms called fan for constructing FFTs, and provide a tutorial for using the FFTrees package. We then conduct a simulation across ten real-world datasets to test how well FFTs created by FFTrees can predict data. Simulation results show that FFTs created by FFTrees can predict data as well as popular classification algorithms such as regression and random forests, while remaining simple enough for anyone to understand and use."

--- I am skeptical about that "simple enough for anyone to understand and use"

to:NB
have_read
decision_trees
heuristics
cognitive_science
R
to_teach:undergrad-ADA
re:ADAfaEPoV
--- I am skeptical about that "simple enough for anyone to understand and use"

august 2017 by cshalizi

Interactive R On-Line

august 2016 by cshalizi

"IROL was developed by the team of Howard Seltman (email feedback), Rebecca Nugent, Sam Ventura, Ryan Tibshirani, and Chris Genovese at the Department of Statistics at Carnegie Mellon University."

--- I mark this as "to_teach:statcomp", but of course the point is to have people go through this _before_ that course, so the class can cover more interesting stuff.

R
kith_and_kin
seltman.howard
nugent.rebecca
genovese.christopher
ventura.samuel
tibshirani.ryan
to_teach:statcomp
--- I mark this as "to_teach:statcomp", but of course the point is to have people go through this _before_ that course, so the class can cover more interesting stuff.

august 2016 by cshalizi

shinyTex

august 2016 by cshalizi

"ShinyTex is a system for authoring interactive World Wide Web applications (apps) which includes the full capabilities of the R statistical language, particularly in the context of Technology Enhanced Learning (TEL). It uses a modified version of the LaTeX syntax that is standard for document creation among mathematicians and statisticians. It is built on the Shiny platform, an extension of R designed by RStudio to produce web apps. The goal is to provide an easy to use TEL authoring environment with excellent mathematical and statistical support using only free software. ShinyTex authoring can be performed on Windows, OS X, and Linux. Users may view the app on any system with a standard web browser."

R
latex
kith_and_kin
seltman.howard
august 2016 by cshalizi

A User’s Guide to Network Analysis in R - Springer

august 2016 by cshalizi

Apparently covers both igraph and statnet.

to:NB
books:noted
R
network_data_analysis
to_teach:baby-nets
august 2016 by cshalizi

Draw the rest of the owl

march 2016 by cshalizi

This strikes me as really excellent pedagogy.

R
programming
to_teach:statcomp
problem-solving
bryan.jennifer
via:tslumley
march 2016 by cshalizi

Jenny Bryan on Twitter: "An Incomplete List of #rstats troubleshooting tips https://t.co/OKKoGkSYzq"

march 2016 by cshalizi

It misses

* Did you use attach()? Don't

but is otherwise pretty good.

R
to_teach:undergrad-ADA
to_teach:statcomp
via:tslumley
bryan.jennifer
* Did you use attach()? Don't

but is otherwise pretty good.

march 2016 by cshalizi

Statistical Modeling: A Fresh Approach

december 2015 by cshalizi

"Statistical Modeling: A Fresh Approach introduces and illuminates the statistical reasoning used in modern research throughout the natural and social sciences, medicine, government, and commerce. It emphasizes the use of models to untangle and quantify variation in observed data. By a deft and concise use of computing coupled with an innovative geometrical presentation of the relationship among variables, A Fresh Approach reveals the logic of statistical inference and empowers the reader to use and understand techniques such as analysis of covariance that appear widely in published research but are hardly ever found in introductory texts.

"Recognizing the essential role the computer plays in modern statistics, A Fresh Approach provides a complete and self-contained introduction to statistical computing using the powerful (and free) statistics package R."

in_NB
books:noted
statistics
regression
R
re:ADAfaEPoV
"Recognizing the essential role the computer plays in modern statistics, A Fresh Approach provides a complete and self-contained introduction to statistical computing using the powerful (and free) statistics package R."

december 2015 by cshalizi

CRAN - Package ridge

october 2015 by cshalizi

"Linear and logistic ridge regression for small data sets and genome-wide SNP data"

R
regression
statistics
ridge_regression
to_teach:undergrad-ADA
to_teach:linear_models
october 2015 by cshalizi

Humanities Data in R - Exploring Networks, Geospatial Data, | Taylor Arnold | Springer

october 2015 by cshalizi

"This pioneering book teaches readers to use R within four core analytical areas applicable to the Humanities: networks, text, geospatial data, and images. This book is also designed to be a bridge: between quantitative and qualitative methods, individual and collaborative work, and the humanities and social sciences. Humanities Data with R does not presuppose background programming experience. Early chapters take readers from R set-up to exploratory data analysis (continuous and categorical data, multivariate analysis, and advanced graphics with emphasis on aesthetics and facility). Following this, networks, geospatial data, image data, natural language processing and text analysis each have a dedicated chapter. Each chapter is grounded in examples to move readers beyond the intimidation of adding new tools to their research. Everything is hands-on: networks are explained using U.S. Supreme Court opinions, and low-level NLP methods are applied to short stories by Sir Arthur Conan Doyle. After working through these examples with the provided data, code and book website, readers are prepared to apply new methods to their own work. The open source R programming language, with its myriad packages and popularity within the sciences and social sciences, is particularly well-suited to working with humanities data. R packages are also highlighted in an appendix. This book uses an expanded conception of the forms data may take and the information it represents. The methodology will have wide application in classrooms and self-study for the humanities, but also for use in linguistics, anthropology, and political science. Outside the classroom, this intersection of humanities and computing is particularly relevant for research and new modes of dissemination across archives, museums and libraries."

to:NB
books:noted
R
statistical_computing
network_data_analysis
text_mining
spatial_statistics
books:owned
electronic_copy
october 2015 by cshalizi

Commented Scripts to Build Maps with cartography

october 2015 by cshalizi

These look nice. Maybe for the spatial data examples in the book?

R
visual_display_of_quantitative_information
maps
to_teach:statcomp
re:ADAfaEPoV
via:phnk
october 2015 by cshalizi

Rcartogram

july 2015 by cshalizi

R interface to the Gastner-Newman cartogram code. I haven't tried it out yet.

R
visual_display_of_quantitative_information
cartograms
july 2015 by cshalizi

CRAN - Package AlgDesign

april 2015 by cshalizi

"Algorithmic experimental designs. Calculates exact and approximate theory experimental designs for D,A, and I criteria. Very large designs may be created. Experimental designs may be blocked or blocked designs created from a candidate list, using several criteria. The blocking can be done when whole and within plot factors interact."

R
experimental_design
statistics
to_teach:undergrad-ADA
april 2015 by cshalizi

CRAN - Package BatchJobs

april 2015 by cshalizi

"Provides Map, Reduce and Filter variants to generate jobs on batch computing systems like PBS/Torque, LSF, SLURM and Sun Grid Engine. Multicore and SSH systems are also supported."

to_read
R
programming
to_teach:statcomp
april 2015 by cshalizi

CRAN - Package markovchain

march 2015 by cshalizi

"Functions and S4 methods to create and manage discrete time Markov chains (DTMC) more easily. In addition functions to perform statistical (fitting and drawing random variates) and probabilistic (analysis of DTMC proprieties) analysis are provided."

markov_models
R
to_teach:statcomp
march 2015 by cshalizi

r - How to add elements to a plot using a knitr chunk without original markdown output? - Stack Overflow

february 2015 by cshalizi

Need to check whether this works when knitting a latex document as well. (Presumably.)

latex
R
knitr
to_teach:statcomp
re:ADAfaEPoV
february 2015 by cshalizi

r - knitr - How to align code and plot side by side - Stack Overflow

january 2015 by cshalizi

Could this be modified to put figure on top, then code, then caption?

R
kntir
re:ADAfaEPoV
january 2015 by cshalizi

[1409.3531] Object-Oriented Programming, Functional Programming and R

january 2015 by cshalizi

"This paper reviews some programming techniques in R that have proved useful, particularly for substantial projects. These include several versions of object-oriented programming, used in a large number of R packages. The review tries to clarify the origins and ideas behind the various versions, each of which is valuable in the appropriate context. R has also been strongly influenced by the ideas of functional programming and, in particular, by the desire to combine functional with object oriented programming. To clarify how this particular mix of ideas has turned out in the current R language and supporting software, the paper will first review the basic ideas behind object-oriented and functional programming, and then examine the evolution of R with these ideas providing context. Functional programming supports well-defined, defensible software giving reproducible results. Object-oriented programming is the mechanism par excellence for managing complexity while keeping things simple for the user. The two paradigms have been valuable in supporting major software for fitting models to data and numerous other statistical applications. The paradigms have been adopted, and adapted, distinctively in R. Functional programming motivates much of R but R does not enforce the paradigm. Object-oriented programming from a functional perspective differs from that used in non-functional languages, a distinction that needs to be emphasized to avoid confusion. R initially replicated the S language from Bell Labs, which in turn was strongly influenced by earlier program libraries. At each stage, new ideas have been added, but the previous software continues to show its influence in the design as well. Outlining the evolution will further clarify why we currently have this somewhat unusual combination of ideas."

to:NB
to_read
programming
R
chambers.john
to_teach:statcomp
january 2015 by cshalizi

sjewo/readstata13 · GitHub

january 2015 by cshalizi

Reading (and eventually writing, but I don't care about that) data files from Stata version 13+, because the "foreign" library in R can only handles Stata 12-, and guess what someone's replication files are in? The things I do for you kids.

ETA: only _some_ of their .dta files are 13+, about half of what I need are 12-, with of course no indication in the file names.

R
ETA: only _some_ of their .dta files are 13+, about half of what I need are 12-, with of course no indication in the file names.

january 2015 by cshalizi

Notifications from R | The stupidest thing...

september 2014 by cshalizi

Couldn't one just do a system call to mail(1), rather than GIVING AN R SCRIPT YOUR PASSWORD?

R
to_teach:statcomp
via:phnk
september 2014 by cshalizi

Text Analysis with R for Students of Literature

july 2014 by cshalizi

"Text Analysis with R for Students of Literature is written with students and scholars of literature in mind but will be applicable to other humanists and social scientists wishing to extend their methodological tool kit to include quantitative and computational approaches to the study of text. Computation provides access to information in text that we simply cannot gather using traditional qualitative methods of close reading and human synthesis. Text Analysis with R for Students of Literature provides a practical introduction to computational text analysis using the open source programming language R. R is extremely popular throughout the sciences and because of its accessibility, R is now used increasingly in other research areas. Readers begin working with text right away and each chapter works through a new technique or process such that readers gain a broad exposure to core R procedures and a basic understanding of the possibilities of computational text analysis at both the micro and macro scale. Each chapter builds on the previous as readers move from small scale “microanalysis” of single texts to large scale “macroanalysis” of text corpora, and each chapter concludes with a set of practice exercises that reinforce and expand upon the chapter lessons. The book’s focus is on making the technical palatable and making the technical useful and immediately gratifying."

to:NB
books:noted
data_analysis
R
text_mining
humanities
electronic_copy
books:owned
july 2014 by cshalizi

Statistical Analysis of Network Data with R

june 2014 by cshalizi

Supposedly aligned with Kolaczyk's textbook, and using the igraph package.

books:noted
network_data_analysis
statistics
R
kolaczyk.eric
in_NB
books:owned
june 2014 by cshalizi

Scalable Strategies for Computing with Massive Data

february 2014 by cshalizi

"This paper presents two complementary statistical computing frameworks that address challenges in parallel processing and the analysis of massive data. First, the foreach package allows users of the R programming environment to define parallel loops that may be run sequentially on a single machine, in parallel on a symmetric multiprocessing (SMP) machine, or in cluster environments without platform-specific code. Second, the bigmemory package implements memory- and file-mapped data structures that provide (a) access to arbitrarily large data while retaining a look and feel that is familiar to R users and (b) data structures that are shared across processor cores in order to support efficient parallel computing techniques. Although these packages may be used independently, this paper shows how they can be used in combination to address challenges that have effectively been beyond the reach of researchers who lack specialized software development skills or expensive hardware."

have_read
R
computational_statistics
data_analysis
in_NB
to_teach:statcomp
february 2014 by cshalizi

[1401.6389] Parallel Optimisation of Bootstrapping in R

february 2014 by cshalizi

"Bootstrapping is a popular and computationally demanding resampling method used for measuring the accuracy of sample estimates and assisting with statistical inference. R is a freely available language and environment for statistical computing popular with biostatisticians for genomic data analyses. A survey of such R users highlighted its implementation of bootstrapping as a prime candidate for parallelization to overcome computational bottlenecks. The Simple Parallel R Interface (SPRINT) is a package that allows R users to exploit high performance computing in multi-core desktops and supercomputers without expert knowledge of such systems. This paper describes the parallelization of bootstrapping for inclusion in the SPRINT R package. Depending on the complexity of the bootstrap statistic and the number of resamples, this implementation has close to optimal speed up on up to 16 nodes of a supercomputer and close to 100 on 512 nodes. This performance in a multi-node setting compares favourably with an existing parallelization option in the native R implementation of bootstrapping."

to:NB
bootstrap
parallel_computing
computational_statistics
R
to_teach:statcomp
february 2014 by cshalizi

ŷhat | 10 R packages I wish I knew about earlier

february 2014 by cshalizi

I know most but not all of these...

R
to_teach:statcomp
via:?
february 2014 by cshalizi

[1402.1894] R Markdown: Integrating A Reproducible Analysis Tool into Introductory Statistics

february 2014 by cshalizi

"Nolan and Temple Lang argue that "the ability to express statistical computations is an essential skill." A key related capacity is the ability to conduct and present data analysis in a way that another person can understand and replicate. The copy-and-paste workflow that is an artifact of antiquated user-interface design makes reproducibility of statistical analysis more difficult, especially as data become increasingly complex and statistical methods become increasingly sophisticated. R Markdown is a new technology that makes creating fully-reproducible statistical analysis simple and painless. It provides a solution suitable not only for cutting edge research, but also for use in an introductory statistics course. We present evidence that R Markdown can be used effectively in introductory statistics courses, and discuss its role in the rapidly-changing world of statistical computation."

to:NB
R
teaching
statistics
to_teach:statcomp
february 2014 by cshalizi

algorithm - Generate a list of primes in R up to a certain number - Stack Overflow

january 2014 by cshalizi

Not actually _about_ hashing, but helpful for creating random hash functions.

R
primes
programming
hashing
january 2014 by cshalizi

CRAN - Package alabama

december 2013 by cshalizi

"Augmented Lagrangian Adaptive Barrier Minimization Algorithm for optimizing smooth nonlinear objective functions with constraints. Linear or nonlinear equality and inequality constraints are allowed."

- Well, that solved my problem.

optimization
R
to_teach:statcomp
- Well, that solved my problem.

december 2013 by cshalizi

Literate Testing in R | Data Analysis Visually Enforced

november 2013 by cshalizi

Nicely-named functions for testing some kinds of numerical properties.

R
programming
to_teach:statcomp
november 2013 by cshalizi

CRAN - Package hash

october 2013 by cshalizi

"This package implements a data structure similar to hashes in Perl and dictionaries in Python but with a purposefully R flavor. For objects of appreciable size, access using hashes outperforms native named lists and vectors."

R
programming
to_teach:statcomp
hashing
october 2013 by cshalizi

R: Create hash function digests for arbitrary R objects

october 2013 by cshalizi

"The digest function applies a cryptographical hash function to arbitrary R objects. By default, the objects are internally serialized, and either one of the currently implemented MD5 and SHA-1 hash functions algorithms can be used to compute a compact digest of the serialized object."

-- but I want very crude hashing...

R
programming
hashing
-- but I want very crude hashing...

october 2013 by cshalizi

Christopher Gandrud (간드루드 크리스토파): Slide: one function for lag/lead variables in data frames, including time-series cross-sectional data

july 2013 by cshalizi

"I often want to quickly create a lag or lead variable in an R data frame. Sometimes I also want to create the lag or lead variable for different groups in a data frame, for example, if I want to lag GDP for each country in a data frame.

"I've found the various R methods for doing this hard to remember and usually need to look at old blog posts. Any time we find ourselves using the same series of codes over and over, it's probably time to put them into a function."

--- I think this might make a good exercise for statistical computing.

R
time_series
to_teach:statcomp
"I've found the various R methods for doing this hard to remember and usually need to look at old blog posts. Any time we find ourselves using the same series of codes over and over, it's probably time to put them into a function."

--- I think this might make a good exercise for statistical computing.

july 2013 by cshalizi

My Stat Bytes talk, with slides and code | Nathan VanHoudnos

june 2013 by cshalizi

"I will present a grab bag of tricks to speed up your R code. Topics will include: installing an optimized BLAS, how to profile your R code to find which parts are slow, replacing slow code with inline C/C++, and running code in parallel on multiple cores. My running example will be fitting a 2PL IRT model with a hand coded MCMC sampler. The idea is to start with naive, pedagogically clear code and end up with fast, production quality code."

kith_and_kin
computational_statistics
R
vanhoudnos.nathan
to_teach:statcomp
june 2013 by cshalizi

Bayesian Networks in R - with Applications in Systems Biology

june 2013 by cshalizi

"introduces the reader to the essential concepts in Bayesian network modeling and inference in conjunction with examples in the open-source statistical environment R. The level of sophistication is gradually increased across the chapters with exercises and solutions for enhanced understanding and hands-on experimentation of key concepts. Applications focus on systems biology with emphasis on modeling pathways and signaling mechanisms from high throughput molecular data. Bayesian networks have proven to be especially useful abstractions in this regards as exemplified by their ability to discover new associations while validating known ones. It is also expected that the prevalence of publicly available high-throughput biological and healthcare data sets may encourage the audience to explore investigating novel paradigms using the approaches presented in the book."

to:NB
books:noted
graphical_models
statistics
R
june 2013 by cshalizi

**related tags**

Copy this bookmark: