cshalizi + r   155

[1909.03813] INTEREST: INteractive Tool for Exploring REsults from Simulation sTudies
"Simulation studies allow us to explore the properties of statistical methods. They provide a powerful tool with a multiplicity of aims; among others: evaluating and comparing new or existing statistical methods, assessing violations of modelling assumptions, helping with the understanding of statistical concepts, and supporting the design of clinical trials. The increased availability of powerful computational tools and usable software has contributed to the rise of simulation studies in the current literature. However, simulation studies involve increasingly complex designs, making it difficult to provide all relevant results clearly. Dissemination of results plays a focal role in simulation studies: it can drive applied analysts to use methods that have been shown to perform well in their settings, guide researchers to develop new methods in a promising direction, and provide insights into less established methods. It is crucial that we can digest relevant results of simulation studies. Therefore, we developed INTEREST: an INteractive Tool for Exploring REsults from Simulation sTudies. The tool has been developed using the Shiny framework in R and is available as a web app or as a standalone package. It requires uploading a tidy format dataset with the results of a simulation study in R, Stata, SAS, SPSS, or comma-separated format. A variety of performance measures are estimated automatically along with Monte Carlo standard errors; results and performance summaries are displayed both in tabular and graphical fashion, with a wide variety of available plots. Consequently, the reader can focus on simulation parameters and estimands of most interest. In conclusion, INTEREST can facilitate the investigation of results from simulation studies and supplement the reporting of results, allowing researchers to share detailed results from their simulations and readers to explore them freely."
to:NB  simulation  R  to_teach:statcomp 
4 weeks ago by cshalizi
[1908.06936] ExaGeoStatR: A Package for Large-Scale Geostatistics in R
"Parallel computing in Gaussian process calculation becomes a necessity for avoiding computational and memory restrictions associated with Geostatistics applications. The evaluation of the Gaussian log-likelihood function requires O(n^2) storage and O(n^3) operations where n is the number of geographical locations. In this paper, we present ExaGeoStatR, a package for large-scale Geostatistics in R that supports parallel computation of the maximum likelihood function on shared memory, GPU, and distributed systems. The parallelization depends on breaking down the numerical linear algebra operations into a set of tasks and rendering them for a task-based programming model. ExaGeoStatR supports several maximum likelihood computation variants such as exact, Diagonal Super Tile (DST), and Tile Low-Rank (TLR) approximation besides providing a tool to generate large-scale synthetic datasets which can be used to test and compare different approximations methods. The package can be used directly through the R environment without any C, CUDA, or MPIknowledge. Here, we demonstrate the ExaGeoStatR package by illustrating its implementation details, analyzing its performance on various parallel architectures, and assessing its accuracy using both synthetic datasets and a sea surface temperature dataset. The performance evaluation involves spatial datasets with up to 250K observations."
to:NB  spatial_statistics  prediction  computational_statistics  R  statistics  to_teach:data_over_space_and_time 
8 weeks ago by cshalizi
[1904.02101] The Landscape of R Packages for Automated Exploratory Data Analysis
"The increasing availability of large but noisy data sets with a large number of heterogeneous variables leads to the increasing interest in the automation of common tasks for data analysis. The most time-consuming part of this process is the Exploratory Data Analysis, crucial for better domain understanding, data cleaning, data validation, and feature engineering. "
There is a growing number of libraries that attempt to automate some of the typical Exploratory Data Analysis tasks to make the search for new insights easier and faster. In this paper, we present a systematic review of existing tools for Automated Exploratory Data Analysis (autoEDA). We explore the features of twelve popular R packages to identify the parts of analysis that can be effectively automated with the current tools and to point out new directions for further autoEDA development.
to:NB  R  exploratory_data_analysis  data_analysis  statistics  to_teach:data-mining 
11 weeks ago by cshalizi
Evolutionary Genetics - Hardcover - Glenn-Peter Saetre; Mark Ravinet - Oxford University Press
"With recent technological advances, vast quantities of genetic and genomic data are being generated at an ever-increasing pace. The explosion in access to data has transformed the field of evolutionary genetics. A thorough understanding of evolutionary principles is essential for making sense of this, but new skill sets are also needed to handle and analyze big data. This contemporary textbook covers all the major components of modern evolutionary genetics, carefully explaining fundamental processes such as mutation, natural selection, genetic drift, and speciation. It also draws on a rich literature of exciting and inspiring examples to demonstrate the diversity of evolutionary research, including an emphasis on how evolution and selection has shaped our own species.
"Practical experience is essential for developing an understanding of how to use genetic and genomic data to analyze and interpret results in meaningful ways. In addition to the main text, a series of online tutorials using the R language serves as an introduction to programming, statistics, and analysis. Indeed the R environment stands out as an ideal all-purpose source platform to handle and analyze such data. The book and its online materials take full advantage of the authors' own experience in working in a post-genomic revolution world, and introduces readers to the plethora of molecular and analytical methods that have only recently become available.
"Evolutionary Genetics is an advanced but accessible textbook aimed principally at students of various levels (from undergraduate to postgraduate) but also for researchers looking for an updated introduction to modern evolutionary biology and genetics. "
to:NB  genetics  evolutionary_biology  statistics  R  books:noted 
12 weeks ago by cshalizi
Scalable Visualization Methods for Modern Generalized Additive Models: Journal of Computational and Graphical Statistics: Vol 0, No 0
"In the last two decades, the growth of computational resources has made it possible to handle generalized additive models (GAMs) that formerly were too costly for serious applications. However, the growth in model complexity has not been matched by improved visualizations for model development and results presentation. Motivated by an industrial application in electricity load forecasting, we identify the areas where the lack of modern visualization tools for GAMs is particularly severe, and we address the shortcomings of existing methods by proposing a set of visual tools that (a) are fast enough for interactive use, (b) exploit the additive structure of GAMs, (c) scale to large data sets, and (d) can be used in conjunction with a wide range of response distributions. The new visual methods proposed here are implemented by the mgcViz R package, available on the Comprehensive R Archive Network. Supplementary materials for this article are available online."
to:NB  additive_models  visual_display_of_quantitative_information  computational_statistics  statistics  R  to_teach:undergrad-ADA 
12 weeks ago by cshalizi
[1703.04467] spmoran: An R package for Moran's eigenvector-based spatial regression analysis
"This study illustrates how to use "spmoran," which is an R package for Moran's eigenvector-based spatial regression analysis for up to millions of observations. This package estimates fixed or random effects eigenvector spatial filtering models and their extensions including a spatially varying coefficient model, a spatial unconditional quantile regression model, and low rank spatial econometric models. These models are estimated computationally efficiently."

--- ETA after reading: The approach sounds interesting enough that I want to track down the references that actually explain it, rather than just the software.
in_NB  spatial_statistics  regression  statistics  to_teach:data_over_space_and_time  R  have_read 
12 weeks ago by cshalizi
Modern statistics modern biology | Statistics for life sciences, medicine and health | Cambridge University Press
"If you are a biologist and want to get the best out of the powerful methods of modern computational statistics, this is your book. You can visualize and analyze your own data, apply unsupervised and supervised learning, integrate datasets, apply hypothesis testing, and make publication-quality figures using the power of R/Bioconductor and ggplot2. This book will teach you 'cooking from scratch', from raw data to beautiful illuminating output, as you learn to write your own scripts in the R language and to use advanced statistics packages from CRAN and Bioconductor. It covers a broad range of basic and advanced topics important in the analysis of high-throughput biological data, including principal component analysis and multidimensional scaling, clustering, multiple testing, unsupervised and supervised learning, resampling, the pitfalls of experimental design, and power simulations using Monte Carlo, and it even reaches networks, trees, spatial statistics, image data, and microbial ecology. Using a minimum of mathematical notation, it builds understanding from well-chosen examples, simulation, visualization, and above all hands-on interaction with data and code."
to:NB  books:noted  statistics  computational_statistics  biology  genomics  R 
february 2019 by cshalizi
How to be a Quantitative Ecologist | Wiley Online Books
"Ecological research is becoming increasingly quantitative, yet students often opt out of courses in mathematics and statistics, unwittingly limiting their ability to carry out research in the future. This textbook provides a practical introduction to quantitative ecology for students and practitioners who have realised that they need this opportunity.
"The text is addressed to readers who haven't used mathematics since school, who were perhaps more confused than enlightened by their undergraduate lectures in statistics and who have never used a computer for much more than word processing and data entry. From this starting point, it slowly but surely instils an understanding of mathematics, statistics and programming, sufficient for initiating research in ecology. The book’s practical value is enhanced by extensive use of biological examples and the computer language R for graphics, programming and data analysis."
to:NB  books:noted  downloaded  ecology  statistics  R 
january 2019 by cshalizi
An Introduction to the Advanced Theory and Practice of Nonparametric Econometrics by Jeffrey S. Racine
"Interest in nonparametric methodology has grown considerably over the past few decades, stemming in part from vast improvements in computer hardware and the availability of new software that allows practitioners to take full advantage of these numerically intensive methods. This book is written for advanced undergraduate students, intermediate graduate students, and faculty, and provides a complete teaching and learning course at a more accessible level of theoretical rigor than Racine's earlier book co-authored with Qi Li, Nonparametric Econometrics: Theory and Practice (2007). The open source R platform for statistical computing and graphics is used throughout in conjunction with the R package np. Recent developments in reproducible research is emphasized throughout with appendices devoted to helping the reader get up to speed with R, R Markdown, TeX and Git."

--- Ooh.
to:NB  books:noted  coveted  econometrics  nonparametrics  statistics  racine.jeffrey  R 
january 2019 by cshalizi
Allesina, S. and Wilmes, M.: Computing Skills for Biologists: A Toolbox (Hardcover, Paperback and eBook) | Princeton University Press
"While biological data continues to grow exponentially in size and quality, many of today’s biologists are not trained adequately in the computing skills necessary for leveraging this information deluge. In Computing Skills for Biologists, Stefano Allesina and Madlen Wilmes present a valuable toolbox for the effective analysis of biological data.
"Based on the authors’ experiences teaching scientific computing at the University of Chicago, this textbook emphasizes the automation of repetitive tasks and the construction of pipelines for data organization, analysis, visualization, and publication. Stressing practice rather than theory, the book’s examples and exercises are drawn from actual biological data and solve cogent problems spanning the entire breadth of biological disciplines, including ecology, genetics, microbiology, and molecular biology. Beginners will benefit from the many examples explained step-by-step, while more seasoned researchers will learn how to combine tools to make biological data analysis robust and reproducible. The book uses free software and code that can be run on any platform.
"Computing Skills for Biologists is ideal for scientists wanting to improve their technical skills and instructors looking to teach the main computing tools essential for biology research in the twenty-first century."
to:NB  books:noted  scientific_computing  R  to_teach:statcomp 
january 2019 by cshalizi
Confidence intervals for GLMs
For the trick about finding the inverse link function.
regression  R  to_teach:undergrad-ADA  via:kjhealy 
december 2018 by cshalizi
Solving Differential Equations in R: Package deSolve | Soetaert | Journal of Statistical Software
"In this paper we present the R package deSolve to solve initial value problems (IVP) written as ordinary differential equations (ODE), differential algebraic equations (DAE) of index 0 or 1 and partial differential equations (PDE), the latter solved using the method of lines approach. The differential equations can be represented in R code or as compiled code. In the latter case, R is used as a tool to trigger the integration and post-process the results, which facilitates model development and application, whilst the compiled code significantly increases simulation speed. The methods implemented are efficient, robust, and well documented public-domain Fortran routines. They include four integrators from the ODEPACK package (LSODE, LSODES, LSODA, LSODAR), DVODE and DASPK2.0. In addition, a suite of Runge-Kutta integrators and special-purpose solvers to efficiently integrate 1-, 2- and 3-dimensional partial differential equations are available. The routines solve both stiff and non-stiff systems, and include many options, e.g., to deal in an efficient way with the sparsity of the Jacobian matrix, or finding the root of equations. In this article, our objectives are threefold: (1) to demonstrate the potential of using R for dynamic modeling, (2) to highlight typical uses of the different methods implemented and (3) to compare the performance of models specified in R code and in compiled code for a number of test cases. These comparisons demonstrate that, if the use of loops is avoided, R code can efficiently integrate problems comprising several thousands of state variables. Nevertheless, the same problem may be solved from 2 to more than 50 times faster by using compiled code compared to an implementation using only R code. Still, amongst the benefits of R are a more flexible and interactive implementation, better readability of the code, and access to R’s high-level procedures. deSolve is the successor of package odesolve which will be deprecated in the future; it is free software and distributed under the GNU General Public License, as part of the R software project."
to:NB  dynamical_systems  computational_statistics  R  to_teach:data_over_space_and_time  have_read 
december 2018 by cshalizi
Object-oriented Computation of Sandwich Estimators | Zeileis | Journal of Statistical Software
"Sandwich covariance matrix estimators are a popular tool in applied regression modeling for performing inference that is robust to certain types of model misspecification. Suitable implementations are available in the R system for statistical computing for certain model fitting functions only (in particular lm()), but not for other standard regression functions, such as glm(), nls(), or survreg(). Therefore, conceptual tools and their translation to computational tools in the package sandwich are discussed, enabling the computation of sandwich estimators in general parametric models. Object orientation can be achieved by providing a few extractor functions' most importantly for the empirical estimating functions' from which various types of sandwich estimators can be computed."
to:NB  computational_statistics  R  estimation  regression  statistics  to_teach 
october 2018 by cshalizi
rWind package on CRAN
For access to wind velocity data sets. (Surprisingly slow access, but very glad somebody has written this so I don't have to!)

--- ETA: The server they're yanking the data from is very temperamental, and grabbing a long temporal stretch is almost sure to fail. But grabbing about 30 days of data at a time seems OK.
R  data_sets  to_teach:data_over_space_and_time 
october 2018 by cshalizi
[1705.08105] FRK: An R Package for Spatial and Spatio-Temporal Prediction with Large Datasets
"FRK is an R software package for spatial/spatio-temporal modelling and prediction with large datasets. It facilitates optimal spatial prediction (kriging) on the most commonly used manifolds (in Euclidean space and on the surface of the sphere), for both spatial and spatio-temporal fields. It differs from many of the packages for spatial modelling and prediction by avoiding stationary and isotropic covariance and variogram models, instead constructing a spatial random effects (SRE) model on a fine-resolution discretised spatial domain. The discrete element is known as a basic areal unit (BAU), whose introduction in the software leads to several practical advantages. The software can be used to (i) integrate multiple observations with different supports with relative ease; (ii) obtain exact predictions at millions of prediction locations (without conditional simulation); and (iii) distinguish between measurement error and fine-scale variation at the resolution of the BAU, thereby allowing for reliable uncertainty quantification. The temporal component is included by adding another dimension. A key component of the SRE model is the specification of spatial or spatio-temporal basis functions; in the package, they can be generated automatically or by the user. The package also offers automatic BAU construction, an expectation-maximisation (EM) algorithm for parameter estimation, and functionality for prediction over any user-specified polygons or BAUs. Use of the package is illustrated on several spatial and spatio-temporal datasets, and its predictions and the model it implements are extensively compared to others commonly used for spatial prediction and modelling."
in_NB  to_read  R  heard_the_talk  prediction  spatial_statistics  spatio-temporal_statistics  to_teach:data_over_space_and_time 
august 2018 by cshalizi
General Resampling Infrastructure • rsample
"rsample contains a set of functions that can create different types of resamples and corresponding classes for their analysis. The goal is to have a modular set of methods that can be used across different R packages for:
"traditional resampling techniques for estimating the sampling distribution of a statistic and
"estimating model performance using a holdout set
"The scope of rsample is to provide the basic building blocks for creating and analyzing resamples of a data set but does not include code for modeling or calculating statistics. The “Working with Resample Sets” vignette gives demonstrations of how rsample tools can be used."
to:NB  R  computational_statistics  to_teach:statcomp  to_teach:undergrad-ADA  via:? 
august 2018 by cshalizi
blogdown: Creating Websites with R Markdown
Since I am, as a loyal reader informs me, the last human being still using Blosxom...
blogging  R  to_read 
june 2018 by cshalizi
Quantitative methods archaeology using R | Archaeological theory and methods | Cambridge University Press
"Quantitative Methods in Archaeology Using R is the first hands-on guide to using the R statistical computing system written specifically for archaeologists. It shows how to use the system to analyze many types of archaeological data. Part I includes tutorials on R, with applications to real archaeological data showing how to compute descriptive statistics, create tables, and produce a wide variety of charts and graphs. Part II addresses the major multivariate approaches used by archaeologists, including multiple regression (and the generalized linear model); multiple analysis of variance and discriminant analysis; principal components analysis; correspondence analysis; distances and scaling; and cluster analysis. Part III covers specialized topics in archaeology, including intra-site spatial analysis, seriation, and assemblage diversity."

--- This looks like it might be an interesting source of teaching examples. (OTOH, I'm not sure how many of The Kids would get it...)
to:NB  books:noted  archaeology  statistics  R  books:owned 
may 2018 by cshalizi
FFTrees: A toolbox to create, visualize, and evaluate fast-and-frugal decision trees
"Fast-and-frugal trees (FFTs) are simple algorithms that facilitate efficient and accurate decisions based on limited information. But despite their successful use in many applied domains, there is no widely available toolbox that allows anyone to easily create, visualize, and evaluate FFTs. We fill this gap by introducing the R package FFTrees. In this paper, we explain how FFTs work, introduce a new class of algorithms called fan for constructing FFTs, and provide a tutorial for using the FFTrees package. We then conduct a simulation across ten real-world datasets to test how well FFTs created by FFTrees can predict data. Simulation results show that FFTs created by FFTrees can predict data as well as popular classification algorithms such as regression and random forests, while remaining simple enough for anyone to understand and use."

--- I am skeptical about that "simple enough for anyone to understand and use"
to:NB  have_read  decision_trees  heuristics  cognitive_science  R  to_teach:undergrad-ADA  re:ADAfaEPoV 
august 2017 by cshalizi
Interactive R On-Line
"IROL was developed by the team of Howard Seltman (email feedback), Rebecca Nugent, Sam Ventura, Ryan Tibshirani, and Chris Genovese at the Department of Statistics at Carnegie Mellon University."

--- I mark this as "to_teach:statcomp", but of course the point is to have people go through this _before_ that course, so the class can cover more interesting stuff.
R  kith_and_kin  seltman.howard  nugent.rebecca  genovese.christopher  ventura.samuel  tibshirani.ryan  to_teach:statcomp 
august 2016 by cshalizi
"ShinyTex is a system for authoring interactive World Wide Web applications (apps) which includes the full capabilities of the R statistical language, particularly in the context of Technology Enhanced Learning (TEL). It uses a modified version of the LaTeX syntax that is standard for document creation among mathematicians and statisticians. It is built on the Shiny platform, an extension of R designed by RStudio to produce web apps. The goal is to provide an easy to use TEL authoring environment with excellent mathematical and statistical support using only free software. ShinyTex authoring can be performed on Windows, OS X, and Linux. Users may view the app on any system with a standard web browser."
R  latex  kith_and_kin  seltman.howard 
august 2016 by cshalizi
Statistical Modeling: A Fresh Approach
"Statistical Modeling: A Fresh Approach introduces and illuminates the statistical reasoning used in modern research throughout the natural and social sciences, medicine, government, and commerce. It emphasizes the use of models to untangle and quantify variation in observed data. By a deft and concise use of computing coupled with an innovative geometrical presentation of the relationship among variables, A Fresh Approach reveals the logic of statistical inference and empowers the reader to use and understand techniques such as analysis of covariance that appear widely in published research but are hardly ever found in introductory texts.
"Recognizing the essential role the computer plays in modern statistics, A Fresh Approach provides a complete and self-contained introduction to statistical computing using the powerful (and free) statistics package R."
in_NB  books:noted  statistics  regression  R  re:ADAfaEPoV 
december 2015 by cshalizi
CRAN - Package ridge
"Linear and logistic ridge regression for small data sets and genome-wide SNP data"
R  regression  statistics  ridge_regression  to_teach:undergrad-ADA  to_teach:linear_models 
october 2015 by cshalizi
Humanities Data in R - Exploring Networks, Geospatial Data, | Taylor Arnold | Springer
"This pioneering book teaches readers to use R within four core analytical areas applicable to the Humanities: networks, text, geospatial data, and images. This book is also designed to be a bridge: between quantitative and qualitative methods, individual and collaborative work, and the humanities and social sciences. Humanities Data with R does not presuppose background programming experience. Early chapters take readers from R set-up to exploratory data analysis (continuous and categorical data, multivariate analysis, and advanced graphics with emphasis on aesthetics and facility). Following this, networks, geospatial data, image data, natural language processing and text analysis each have a dedicated chapter. Each chapter is grounded in examples to move readers beyond the intimidation of adding new tools to their research. Everything is hands-on: networks are explained using U.S. Supreme Court opinions, and low-level NLP methods are applied to short stories by Sir Arthur Conan Doyle. After working through these examples with the provided data, code and book website, readers are prepared to apply new methods to their own work. The open source R programming language, with its myriad packages and popularity within the sciences and social sciences, is particularly well-suited to working with humanities data. R packages are also highlighted in an appendix. This book uses an expanded conception of the forms data may take and the information it represents. The methodology will have wide application in classrooms and self-study for the humanities, but also for use in linguistics, anthropology, and political science. Outside the classroom, this intersection of humanities and computing is particularly relevant for research and new modes of dissemination across archives, museums and libraries."
to:NB  books:noted  R  statistical_computing  network_data_analysis  text_mining  spatial_statistics  books:owned  electronic_copy 
october 2015 by cshalizi
R interface to the Gastner-Newman cartogram code. I haven't tried it out yet.
R  visual_display_of_quantitative_information  cartograms 
july 2015 by cshalizi
CRAN - Package AlgDesign
"Algorithmic experimental designs. Calculates exact and approximate theory experimental designs for D,A, and I criteria. Very large designs may be created. Experimental designs may be blocked or blocked designs created from a candidate list, using several criteria. The blocking can be done when whole and within plot factors interact."
R  experimental_design  statistics  to_teach:undergrad-ADA 
april 2015 by cshalizi
CRAN - Package BatchJobs
"Provides Map, Reduce and Filter variants to generate jobs on batch computing systems like PBS/Torque, LSF, SLURM and Sun Grid Engine. Multicore and SSH systems are also supported."
to_read  R  programming  to_teach:statcomp 
april 2015 by cshalizi
CRAN - Package markovchain
"Functions and S4 methods to create and manage discrete time Markov chains (DTMC) more easily. In addition functions to perform statistical (fitting and drawing random variates) and probabilistic (analysis of DTMC proprieties) analysis are provided."
markov_models  R  to_teach:statcomp 
march 2015 by cshalizi
r - knitr - How to align code and plot side by side - Stack Overflow
Could this be modified to put figure on top, then code, then caption?
R  kntir  re:ADAfaEPoV 
january 2015 by cshalizi
[1409.3531] Object-Oriented Programming, Functional Programming and R
"This paper reviews some programming techniques in R that have proved useful, particularly for substantial projects. These include several versions of object-oriented programming, used in a large number of R packages. The review tries to clarify the origins and ideas behind the various versions, each of which is valuable in the appropriate context. R has also been strongly influenced by the ideas of functional programming and, in particular, by the desire to combine functional with object oriented programming. To clarify how this particular mix of ideas has turned out in the current R language and supporting software, the paper will first review the basic ideas behind object-oriented and functional programming, and then examine the evolution of R with these ideas providing context. Functional programming supports well-defined, defensible software giving reproducible results. Object-oriented programming is the mechanism par excellence for managing complexity while keeping things simple for the user. The two paradigms have been valuable in supporting major software for fitting models to data and numerous other statistical applications. The paradigms have been adopted, and adapted, distinctively in R. Functional programming motivates much of R but R does not enforce the paradigm. Object-oriented programming from a functional perspective differs from that used in non-functional languages, a distinction that needs to be emphasized to avoid confusion. R initially replicated the S language from Bell Labs, which in turn was strongly influenced by earlier program libraries. At each stage, new ideas have been added, but the previous software continues to show its influence in the design as well. Outlining the evolution will further clarify why we currently have this somewhat unusual combination of ideas."
to:NB  to_read  programming  R  chambers.john  to_teach:statcomp 
january 2015 by cshalizi
sjewo/readstata13 · GitHub
Reading (and eventually writing, but I don't care about that) data files from Stata version 13+, because the "foreign" library in R can only handles Stata 12-, and guess what someone's replication files are in? The things I do for you kids.

ETA: only _some_ of their .dta files are 13+, about half of what I need are 12-, with of course no indication in the file names.
january 2015 by cshalizi
Notifications from R | The stupidest thing...
Couldn't one just do a system call to mail(1), rather than GIVING AN R SCRIPT YOUR PASSWORD?
R  to_teach:statcomp  via:phnk 
september 2014 by cshalizi
Text Analysis with R for Students of Literature
"Text Analysis with R for Students of Literature is written with students and scholars of literature in mind but will be applicable to other humanists and social scientists wishing to extend their methodological tool kit to include quantitative and computational approaches to the study of text. Computation provides access to information in text that we simply cannot gather using traditional qualitative methods of close reading and human synthesis. Text Analysis with R for Students of Literature provides a practical introduction to computational text analysis using the open source programming language R. R is extremely popular throughout the sciences and because of its accessibility, R is now used increasingly in other research areas. Readers begin working with text right away and each chapter works through a new technique or process such that readers gain a broad exposure to core R procedures and a basic understanding of the possibilities of computational text analysis at both the micro and macro scale. Each chapter builds on the previous as readers move from small scale “microanalysis” of single texts to large scale “macroanalysis” of text corpora, and each chapter concludes with a set of practice exercises that reinforce and expand upon the chapter lessons. The book’s focus is on making the technical palatable and making the technical useful and immediately gratifying."
to:NB  books:noted  data_analysis  R  text_mining  humanities  electronic_copy  books:owned 
july 2014 by cshalizi
Statistical Analysis of Network Data with R
Supposedly aligned with Kolaczyk's textbook, and using the igraph package.
books:noted  network_data_analysis  statistics  R  kolaczyk.eric  in_NB  books:owned 
june 2014 by cshalizi
Scalable Strategies for Computing with Massive Data
"This paper presents two complementary statistical computing frameworks that address challenges in parallel processing and the analysis of massive data. First, the foreach package allows users of the R programming environment to define parallel loops that may be run sequentially on a single machine, in parallel on a symmetric multiprocessing (SMP) machine, or in cluster environments without platform-specific code. Second, the bigmemory package implements memory- and file-mapped data structures that provide (a) access to arbitrarily large data while retaining a look and feel that is familiar to R users and (b) data structures that are shared across processor cores in order to support efficient parallel computing techniques. Although these packages may be used independently, this paper shows how they can be used in combination to address challenges that have effectively been beyond the reach of researchers who lack specialized software development skills or expensive hardware."
have_read  R  computational_statistics  data_analysis  in_NB  to_teach:statcomp 
february 2014 by cshalizi
[1401.6389] Parallel Optimisation of Bootstrapping in R
"Bootstrapping is a popular and computationally demanding resampling method used for measuring the accuracy of sample estimates and assisting with statistical inference. R is a freely available language and environment for statistical computing popular with biostatisticians for genomic data analyses. A survey of such R users highlighted its implementation of bootstrapping as a prime candidate for parallelization to overcome computational bottlenecks. The Simple Parallel R Interface (SPRINT) is a package that allows R users to exploit high performance computing in multi-core desktops and supercomputers without expert knowledge of such systems. This paper describes the parallelization of bootstrapping for inclusion in the SPRINT R package. Depending on the complexity of the bootstrap statistic and the number of resamples, this implementation has close to optimal speed up on up to 16 nodes of a supercomputer and close to 100 on 512 nodes. This performance in a multi-node setting compares favourably with an existing parallelization option in the native R implementation of bootstrapping."
to:NB  bootstrap  parallel_computing  computational_statistics  R  to_teach:statcomp 
february 2014 by cshalizi
[1402.1894] R Markdown: Integrating A Reproducible Analysis Tool into Introductory Statistics
"Nolan and Temple Lang argue that "the ability to express statistical computations is an essential skill." A key related capacity is the ability to conduct and present data analysis in a way that another person can understand and replicate. The copy-and-paste workflow that is an artifact of antiquated user-interface design makes reproducibility of statistical analysis more difficult, especially as data become increasingly complex and statistical methods become increasingly sophisticated. R Markdown is a new technology that makes creating fully-reproducible statistical analysis simple and painless. It provides a solution suitable not only for cutting edge research, but also for use in an introductory statistics course. We present evidence that R Markdown can be used effectively in introductory statistics courses, and discuss its role in the rapidly-changing world of statistical computation."
to:NB  R  teaching  statistics  to_teach:statcomp 
february 2014 by cshalizi
algorithm - Generate a list of primes in R up to a certain number - Stack Overflow
Not actually _about_ hashing, but helpful for creating random hash functions.
R  primes  programming  hashing 
january 2014 by cshalizi
CRAN - Package alabama
"Augmented Lagrangian Adaptive Barrier Minimization Algorithm for optimizing smooth nonlinear objective functions with constraints. Linear or nonlinear equality and inequality constraints are allowed."

- Well, that solved my problem.
optimization  R  to_teach:statcomp 
december 2013 by cshalizi
Literate Testing in R | Data Analysis Visually Enforced
Nicely-named functions for testing some kinds of numerical properties.
R  programming  to_teach:statcomp 
november 2013 by cshalizi
CRAN - Package hash
"This package implements a data structure similar to hashes in Perl and dictionaries in Python but with a purposefully R flavor. For objects of appreciable size, access using hashes outperforms native named lists and vectors."
R  programming  to_teach:statcomp  hashing 
october 2013 by cshalizi
R: Create hash function digests for arbitrary R objects
"The digest function applies a cryptographical hash function to arbitrary R objects. By default, the objects are internally serialized, and either one of the currently implemented MD5 and SHA-1 hash functions algorithms can be used to compute a compact digest of the serialized object."

-- but I want very crude hashing...
R  programming  hashing 
october 2013 by cshalizi
Christopher Gandrud (간드루드 크리스토파): Slide: one function for lag/lead variables in data frames, including time-series cross-sectional data
"I often want to quickly create a lag or lead variable in an R data frame. Sometimes I also want to create the lag or lead variable for different groups in a data frame, for example, if I want to lag GDP for each country in a data frame.
"I've found the various R methods for doing this hard to remember and usually need to look at old blog posts. Any time we find ourselves using the same series of codes over and over, it's probably time to put them into a function."

--- I think this might make a good exercise for statistical computing.
R  time_series  to_teach:statcomp 
july 2013 by cshalizi
My Stat Bytes talk, with slides and code | Nathan VanHoudnos
"I will present a grab bag of tricks to speed up your R code. Topics will include: installing an optimized BLAS, how to profile your R code to find which parts are slow, replacing slow code with inline C/C++, and running code in parallel on multiple cores. My running example will be fitting a 2PL IRT model with a hand coded MCMC sampler. The idea is to start with naive, pedagogically clear code and end up with fast, production quality code."
kith_and_kin  computational_statistics  R  vanhoudnos.nathan  to_teach:statcomp 
june 2013 by cshalizi
Bayesian Networks in R - with Applications in Systems Biology
"introduces the reader to the essential concepts in Bayesian network modeling and inference in conjunction with examples in the open-source statistical environment R. The level of sophistication is gradually increased across the chapters with exercises and solutions for enhanced understanding and hands-on experimentation of key concepts. Applications focus on systems biology with emphasis on modeling pathways and signaling mechanisms from high throughput molecular data. Bayesian networks have proven to be especially useful abstractions in this regards as exemplified by their ability to discover new associations while validating known ones. It is also expected that the prevalence of publicly available high-throughput biological and healthcare data sets may encourage the audience to explore investigating novel paradigms using the approaches presented in the book."
to:NB  books:noted  graphical_models  statistics  R 
june 2013 by cshalizi
« earlier      
per page:    204080120160

related tags

additive_models  advice  agent-based_models  aligheri.dante  anthropology  apple  archaeology  biology  blogging  books:noted  books:owned  books:recommended  bootstrap  bryan.jennifer  burns.patrick  c++  cartograms  census  chambers.john  clustering  cognitive_science  computational_statistics  content_analysis  conway.drew  coveted  databases  data_analysis  data_mining  data_sets  decision_trees  density_estimation  diffusion_maps  diggle.peter  dimension_reduction  downloaded  dynamical_systems  ecology  econometrics  electronic_copy  emacs  em_algorithm  estimation  evolutionary_biology  experimental_design  exploratory_data_analysis  filtering  franklin.charles  funny:academic  funny:geeky  funny:malicious  genetics  genetic_algorithms  genomics  genovese.christopher  graphical_models  graphics  handcock.mark  hashing  have_read  hayfield.tristen  healy.kieran  heard_the_talk  heritability  heteroskedasticity  heuristics  how_outsiders_see_us  humanities  intro_prob  intro_stats  in_NB  kalman_filter  kernel_estimators  king.gary  kith_and_kin  knitr  kntir  kolaczyk.eric  latex  lauritzen.steffen  lee.ann_b.  linguistics  literary_homage  lumley.thomas  machine_learning  manifold_learning  maps  markdown  markov_models  mixture_models  morris.martina  network_data_analysis  network_visualization  nonparametrics  nugent.rebecca  optimization  paper_writing  parallel_computing  perl  point_processes  poldrack.russell  practices_relating_to_the_transmission_of_genetic_information  prediction  pretty_pictures  primes  problem-solving  productivity_software  programming  programming_languages  python  r  racine.jeffrey  racism  re:6dfb  re:ADAfaEPoV  regression  relative_distributions  richards.joey  ridge_regression  scientific_computing  self-promotion  seltman.howard  simulation  software  spatial_statistics  spatio-temporal_statistics  spectral_clustering  state-space_models  state_estimation  statistical_computing  statistical_inference_for_stochastic_processes  statistics  stochastic_processes  surveillance  surveys  sweave  teaching  text_mining  the_spreadsheet_menace  tibshirani.ryan  time_series  to:blog  to:NB  to_read  to_teach  to_teach:ADA  to_teach:baby-nets  to_teach:complexity-and-inference  to_teach:data-mining  to_teach:data_over_space_and_time  to_teach:linear_models  to_teach:statcomp  to_teach:undergrad-ADA  to_teach:undergrad-research  track_down_references  utter_stupidity  vanhoudnos.nathan  ventura.samuel  version_control  verzani.john  via:?  via:aaron_clauset  via:absfac  via:arsyed  via:benjaminlind  via:chl  via:fionajay  via:gelman  via:jbowers  via:jhofman  via:kgilbert  via:kjhealy  via:klk  via:nikete  via:phnk  via:tslumley  via:vqv  visual_display_of_quantitative_information  where_have_you_been_all_my_life  white.john_myles  wickham.hadley  workflow  xml 

Copy this bookmark: