**cshalizi + to_teach:undergrad-ada**
437

[1909.06539] Not again! Data Leakage in Digital Pathology

10 weeks ago by cshalizi

"Bioinformatics of high throughput omics data (e.g. microarrays and proteomics) has been plagued by uncountable issues with reproducibility at the start of the century. Concerns have motivated international initiatives such as the FDA's led MAQC Consortium, addressing reproducibility of predictive biomarkers by means of appropriate Data Analysis Plans (DAPs). For instance, repreated cross-validation is a standard procedure meant at mitigating the risk that information from held-out validation data may be used during model selection. We prove here that, many years later, Data Leakage can still be a non-negligible overfitting source in deep learning models for digital pathology. In particular, we evaluate the impact of (i) the presence of multiple images for each subject in histology collections; (ii) the systematic adoption of training over collection of subregions (i.e. "tiles" or "patches") extracted for the same subject. We verify that accuracy scores may be inflated up to 41%, even if a well-designed 10x5 iterated cross-validation DAP is applied, unless all images from the same subject are kept together either in the internal training or validation splits. Results are replicated for 4 classification tasks in digital pathology on 3 datasets, for a total of 373 subjects, and 543 total slides (around 27, 000 tiles). Impact of applying transfer learning strategies with models pre-trained on general-purpose or digital pathology datasets is also discussed."

to:NB
cross-validation
statistics
bad_data_analysis
to_teach:undergrad-ADA
to_teach:data-mining
10 weeks ago by cshalizi

[1901.01241] Nonparametric Instrumental Variables Estimation Under Misspecification

august 2019 by cshalizi

"We show that nonparametric instrumental variables estimators are highly sensitive to misspecification: an arbitrarily small deviation from instrumental validity can lead to large asymptotic bias for a broad class of estimators. The problem is mitigated if strong restrictions on the structural function are imposed in estimation. However, if the true function does not obey the restrictions, then imposing them imparts bias. Therefore, there is a trade-off between the sensitivity to invalid instruments and bias from imposing excessive restrictions. We propose a partial identification approach that allows a researcher to explicitly and transparently examine this trade-off and make inferences about the structural function that are valid under a small failure of instrumental validity. We construct a simple, consistent estimator of the identified set. We apply our methods to the empirical setting of Blundell et al. (2007) and Horowitz (2011) to estimate shape-invariant Engel curves."

to:NB
instrumental_variables
causal_inference
nonparametrics
statistics
to_teach:undergrad-ADA
august 2019 by cshalizi

Bootstrap Methods in Econometrics | Annual Review of Economics

august 2019 by cshalizi

"The bootstrap is a method for estimating the distribution of an estimator or test statistic by resampling one's data or a model estimated from the data. Under conditions that hold in a wide variety of econometric applications, the bootstrap provides approximations to distributions of statistics, coverage probabilities of confidence intervals, and rejection probabilities of hypothesis tests that are more accurate than the approximations of first-order asymptotic distribution theory. The reductions in the differences between true and nominal coverage or rejection probabilities can be very large. In addition, the bootstrap provides a way to carry out inference in certain settings where obtaining analytic distributional approximations is difficult or impossible. This article explains the usefulness and limitations of the bootstrap in contexts of interest in econometrics. The presentation is informal and expository. It provides an intuitive understanding of how the bootstrap works. Mathematical details are available in the references that are cited."

to:NB
bootstrap
statistics
economics
to_teach:undergrad-ADA
horowitz.joel
august 2019 by cshalizi

Evaluating Probabilistic Forecasts with scoringRules | Jordan | Journal of Statistical Software

august 2019 by cshalizi

"Probabilistic forecasts in the form of probability distributions over future events have become popular in several fields including meteorology, hydrology, economics, and demography. In typical applications, many alternative statistical models and data sources can be used to produce probabilistic forecasts. Hence, evaluating and selecting among competing methods is an important task. The scoringRules package for R provides functionality for comparative evaluation of probabilistic models based on proper scoring rules, covering a wide range of situations in applied work. This paper discusses implementation and usage details, presents case studies from meteorology and economics, and points to the relevant background literature."

to:NB
prediction
statistics
to_teach:undergrad-ADA
to_teach:data-mining
august 2019 by cshalizi

Free trade and opioid overdose death in the United States - ScienceDirect

july 2019 by cshalizi

"Opioid overdose deaths in the U.S. rose dramatically after 1999, but also exhibited substantial geographic variation. This has largely been explained by differential availability of prescription and non-prescription opioids, including heroin and fentanyl. Recent studies explore the underlying role of socioeconomic factors, but overlook the influence of job loss due to international trade, an economic phenomenon that disproportionately harms the same regions and demographic groups at the heart of the opioid epidemic. We used OLS regression and county-year level data from the Centers for Disease Controls and the Department of Labor to test the association between trade-related job loss and opioid-related overdose death between 1999 and 2015. We find that the loss of 1000 trade-related jobs was associated with a 2.7 percent increase in opioid-related deaths. When fentanyl was present in the heroin supply, the same number of job losses was associated with a 11.3 percent increase in opioid-related deaths."

--- I'm very skeptical about OLS here. Something like nearest neighbors would be better here, but I'm not sure how to handle spatial correlation.

to:NB
to_read
drugs
whats_gone_wrong_with_america
class_struggles_in_america
econometrics
statistics
globalization
to_teach:data_over_space_and_time
to_teach:undergrad-ADA
causal_inference
--- I'm very skeptical about OLS here. Something like nearest neighbors would be better here, but I'm not sure how to handle spatial correlation.

july 2019 by cshalizi

Scalable Visualization Methods for Modern Generalized Additive Models: Journal of Computational and Graphical Statistics: Vol 0, No 0

july 2019 by cshalizi

"In the last two decades, the growth of computational resources has made it possible to handle generalized additive models (GAMs) that formerly were too costly for serious applications. However, the growth in model complexity has not been matched by improved visualizations for model development and results presentation. Motivated by an industrial application in electricity load forecasting, we identify the areas where the lack of modern visualization tools for GAMs is particularly severe, and we address the shortcomings of existing methods by proposing a set of visual tools that (a) are fast enough for interactive use, (b) exploit the additive structure of GAMs, (c) scale to large data sets, and (d) can be used in conjunction with a wide range of response distributions. The new visual methods proposed here are implemented by the mgcViz R package, available on the Comprehensive R Archive Network. Supplementary materials for this article are available online."

to:NB
additive_models
visual_display_of_quantitative_information
computational_statistics
statistics
R
to_teach:undergrad-ADA
july 2019 by cshalizi

Life after Lead: Effects of Early Interventions for Children Exposed to Lead

june 2019 by cshalizi

"Lead pollution is consistently linked to cognitive and behavioral impairments, yet little is known about the benefits of public health interventions for children exposed to lead. This paper estimates the long-term impacts of early-life interventions (e.g. lead remediation, nutritional assessment, medical evaluation, developmental surveillance, and public assistance referrals) recommended for lead-poisoned children. Using linked administrative data from Charlotte, NC, we compare outcomes for children who are similar across observable characteristics but differ in eligibility for intervention due to blood lead test results. We find that the negative outcomes previously associated with early-life exposure can largely be reversed by intervention."

--- The last tag, as usual, is conditional on liking the paper after reading it, and on replication data being available.

to:NB
to_read
lead
cognitive_development
sociology
causal_inference
to_teach:undergrad-ADA
--- The last tag, as usual, is conditional on liking the paper after reading it, and on replication data being available.

june 2019 by cshalizi

Interpreting and Understanding Logits, Probits, and Other Nonlinear Probability Models | Annual Review of Sociology

may 2019 by cshalizi

"Methods textbooks in sociology and other social sciences routinely recommend the use of the logit or probit model when an outcome variable is binary, an ordered logit or ordered probit when it is ordinal, and a multinomial logit when it has more than two categories. But these methodological guidelines take little or no account of a body of work that, over the past 30 years, has pointed to problematic aspects of these nonlinear probability models and, particularly, to difficulties in interpreting their parameters. In this review, we draw on that literature to explain the problems, show how they manifest themselves in research, discuss the strengths and weaknesses of alternatives that have been suggested, and point to lines of further analysis."

to:NB
statistics
classifiers
bad_data_analysis
to_teach:undergrad-ADA
may 2019 by cshalizi

estimation - Variance of a sample covariance for normal variables - Cross Validated

may 2019 by cshalizi

To make into an exercise. (As one of the answers points out, there is nothing here which turns on using a Gaussian distribution.)

probability
statistics
to_teach:undergrad-ADA
to_teach:linear_models
to_teach:data_over_space_and_time
may 2019 by cshalizi

[1904.02438] Cross-Validation for Correlated Data

april 2019 by cshalizi

"K-fold cross-validation (CV) with squared error loss is widely used for evaluating predictive models, especially when strong distributional data assumptions cannot be taken. However, CV with squared error loss is not free from distributional assumptions, in particular in cases involving non-i.i.d data. This paper analyzes CV for correlated data. We present a criterion for suitability of CV, and introduce a bias corrected cross-validation prediction error estimator, CVc, which is suitable in many settings involving correlated data, where CV is invalid. Our theoretical results are also demonstrated numerically."

to:NB
statistics
cross-validation
time_series
rosset.saharon
to_teach:undergrad-ADA
to_teach:data_over_space_and_time
april 2019 by cshalizi

The Bias Is Built In: How Administrative Records Mask Racially Biased Policing by Dean Knox, Will Lowe, Jonathan Mummolo :: SSRN

february 2019 by cshalizi

"Researchers often lack the necessary data to credibly estimate racial bias in policing. In particular, police administrative records lack information on civilians that police observe but do not investigate. In this paper, we show that if police racially discriminate when choosing whom to investigate, using administrative records to estimate racial bias in police behavior amounts to post-treatment conditioning, and renders many quantities of interest unidentified---even among investigated individuals---absent strong and untestable assumptions. In most cases, no set of controls can eliminate this statistical bias, the exact form of which we derive through principal stratification in a causal mediation framework. We develop a bias-correction procedure and nonparametric sharp bounds for race effects, replicate published findings, and show traditional estimation techniques can severely underestimate levels of racially biased policing or even mask discrimination entirely. We conclude by outlining a general and feasible design for future studies that is robust to this inferential snare."

to:NB
to_read
causal_inference
police
discrimination
statistics
to_teach:undergrad-ADA
via:henry_farrell
february 2019 by cshalizi

Recognising when you don’t know - Biased and Inefficient

february 2019 by cshalizi

(Some nice shade is thrown on the difference between machine learning and statistics --- excuse me, "data science".)

classifiers
mushrooms
statistics
to_teach:data-mining
to_teach:undergrad-ADA
lumley.thomas
february 2019 by cshalizi

The Taxing Deed of Globalization

january 2019 by cshalizi

"This paper examines the effects of globalization on the distribution of worker-specific labor taxes using a unique set of tax calculators. We find a differential effect of higher trade and factor mobility on relative tax burdens in 1980–1993 versus 1994–2007 in the OECD. Prior to 1994, greater openness meant that higher income earners were taxed progressively more. However, after 1994, we document a globalization-induced rise in the labor income tax burden of the middle class, while the top 1 percent of workers and employees faced a reduction in their tax burden of 0.59–1.45 percentage points."

to:NB
globalization
economics
to_teach:undergrad-ADA
january 2019 by cshalizi

The association between adolescent well-being and digital technology use | Nature Human Behaviour

january 2019 by cshalizi

"The widespread use of digital technologies by young people has spurred speculation that their regular use negatively impacts psychological well-being. Current empirical evidence supporting this idea is largely based on secondary analyses of large-scale social datasets. Though these datasets provide a valuable resource for highly powered investigations, their many variables and observations are often explored with an analytical flexibility that marks small effects as statistically significant, thereby leading to potential false positives and conflicting results. Here we address these methodological challenges by applying specification curve analysis (SCA) across three large-scale social datasets (total n = 355,358) to rigorously examine correlational evidence for the effects of digital technology on adolescents. The association we find between digital technology use and adolescent well-being is negative but small, explaining at most 0.4% of the variation in well-being. Taking the broader context of the data into account suggests that these effects are too small to warrant policy change."

--- This sounds awesome, but will need to be read carefully.

to:NB
to_read
networked_life
sociology
statistics
model_checking
to_teach:undergrad-ADA
to_be_shot_after_a_fair_trial
re:actually-dr-internet-is-the-name-of-the-monsters-creator
--- This sounds awesome, but will need to be read carefully.

january 2019 by cshalizi

Youth-Parent Socialization Panel Study, 1965-1997: Four Waves Combined

january 2019 by cshalizi

"The Youth-Parent Socialization Panel Study is a series of surveys designed to assess political continuity and change across time for biologically-related generations and to gauge the impact of life-stage events and historical trends on the behaviors and attitudes of respondents. A national sample of high school seniors and their parents was first surveyed in 1965. Subsequent surveys of the same individuals were conducted in 1973, 1982, and 1997. This data collection combines all four waves of youth data for the study. The general objective of the data collection was to study the dynamics of political attitudes and behaviors by obtaining data on the same individuals as they aged from approximately 18 years of age in 1965 to 50 years of age in 1997. Especially when combined with other elements of the study as released in other ICPSR collections in the Youth Studies Series, this data collection facilitates the analysis of generational, life cycle, and historical effects and political influences on relationships within the family. This data collection also has several distinctive properties. First, it is a longitudinal study of a particular cohort, a national sample from the graduating high school class of 1965. Second, it captures the respondents at key points in their life stages -- at ages 18, 26, 35, and 50. Third, the dataset contains many replicated measures over time as well as some measures unique to each data point. Fourth, there is detailed information about the respondents' life histories. Background variables include age, sex, religious orientation, level of religious participation, marital status, ethnicity, educational status and background, place of residence, family income, and employment status."

--- Used in Rochon's book about value change, in a way which would make it a good case study for propensity-score matching (which Rochon did _not_ do, confounding his inferences). Query, can I get access via CMU, or are we not part of the consortium?

data_sets
us_politics
public_opinion
to_teach:undergrad-ADA
--- Used in Rochon's book about value change, in a way which would make it a good case study for propensity-score matching (which Rochon did _not_ do, confounding his inferences). Query, can I get access via CMU, or are we not part of the consortium?

january 2019 by cshalizi

Robots at Work | The Review of Economics and Statistics | MIT Press Journals

january 2019 by cshalizi

"We analyze for the first time the economic contributions of modern industrial robots, which are flexible, versatile, and autonomous machines. We use novel panel data on robot adoption within industries in seventeen countries from 1993 to 2007 and new instrumental variables that rely on robots’ comparative advantage in specific tasks. Our findings suggest that increased robot use contributed approximately 0.36 percentage points to annual labor productivity growth, while at the same time raising total factor productivity and lowering output prices. Our estimates also suggest that robots did not significantly reduce total employment, although they did reduce low-skilled workers’ employment share."

- Last tag for the instrumental variables (if they look sensible and perhaps especially if they do not)

to:NB
economics
instrumental_variables
robots_and_robotics
to_teach:undergrad-ADA
- Last tag for the instrumental variables (if they look sensible and perhaps especially if they do not)

january 2019 by cshalizi

Confidence intervals for GLMs

december 2018 by cshalizi

For the trick about finding the inverse link function.

regression
R
to_teach:undergrad-ADA
via:kjhealy
december 2018 by cshalizi

The Effect of Media Coverage on Mass Shootings | IZA - Institute of Labor Economics

december 2018 by cshalizi

"Can media coverage of shooters encourage future mass shootings? We explore the link between the day-to-day prime time television news coverage of shootings on ABC World News Tonight and subsequent mass shootings in the US from January 1, 2013 to June 23, 2016. To circumvent latent endogeneity concerns, we employ an instrumental variable strategy: worldwide disaster deaths provide an exogenous variation that systematically crowds out shooting-related coverage. Our findings consistently suggest a positive and statistically significant effect of coverage on the number of subsequent shootings, lasting for 4-10 days. At its mean, news coverage is suggested to cause approximately three mass shootings in the following week, which would explain 55 percent of all mass shootings in our sample. Results are qualitatively consistent when using (i) additional keywords to capture shooting-related news coverage, (ii) alternative definitions of mass shootings, (iii) the number of injured or killed people as the dependent variable, and (iv) an alternative, longer data source for mass shootings from 2006-2016."

to:NB
to_read
contagion
causal_inference
to_teach:undergrad-ADA
to_be_shot_after_a_fair_trial
previous_tag_was_in_poor_taste
december 2018 by cshalizi

How to forecast an American’s vote - All politics is identity politics

november 2018 by cshalizi

This looks like a nice case-study for when I teach logistic regression in the spring, provided there's replication data. It'd be even better if there was a follow-up on how well this actually predicted!

track_down_references
logistic_regression
us_politics
to_teach:undergrad-ADA
november 2018 by cshalizi

Analyze and Create Elegant Directed Acyclic Graphs • ggdag

august 2018 by cshalizi

"ggdag: An R Package for visualizing and analyzing directed acyclic graphs"

R
graphical_models
visual_display_of_quantitative_information
via:arsyed
to_teach:undergrad-ADA
re:ADAfaEPoV
august 2018 by cshalizi

General Resampling Infrastructure • rsample

august 2018 by cshalizi

"rsample contains a set of functions that can create different types of resamples and corresponding classes for their analysis. The goal is to have a modular set of methods that can be used across different R packages for:

"traditional resampling techniques for estimating the sampling distribution of a statistic and

"estimating model performance using a holdout set

"The scope of rsample is to provide the basic building blocks for creating and analyzing resamples of a data set but does not include code for modeling or calculating statistics. The “Working with Resample Sets” vignette gives demonstrations of how rsample tools can be used."

to:NB
R
computational_statistics
to_teach:statcomp
to_teach:undergrad-ADA
via:?
"traditional resampling techniques for estimating the sampling distribution of a statistic and

"estimating model performance using a holdout set

"The scope of rsample is to provide the basic building blocks for creating and analyzing resamples of a data set but does not include code for modeling or calculating statistics. The “Working with Resample Sets” vignette gives demonstrations of how rsample tools can be used."

august 2018 by cshalizi

[1706.08576] Invariant Causal Prediction for Nonlinear Models

may 2018 by cshalizi

"An important problem in many domains is to predict how a system will respond to interventions. This task is inherently linked to estimating the system's underlying causal structure. To this end, 'invariant causal prediction' (ICP) (Peters et al., 2016) has been proposed which learns a causal model exploiting the invariance of causal relations using data from different environments. When considering linear models, the implementation of ICP is relatively straight-forward. However, the nonlinear case is more challenging due to the difficulty of performing nonparametric tests for conditional independence. In this work, we present and evaluate an array of methods for nonlinear and nonparametric versions of ICP for learning the causal parents of given target variables. We find that an approach which first fits a nonlinear model with data pooled over all environments and then tests for differences between the residual distributions across environments is quite robust across a large variety of simulation settings. We call this procedure "Invariant residual distribution test". In general, we observe that the performance of all approaches is critically dependent on the true (unknown) causal structure and it becomes challenging to achieve high power if the parental set includes more than two variables. As a real-world example, we consider fertility rate modelling which is central to world population projections. We explore predicting the effect of hypothetical interventions using the accepted models from nonlinear ICP. The results reaffirm the previously observed central causal role of child mortality rates."

to:NB
causal_inference
causal_discovery
statistics
regression
prediction
peters.jonas
meinshausen.nicolai
to_read
heard_the_talk
to_teach:undergrad-ADA
re:ADAfaEPoV
may 2018 by cshalizi

[1501.01332] Causal inference using invariant prediction: identification and confidence intervals

may 2018 by cshalizi

"What is the difference of a prediction that is made with a causal model and a non-causal model? Suppose we intervene on the predictor variables or change the whole environment. The predictions from a causal model will in general work as well under interventions as for observational data. In contrast, predictions from a non-causal model can potentially be very wrong if we actively intervene on variables. Here, we propose to exploit this invariance of a prediction under a causal model for causal inference: given different experimental settings (for example various interventions) we collect all models that do show invariance in their predictive accuracy across settings and interventions. The causal model will be a member of this set of models with high probability. This approach yields valid confidence intervals for the causal relationships in quite general scenarios. We examine the example of structural equation models in more detail and provide sufficient assumptions under which the set of causal predictors becomes identifiable. We further investigate robustness properties of our approach under model misspecification and discuss possible extensions. The empirical properties are studied for various data sets, including large-scale gene perturbation experiments."

to:NB
to_read
causal_inference
causal_discovery
statistics
prediction
regression
buhlmann.peter
meinshausen.nicolai
peters.jonas
heard_the_talk
re:ADAfaEPoV
to_teach:undergrad-ADA
may 2018 by cshalizi

Family Ruptures, Stress, and the Mental Health of the Next Generation

march 2018 by cshalizi

"This paper studies how in utero exposure to maternal stress from family ruptures affects later mental health. We find that prenatal exposure to the death of a maternal relative increases take-up of ADHD medications during childhood and anti-anxiety and depression medications in adulthood. Further, family ruptures during pregnancy depress birth outcomes and raise the risk of perinatal complications necessitating hospitalization. Our results suggest large welfare gains from preventing fetal stress from family ruptures and possibly from economically induced stressors such as unemployment. They further suggest that greater stress exposure among the poor may partially explain the intergenerational persistence of poverty."

See also an important comment (http://dx.doi.org/10.1257/aer.20161124) and reply (http://dx.doi.org/10.1257/aer.20161605) --- potentially the makings of a very good problem set, if data &c. check out.

to:NB
causal_inference
inequality
economics
to_teach:undergrad-ADA
See also an important comment (http://dx.doi.org/10.1257/aer.20161124) and reply (http://dx.doi.org/10.1257/aer.20161605) --- potentially the makings of a very good problem set, if data &c. check out.

march 2018 by cshalizi

How Do Hours Worked Vary with Income? Cross-Country Evidence and Implications

january 2018 by cshalizi

"This paper builds a new internationally comparable database of hours worked to measure how hours vary with income across and within countries. We document that average hours worked per adult are substantially higher in low-income countries than in high-income countries. The pattern of decreasing hours with aggregate income holds for both men and women, for adults of all ages and education levels, and along both the extensive and intensive margin. Within countries, hours worked per worker are also decreasing in the individual wage for most countries, though in the richest countries, hours worked are flat or increasing in the wage. One implication of our findings is that aggregate productivity and welfare differences across countries are larger than currently thought."

--- Last tag depends on availability of replication data.

to:NB
economics
labor
to_teach:undergrad-ADA
--- Last tag depends on availability of replication data.

january 2018 by cshalizi

[1711.07137] Nonparametric Double Robustness

january 2018 by cshalizi

"Use of nonparametric techniques (e.g., machine learning, kernel smoothing, stacking) are increasingly appealing because they do not require precise knowledge of the true underlying models that generated the data under study. Indeed, numerous authors have advocated for their use with standard methods (e.g., regression, inverse probability weighting) in epidemiology. However, when used in the context of such singly robust approaches, nonparametric methods can lead to suboptimal statistical properties, including inefficiency and no valid confidence intervals. Using extensive Monte Carlo simulations, we show how doubly robust methods offer improvements over singly robust approaches when implemented via nonparametric methods. We use 10,000 simulated samples and 50, 100, 200, 600, and 1200 observations to investigate the bias and mean squared error of singly robust (g Computation, inverse probability weighting) and doubly robust (augmented inverse probability weighting, targeted maximum likelihood estimation) estimators under four scenarios: correct and incorrect model specification; and parametric and nonparametric estimation. As expected, results show best performance with g computation under correctly specified parametric models. However, even when based on complex transformed covariates, double robust estimation performs better than singly robust estimators when nonparametric methods are used. Our results suggest that nonparametric methods should be used with doubly instead of singly robust estimation techniques."

to:NB
statistics
causal_inference
estimation
nonparametrics
to_teach:undergrad-ADA
kith_and_kin
january 2018 by cshalizi

Quantitative historical analysis uncovers a single dimension of complexity that structures global variation in human social organization

january 2018 by cshalizi

"Do human societies from around the world exhibit similarities in the way that they are structured, and show commonalities in the ways that they have evolved? These are long-standing questions that have proven difficult to answer. To test between competing hypotheses, we constructed a massive repository of historical and archaeological information known as “Seshat: Global History Databank.” We systematically coded data on 414 societies from 30 regions around the world spanning the last 10,000 years. We were able to capture information on 51 variables reflecting nine characteristics of human societies, such as social scale, economy, features of governance, and information systems. Our analyses revealed that these different characteristics show strong relationships with each other and that a single principal component captures around three-quarters of the observed variation. Furthermore, we found that different characteristics of social complexity are highly predictable across different world regions. These results suggest that key aspects of social organization are functionally related and do indeed coevolve in predictable ways. Our findings highlight the power of the sciences and humanities working together to rigorously test hypotheses about general rules that may have shaped human history."

--- Contributed, so the last tag applies very forcefully.

to:NB
to_read
comparative_history
complexity_measures
principal_components
to_teach:undergrad-ADA
to_be_shot_after_a_fair_trial
--- Contributed, so the last tag applies very forcefully.

january 2018 by cshalizi

Capturing the Dynamical Repertoire of Single Neurons with Generalized Linear Models | Neural Computation | MIT Press Journals

december 2017 by cshalizi

"A key problem in computational neuroscience is to find simple, tractable models that are nevertheless flexible enough to capture the response properties of real neurons. Here we examine the capabilities of recurrent point process models known as Poisson generalized linear models (GLMs). These models are defined by a set of linear filters and a point nonlinearity and are conditionally Poisson spiking. They have desirable statistical properties for fitting and have been widely used to analyze spike trains from electrophysiological recordings. However, the dynamical repertoire of GLMs has not been systematically compared to that of real neurons. Here we show that GLMs can reproduce a comprehensive suite of canonical neural response behaviors, including tonic and phasic spiking, bursting, spike rate adaptation, type I and type II excitation, and two forms of bistability. GLMs can also capture stimulus-dependent changes in spike timing precision and reliability that mimic those observed in real neurons, and can exhibit varying degrees of stochasticity, from virtually deterministic responses to greater-than-Poisson variability. These results show that Poisson GLMs can exhibit a wide range of dynamic spiking behaviors found in real neurons, making them well suited for qualitative dynamical as well as quantitative statistical studies of single-neuron and population response properties."

to:NB
neural_data_analysis
statistics
to_teach:undergrad-ADA
pillow.jonathan
december 2017 by cshalizi

Consistency without Inference: Instrumental Variables in Practical Application

november 2017 by cshalizi

"I use the bootstrap to study a comprehensive sample of 1400 instrumental

variables regressions in 32 papers published in the journals of the American

Economic Association. IV estimates are more often found to be falsely significant

and more sensitive to outliers than OLS, while having a higher mean squared error

around the IV population moment. There is little evidence that OLS estimates are

substantively biased, while IV instruments often appear to be irrelevant. In

addition, I find that established weak instrument pre-tests are largely

uninformative and weak instrument robust methods generally perform no better or

substantially worse than 2SLS. "

to:NB
have_read
re:ADAfaEPoV
to_teach:undergrad-ADA
instrumental_variables
causal_inference
regression
statistics
econometrics
via:kjhealy
variables regressions in 32 papers published in the journals of the American

Economic Association. IV estimates are more often found to be falsely significant

and more sensitive to outliers than OLS, while having a higher mean squared error

around the IV population moment. There is little evidence that OLS estimates are

substantively biased, while IV instruments often appear to be irrelevant. In

addition, I find that established weak instrument pre-tests are largely

uninformative and weak instrument robust methods generally perform no better or

substantially worse than 2SLS. "

november 2017 by cshalizi

[1706.09141] Causal Structure Learning

november 2017 by cshalizi

"Graphical models can represent a multivariate distribution in a convenient and accessible form as a graph. Causal models can be viewed as a special class of graphical models that not only represent the distribution of the observed system but also the distributions under external interventions. They hence enable predictions under hypothetical interventions, which is important for decision making. The challenging task of learning causal models from data always relies on some underlying assumptions. We discuss several recently proposed structure learning algorithms and their assumptions, and compare their empirical performance under various scenarios."

to:NB
to_read
maathuis.marloes
causal_discovery
statistics
to_teach:undergrad-ADA
november 2017 by cshalizi

Community and the Crime Decline: The Causal Effect of Local Nonprofits on Violent CrimeAmerican Sociological Review - Patrick Sharkey, Gerard Torrats-Espinosa, Delaram Takyar, 2017

november 2017 by cshalizi

"Largely overlooked in the theoretical and empirical literature on the crime decline is a long tradition of research in criminology and urban sociology that considers how violence is regulated through informal sources of social control arising from residents and organizations internal to communities. In this article, we incorporate the “systemic” model of community life into debates on the U.S. crime drop, and we focus on the role that local nonprofit organizations played in the national decline of violence from the 1990s to the 2010s. Using longitudinal data and a strategy to account for the endogeneity of nonprofit formation, we estimate the causal effect on violent crime of nonprofits focused on reducing violence and building stronger communities. Drawing on a panel of 264 cities spanning more than 20 years, we estimate that every 10 additional organizations focusing on crime and community life in a city with 100,000 residents leads to a 9 percent reduction in the murder rate, a 6 percent reduction in the violent crime rate, and a 4 percent reduction in the property crime rate."

- Last tag conditional on replication data.

to:NB
causal_inference
crime
institutions
via:rvenkat
to_teach:undergrad-ADA
- Last tag conditional on replication data.

november 2017 by cshalizi

Empirical prediction intervals improve energy forecasting

august 2017 by cshalizi

"Hundreds of organizations and analysts use energy projections, such as those contained in the US Energy Information Administration (EIA)’s Annual Energy Outlook (AEO), for investment and policy decisions. Retrospective analyses of past AEO projections have shown that observed values can differ from the projection by several hundred percent, and thus a thorough treatment of uncertainty is essential. We evaluate the out-of-sample forecasting performance of several empirical density forecasting methods, using the continuous ranked probability score (CRPS). The analysis confirms that a Gaussian density, estimated on past forecasting errors, gives comparatively accurate uncertainty estimates over a variety of energy quantities in the AEO, in particular outperforming scenario projections provided in the AEO. We report probabilistic uncertainties for 18 core quantities of the AEO 2016 projections. Our work frames how to produce, evaluate, and rank probabilistic forecasts in this setting. We propose a log transformation of forecast errors for price projections and a modified nonparametric empirical density forecasting method. Our findings give guidance on how to evaluate and communicate uncertainty in future energy outlooks."

--- It's probably presumptuous of me, but I am a bit proud, because the first author learned a lot of these methods from my class...

to:NB
to_read
heard_the_talk
energy
prediction
statistics
to_teach:undergrad-ADA
--- It's probably presumptuous of me, but I am a bit proud, because the first author learned a lot of these methods from my class...

august 2017 by cshalizi

FFTrees: A toolbox to create, visualize, and evaluate fast-and-frugal decision trees

august 2017 by cshalizi

"Fast-and-frugal trees (FFTs) are simple algorithms that facilitate efficient and accurate decisions based on limited information. But despite their successful use in many applied domains, there is no widely available toolbox that allows anyone to easily create, visualize, and evaluate FFTs. We fill this gap by introducing the R package FFTrees. In this paper, we explain how FFTs work, introduce a new class of algorithms called fan for constructing FFTs, and provide a tutorial for using the FFTrees package. We then conduct a simulation across ten real-world datasets to test how well FFTs created by FFTrees can predict data. Simulation results show that FFTs created by FFTrees can predict data as well as popular classification algorithms such as regression and random forests, while remaining simple enough for anyone to understand and use."

--- I am skeptical about that "simple enough for anyone to understand and use"

to:NB
have_read
decision_trees
heuristics
cognitive_science
R
to_teach:undergrad-ADA
re:ADAfaEPoV
--- I am skeptical about that "simple enough for anyone to understand and use"

august 2017 by cshalizi

Evaluations | The Abdul Latif Jameel Poverty Action Lab

june 2017 by cshalizi

"Search our database of 841 randomized evaluations conducted by our affiliates in 80 countries. To browse summaries of key policy recommendations from a subset of these evaluations, visit the Policy Publications tab."

to:NB
causal_inference
experimental_economics
experimental_sociology
statistics
re:ADAfaEPoV
to_teach:undergrad-ADA
economics
june 2017 by cshalizi

Probabilistic model predicts dynamics of vegetation biomass in a desert ecosystem in NW China

june 2017 by cshalizi

"The temporal dynamics of vegetation biomass are of key importance for evaluating the sustainability of arid and semiarid ecosystems. In these ecosystems, biomass and soil moisture are coupled stochastic variables externally driven, mainly, by the rainfall dynamics. Based on long-term field observations in northwestern (NW) China, we test a recently developed analytical scheme for the description of the leaf biomass dynamics undergoing seasonal cycles with different rainfall characteristics. The probabilistic characterization of such dynamics agrees remarkably well with the field measurements, providing a tool to forecast the changes to be expected in biomass for arid and semiarid ecosystems under climate change conditions. These changes will depend—for each season—on the forecasted rate of rainy days, mean depth of rain in a rainy day, and duration of the season. For the site in NW China, the current scenario of an increase of 10% in rate of rainy days, 10% in mean rain depth in a rainy day, and no change in the season duration leads to forecasted increases in mean leaf biomass near 25% in both seasons."

--- Possible teaching example if data is available?

to:NB
ecology
to_teach:undergrad-ADA
--- Possible teaching example if data is available?

june 2017 by cshalizi

Janzing , Balduzzi , Grosse-Wentrup , Schölkopf : Quantifying causal influences

december 2016 by cshalizi

"Many methods for causal inference generate directed acyclic graphs (DAGs) that formalize causal relations between n variables. Given the joint distribution on all these variables, the DAG contains all information about how intervening on one variable changes the distribution of the other n−1 variables. However, quantifying the causal influence of one variable on another one remains a nontrivial question.

"Here we propose a set of natural, intuitive postulates that a measure of causal strength should satisfy. We then introduce a communication scenario, where edges in a DAG play the role of channels that can be locally corrupted by interventions. Causal strength is then the relative entropy distance between the old and the new distribution.

"Many other measures of causal strength have been proposed, including average causal effect, transfer entropy, directed information, and information flow. We explain how they fail to satisfy the postulates on simple DAGs of ≤3 nodes. Finally, we investigate the behavior of our measure on time-series, supporting our claims with experiments on simulated data."

to:NB
graphical_models
time_series
causality
statistics
information_theory
to_read
re:ADAfaEPoV
to_teach:undergrad-ADA
"Here we propose a set of natural, intuitive postulates that a measure of causal strength should satisfy. We then introduce a communication scenario, where edges in a DAG play the role of channels that can be locally corrupted by interventions. Causal strength is then the relative entropy distance between the old and the new distribution.

"Many other measures of causal strength have been proposed, including average causal effect, transfer entropy, directed information, and information flow. We explain how they fail to satisfy the postulates on simple DAGs of ≤3 nodes. Finally, we investigate the behavior of our measure on time-series, supporting our claims with experiments on simulated data."

december 2016 by cshalizi

[1311.5828] The Splice Bootstrap

december 2016 by cshalizi

"This paper proposes a new bootstrap method to compute predictive intervals for nonlinear autoregressive time series model forecast. This method we call the splice boobstrap as it involves splicing the last p values of a given series to a suitably simulated series. This ensures that each simulated series will have the same set of p time series values in common, a necessary requirement for computing conditional predictive intervals. Using simulation studies we show the methods gives 90% intervals intervals that are similar to those expected from theory for simple linear and SETAR model driven by normal and non-normal noise. Furthermore, we apply the method to some economic data and demonstrate the intervals compare favourably with cross-validation based intervals."

to:NB
bootstrap
time_series
statistics
prediction
to_teach:undergrad-ADA
re:ADAfaEPoV
to_read
december 2016 by cshalizi

Illness as indicator | The Economist

november 2016 by cshalizi

"Polling data suggests that on the whole, Mr Trump’s supporters are not particularly down on their luck: within any given level of educational attainment, higher-income respondents are more likely to vote Republican. But what the geographic numbers do show is that the specific subset of Mr Trump’s voters that won him the election—those in counties where he outperformed Mr Romney by large margins—live in communities that are literally dying."

--- Replication files available?

track_down_references
us_politics
trump.donald
whats_gone_wrong_with_america
to_teach:undergrad-ADA
--- Replication files available?

november 2016 by cshalizi

The Great Minds Journal Club discusses Westfall & Yarkoni (2016) – [citation needed]

june 2016 by cshalizi

In which Tal Yarkoni pulls off writing a dialogue on his own paper. (I'd never dare.)

statistics
measurement
yarkoni.tal
to_teach:undergrad-ADA
to_teach:linear_models
social_measurement
causal_inference
june 2016 by cshalizi

PLOS ONE: Trickle-Down Preferences: Preferential Conformity to High Status Peers in Fashion Choices

may 2016 by cshalizi

On first skim, they don't really seem to consider that women who move from low to high status locations are probably _already different_ from those who don't...

I can't believe I'm writing this, but this might really be a job for propensity-score matching.

to:NB
to_be_shot_after_a_fair_trial
social_influence
economics
shoes
re:homophily_and_confounding
to_teach:undergrad-ADA
I can't believe I'm writing this, but this might really be a job for propensity-score matching.

may 2016 by cshalizi

Quartz/bad-data-guide: An exhaustive reference to problems seen in real-world data along with suggestions on how to resolve them.

may 2016 by cshalizi

This is pretty good (and not limited to "data journalism").

data_analysis
to_teach:undergrad-ADA
to_teach:undergrad-research
have_read
via:?
may 2016 by cshalizi

Surfeit and surface | Big Data & Society

april 2016 by cshalizi

This is awesome. (But it's also completely compatible with causal inference!) Also, the cultural references will probably require footnotes in just 10 years.

social_science_methodology
sociology
data_mining
levi.john_martin
have_read
via:phnk
to_teach:undergrad-ADA
to_teach:data-mining
re:any_p-value_distinguishable_from_zero_is_insufficiently_informative
to:blog
april 2016 by cshalizi

Hardle , Marron : Bootstrap Simultaneous Error Bars for Nonparametric Regression

april 2016 by cshalizi

"Simultaneous error bars are constructed for nonparametric kernel estimates of regression functions. The method is based on the bootstrap, where resampling is done from a suitably estimated residual distribution. The error bars are seen to give asymptotically correct coverage probabilities uniformly over any number of gridpoints. Applications to an economic problem are given and comparison to both pointwise and Bonferroni-type bars is presented through a simulation study."

to:NB
to_read
bootstrap
confidence_sets
regression
nonparametrics
statistics
to_teach:undergrad-ADA
re:ADAfaEPoV
april 2016 by cshalizi

Statistically controlling for confounding constructs is harder than you think

march 2016 by cshalizi

"Social scientists often seek to demonstrate that a construct has incremental validity over and above other related constructs. However, these claims are typically supported by measurement- level models that fail to consider the effects of measurement (un)reliability. We use intuitive examples, Monte Carlo simulations, and a novel analytical framework to demonstrate that common strategies for establishing incremental construct validity using multiple regression analysis exhibit extremely high Type I error rates under parameter regimes common in many psychological domains. Counterintuitively, we find that error rates are highest—in some cases approaching 100%—when sample sizes are large and reliability is moderate. Our findings suggest that a potentially large proportion of incremental validity claims made in the literature are spurious. We present a web application (http://jakewestfall.org/ivy/) that readers can use to explore the statistical properties of these and other incremental validity arguments. We conclude by reviewing SEM-based statistical approaches that appropriately control the Type I error rate when attempting to establish incremental validity."

to:NB
have_read
measurement
social_measurement
social_science_methodology
psychometrics
econometrics
graphical_models
statistics
to_teach:undergrad-ADA
re:ADAfaEPoV
yarkoni.tal
to:blog
march 2016 by cshalizi

Jenny Bryan on Twitter: "An Incomplete List of #rstats troubleshooting tips https://t.co/OKKoGkSYzq"

march 2016 by cshalizi

It misses

* Did you use attach()? Don't

but is otherwise pretty good.

R
to_teach:undergrad-ADA
to_teach:statcomp
via:tslumley
bryan.jennifer
* Did you use attach()? Don't

but is otherwise pretty good.

march 2016 by cshalizi

School Finance Reform and the Distribution of Student Achievement

february 2016 by cshalizi

"We study the impact of post-1990 school finance reforms, during the so-called "adequacy" era, on gaps in spending and achievement between high-income and low-income school districts. Using an event study design, we find that reform events--court orders and legislative reforms--lead to sharp, immediate, and sustained increases in absolute and relative spending in low-income school districts. Using representative samples from the National Assessment of Educational Progress, we also find that reforms cause gradual increases in the relative achievement of students in low-income school districts, consistent with the goal of improving educational opportunity for these students. The implied effect of school resources on educational achievement is large."

- Last tag depends on replication data, which might not be available.

to:NB
education
inequality
us_politics
causal_inference
to_teach:undergrad-ADA
via:jbdelong
- Last tag depends on replication data, which might not be available.

february 2016 by cshalizi

Information, Inequality, and Mass Polarization: Ideology in Advanced Democracies

february 2016 by cshalizi

"Growing polarization in the American Congress is closely related to rising income inequality. Yet there has been no corresponding polarization of the U.S. electorate, and across advanced democracies, mass polarization is negatively related to income inequality. To explain this puzzle, we propose a comparative political economy model of mass polarization in which the same institutional factors that generate income inequality also undermine political information. We explain why more voters then place themselves in the ideological center, hence generating a negative correlation between mass polarization and inequality. We confirm these conjectures on individual-level data for 20 democracies, and we then show that democracies cluster into two types: one with high inequality, low mass polarization, and polarized and right-shifted elites (e.g., the United States); and the other with low inequality and high mass polarization with left-shifted elites (e.g., Sweden). This division reflects long-standing differences in educational systems, the role of unions, and social networks."

--- Replication data available?

political_economy
political_science
social_networks
unions
political_parties
inequality
class_struggles_in_america
whats_gone_wrong_with_america
re:democratic_cognition
democracy
via:henry_farrell
to_read
to_teach:undergrad-ADA
--- Replication data available?

february 2016 by cshalizi

Homicide in Eighteenth-Century Scotland: Numbers and Theories - Edinburgh University Press

february 2016 by cshalizi

"The purpose of this article is to address the lacuna in our knowledge of the extent of interpersonal violence in eighteenth-century Scotland, with particular reference to homicide, and in doing so use these findings to examine the theoretical and empirical issues that have dominated historical discourse regarding this phenomenon over the last few decades. Essentially, it seeks to challenge widely held explanations for the alleged long-term decline in homicide, arguing that incidences of murder in the eighteenth century were affected more by political tensions and socio-economic dislocation than by cultural changes in taste and manners. It also criticises the methodological weaknesses evident in longitudinal studies of homicide and tries to resolve them in two ways: firstly, by adjusting the homicide rate to take account of the rises and falls in population in the period 1700–1799; and, secondly, by providing national data rather than relying on extrapolating national trends from local or regional studies. Finally, it is argued that the main assumptions of historians working in the field of homicide studies are in the light of evidence for Scotland in need of revision as data from there provide little support for a linear fall in the level of homicides, or a link with shifts in sentiment and/or taste as put forward by those influenced by the civilising theories of Norbert Elias."

--- Smoothing over time, with a generalized additive model (though only one predictor variable, so really a spline + a fancy link function). Perhaps usable as an example.

to:NB
to_read
violence
statistics
early_modern_european_history
the_civilizing_process
scotland
to_teach:undergrad-ADA
--- Smoothing over time, with a generalized additive model (though only one predictor variable, so really a spline + a fancy link function). Perhaps usable as an example.

february 2016 by cshalizi

Large Sample Properties of Matching Estimators for Average Treatment Effects - Abadie - 2005 - Econometrica - Wiley Online Library

february 2016 by cshalizi

"Matching estimators for average treatment effects are widely used in evaluation research despite the fact that their large sample properties have not been established in many cases. The absence of formal results in this area may be partly due to the fact that standard asymptotic expansions do not apply to matching estimators with a fixed number of matches because such estimators are highly nonsmooth functionals of the data. In this article we develop new methods for analyzing the large sample properties of matching estimators and establish a number of new results. We focus on matching with replacement with a fixed number of matches. First, we show that matching estimators are not N1/2-consistent in general and describe conditions under which matching estimators do attain N1/2-consistency. Second, we show that even in settings where matching estimators are N1/2-consistent, simple matching estimators with a fixed number of matches do not attain the semiparametric efficiency bound. Third, we provide a consistent estimator for the large sample variance that does not require consistent nonparametric estimation of unknown functions. Software for implementing these methods is available in Matlab, Stata, and R."

--- An unkind version of this would be "matching is what happens when you do nearest-neighbor regression, and you forget that the bias-variance tradeoff is a _tradeoff_."

(Ungated version: http://www.ksg.harvard.edu/fs/aabadie/smep.pdf)

(ADA note: reference in the causal-estimation chapter, re connection between matching and nearest neighbors)

to:NB
statistics
estimation
causal_inference
regression
to_teach:undergrad-ADA
have_read
matching
--- An unkind version of this would be "matching is what happens when you do nearest-neighbor regression, and you forget that the bias-variance tradeoff is a _tradeoff_."

(Ungated version: http://www.ksg.harvard.edu/fs/aabadie/smep.pdf)

(ADA note: reference in the causal-estimation chapter, re connection between matching and nearest neighbors)

february 2016 by cshalizi

AEAweb: AER (95,3) p. 546 - The Rise of Europe: Atlantic Trade, Institutional Change, and Economic Growth

january 2016 by cshalizi

"The rise of Western Europe after 1500 is due largely to growth in countries with access to the Atlantic Ocean and with substantial trade with the New World, Africa, and Asia via the Atlantic. This trade and the associated colonialism affected Europe not only directly, but also indirectly by inducing institutional change. Where "initial" political institutions (those established before 1500) placed significant checks on the monarchy, the growth of Atlantic trade strengthened merchant groups by constraining the power of the monarchy, and helped merchants obtain changes in institutions to protect property rights. These changes were central to subsequent economic growth."

to:NB
economics
economic_history
institutions
economic_growth
to_teach:undergrad-ADA
via:jbdelong
have_read
january 2016 by cshalizi

Does data splitting improve prediction? - Springer

january 2016 by cshalizi

"Data splitting divides data into two parts. One part is reserved for model selection. In some applications, the second part is used for model validation but we use this part for estimating the parameters of the chosen model. We focus on the problem of constructing reliable predictive distributions for future observed values. We judge the predictive performance using log scoring. We compare the full data strategy with the data splitting strategy for prediction. We show how the full data score can be decomposed into model selection, parameter estimation and data reuse costs. Data splitting is preferred when data reuse costs are high. We investigate the relative performance of the strategies in four simulation scenarios. We introduce a hybrid estimator that uses one part for model selection but both parts for estimation. We argue that a split data analysis is prefered to a full data analysis for prediction with some exceptions."

--- Ungated: http://arxiv.org/abs/1301.2983

statistics
regression
prediction
model_selection
faraway.j.j.
re:ADAfaEPoV
to_teach:undergrad-ADA
have_read
to_teach:linear_models
in_NB
--- Ungated: http://arxiv.org/abs/1301.2983

january 2016 by cshalizi

Evidence based policy or policy based evidence? Supply and demand for data in a donor dominant world | People, Spaces, Deliberation

statistics economics development_economics organizations political_economy science_as_a_social_process social_measurement to_teach:undergrad-ADA

january 2016 by cshalizi

statistics economics development_economics organizations political_economy science_as_a_social_process social_measurement to_teach:undergrad-ADA

january 2016 by cshalizi

CRAN - Package ridge

october 2015 by cshalizi

"Linear and logistic ridge regression for small data sets and genome-wide SNP data"

R
regression
statistics
ridge_regression
to_teach:undergrad-ADA
to_teach:linear_models
october 2015 by cshalizi

Andrey Nikolayevich Tikhonov - Wikipedia, the free encyclopedia

october 2015 by cshalizi

Of course the bit I am interested in is only in the short next-to-last paragraph.

tikhonov.a.n.
lives_of_the_scientists
mathematics
optimization
to_teach:undergrad-ADA
to_teach:linear_models
october 2015 by cshalizi

Untangling the sources of racial wealth inequality in the United States - Equitable Growth Equitable Growth

october 2015 by cshalizi

Suppose --- work with me here --- that one of the things which makes it easier to buy a home is _having parents with wealth_, who can _pass along some of that wealth while they are alive_. (Hey, it's a hypothesis.) What, exactly, do you learn from the coefficient on "inheritance" in a regression which controls for "homeownership"? Similarly, suppose one of the things wealthier parents buy for their children is _access to education_, leading to job opportunities, and _direct access to job opportunities_. Again, what do you learn in a regression which "controls for" (I can't help the scare quotes) income?

economics
inequality
have_read
track_down_references
racism
the_american_dilemma
transmission_of_inequality
to_teach:undergrad-ADA
to_teach:linear_models
october 2015 by cshalizi

globalinequality: Did socialism keep capitalism equal?

august 2015 by cshalizi

Some econometric evidence for one of my pet-crank notions. The regression specifications look dubious, however.

cold_war
political_economy
socialism
economics
inequality
to:blog
track_down_references
to_teach:linear_models
to_teach:undergrad-ADA
august 2015 by cshalizi

Science Isn’t Broken | FiveThirtyEight

august 2015 by cshalizi

I like the idea of having researchers compete to throw all sorts of different modeling choices at the same data, and the initial example is cool.

science
science_as_a_social_process
have_read
to_teach:undergrad-ADA
august 2015 by cshalizi

Heat Wave: A Social Autopsy of Disaster in Chicago, Klinenberg

june 2015 by cshalizi

"Heat waves in the United States kill more people during a typical year than all other natural disasters combined. Until now, no one could explain either the overwhelming number or the heartbreaking manner of the deaths resulting from the 1995 Chicago heat wave. Meteorologists and medical scientists have been unable to account for the scale of the trauma, and political officials have puzzled over the sources of the city's vulnerability. In Heat Wave, Eric Klinenberg takes us inside the anatomy of the metropolis to conduct what he calls a "social autopsy," examining the social, political, and institutional organs of the city that made this urban disaster so much worse than it ought to have been."

to:NB
books:noted
chicago
disasters
sociology
to_teach:undergrad-ADA
books:owned
june 2015 by cshalizi

AEAweb: AER (105,6) p. 1738 - Trafficking Networks and the Mexican Drug War

june 2015 by cshalizi

"Drug trade-related violence has escalated dramatically in Mexico since 2007, and recent years have also witnessed large-scale efforts to combat trafficking, spearheaded by Mexico's conservative PAN party. This study examines the direct and spillover effects of Mexican policy toward the drug trade. Regression discontinuity estimates show that drug-related violence increases substantially after close elections of PAN mayors. Empirical evidence suggests that the violence reflects rival traffickers' attempts to usurp territories after crackdowns have weakened incumbent criminals. Moreover, the study uses a network model of trafficking routes to show that PAN victories divert drug traffic, increasing violence along alternative drug routes."

--- Look at data set & c. to see if this could become a problem set.

to:NB
drugs
crime
causal_inference
mexico
economics
to_teach:undergrad-ADA
--- Look at data set & c. to see if this could become a problem set.

june 2015 by cshalizi

Welcome to NB

may 2015 by cshalizi

Last tag is tentative but this seems like a very interesting tool.

teaching
social_media
to_teach:undergrad-ADA
may 2015 by cshalizi

[1505.02452] Design and interpretation of studies: relevant concepts from the past and some extensions

may 2015 by cshalizi

"Principles for the planning and analysis of observational studies, as suggested by W.G Cochran in 1972, are discussed and compared to additional methodological developments since then."

to:NB
statistics
cox.d.r
wemuth.nanny
to_teach:undergrad-ADA
to_read
may 2015 by cshalizi

CRAN - Package AlgDesign

april 2015 by cshalizi

"Algorithmic experimental designs. Calculates exact and approximate theory experimental designs for D,A, and I criteria. Very large designs may be created. Experimental designs may be blocked or blocked designs created from a candidate list, using several criteria. The blocking can be done when whole and within plot factors interact."

R
experimental_design
statistics
to_teach:undergrad-ADA
april 2015 by cshalizi

Welcome to the CRCNS data sharing website — CRCNS.org

march 2015 by cshalizi

Sharing neural data; some of the data sets require an (anonymous) login.

--- See about using one of the movement data sets for a multivariate-analysis problem set (or exam?).

neuroscience
data_sets
to_teach:undergrad-ADA
--- See about using one of the movement data sets for a multivariate-analysis problem set (or exam?).

march 2015 by cshalizi

Power from the People -- Finance & Development, March 2015

march 2015 by cshalizi

Well, the final project _will_ be assigned on May Day --- see if data's available...

inequality
economics
political_economy
unions
track_down_references
via:jbdelong
to_teach:undergrad-ADA
march 2015 by cshalizi

Dead and Alive: Beliefs in Contradictory Conspiracy Theories

february 2015 by cshalizi

"Conspiracy theories can form a monological belief system: A self-sustaining worldview comprised of a network of mutually supportive beliefs. The present research shows that even mutually incompatible conspiracy theories are positively correlated in endorsement. In Study 1 (n = 137), the more participants believed that Princess Diana faked her own death, the more they believed that she was murdered. In Study 2 (n = 102), the more participants believed that Osama Bin Laden was already dead when U.S. special forces raided his compound in Pakistan, the more they believed he is still alive. Hierarchical regression models showed that mutually incompatible conspiracy theories are positively associated because both are associated with the view that the authorities are engaged in a cover-up (Study 2). The monological nature of conspiracy belief appears to be driven not by conspiracy theories directly supporting one another but by broader beliefs supporting conspiracy theories in general."

--- I'd want to look very carefully at the numerical data to make sure this isn't being driven by a few people who are crazy (even once you allow for their being into conspiracy theories). In fact, this sounds like a situation where you'd really want to look carefully at protocols collected from the interviewees... Last tag conditional on the authors responding positively to my query about access to the data.

to:NB
have_skimmed
surveys
hierarchical_statistical_models
conspiracy_theories
sociology
to_teach:undergrad-ADA
psychology
natural_history_of_truthiness
--- I'd want to look very carefully at the numerical data to make sure this isn't being driven by a few people who are crazy (even once you allow for their being into conspiracy theories). In fact, this sounds like a situation where you'd really want to look carefully at protocols collected from the interviewees... Last tag conditional on the authors responding positively to my query about access to the data.

february 2015 by cshalizi

Overdispersion Diagnostics for Generalized Linear Models on JSTOR

february 2015 by cshalizi

"Generalized linear models (GLM's) are simple, convenient models for count data, but they assume that the variance is a specified function of the mean. Although overdispersed GLM's allow more flexible mean-variance relationships, they are often not as simple to interpret nor as easy to fit as standard GLM's. This article introduces a convexity plot, or C plot for short, that detects overdispersion and relative variance curves and relative variance tests that help to understand the nature of the overdispersion. Convexity plots sometimes detect overdispersion better than score tests, and relative variance curves and tests sometimes distinguish the source of the overdispersion better than score tests."

in_NB
statistics
regression
model_checking
kith_and_kin
roeder.kathryn
to_teach:undergrad-ADA
have_read
february 2015 by cshalizi

What Do Data on Millions of U.S. Workers Reveal about Life-Cycle Earnings Risk?

february 2015 by cshalizi

"We study the evolution of individual labor earnings over the life cycle, using a large panel data set of earnings histories drawn from U.S. administrative records. Using fully nonparametric methods, our analysis reaches two broad conclusions. First, earnings shocks display substantial deviations from lognormality—the standard assumption in the literature on incomplete markets. In particular, earnings shocks display strong negative skewness and extremely high kurtosis—as high as 30 compared with 3 for a Gaussian distribution. The high kurtosis implies that, in a given year, most individuals experience very small earnings shocks, and a small but non-negligible number experience very large shocks. Second, these statistical properties vary significantly both over the life cycle and with the earnings level of individuals. We also estimate impulse response functions of earnings shocks and find important asymmetries: Positive shocks to high-income individuals are quite transitory, whereas negative shocks are very persistent; the opposite is true for low-income individuals. Finally, we use these rich sets of moments to estimate econometric processes with increasing generality to capture these salient features of earnings dynamics."

--- Last tag conditional on what exactly is in the "data appendix" at https://fguvenendotcom.files.wordpress.com/2014/04/moments_for_publication.xls

to:NB
to_read
economics
inequality
heavy_tails
to_teach:undergrad-ADA
statistics
great_risk_shift
--- Last tag conditional on what exactly is in the "data appendix" at https://fguvenendotcom.files.wordpress.com/2014/04/moments_for_publication.xls

february 2015 by cshalizi

On the Interpretation of Instrumental Variables in the Presence of Specification Errors

february 2015 by cshalizi

"The method of instrumental variables (IV) and the generalized method of moments (GMM), and their applications to the estimation of errors-in-variables and simultaneous equations models in econometrics, require data on a sufficient number of instrumental variables that are both exogenous and relevant. We argue that, in general, such instruments (weak or strong) cannot exist."

--- I think they are too quick to dismiss non-parametric IV; if what one wants is consistent estimates of the partial derivatives at a given point, you _can_ get that by (e.g.) splines or locally linear regression. Need to think through this in terms of Pearl's graphical definition of IVs.

in_NB
instrumental_variables
misspecification
regression
linear_regression
causal_inference
statistics
econometrics
via:jbdelong
have_read
to_teach:undergrad-ADA
re:ADAfaEPoV
--- I think they are too quick to dismiss non-parametric IV; if what one wants is consistent estimates of the partial derivatives at a given point, you _can_ get that by (e.g.) splines or locally linear regression. Need to think through this in terms of Pearl's graphical definition of IVs.

february 2015 by cshalizi

[1312.7851] Effective Degrees of Freedom: A Flawed Metaphor

january 2015 by cshalizi

"To most applied statisticians, a fitting procedure's degrees of freedom is synonymous with its model complexity, or its capacity for overfitting to data. In particular, it is often used to parameterize the bias-variance tradeoff in model selection. We argue that, contrary to folk intuition, model complexity and degrees of freedom are not synonymous and may correspond very poorly. We exhibit and theoretically explore various examples of fitting procedures for which degrees of freedom is not monotonic in the model complexity parameter, and can exceed the total dimension of the response space. Even in very simple settings, the degrees of freedom can exceed the dimension of the ambient space by an arbitrarily large amount. We show the degrees of freedom for any non-convex projection method can be unbounded."

--- I have never really liked "degrees of freedom"...

--- ETA after reading: to be clear, no one is arguing about "effective degrees of freedom", in the sense of Efron (1986), telling us about over-fitting. The demonstrations here are that the geometric metaphor behind "degrees of freedom", while holding for linear models (without model selection), becomes very misleading in other contexts. Now, since I prefer to think of model selection in terms of capacity to over-fit, rather than the number of adjustable knobs...

to:NB
model_selection
regression
statistics
hastie.trevor
to_teach:undergrad-ADA
have_read
convexity
--- I have never really liked "degrees of freedom"...

--- ETA after reading: to be clear, no one is arguing about "effective degrees of freedom", in the sense of Efron (1986), telling us about over-fitting. The demonstrations here are that the geometric metaphor behind "degrees of freedom", while holding for linear models (without model selection), becomes very misleading in other contexts. Now, since I prefer to think of model selection in terms of capacity to over-fit, rather than the number of adjustable knobs...

january 2015 by cshalizi

**related tags**

Copy this bookmark: