statistics   210312

« earlier    

Has baseball analytics killed the art of hitting? | Sport | The Guardian
As is often the case in baseball, the stats tell the story: the major league batting average has dropped below .250 for the first time since 1972 and strikeouts have topped hits for the first time ever.
baseball  statistics 
17 hours ago by campion1581
How Much Should We Trust Estimates from Multiplicative Interaction Models? Simple Tools to Improve Empirical Practice | Political Analysis | Cambridge Core
Multiplicative interaction models are widely used in social science to examine whether the relationship between an outcome and an independent variable changes with a moderating variable. Current empirical practice tends to overlook two important problems. First, these models assume a linear interaction effect that changes at a constant rate with the moderator. Second, estimates of the conditional effects of the independent variable can be misleading if there is a lack of common support of the moderator. Replicating 46 interaction effects from 22 recent publications in five top political science journals, we find that these core assumptions often fail in practice, suggesting that a large portion of findings across all political science subfields based on interaction models are fragile and model dependent. We propose a checklist of simple diagnostics to assess the validity of these assumptions and offer flexible estimation strategies that allow for nonlinear interaction effects and safeguard against excessive extrapolation. These statistical routines are available in both R and STATA.

--Not clear if transfers to log-linear interaction models...
regression  statistics  estimation  meta-analysis  for_friends 
17 hours ago by rvenkat
Why you should care about the Nate Silver vs. Nassim Taleb Twitter war • Towards Data Science
Isaac Faber:
<p>If a prediction does not obey some fundamental characteristics, it should not be marketed as a probability. More importantly, a prediction should be judged from the time it is given to the public and not just the moment before the event. A forecaster should be held responsible for both aleatory and epistemic uncertainty.

When viewed this way, it is clear that FiveThirtyEight reports too much noise leading up to an event and not enough signal. This is great for driving users to read long series of related articles on the same topic but not so rigorous to bet your fortune on. Taleb's and Silver's take on how FiveThirtyEight should be judged can be visualized like this.

<img src="*D_tidaT-fHMY3DRLgjwekw.png" width="100%" /><br /><em>Taleb vs. Silver's different take on how FiveThirtyEight should be judged in 2016</em>

Because there is so much uncertainty around non-linear events, like an election, it could reasonably be considered frivolous to report early stage forecasts. The only conceivable reason to do so is to capture (and monetize?) the interest of a public which is hungry to know the future. I will not go into the technical arguments; <a href="">Taleb has written and published a paper on the key issues with a solution</a>.</p>

"Too much noise, not enough signal" - but elections mostly are noise, and figuring out what is signal can only be done afterwards. (And everyone can argue it differently.)
culture  statistics  maths  elections  probability 
17 hours ago by charlesarthur
Introduction to Conditional Random Fields
current math level: this mostly makes sense to me, like 90%
machinelearning  nlproc  statistics  language 
18 hours ago by aparrish
To Reduce Privacy Risks, the Census Plans to Report Less Accurate Data - The New York Times
When the Census Bureau gathered data in 2010, it made two promises. The form would be “quick and easy,” it said. And “your answers are protected by law.”

But mathematical breakthroughs, easy access to more powerful computing, and widespread availability of large and varied public data sets have made the bureau reconsider whether the protection it offers Americans is strong enough. To preserve confidentiality, the bureau’s directors have determined they need to adopt a “formal privacy” approach, one that adds uncertainty to census data before it is published and achieves privacy assurances that are provable mathematically.

The census has always added some uncertainty to its data, but a key innovation of this new framework, known as “differential privacy,” is a numerical value describing how much privacy loss a person will experience. It determines the amount of randomness — “noise” — that needs to be added to a data set before it is released, and sets up a balancing act between accuracy and privacy. Too much noise would mean the data would not be accurate enough to be useful — in redistricting, in enforcing the Voting Rights Act or in conducting academic research. But too little, and someone’s personal data could be revealed....

In November 2016, the bureau staged something of an attack on itself. Using only the summary tables with their eight billion numbers, Mr. Abowd formed a small team to try to generate a record for every American that would show the block where he or she lived, as well as his or her sex, age, race and ethnicity — a “reconstruction” of the person-level data.

Each statistic in a summary table leaks a little information, offering clues about, or rather constraints on, what respondents’ answers to the census could look like. Combining statistics from different aggregate tables at different levels of geography, we start to get a picture of the demographics of who is living where....

By this summer, Mr. Abowd and his team had completed their reconstruction for nearly every part of the country. When they matched their reconstructed data to the actual, confidential records — again comparing just block, sex, age, race and ethnicity — they found about 50 percent of people matched exactly. And for over 90 percent there was at most one mistake, typically a person’s age being missed by one or two years. (At smaller levels of geography, the census reports age in five-year buckets.)

This level of accuracy was alarming. Mr. Abowd and his peers say that their reconstruction, while still preliminary, is not a violation of Title 13. Instead it is seen as a red flag that their current disclosure limitation system is out of date....
census  statistics  mapping  privacy 
yesterday by shannon_mattern
Causal Inference Animated Plots
When you're learning econometrics, we tend to toss a bunch of methods at you. Here's multivariate OLS. Here's difference-in-difference. Here's instrumental variables. We show you how to perform them, and we tell you the assumptions necessary for them to work, but how often do we show you what they actually do?

On this page, I take several popular methods for getting causal effects out of non-experimental data and provide animated plots that show you what these methods actually do to the data and how you can understand what the effects you're estimating actually ARE.

You may find it useful to know that whenever I say some variation in A is 'explained by' B, I'm talking about taking the mean of A among observations with different values of B. So if Alice's height is 64 inches, Bob's height is 72 inches, the average woman is 66 inches, and the average man is 69 inches, then 66 of Alice's inches and 69 of Bob's inches are 'explained by' gender, and (64-66) = -2 of Alice's inches and (72-69) = 3 of Bob's inches are 'not explained by' gender.

A couple brief notes:

These graphs are intended to give an intuitive understanding of how these methods make use of and interpret data, not to show the best practices for actually performing the methods. Take these as illustrations, not a set of instructions. This is clearest on the graphs for matching (which shows only one matching method of many), and regression discontinuity (which shows a pretty weak way of performing RDD).
I completely ignore the kind of stuff you're likely to get in your class anyway: what assumptions are necessary, what kinds of tests to run, what regression functional forms to use, whether you've identified an ATE, ATT, LATE, etc..
On that note, you may notice that I actually ignore regression entirely in making these graphs (I do sneak in a correlation or two in some of the explanatory text, but not in the pictures)! You can get the idea for all of these across by just taking means. That's simple, and simple is good.
If you view the animations on Twitter, you can click them to pause/play.
R code for these graphs is available on GitHub.
How did I draw those causal diagrams? Why,!
Jump to: Controlling for a Variable, Matching on a Variable, Instrumental Variables, Fixed Effects, Difference-in-Difference, Regression Discontinuity, Collider Variables, or Post-Treatment Controls.
econometrics  statistics  visualisation 
yesterday by gimber

« earlier    

related tags

*  2019  2read  538  abtesting  acupuncture  airplanes  analysis  analyst  analysts  analytics  analyze  anti  archive_it  article  astroinformatics  astronomy  astrostatistics  baseball  bayes  bayesian  bayesian_consistency  benchmarks  bigdata  biology  blogs  book  books:noted  brave  brookings  browser  business  cancer  cdc  census  charity  chi  chi2019  children  christmas  cli  climate  coding  communication  company  computerscience  course  criticism  culture  data-ethics  data-governance  data-policy  data-trusts  data  dataframe  datascience  dataviz  decision_theory  devops  dh  digitalhumanities  digitalization  drugs  ebooks  econometrics  economics  economy  editorial  education  elections  environment  estimation  evelina_gabasova  facebook  faster  finance  for_friends  funny  games  gdp  genetics  germany  gis  google  government  graphics  growth  guncontrol  guns  health  history  holt-winters  home  homeless  homelessness  house  infographics  instagram  interesting  issue  it  journalism  kids  language  law  legislation  liberal  life  linguistics  local  logging  luajit  machine-learning  machinelearning  mainstream  mapping  maps  marketing  math  mathematics  mathn  maths  measure  medicine  memorymapping  meta-analysis  metaresearch  methodology  metrics  millenial  modelling  money  monitoring  movies  music  must  nautilusmag  nba  nhanes  nlproc  nm  nostalgebraist  nps  opensource  organization  oxford  pandas  parachutes  paradox  paradoxes  peer_review  performance  philosophy  photography  placebo  pocket  poldrack.russell  policy  politics  population-genetics  portal  prediction  predictions  privacy  probabilistic  probability  programming  psmag  psychology  python  r  reference  regression  renting  reproducibility  research  rforstats  risk  rstats  rust  safety  scandal  science  search  sex  shell  society  software  startups  stories  sysadmin  teaching  testing  tips  to:ipe  to:nb  to_read  tools  tourism  towatch  travel  tutorial  tweets  uk  unitedstates  university  usa  useatwork  ux  vector  video  visualisation  visualization  web  webdesign  webdev  wikipedia  words  work  year-end-review 

Copy this bookmark: