sampling-bias   20

Frontiers | Can We Validate the Results of Twin Studies? A Census-Based Study on the Heritability of Educational Achievement | Genetics
As for most phenotypes, the amount of variance in educational achievement explained by SNPs is lower than the amount of additive genetic variance estimated in twin studies. Twin-based estimates may however be biased because of self-selection and differences in cognitive ability between twins and the rest of the population. Here we compare twin registry based estimates with a census-based heritability estimate, sampling from the same Dutch birth cohort population and using the same standardized measure for educational achievement. Including important covariates (i.e., sex, migration status, school denomination, SES, and group size), we analyzed 893,127 scores from primary school children from the years 2008–2014. For genetic inference, we used pedigree information to construct an additive genetic relationship matrix. Corrected for the covariates, this resulted in an estimate of 85%, which is even higher than based on twin studies using the same cohort and same measure. We therefore conclude that the genetic variance not tagged by SNPs is not an artifact of the twin method itself.
study  biodet  behavioral-gen  iq  psychometrics  psychology  cog-psych  twin-study  methodology  variance-components  state-of-art  🌞  developmental  age-generation  missing-heritability  biases  measurement  sampling-bias  sib-study 
december 2017 by nhaliday
[1106.2832] Active Learning to Overcome Sample Selection Bias: Application to Photometric Variable Star Classification
Despite the great promise of machine-learning algorithms to classify and predict astrophysical parameters for the vast numbers of astrophysical sources and transients observed in large-scale surveys, the peculiarities of the training data often manifest as strongly biased predictions on the data of interest. Typically, training sets are derived from historical surveys of brighter, more nearby objects than those from more extensive, deeper surveys (testing data). This sample selection bias can cause catastrophic errors in predictions on the testing data because a) standard assumptions for machine-learned model selection procedures break down and b) dense regions of testing space might be completely devoid of training data. We explore possible remedies to sample selection bias, including importance weighting (IW), co-training (CT), and active learning (AL). We argue that AL---where the data whose inclusion in the training set would most improve predictions on the testing set are queried for manual follow-up---is an effective approach and is appropriate for many astronomical applications. For a variable star classification problem on a well-studied set of stars from Hipparcos and OGLE, AL is the optimal method in terms of error rate on the testing data, beating the off-the-shelf classifier by 3.4% and the other proposed methods by at least 3.0%. To aid with manual labeling of variable stars, we developed a web interface which allows for easy light curve visualization and querying of external databases. Finally, we apply active learning to classify variable stars in the ASAS survey, finding dramatic improvement in our agreement with the ACVS catalog, from 65.5% to 79.5%, and a significant increase in the classifier's average confidence for the testing set, from 14.6% to 42.9%, after a few AL iterations.
papers  active-learning  sampling-bias 
july 2017 by arsyed
Reporting bias inflates the reputation of medical treatments: A comparison of outcomes in clinical trials and online product reviews
- Why do people overestimate how much a medical treatment will benefit them?
- We compare average outcomes in clinical trials and online product reviews.
- Average outcomes in online reviews are much more positive.
- People who have good outcomes are more likely to write online reviews.
- Beliefs based on (electronic) word-of-mouth will be positively distorte
study  psychology  social-psych  medicine  replication  biases  epistemic  info-foraging  stylized-facts  sampling-bias  info-dynamics 
february 2017 by nhaliday
teaching - Intuitive explanation for dividing by $n-1$ when calculating standard deviation? - Cross Validated
The standard deviation calculated with a divisor of n-1 is a standard deviation calculated from the sample as an estimate of the standard deviation of the population from which the sample was drawn. Because the observed values fall, on average, closer to the sample mean than to the population mean, the standard deviation which is calculated using deviations from the sample mean underestimates the desired standard deviation of the population. Using n-1 instead of n as the divisor corrects for that by making the result a little bit bigger.

Note that the correction has a larger proportional effect when n is small than when it is large, which is what we want because when n is larger the sample mean is likely to be a good estimator of the population mean.

...

A common one is that the definition of variance (of a distribution) is the second moment recentered around a known, definite mean, whereas the estimator uses an estimated mean. This loss of a degree of freedom (given the mean, you can reconstitute the dataset with knowledge of just n−1 of the data values) requires the use of n−1 rather than nn to "adjust" the result.
q-n-a  overflow  stats  acm  intuition  explanation  bias-variance  methodology  moments  nibble  degrees-of-freedom  sampling-bias  generalization  dimensionality  ground-up  intricacy 
january 2017 by nhaliday
political analysis | West Hunter
Just to make things clear, most political reporters are morons, nearly as bad as sports reporters. Mostly ugly cheerleaders for their side, rather than analysts. Uninteresting.

how to analyze polls:

Who ever is ahead in the polls at the time of election is extremely likely to win. Talk about how Candidate X would have a ‘difficult path to 270 electoral votes’ when he’s up 2 points (for example), is pretty much horseshit. There are second-order considerations: you get more oomph per voter when the voter is in a small state, and you also want your votes distributed fairly evenly, so that you win states giving you a majority of electoral votes by a little rather than winning states giving you a minority of electoral votes by huge margins. Not that a candidate can do much about this, of course.

When you hear someone say that it’s really 50 state contests [ more if you think about Maine and Nebraska] , so you should pay attention to the state polls, not the national polls: also horseshit. In some sense, it is true – but when your national polls go up, so do your state polls – almost all of them, in practice. On election day, or just before, you want to consider national polls rather than state polls, because they are almost always more recent, therefore more accurate.

When should you trust an outlier poll, rather than the average: when you want to be wrong.

Money doesn’t help much. Political consultants will tell you that it does, but then they get 15% of ad buys.

A decent political reporter would actually go out and talk to people that aren’t exactly like him. Apparently this no longer happens.

All of these rules have exceptions – but if you understand those [rare] exceptions and can apply them, you’re paying too much attention to politics.
thinking  politics  media  data  street-fighting  poll  contrarianism  len:short  west-hunter  objektbuch  metameta  checklists  sampling-bias  outliers  descriptive  social-choice  gilens-page  elections  scitariat  money  null-result  polisci  incentives  stylized-facts  metabuch  chart  top-n  hi-order-bits  track-record  wonkish  data-science  tetlock  meta:prediction  info-foraging  civic  info-dynamics  interests 
september 2016 by nhaliday
Are You Living in a Computer Simulation?
Bostrom's anthropic arguments

https://www.jetpress.org/volume7/simulation.htm
In sum, if your descendants might make simulations of lives like yours, then you might be living in a simulation. And while you probably cannot learn much detail about the specific reasons for and nature of the simulation you live in, you can draw general conclusions by making analogies to the types and reasons of simulations today. If you might be living in a simulation then all else equal it seems that you should care less about others, live more for today, make your world look likely to become eventually rich, expect to and try to participate in pivotal events, be entertaining and praiseworthy, and keep the famous people around you happy and interested in you.

Theological Implications of the Simulation Argument: https://www.tandfonline.com/doi/pdf/10.1080/15665399.2010.10820012
Nick Bostrom’s Simulation Argument (SA) has many intriguing theological implications. We work out some of them here. We show how the SA can be used to develop novel versions of the Cosmological and Design Arguments. We then develop some of the affinities between Bostrom’s naturalistic theogony and more traditional theological topics. We look at the resurrection of the body and at theodicy. We conclude with some reflections on the relations between the SA and Neoplatonism (friendly) and between the SA and theism (less friendly).

https://www.gwern.net/Simulation-inferences
lesswrong  philosophy  weird  idk  thinking  insight  links  summary  rationality  ratty  bostrom  sampling-bias  anthropic  theos  simulation  hanson  decision-making  advice  mystic  time-preference  futurism  letters  entertainment  multi  morality  humility  hypocrisy  wealth  malthus  power  drama  gedanken  pdf  article  essay  religion  christianity  the-classics  big-peeps  iteration-recursion  aesthetics  nietzschean  axioms  gwern  analysis  realness  von-neumann  space  expansionism  duplication  spreading  sequential  cs  computation  outcome-risk  measurement  empirical  questions  bits  information-theory  efficiency  algorithms  physics  relativity  ems  neuro  data  scale  magnitude  complexity  risk  existence  threat-modeling  civilization  forms-instances 
september 2016 by nhaliday
[1405.0058] Underestimating extreme events in power-law behavior due to machine-dependent cutoffs
Power-law distributions are typical macroscopic features occurring in almost all complex systems observable in nature. As a result, researchers in quantitative analyses must often generate random synthetic variates obeying power-law distributions. The task is usually performed through standard methods that map uniform random variates into the desired probability space. Whereas all these algorithms are theoretically solid, in this paper we show that they are subject to severe machine-dependent limitations. As a result, two dramatic consequences arise: (i) the sampling in the tail of the distribution is not random but deterministic; (ii) the moments of the sample distribution, which are theoretically expected to diverge as functions of the sample sizes, converge instead to finite values. We provide quantitative indications for the range of distribution parameters that can be safely handled by standard libraries used in computational analyses. Whereas our findings indicate possible reinterpretations of numerical results obtained through flawed sampling methodologies, they also pave the way for the search for a concrete solution to this central issue shared by all quantitative sciences dealing with complexity.
power-laws  computer-science  approximation  sampling-bias  rather-interesting  machine-precision  nudge-targets  consider:stress-testing  consider:can-precision-errors-be-selected-for? 
november 2014 by Vaguery
Big data: are we making a big mistake? - FT.com
"Cheerleaders for big data have made four exciting claims, each one reflected in the success of Google Flu Trends: that data analysis produces uncannily accurate results; that every single data point can be captured, making old statistical sampling techniques obsolete; that it is passé to fret about what causes what, because statistical correlation tells us what we need to know; and that scientific or statistical models aren’t needed because, to quote “The End of Theory”, a provocative essay published in Wired in 2008, “with enough data, the numbers speak for themselves”.
Illustration by Ed Nacional depicting big data©Ed Nacional
Unfortunately, these four articles of faith are at best optimistic oversimplifications. At worst, according to David Spiegelhalter, Winton Professor of the Public Understanding of Risk at Cambridge university, they can be “complete bollocks. Absolute nonsense.”
"found data contain systematic biases and it takes careful thought to spot and correct for those biases."
bigdata  surveillance  statistics  sampling-bias  statistical-significance 
april 2014 by jschneider

related tags

2016-election  accuracy  acm  active-learning  adversarial  advice  aesthetics  age-generation  algorithms  analysis  anthropic  applicability-prereqs  approximation  archaeology  article  asae  axioms  behavioral-econ  behavioral-gen  being-right  bias-variance  bias  biases  big-peeps  bigdata  bio  biodet  biophysical-econ  bits  blowhards  books  bostrom  britain  chart  checklists  christianity  civic  civilization  cog-psych  commentary  comparison  complexity  computation  computer-science  conceptual-vocab  confounding  consider:can-precision-errors-be-selected-for?  consider:stress-testing  context  contrarianism  cost-benefit  crime  critique  crooked  cs  current-events  data-science  data  database  decision-making  degrees-of-freedom  demographics  descriptive  developmental  dimensionality  discussion  distorsione-del-campione  distribution  drama  duplication  dysgenics  early-modern  economics  efficiency  elections  empirical  ems  entertainment  epistemic  error  essay  evolution  existence  expansionism  explanation  forms-instances  futurism  galton  gedanken  generalization  giants  gilens-page  government  ground-up  gt-101  gwern  hanson  hi-order-bits  history  hmm  hn  humility  hypocrisy  idk  incentives  industrial-revolution  info-dynamics  info-foraging  information-theory  insight  integrity  interests  intricacy  intuition  iq  iteration-recursion  left-bias  left-wing  len:short  lesswrong  let-me-see  letters  links  list  long-short-run  machine-precision  magnitude  malthus  map-territory  market-power  measurement  media  medicine  meta:prediction  metabuch  metameta  methodology  microfoundations  minimum-viable  miri-cfar  missing-heritability  models  moments  money  morality  multi  mystic  nature  neuro  news  nibble  nietzschean  nudge-targets  null-result  objektbuch  old-anglo  org:data  org:mag  org:med  outcome-risk  outliers  overflow  papers  pareto  pdf  people  philosophy  physics  poast  polisci  politics  poll  power-laws  power  pre-ww2  prediction  preference-falsification  preprint  psychology  psychometrics  q-n-a  questions  ranking  rather-interesting  rationality  ratty  realness  regression-to-mean  regularizer  relativity  religion  replication  research  risk  rot  scale  scitariat  sequential  sib-study  simulation  skeleton  slippery-slope  social-choice  social-psych  social-science  sociology  sondaggi  space  speculation  spreading  ssc  state-of-art  statistical-significance  statistics  stats  street-fighting  study  stylized-facts  summary  supply-demand  surveillance  tetlock  the-classics  theos  thinking  threat-modeling  time-preference  top-n  traces  track-record  tradeoffs  trump  twin-study  unintended-consequences  variance-components  volo-avolo  von-neumann  wealth  weird  west-hunter  winner-take-all  wire-guided  wonkish  workflow  yvain  zeitgeist  zero-positive-sum  🌞  🎩 

Copy this bookmark:



description:


tags: