nhaliday + linear-models   23

Accurate Genomic Prediction Of Human Height | bioRxiv
Stephen Hsu's compressed sensing application paper

We construct genomic predictors for heritable and extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). Replication tests show that these predictors capture, respectively, ~40, 20, and 9 percent of total variance for the three traits. For example, predicted heights correlate ~0.65 with actual height; actual heights of most individuals in validation samples are within a few cm of the prediction.


I'm in Mountain View to give a talk at 23andMe. Their latest funding round was $250M on a (reported) valuation of $1.5B. If I just add up the Crunchbase numbers it looks like almost half a billion invested at this point...

Slides: Genomic Prediction of Complex Traits

Here's how people + robots handle your spit sample to produce a SNP genotype:

study  bio  preprint  GWAS  state-of-art  embodied  genetics  genomics  compressed-sensing  high-dimension  machine-learning  missing-heritability  hsu  scitariat  education  🌞  frontier  britain  regression  data  visualization  correlation  phase-transition  multi  commentary  summary  pdf  slides  brands  skunkworks  hard-tech  presentation  talks  methodology  intricacy  bioinformatics  scaling-up  stat-power  sparsity  norms  nibble  speedometer  stats  linear-models  2017  biodet 
september 2017 by nhaliday
Information Processing: Epistasis vs additivity
On epistasis: why it is unimportant in polygenic directional selection: http://rstb.royalsocietypublishing.org/content/365/1544/1241.short
- James F. Crow

The Evolution of Multilocus Systems Under Weak Selection: http://www.genetics.org/content/genetics/134/2/627.full.pdf
- Thomas Nagylaki

Data and Theory Point to Mainly Additive Genetic Variance for Complex Traits: http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1000008
The relative proportion of additive and non-additive variation for complex traits is important in evolutionary biology, medicine, and agriculture. We address a long-standing controversy and paradox about the contribution of non-additive genetic variation, namely that knowledge about biological pathways and gene networks imply that epistasis is important. Yet empirical data across a range of traits and species imply that most genetic variance is additive. We evaluate the evidence from empirical studies of genetic variance components and find that additive variance typically accounts for over half, and often close to 100%, of the total genetic variance. We present new theoretical results, based upon the distribution of allele frequencies under neutral and other population genetic models, that show why this is the case even if there are non-additive effects at the level of gene action. We conclude that interactions at the level of genes are not likely to generate much interaction at the level of variance.
hsu  scitariat  commentary  links  study  list  evolution  population-genetics  genetics  methodology  linearity  nonlinearity  comparison  scaling-up  nibble  lens  bounded-cognition  ideas  bio  occam  parsimony  🌞  summary  quotes  multi  org:nat  QTL  stylized-facts  article  explanans  sapiens  biodet  selection  variance-components  metabuch  thinking  models  data  deep-materialism  chart  behavioral-gen  evidence-based  empirical  mutation  spearhead  model-organism  bioinformatics  linear-models  math  magnitude  limits  physics  interdisciplinary  stat-mech 
february 2017 by nhaliday
The infinitesimal model | bioRxiv
Our focus here is on the infinitesimal model. In this model, one or several quantitative traits are described as the sum of a genetic and a non-genetic component, the first being distributed as a normal random variable centred at the average of the parental genetic components, and with a variance independent of the parental traits. We first review the long history of the infinitesimal model in quantitative genetics. Then we provide a definition of the model at the phenotypic level in terms of individual trait values and relationships between individuals, but including different evolutionary processes: genetic drift, recombination, selection, mutation, population structure, ... We give a range of examples of its application to evolutionary questions related to stabilising selection, assortative mating, effective population size and response to selection, habitat preference and speciation. We provide a mathematical justification of the model as the limit as the number M of underlying loci tends to infinity of a model with Mendelian inheritance, mutation and environmental noise, when the genetic component of the trait is purely additive. We also show how the model generalises to include epistatic effects. In each case, by conditioning on the pedigree relating individuals in the population, we incorporate arbitrary selection and population structure. We suppose that we can observe the pedigree up to the present generation, together with all the ancestral traits, and we show, in particular, that the genetic components of the individual trait values in the current generation are indeed normally distributed with a variance independent of ancestral traits, up to an error of order M^{-1/2}. Simulations suggest that in particular cases the convergence may be as fast as 1/M.

published version:
The infinitesimal model: Definition, derivation, and implications: https://sci-hub.tw/10.1016/j.tpb.2017.06.001

Commentary: Fisher’s infinitesimal model: A story for the ages: http://www.sciencedirect.com/science/article/pii/S0040580917301508?via%3Dihub
This commentary distinguishes three nested approximations, referred to as “infinitesimal genetics,” “Gaussian descendants” and “Gaussian population,” each plausibly called “the infinitesimal model.” The first and most basic is Fisher’s “infinitesimal” approximation of the underlying genetics – namely, many loci, each making a small contribution to the total variance. As Barton et al. (2017) show, in the limit as the number of loci increases (with enough additivity), the distribution of genotypic values for descendants approaches a multivariate Gaussian, whose variance–covariance structure depends only on the relatedness, not the phenotypes, of the parents (or whether their population experiences selection or other processes such as mutation and migration). Barton et al. (2017) call this rigorously defensible “Gaussian descendants” approximation “the infinitesimal model.” However, it is widely assumed that Fisher’s genetic assumptions yield another Gaussian approximation, in which the distribution of breeding values in a population follows a Gaussian — even if the population is subject to non-Gaussian selection. This third “Gaussian population” approximation, is also described as the “infinitesimal model.” Unlike the “Gaussian descendants” approximation, this third approximation cannot be rigorously justified, except in a weak-selection limit, even for a purely additive model. Nevertheless, it underlies the two most widely used descriptions of selection-induced changes in trait means and genetic variances, the “breeder’s equation” and the “Bulmer effect.” Future generations may understand why the “infinitesimal model” provides such useful approximations in the face of epistasis, linkage, linkage disequilibrium and strong selection.
study  exposition  bio  evolution  population-genetics  genetics  methodology  QTL  preprint  models  unit  len:long  nibble  linearity  nonlinearity  concentration-of-measure  limits  applications  🌞  biodet  oscillation  fisher  perturbation  stylized-facts  chart  ideas  article  pop-structure  multi  pdf  piracy  intricacy  map-territory  kinship  distribution  simulation  ground-up  linear-models  applicability-prereqs  bioinformatics 
january 2017 by nhaliday
Information Processing: Search results for compressed sensing
Added: Here are comments from "Donoho-Student":
Donoho-Student says:
September 14, 2017 at 8:27 pm GMT • 100 Words

The Donoho-Tanner transition describes the noise-free (h2=1) case, which has a direct analog in the geometry of polytopes.

The n = 30s result from Hsu et al. (specifically the value of the coefficient, 30, when p is the appropriate number of SNPs on an array and h2 = 0.5) is obtained via simulation using actual genome matrices, and is original to them. (There is no simple formula that gives this number.) The D-T transition had only been established in the past for certain classes of matrices, like random matrices with specific distributions. Those results cannot be immediately applied to genomes.

The estimate that s is (order of magnitude) 10k is also a key input.

I think Hsu refers to n = 1 million instead of 30 * 10k = 300k because the effective SNP heritability of IQ might be less than h2 = 0.5 — there is noise in the phenotype measurement, etc.

Donoho-Student says:
September 15, 2017 at 11:27 am GMT • 200 Words

Lasso is a common statistical method but most people who use it are not familiar with the mathematical theorems from compressed sensing. These results give performance guarantees and describe phase transition behavior, but because they are rigorous theorems they only apply to specific classes of sensor matrices, such as simple random matrices. Genomes have correlation structure, so the theorems do not directly apply to the real world case of interest, as is often true.

What the Hsu paper shows is that the exact D-T phase transition appears in the noiseless (h2 = 1) problem using genome matrices, and a smoothed version appears in the problem with realistic h2. These are new results, as is the prediction for how much data is required to cross the boundary. I don’t think most gwas people are familiar with these results. If they did understand the results they would fund/design adequately powered studies capable of solving lots of complex phenotypes, medical conditions as well as IQ, that have significant h2.

Most people who use lasso, as opposed to people who prove theorems, are not even aware of the D-T transition. Even most people who prove theorems have followed the Candes-Tao line of attack (restricted isometry property) and don’t think much about D-T. Although D eventually proved some things about the phase transition using high dimensional geometry, it was initially discovered via simulation using simple random matrices.
hsu  list  stream  genomics  genetics  concept  stats  methodology  scaling-up  scitariat  sparsity  regression  biodet  bioinformatics  norms  nibble  compressed-sensing  applications  search  ideas  multi  albion  behavioral-gen  iq  state-of-art  commentary  explanation  phase-transition  measurement  volo-avolo  regularization  levers  novelty  the-trenches  liner-notes  clarity  random-matrices  innovation  high-dimension  linear-models  grokkability-clarity 
november 2016 by nhaliday

bundles : acm

related tags

accretion  accuracy  acm  acmtariat  advanced  ai  albion  applicability-prereqs  applications  article  atoms  bayesian  behavioral-gen  best-practices  better-explained  bias-variance  bio  biodet  bioinformatics  boltzmann  books  bounded-cognition  brands  britain  c(pp)  chart  cheatsheet  checking  checklists  clarity  classification  columbia  commentary  comparison  compressed-sensing  computer-vision  concentration-of-measure  concept  concurrency  confluence  confusion  convexity-curvature  core-rats  correlation  course  critique  crypto  dan-luu  data  data-science  dbs  deep-learning  deep-materialism  distribution  documentation  dumb-ML  econometrics  economics  ecosystem  education  embodied  empirical  ensembles  evidence-based  evolution  explanans  explanation  exploratory  exposition  facebook  features  fisher  frontier  GCTA  generalization  generative  genetics  genomics  graphical-models  graphics  graphs  grokkability-clarity  ground-up  GWAS  hard-tech  hi-order-bits  high-dimension  homo-hetero  howto  hsu  human-capital  hypothesis-testing  ideas  iidness  init  innovation  interdisciplinary  interface  interface-compatibility  intricacy  iq  kernels  kinship  labor  language  learning-theory  lecture-notes  left-wing  len:long  len:short  lens  levers  libraries  limits  linear-algebra  linear-models  linearity  liner-notes  links  list  machine-learning  magnitude  map-territory  markov  martingale  math  matrix-factorization  measurement  metabuch  methodology  missing-heritability  mit  ML-MAP-E  model-class  model-organism  models  moments  monte-carlo  multi  multiplicative  mutation  networking  nibble  nitty-gritty  nlp  no-go  nonlinearity  nonparametric  norms  novelty  numerics  occam  optimization  org:bleg  org:lite  org:nat  oscillation  oss  outliers  overflow  p:*  p:someday  p:whenever  parsimony  pdf  perturbation  phase-transition  physics  piracy  pls  pop-structure  population-genetics  positivity  ppl  preprint  presentation  programming  protocol-metadata  q-n-a  QTL  quixotic  quotes  rand-approx  random  random-matrices  ratty  reading  recommendations  reference  regression  regularization  rhetoric  roadmap  robust  s:*  sapiens  scaling-up  sci-comp  science  scitariat  search  selection  signum  simulation  skunkworks  slides  sparsity  spearhead  speedometer  stanford  stat-mech  stat-power  state-of-art  stats  stream  street-fighting  structure  study  stylized-facts  summary  symmetry  synthesis  systems  talks  teaching  techtariat  the-trenches  thinking  time-series  top-n  topics  tutorial  unaffiliated  unit  variance-components  visualization  volo-avolo  wiki  🌞 

Copy this bookmark: