nhaliday + high-dimension   30

Accurate Genomic Prediction Of Human Height | bioRxiv
Stephen Hsu's compressed sensing application paper

We construct genomic predictors for heritable and extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). Replication tests show that these predictors capture, respectively, ~40, 20, and 9 percent of total variance for the three traits. For example, predicted heights correlate ~0.65 with actual height; actual heights of most individuals in validation samples are within a few cm of the prediction.


I'm in Mountain View to give a talk at 23andMe. Their latest funding round was $250M on a (reported) valuation of $1.5B. If I just add up the Crunchbase numbers it looks like almost half a billion invested at this point...

Slides: Genomic Prediction of Complex Traits

Here's how people + robots handle your spit sample to produce a SNP genotype:

study  bio  preprint  GWAS  state-of-art  embodied  genetics  genomics  compressed-sensing  high-dimension  machine-learning  missing-heritability  hsu  scitariat  education  🌞  frontier  britain  regression  data  visualization  correlation  phase-transition  multi  commentary  summary  pdf  slides  brands  skunkworks  hard-tech  presentation  talks  methodology  intricacy  bioinformatics  scaling-up  stat-power  sparsity  norms  nibble  speedometer  stats  linear-models  2017  biodet 
september 2017 by nhaliday
Overcoming Bias : High Dimensional Societes?
I’ve seen many “spatial” models in social science. Such as models where voters and politicians sit at points in a space of policies. Or where customers and firms sit at points in a space of products. But I’ve never seen a discussion of how one should expect such models to change in high dimensions, such as when there are more dimensions than points.

In small dimensional spaces, the distances between points vary greatly; neighboring points are much closer to each other than are distant points. However, in high dimensional spaces, distances between points vary much less; all points are about the same distance from all other points. When points are distributed randomly, however, these distances do vary somewhat, allowing us to define the few points closest to each point as that point’s “neighbors”. “Hubs” are closest neighbors to many more points than average, while “anti-hubs” are closest neighbors to many fewer points than average. It turns out that in higher dimensions a larger fraction of points are hubs and anti-hubs (Zimek et al. 2012).

If we think of people or organizations as such points, is being a hub or anti-hub associated with any distinct social behavior?  Does it contribute substantially to being popular or unpopular? Or does the fact that real people and organizations are in fact distributed in real space overwhelm such things, which only only happen in a truly high dimensional social world?
ratty  hanson  speculation  ideas  thinking  spatial  dimensionality  high-dimension  homo-hetero  analogy  models  network-structure  degrees-of-freedom 
july 2017 by nhaliday
Genomic analysis of family data reveals additional genetic effects on intelligence and personality | bioRxiv
Using Extended Genealogy to Estimate Components of Heritability for 23 Quantitative and Dichotomous Traits: http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1003520
Pedigree- and SNP-Associated Genetics and Recent Environment are the Major Contributors to Anthropometric and Cardiometabolic Trait Variation: http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1005804

Missing Heritability – found?: https://westhunt.wordpress.com/2017/02/09/missing-heritability-found/
There is an interesting new paper out on genetics and IQ. The claim is that they have found the missing heritability – in rare variants, generally different in each family.

Some of the variants, the ones we find with GWAS, are fairly common and fitness-neutral: the variant that slightly increases IQ confers the same fitness (or very close to the same) as the one that slightly decreases IQ – presumably because of other effects it has. If this weren’t the case, it would be impossible for both of the variants to remain common.

The rare variants that affect IQ will generally decrease IQ – and since pleiotropy is the norm, usually they’ll be deleterious in other ways as well. Genetic load.

Happy families are all alike; every unhappy family is unhappy in its own way.: https://westhunt.wordpress.com/2017/06/06/happy-families-are-all-alike-every-unhappy-family-is-unhappy-in-its-own-way/
It now looks as if the majority of the genetic variance in IQ is the product of mutational load, and the same may be true for many psychological traits. To the extent this is the case, a lot of human psychological variation must be non-adaptive. Maybe some personality variation fulfills an evolutionary function, but a lot does not. Being a dumb asshole may be a bug, rather than a feature. More generally, this kind of analysis could show us whether particular low-fitness syndromes, like autism, were ever strategies – I suspect not.

It’s bad new news for medicine and psychiatry, though. It would suggest that what we call a given type of mental illness, like schizophrenia, is really a grab-bag of many different syndromes. The ultimate causes are extremely varied: at best, there may be shared intermediate causal factors. Not good news for drug development: individualized medicine is a threat, not a promise.

see also comment at: https://pinboard.in/u:nhaliday/b:a6ab4034b0d0

So the big implication here is that it's better than I had dared hope - like Yang/Visscher/Hsu have argued, the old GCTA estimate of ~0.3 is indeed a rather loose lower bound on additive genetic variants, and the rest of the missing heritability is just the relatively uncommon additive variants (ie <1% frequency), and so, like Yang demonstrated with height, using much more comprehensive imputation of SNP scores or using whole-genomes will be able to explain almost all of the genetic contribution. In other words, with better imputation panels, we can go back and squeeze out better polygenic scores from old GWASes, new GWASes will be able to reach and break the 0.3 upper bound, and eventually we can feasibly predict 0.5-0.8. Between the expanding sample sizes from biobanks, the still-falling price of whole genomes, the gradual development of better regression methods (informative priors, biological annotation information, networks, genetic correlations), and better imputation, the future of GWAS polygenic scores is bright. Which obviously will be extremely helpful for embryo selection/genome synthesis.

The argument that this supports mutation-selection balance is weaker but plausible. I hope that it's true, because if that's why there is so much genetic variation in intelligence, then that strongly encourages genetic engineering - there is no good reason or Chesterton fence for intelligence variants being non-fixed, it's just that evolution is too slow to purge the constantly-accumulating bad variants. And we can do better.

The surprising implications of familial association in disease risk: https://arxiv.org/abs/1707.00014
As Greg Cochran has pointed out, this probably isn’t going to work. There are a few genes like BRCA1 (which makes you more likely to get breast and ovarian cancer) that we can detect and might affect treatment, but an awful lot of disease turns out to be just the result of random chance and deleterious mutation. This means that you can’t easily tailor disease treatment to people’s genes, because everybody is fucked up in their own special way. If Johnny is schizophrenic because of 100 random errors in the genes that code for his neurons, and Jack is schizophrenic because of 100 other random errors, there’s very little way to test a drug to work for either of them- they’re the only one in the world, most likely, with that specific pattern of errors. This is, presumably why the incidence of schizophrenia and autism rises in populations when dads get older- more random errors in sperm formation mean more random errors in the baby’s genes, and more things that go wrong down the line.

The looming crisis in human genetics: http://www.economist.com/node/14742737
Some awkward news ahead
- Geoffrey Miller

Human geneticists have reached a private crisis of conscience, and it will become public knowledge in 2010. The crisis has depressing health implications and alarming political ones. In a nutshell: the new genetics will reveal much less than hoped about how to cure disease, and much more than feared about human evolution and inequality, including genetic differences between classes, ethnicities and races.

study  preprint  bio  biodet  behavioral-gen  GWAS  missing-heritability  QTL  🌞  scaling-up  replication  iq  education  spearhead  sib-study  multi  west-hunter  scitariat  genetic-load  mutation  medicine  meta:medicine  stylized-facts  ratty  unaffiliated  commentary  rhetoric  wonkish  genetics  genomics  race  pop-structure  poast  population-genetics  psychiatry  aphorism  homo-hetero  generalization  scale  state-of-art  ssc  reddit  social  summary  gwern  methodology  personality  britain  anglo  enhancement  roots  s:*  2017  data  visualization  database  let-me-see  bioinformatics  news  org:rec  org:anglo  org:biz  track-record  prediction  identity-politics  pop-diff  recent-selection  westminster  inequality  egalitarianism-hierarchy  high-dimension  applications  dimensionality  ideas  no-go  volo-avolo  magnitude  variance-components  GCTA  tradeoffs  counter-revolution  org:mat  dysgenics  paternal-age  distribution  chart  abortion-contraception-embryo 
june 2017 by nhaliday
Dvoretzky's theorem - Wikipedia
In mathematics, Dvoretzky's theorem is an important structural theorem about normed vector spaces proved by Aryeh Dvoretzky in the early 1960s, answering a question of Alexander Grothendieck. In essence, it says that every sufficiently high-dimensional normed vector space will have low-dimensional subspaces that are approximately Euclidean. Equivalently, every high-dimensional bounded symmetric convex set has low-dimensional sections that are approximately ellipsoids.

math  math.FA  inner-product  levers  characterization  geometry  math.MG  concentration-of-measure  multi  q-n-a  overflow  intuition  examples  proofs  dimensionality  gowers  mathtariat  tcstariat  quantum  quantum-info  norms  nibble  high-dimension  wiki  reference  curvature  convexity-curvature  tcs 
january 2017 by nhaliday
Science Policy | West Hunter
If my 23andme profile revealed that I was the last of the Plantagenets (as some suspect), and therefore rightfully King of the United States and Defender of Mexico, and I asked you for a general view of the right approach to science and technology – where the most promise is, what should be done, etc – what would you say?

genetically personalized medicine: https://westhunt.wordpress.com/2016/12/08/science-policy/#comment-85698
I have no idea how personalized medicine is supposed to work. Suppose that we sequence your entire genome, and then we intend to tailor a therapeutic approach to your genome.

How do we test it? By trying it on a bunch of genetically similar people? The more genetic details we take into account, the smaller that class is. It could easily become so small that it would be difficult to recruit enough people for a reasonable statistical trial. Second, the more details we take into account, the smaller the class that benefits from the whole testing process – which as far as I can see, is just as expensive as conventional Phasei/II etc trials.

What am I missing?

Now if you are a forethoughtful trillionaire, sure: you manufacture lots of clones just to test therapies you might someday need, and cost is no object.

I think I can see ways you could make it work tho [edit: what did I mean by this?...damnit]
west-hunter  discussion  politics  government  policy  science  technology  the-world-is-just-atoms  🔬  scitariat  meta:science  proposal  genetics  genomics  medicine  meta:medicine  multi  ideas  counter-revolution  poast  homo-hetero  generalization  scale  antidemos  alt-inst  applications  dimensionality  high-dimension  bioinformatics  no-go  volo-avolo  magnitude  trump  2016-election  questions 
december 2016 by nhaliday
gt.geometric topology - Intuitive crutches for higher dimensional thinking - MathOverflow
Terry Tao:
I can't help you much with high-dimensional topology - it's not my field, and I've not picked up the various tricks topologists use to get a grip on the subject - but when dealing with the geometry of high-dimensional (or infinite-dimensional) vector spaces such as R^n, there are plenty of ways to conceptualise these spaces that do not require visualising more than three dimensions directly.

For instance, one can view a high-dimensional vector space as a state space for a system with many degrees of freedom. A megapixel image, for instance, is a point in a million-dimensional vector space; by varying the image, one can explore the space, and various subsets of this space correspond to various classes of images.

One can similarly interpret sound waves, a box of gases, an ecosystem, a voting population, a stream of digital data, trials of random variables, the results of a statistical survey, a probabilistic strategy in a two-player game, and many other concrete objects as states in a high-dimensional vector space, and various basic concepts such as convexity, distance, linearity, change of variables, orthogonality, or inner product can have very natural meanings in some of these models (though not in all).

It can take a bit of both theory and practice to merge one's intuition for these things with one's spatial intuition for vectors and vector spaces, but it can be done eventually (much as after one has enough exposure to measure theory, one can start merging one's intuition regarding cardinality, mass, length, volume, probability, cost, charge, and any number of other "real-life" measures).

For instance, the fact that most of the mass of a unit ball in high dimensions lurks near the boundary of the ball can be interpreted as a manifestation of the law of large numbers, using the interpretation of a high-dimensional vector space as the state space for a large number of trials of a random variable.

More generally, many facts about low-dimensional projections or slices of high-dimensional objects can be viewed from a probabilistic, statistical, or signal processing perspective.

Scott Aaronson:
Here are some of the crutches I've relied on. (Admittedly, my crutches are probably much more useful for theoretical computer science, combinatorics, and probability than they are for geometry, topology, or physics. On a related note, I personally have a much easier time thinking about R^n than about, say, R^4 or R^5!)

1. If you're trying to visualize some 4D phenomenon P, first think of a related 3D phenomenon P', and then imagine yourself as a 2D being who's trying to visualize P'. The advantage is that, unlike with the 4D vs. 3D case, you yourself can easily switch between the 3D and 2D perspectives, and can therefore get a sense of exactly what information is being lost when you drop a dimension. (You could call this the "Flatland trick," after the most famous literary work to rely on it.)
2. As someone else mentioned, discretize! Instead of thinking about R^n, think about the Boolean hypercube {0,1}^n, which is finite and usually easier to get intuition about. (When working on problems, I often find myself drawing {0,1}^4 on a sheet of paper by drawing two copies of {0,1}^3 and then connecting the corresponding vertices.)
3. Instead of thinking about a subset S⊆R^n, think about its characteristic function f:R^n→{0,1}. I don't know why that trivial perspective switch makes such a big difference, but it does ... maybe because it shifts your attention to the process of computing f, and makes you forget about the hopeless task of visualizing S!
4. One of the central facts about R^n is that, while it has "room" for only n orthogonal vectors, it has room for exp⁡(n) almost-orthogonal vectors. Internalize that one fact, and so many other properties of R^n (for example, that the n-sphere resembles a "ball with spikes sticking out," as someone mentioned before) will suddenly seem non-mysterious. In turn, one way to internalize the fact that R^n has so many almost-orthogonal vectors is to internalize Shannon's theorem that there exist good error-correcting codes.
5. To get a feel for some high-dimensional object, ask questions about the behavior of a process that takes place on that object. For example: if I drop a ball here, which local minimum will it settle into? How long does this random walk on {0,1}^n take to mix?

Gil Kalai:
This is a slightly different point, but Vitali Milman, who works in high-dimensional convexity, likes to draw high-dimensional convex bodies in a non-convex way. This is to convey the point that if you take the convex hull of a few points on the unit sphere of R^n, then for large n very little of the measure of the convex body is anywhere near the corners, so in a certain sense the body is a bit like a small sphere with long thin "spikes".
q-n-a  intuition  math  visual-understanding  list  discussion  thurston  tidbits  aaronson  tcs  geometry  problem-solving  yoga  👳  big-list  metabuch  tcstariat  gowers  mathtariat  acm  overflow  soft-question  levers  dimensionality  hi-order-bits  insight  synthesis  thinking  models  cartoons  coding-theory  information-theory  probability  concentration-of-measure  magnitude  linear-algebra  boolean-analysis  analogy  arrows  lifts-projections  measure  markov  sampling  shannon  conceptual-vocab  nibble  degrees-of-freedom  worrydream  neurons  retrofit  oscillation  paradox  novelty  tricki  concrete  high-dimension  s:***  manifolds  direction  curvature  convexity-curvature 
december 2016 by nhaliday
Information Processing: Search results for compressed sensing
Added: Here are comments from "Donoho-Student":
Donoho-Student says:
September 14, 2017 at 8:27 pm GMT • 100 Words

The Donoho-Tanner transition describes the noise-free (h2=1) case, which has a direct analog in the geometry of polytopes.

The n = 30s result from Hsu et al. (specifically the value of the coefficient, 30, when p is the appropriate number of SNPs on an array and h2 = 0.5) is obtained via simulation using actual genome matrices, and is original to them. (There is no simple formula that gives this number.) The D-T transition had only been established in the past for certain classes of matrices, like random matrices with specific distributions. Those results cannot be immediately applied to genomes.

The estimate that s is (order of magnitude) 10k is also a key input.

I think Hsu refers to n = 1 million instead of 30 * 10k = 300k because the effective SNP heritability of IQ might be less than h2 = 0.5 — there is noise in the phenotype measurement, etc.

Donoho-Student says:
September 15, 2017 at 11:27 am GMT • 200 Words

Lasso is a common statistical method but most people who use it are not familiar with the mathematical theorems from compressed sensing. These results give performance guarantees and describe phase transition behavior, but because they are rigorous theorems they only apply to specific classes of sensor matrices, such as simple random matrices. Genomes have correlation structure, so the theorems do not directly apply to the real world case of interest, as is often true.

What the Hsu paper shows is that the exact D-T phase transition appears in the noiseless (h2 = 1) problem using genome matrices, and a smoothed version appears in the problem with realistic h2. These are new results, as is the prediction for how much data is required to cross the boundary. I don’t think most gwas people are familiar with these results. If they did understand the results they would fund/design adequately powered studies capable of solving lots of complex phenotypes, medical conditions as well as IQ, that have significant h2.

Most people who use lasso, as opposed to people who prove theorems, are not even aware of the D-T transition. Even most people who prove theorems have followed the Candes-Tao line of attack (restricted isometry property) and don’t think much about D-T. Although D eventually proved some things about the phase transition using high dimensional geometry, it was initially discovered via simulation using simple random matrices.
hsu  list  stream  genomics  genetics  concept  stats  methodology  scaling-up  scitariat  sparsity  regression  biodet  bioinformatics  norms  nibble  compressed-sensing  applications  search  ideas  multi  albion  behavioral-gen  iq  state-of-art  commentary  explanation  phase-transition  measurement  volo-avolo  regularization  levers  novelty  the-trenches  liner-notes  clarity  random-matrices  innovation  high-dimension  linear-models 
november 2016 by nhaliday
Talagrand’s concentration inequality | What's new
Proposition 1 follows easily from the following statement, that asserts that if a convex set {A \subset {\bf R}^n} occupies a non-trivial fraction of the cube {\{-1,+1\}^n}, then the neighbourhood {A_t := \{ x \in {\bf R}^n: \hbox{dist}(x,A) \leq t \}} will occupy almost all of the cube for {t \gg 1}:
exposition  math.CA  math  gowers  concentration-of-measure  mathtariat  random-matrices  levers  estimate  probability  math.MG  geometry  boolean-analysis  nibble  org:bleg  high-dimension  p:whenever  dimensionality  curvature  convexity-curvature 
may 2016 by nhaliday

bundles : abstractacmmath

related tags

2016-election  aaronson  abortion-contraception-embryo  accretion  acm  additive-combo  adversarial  ai  ai-control  albion  algorithms  alt-inst  analogy  analysis  anglo  antidemos  aphorism  applications  approximation  arrows  article  atoms  attention  average-case  behavioral-gen  best-practices  big-list  big-picture  bio  biodet  bioinformatics  boltzmann  books  boolean-analysis  brands  britain  brunn-minkowski  cartoons  chaining  characterization  chart  clarity  cmu  coarse-fine  coding-theory  commentary  comparison  competition  complement-substitute  compressed-sensing  concentration-of-measure  concept  conceptual-vocab  concrete  convexity-curvature  cooperate-defect  coordination  correlation  counter-revolution  counting  course  crosstab  crypto  curiosity  curvature  data  data-science  data-structures  database  decision-theory  deep-learning  deepgoog  degrees-of-freedom  dimensionality  direct-indirect  direction  discussion  distribution  draft  duality  dysgenics  ecology  economics  education  egalitarianism-hierarchy  embeddings  embodied  encyclopedic  enhancement  entropy-like  environment  equilibrium  ergodic  estimate  evolution  examples  expanders  expert  expert-experience  explanation  exploratory  explore-exploit  exposition  fedja  fourier  frontier  game-theory  gaussian-processes  GCTA  generalization  genetic-load  genetics  genomics  geography  geometry  giants  google  gotchas  government  gowers  gradient-descent  ground-up  GWAS  gwern  hanson  hard-tech  hashing  heuristic  hi-order-bits  high-dimension  history  homepage  homo-hetero  howto  hsu  huge-data-the-biggest  ideas  identity-politics  incentives  inequality  information-theory  inner-product  innovation  insight  intelligence  interdisciplinary  intricacy  intuition  iq  iteration-recursion  learning-theory  lecture-notes  let-me-see  letters  levers  lifts-projections  limits  linear-algebra  linear-models  linear-programming  linearity  liner-notes  links  list  local-global  machine-learning  magnitude  manifolds  markov  martingale  math  math.CA  math.CO  math.FA  math.MG  mathtariat  matrix-factorization  measure  measurement  medicine  meta:medicine  meta:science  metabuch  methodology  metric-space  mihai  missing-heritability  mit  model-class  models  moloch  monte-carlo  multi  mutation  network-structure  neurons  news  nibble  no-go  norms  novelty  objektbuch  off-convex  oly  online-learning  optimization  ORFE  org:anglo  org:biz  org:bleg  org:edu  org:mat  org:rec  orourke  oscillation  overflow  p:*  p:***  p:whenever  paradox  paternal-age  pdf  people  performance  personality  phase-transition  pigeonhole-markov  poast  policy  politics  pop-diff  pop-structure  population-genetics  prediction  preprint  presentation  princeton  probabilistic-method  probability  problem-solving  prof  proofs  proposal  psychiatry  q-n-a  QTL  quantum  quantum-info  questions  quixotic  race  rand-approx  random  random-matrices  ratty  recent-selection  reddit  reduction  reference  regression  regularization  reinforcement  relaxation  replication  retention  retrofit  rhetoric  rigorous-crypto  risk  roots  s:*  s:**  s:***  saas  sampling  sanjeev-arora  scale  scaling-up  science  scitariat  SDP  search  separation  shannon  shift  sib-study  skunkworks  slides  social  soft-question  sparsity  spatial  spearhead  spectral  speculation  speedometer  ssc  stanford  stat-mech  stat-power  state-of-art  stats  stochastic-processes  stories  stream  study  stylized-facts  sublinear  summary  survey  synthesis  systems  talks  tcs  tcstariat  technology  telos-atelos  tensors  the-self  the-trenches  the-world-is-just-atoms  things  thinking  threat-modeling  thurston  tidbits  tim-roughgarden  time  toolkit  track-record  tradeoffs  trees  tricki  trump  turing  unaffiliated  unit  unsupervised  valiant  variance-components  video  visual-understanding  visualization  volo-avolo  west-hunter  westminster  wiki  wonkish  wormholes  worrydream  yoga  🌞  👳  🔬 

Copy this bookmark: