**nhaliday + acm + explanation**
52

An Outsider's Tour of Reinforcement Learning – arg min blog

acmtariat ben-recht org:bleg nibble exposition explanation expert-experience tutorial guide yoga reinforcement optimization linear-algebra model-class atoms concept signal-noise iteration-recursion volo-avolo benchmarks deep-learning unsupervised thinking descriptive values gradient-descent acm decision-theory decision-making math.DS sequential random search realness hi-order-bits synthesis coarse-fine bare-hands openai replication linearity nonlinearity research

april 2018 by nhaliday

acmtariat ben-recht org:bleg nibble exposition explanation expert-experience tutorial guide yoga reinforcement optimization linear-algebra model-class atoms concept signal-noise iteration-recursion volo-avolo benchmarks deep-learning unsupervised thinking descriptive values gradient-descent acm decision-theory decision-making math.DS sequential random search realness hi-order-bits synthesis coarse-fine bare-hands openai replication linearity nonlinearity research

april 2018 by nhaliday

Prisoner's dilemma - Wikipedia

march 2018 by nhaliday

caveat to result below:

An extension of the IPD is an evolutionary stochastic IPD, in which the relative abundance of particular strategies is allowed to change, with more successful strategies relatively increasing. This process may be accomplished by having less successful players imitate the more successful strategies, or by eliminating less successful players from the game, while multiplying the more successful ones. It has been shown that unfair ZD strategies are not evolutionarily stable. The key intuition is that an evolutionarily stable strategy must not only be able to invade another population (which extortionary ZD strategies can do) but must also perform well against other players of the same type (which extortionary ZD players do poorly, because they reduce each other's surplus).[14]

Theory and simulations confirm that beyond a critical population size, ZD extortion loses out in evolutionary competition against more cooperative strategies, and as a result, the average payoff in the population increases when the population is bigger. In addition, there are some cases in which extortioners may even catalyze cooperation by helping to break out of a face-off between uniform defectors and win–stay, lose–switch agents.[8]

https://alfanl.com/2018/04/12/defection/

Nature boils down to a few simple concepts.

Haters will point out that I oversimplify. The haters are wrong. I am good at saying a lot with few words. Nature indeed boils down to a few simple concepts.

In life, you can either cooperate or defect.

Used to be that defection was the dominant strategy, say in the time when the Roman empire started to crumble. Everybody complained about everybody and in the end nothing got done. Then came Jesus, who told people to be loving and cooperative, and boom: 1800 years later we get the industrial revolution.

Because of Jesus we now find ourselves in a situation where cooperation is the dominant strategy. A normie engages in a ton of cooperation: with the tax collector who wants more and more of his money, with schools who want more and more of his kid’s time, with media who wants him to repeat more and more party lines, with the Zeitgeist of the Collective Spirit of the People’s Progress Towards a New Utopia. Essentially, our normie is cooperating himself into a crumbling Western empire.

Turns out that if everyone blindly cooperates, parasites sprout up like weeds until defection once again becomes the standard.

The point of a post-Christian religion is to once again create conditions for the kind of cooperation that led to the industrial revolution. This necessitates throwing out undead Christianity: you do not blindly cooperate. You cooperate with people that cooperate with you, you defect on people that defect on you. Christianity mixed with Darwinism. God and Gnon meet.

This also means we re-establish spiritual hierarchy, which, like regular hierarchy, is a prerequisite for cooperation. It is this hierarchical cooperation that turns a household into a force to be reckoned with, that allows a group of men to unite as a front against their enemies, that allows a tribe to conquer the world. Remember: Scientology bullied the Cathedral’s tax department into submission.

With a functioning hierarchy, men still gossip, lie and scheme, but they will do so in whispers behind closed doors. In your face they cooperate and contribute to the group’s wellbeing because incentives are thus that contributing to group wellbeing heightens status.

Without a functioning hierarchy, men gossip, lie and scheme, but they do so in your face, and they tell you that you are positively deluded for accusing them of gossiping, lying and scheming. Seeds will not sprout in such ground.

Spiritual dominance is established in the same way any sort of dominance is established: fought for, taken. But the fight is ritualistic. You can’t force spiritual dominance if no one listens, or if you are silenced the ritual is not allowed to happen.

If one of our priests is forbidden from establishing spiritual dominance, that is a sure sign an enemy priest is in better control and has vested interest in preventing you from establishing spiritual dominance..

They defect on you, you defect on them. Let them suffer the consequences of enemy priesthood, among others characterized by the annoying tendency that very little is said with very many words.

https://contingentnotarbitrary.com/2018/04/14/rederiving-christianity/

To recap, we started with a secular definition of Logos and noted that its telos is existence. Given human nature, game theory and the power of cooperation, the highest expression of that telos is freely chosen universal love, tempered by constant vigilance against defection while maintaining compassion for the defectors and forgiving those who repent. In addition, we must know the telos in order to fulfill it.

In Christian terms, looks like we got over half of the Ten Commandments (know Logos for the First, don’t defect or tempt yourself to defect for the rest), the importance of free will, the indestructibility of evil (group cooperation vs individual defection), loving the sinner and hating the sin (with defection as the sin), forgiveness (with conditions), and love and compassion toward all, assuming only secular knowledge and that it’s good to exist.

Iterated Prisoner's Dilemma is an Ultimatum Game: http://infoproc.blogspot.com/2012/07/iterated-prisoners-dilemma-is-ultimatum.html

The history of IPD shows that bounded cognition prevented the dominant strategies from being discovered for over over 60 years, despite significant attention from game theorists, computer scientists, economists, evolutionary biologists, etc. Press and Dyson have shown that IPD is effectively an ultimatum game, which is very different from the Tit for Tat stories told by generations of people who worked on IPD (Axelrod, Dawkins, etc., etc.).

...

For evolutionary biologists: Dyson clearly thinks this result has implications for multilevel (group vs individual selection):

... Cooperation loses and defection wins. The ZD strategies confirm this conclusion and make it sharper. ... The system evolved to give cooperative tribes an advantage over non-cooperative tribes, using punishment to give cooperation an evolutionary advantage within the tribe. This double selection of tribes and individuals goes way beyond the Prisoners' Dilemma model.

implications for fractionalized Europe vis-a-vis unified China?

and more broadly does this just imply we're doomed in the long run RE: cooperation, morality, the "good society", so on...? war and group-selection is the only way to get a non-crab bucket civilization?

Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent:

http://www.pnas.org/content/109/26/10409.full

http://www.pnas.org/content/109/26/10409.full.pdf

https://www.edge.org/conversation/william_h_press-freeman_dyson-on-iterated-prisoners-dilemma-contains-strategies-that

https://en.wikipedia.org/wiki/Ultimatum_game

analogy for ultimatum game: the state gives the demos a bargain take-it-or-leave-it, and...if the demos refuses...violence?

The nature of human altruism: http://sci-hub.tw/https://www.nature.com/articles/nature02043

- Ernst Fehr & Urs Fischbacher

Some of the most fundamental questions concerning our evolutionary origins, our social relations, and the organization of society are centred around issues of altruism and selfishness. Experimental evidence indicates that human altruism is a powerful force and is unique in the animal world. However, there is much individual heterogeneity and the interaction between altruists and selfish individuals is vital to human cooperation. Depending on the environment, a minority of altruists can force a majority of selfish individuals to cooperate or, conversely, a few egoists can induce a large number of altruists to defect. Current gene-based evolutionary theories cannot explain important patterns of human altruism, pointing towards the importance of both theories of cultural evolution as well as gene–culture co-evolution.

...

Why are humans so unusual among animals in this respect? We propose that quantitatively, and probably even qualitatively, unique patterns of human altruism provide the answer to this question. Human altruism goes far beyond that which has been observed in the animal world. Among animals, fitness-reducing acts that confer fitness benefits on other individuals are largely restricted to kin groups; despite several decades of research, evidence for reciprocal altruism in pair-wise repeated encounters4,5 remains scarce6–8. Likewise, there is little evidence so far that individual reputation building affects cooperation in animals, which contrasts strongly with what we find in humans. If we randomly pick two human strangers from a modern society and give them the chance to engage in repeated anonymous exchanges in a laboratory experiment, there is a high probability that reciprocally altruistic behaviour will emerge spontaneously9,10.

However, human altruism extends far beyond reciprocal altruism and reputation-based cooperation, taking the form of strong reciprocity11,12. Strong reciprocity is a combination of altruistic rewarding, which is a predisposition to reward others for cooperative, norm-abiding behaviours, and altruistic punishment, which is a propensity to impose sanctions on others for norm violations. Strong reciprocators bear the cost of rewarding or punishing even if they gain no individual economic benefit whatsoever from their acts. In contrast, reciprocal altruists, as they have been defined in the biological literature4,5, reward and punish only if this is in their long-term self-interest. Strong reciprocity thus constitutes a powerful incentive for cooperation even in non-repeated interactions and when reputation gains are absent, because strong reciprocators will reward those who cooperate and punish those who defect.

...

We will show that the interaction between selfish and strongly reciprocal … [more]

concept
conceptual-vocab
wiki
reference
article
models
GT-101
game-theory
anthropology
cultural-dynamics
trust
cooperate-defect
coordination
iteration-recursion
sequential
axelrod
discrete
smoothness
evolution
evopsych
EGT
economics
behavioral-econ
sociology
new-religion
deep-materialism
volo-avolo
characterization
hsu
scitariat
altruism
justice
group-selection
decision-making
tribalism
organizing
hari-seldon
theory-practice
applicability-prereqs
bio
finiteness
multi
history
science
social-science
decision-theory
commentary
study
summary
giants
the-trenches
zero-positive-sum
🔬
bounded-cognition
info-dynamics
org:edge
explanation
exposition
org:nat
eden
retention
long-short-run
darwinian
markov
equilibrium
linear-algebra
nitty-gritty
competition
war
explanans
n-factor
europe
the-great-west-whale
occident
china
asia
sinosphere
orient
decentralized
markets
market-failure
cohesion
metabuch
stylized-facts
interdisciplinary
physics
pdf
pessimism
time
insight
the-basilisk
noblesse-oblige
the-watchers
ideas
l
An extension of the IPD is an evolutionary stochastic IPD, in which the relative abundance of particular strategies is allowed to change, with more successful strategies relatively increasing. This process may be accomplished by having less successful players imitate the more successful strategies, or by eliminating less successful players from the game, while multiplying the more successful ones. It has been shown that unfair ZD strategies are not evolutionarily stable. The key intuition is that an evolutionarily stable strategy must not only be able to invade another population (which extortionary ZD strategies can do) but must also perform well against other players of the same type (which extortionary ZD players do poorly, because they reduce each other's surplus).[14]

Theory and simulations confirm that beyond a critical population size, ZD extortion loses out in evolutionary competition against more cooperative strategies, and as a result, the average payoff in the population increases when the population is bigger. In addition, there are some cases in which extortioners may even catalyze cooperation by helping to break out of a face-off between uniform defectors and win–stay, lose–switch agents.[8]

https://alfanl.com/2018/04/12/defection/

Nature boils down to a few simple concepts.

Haters will point out that I oversimplify. The haters are wrong. I am good at saying a lot with few words. Nature indeed boils down to a few simple concepts.

In life, you can either cooperate or defect.

Used to be that defection was the dominant strategy, say in the time when the Roman empire started to crumble. Everybody complained about everybody and in the end nothing got done. Then came Jesus, who told people to be loving and cooperative, and boom: 1800 years later we get the industrial revolution.

Because of Jesus we now find ourselves in a situation where cooperation is the dominant strategy. A normie engages in a ton of cooperation: with the tax collector who wants more and more of his money, with schools who want more and more of his kid’s time, with media who wants him to repeat more and more party lines, with the Zeitgeist of the Collective Spirit of the People’s Progress Towards a New Utopia. Essentially, our normie is cooperating himself into a crumbling Western empire.

Turns out that if everyone blindly cooperates, parasites sprout up like weeds until defection once again becomes the standard.

The point of a post-Christian religion is to once again create conditions for the kind of cooperation that led to the industrial revolution. This necessitates throwing out undead Christianity: you do not blindly cooperate. You cooperate with people that cooperate with you, you defect on people that defect on you. Christianity mixed with Darwinism. God and Gnon meet.

This also means we re-establish spiritual hierarchy, which, like regular hierarchy, is a prerequisite for cooperation. It is this hierarchical cooperation that turns a household into a force to be reckoned with, that allows a group of men to unite as a front against their enemies, that allows a tribe to conquer the world. Remember: Scientology bullied the Cathedral’s tax department into submission.

With a functioning hierarchy, men still gossip, lie and scheme, but they will do so in whispers behind closed doors. In your face they cooperate and contribute to the group’s wellbeing because incentives are thus that contributing to group wellbeing heightens status.

Without a functioning hierarchy, men gossip, lie and scheme, but they do so in your face, and they tell you that you are positively deluded for accusing them of gossiping, lying and scheming. Seeds will not sprout in such ground.

Spiritual dominance is established in the same way any sort of dominance is established: fought for, taken. But the fight is ritualistic. You can’t force spiritual dominance if no one listens, or if you are silenced the ritual is not allowed to happen.

If one of our priests is forbidden from establishing spiritual dominance, that is a sure sign an enemy priest is in better control and has vested interest in preventing you from establishing spiritual dominance..

They defect on you, you defect on them. Let them suffer the consequences of enemy priesthood, among others characterized by the annoying tendency that very little is said with very many words.

https://contingentnotarbitrary.com/2018/04/14/rederiving-christianity/

To recap, we started with a secular definition of Logos and noted that its telos is existence. Given human nature, game theory and the power of cooperation, the highest expression of that telos is freely chosen universal love, tempered by constant vigilance against defection while maintaining compassion for the defectors and forgiving those who repent. In addition, we must know the telos in order to fulfill it.

In Christian terms, looks like we got over half of the Ten Commandments (know Logos for the First, don’t defect or tempt yourself to defect for the rest), the importance of free will, the indestructibility of evil (group cooperation vs individual defection), loving the sinner and hating the sin (with defection as the sin), forgiveness (with conditions), and love and compassion toward all, assuming only secular knowledge and that it’s good to exist.

Iterated Prisoner's Dilemma is an Ultimatum Game: http://infoproc.blogspot.com/2012/07/iterated-prisoners-dilemma-is-ultimatum.html

The history of IPD shows that bounded cognition prevented the dominant strategies from being discovered for over over 60 years, despite significant attention from game theorists, computer scientists, economists, evolutionary biologists, etc. Press and Dyson have shown that IPD is effectively an ultimatum game, which is very different from the Tit for Tat stories told by generations of people who worked on IPD (Axelrod, Dawkins, etc., etc.).

...

For evolutionary biologists: Dyson clearly thinks this result has implications for multilevel (group vs individual selection):

... Cooperation loses and defection wins. The ZD strategies confirm this conclusion and make it sharper. ... The system evolved to give cooperative tribes an advantage over non-cooperative tribes, using punishment to give cooperation an evolutionary advantage within the tribe. This double selection of tribes and individuals goes way beyond the Prisoners' Dilemma model.

implications for fractionalized Europe vis-a-vis unified China?

and more broadly does this just imply we're doomed in the long run RE: cooperation, morality, the "good society", so on...? war and group-selection is the only way to get a non-crab bucket civilization?

Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent:

http://www.pnas.org/content/109/26/10409.full

http://www.pnas.org/content/109/26/10409.full.pdf

https://www.edge.org/conversation/william_h_press-freeman_dyson-on-iterated-prisoners-dilemma-contains-strategies-that

https://en.wikipedia.org/wiki/Ultimatum_game

analogy for ultimatum game: the state gives the demos a bargain take-it-or-leave-it, and...if the demos refuses...violence?

The nature of human altruism: http://sci-hub.tw/https://www.nature.com/articles/nature02043

- Ernst Fehr & Urs Fischbacher

Some of the most fundamental questions concerning our evolutionary origins, our social relations, and the organization of society are centred around issues of altruism and selfishness. Experimental evidence indicates that human altruism is a powerful force and is unique in the animal world. However, there is much individual heterogeneity and the interaction between altruists and selfish individuals is vital to human cooperation. Depending on the environment, a minority of altruists can force a majority of selfish individuals to cooperate or, conversely, a few egoists can induce a large number of altruists to defect. Current gene-based evolutionary theories cannot explain important patterns of human altruism, pointing towards the importance of both theories of cultural evolution as well as gene–culture co-evolution.

...

Why are humans so unusual among animals in this respect? We propose that quantitatively, and probably even qualitatively, unique patterns of human altruism provide the answer to this question. Human altruism goes far beyond that which has been observed in the animal world. Among animals, fitness-reducing acts that confer fitness benefits on other individuals are largely restricted to kin groups; despite several decades of research, evidence for reciprocal altruism in pair-wise repeated encounters4,5 remains scarce6–8. Likewise, there is little evidence so far that individual reputation building affects cooperation in animals, which contrasts strongly with what we find in humans. If we randomly pick two human strangers from a modern society and give them the chance to engage in repeated anonymous exchanges in a laboratory experiment, there is a high probability that reciprocally altruistic behaviour will emerge spontaneously9,10.

However, human altruism extends far beyond reciprocal altruism and reputation-based cooperation, taking the form of strong reciprocity11,12. Strong reciprocity is a combination of altruistic rewarding, which is a predisposition to reward others for cooperative, norm-abiding behaviours, and altruistic punishment, which is a propensity to impose sanctions on others for norm violations. Strong reciprocators bear the cost of rewarding or punishing even if they gain no individual economic benefit whatsoever from their acts. In contrast, reciprocal altruists, as they have been defined in the biological literature4,5, reward and punish only if this is in their long-term self-interest. Strong reciprocity thus constitutes a powerful incentive for cooperation even in non-repeated interactions and when reputation gains are absent, because strong reciprocators will reward those who cooperate and punish those who defect.

...

We will show that the interaction between selfish and strongly reciprocal … [more]

march 2018 by nhaliday

Analytic approaches to twin data using structural equation models

pdf study article explanation methodology variance-components biodet behavioral-gen twin-study genetics population-genetics models model-class graphs graphical-models latent-variables ML-MAP-E stats hypothesis-testing nibble 🌞 correlation bioinformatics acm GxE assortative-mating stat-power confidence

november 2017 by nhaliday

pdf study article explanation methodology variance-components biodet behavioral-gen twin-study genetics population-genetics models model-class graphs graphical-models latent-variables ML-MAP-E stats hypothesis-testing nibble 🌞 correlation bioinformatics acm GxE assortative-mating stat-power confidence

november 2017 by nhaliday

Karl Pearson and the Chi-squared Test

october 2017 by nhaliday

Pearson's paper of 1900 introduced what subsequently became known as the chi-squared test of goodness of fit. The terminology and allusions of 80 years ago create a barrier for the modern reader, who finds that the interpretation of Pearson's test procedure and the assessment of what he achieved are less than straightforward, notwithstanding the technical advances made since then. An attempt is made here to surmount these difficulties by exploring Pearson's relevant activities during the first decade of his statistical career, and by describing the work by his contemporaries and predecessors which seem to have influenced his approach to the problem. Not all the questions are answered, and others remain for further study.

original paper: http://www.economics.soton.ac.uk/staff/aldrich/1900.pdf

How did Karl Pearson come up with the chi-squared statistic?: https://stats.stackexchange.com/questions/97604/how-did-karl-pearson-come-up-with-the-chi-squared-statistic

He proceeds by working with the multivariate normal, and the chi-square arises as a sum of squared standardized normal variates.

You can see from the discussion on p160-161 he's clearly discussing applying the test to multinomial distributed data (I don't think he uses that term anywhere). He apparently understands the approximate multivariate normality of the multinomial (certainly he knows the margins are approximately normal - that's a very old result - and knows the means, variances and covariances, since they're stated in the paper); my guess is that most of that stuff is already old hat by 1900. (Note that the chi-squared distribution itself dates back to work by Helmert in the mid-1870s.)

Then by the bottom of p163 he derives a chi-square statistic as "a measure of goodness of fit" (the statistic itself appears in the exponent of the multivariate normal approximation).

He then goes on to discuss how to evaluate the p-value*, and then he correctly gives the upper tail area of a χ212χ122 beyond 43.87 as 0.000016. [You should keep in mind, however, that he didn't correctly understand how to adjust degrees of freedom for parameter estimation at that stage, so some of the examples in his papers use too high a d.f.]

nibble
papers
acm
stats
hypothesis-testing
methodology
history
mostly-modern
pre-ww2
old-anglo
giants
science
the-trenches
stories
multi
q-n-a
overflow
explanation
summary
innovation
discovery
distribution
degrees-of-freedom
limits
original paper: http://www.economics.soton.ac.uk/staff/aldrich/1900.pdf

How did Karl Pearson come up with the chi-squared statistic?: https://stats.stackexchange.com/questions/97604/how-did-karl-pearson-come-up-with-the-chi-squared-statistic

He proceeds by working with the multivariate normal, and the chi-square arises as a sum of squared standardized normal variates.

You can see from the discussion on p160-161 he's clearly discussing applying the test to multinomial distributed data (I don't think he uses that term anywhere). He apparently understands the approximate multivariate normality of the multinomial (certainly he knows the margins are approximately normal - that's a very old result - and knows the means, variances and covariances, since they're stated in the paper); my guess is that most of that stuff is already old hat by 1900. (Note that the chi-squared distribution itself dates back to work by Helmert in the mid-1870s.)

Then by the bottom of p163 he derives a chi-square statistic as "a measure of goodness of fit" (the statistic itself appears in the exponent of the multivariate normal approximation).

He then goes on to discuss how to evaluate the p-value*, and then he correctly gives the upper tail area of a χ212χ122 beyond 43.87 as 0.000016. [You should keep in mind, however, that he didn't correctly understand how to adjust degrees of freedom for parameter estimation at that stage, so some of the examples in his papers use too high a d.f.]

october 2017 by nhaliday

Rank aggregation basics: Local Kemeny optimisation | David R. MacIver

september 2017 by nhaliday

This turns our problem from a global search to a local one: Basically we can start from any point in the search space and search locally by swapping adjacent pairs until we hit a minimum. This turns out to be quite easy to do. _We basically run insertion sort_: At step n we have the first n items in a locally Kemeny optimal order. Swap the n+1th item backwards until the majority think its predecessor is < it. This ensures all adjacent pairs are in the majority order, so swapping them would result in a greater than or equal K. This is of course an O(n^2) algorithm. In fact, the problem of merely finding a locally Kemeny optimal solution can be done in O(n log(n)) (for much the same reason as you can sort better than insertion sort). You just take the directed graph of majority votes and find a Hamiltonian Path. The nice thing about the above version of the algorithm is that it gives you a lot of control over where you start your search.

techtariat
liner-notes
papers
tcs
algorithms
machine-learning
acm
optimization
approximation
local-global
orders
graphs
graph-theory
explanation
iteration-recursion
time-complexity
nibble
september 2017 by nhaliday

Unsupervised learning, one notion or many? – Off the convex path

june 2017 by nhaliday

(Task A) Learning a distribution from samples. (Examples: gaussian mixtures, topic models, variational autoencoders,..)

(Task B) Understanding latent structure in the data. This is not the same as (a); for example principal component analysis, clustering, manifold learning etc. identify latent structure but don’t learn a distribution per se.

(Task C) Feature Learning. Learn a mapping from datapoint → feature vector such that classification tasks are easier to carry out on feature vectors rather than datapoints. For example, unsupervised feature learning could help lower the amount of labeled samples needed for learning a classifier, or be useful for domain adaptation.

Task B is often a subcase of Task C, as the intended user of “structure found in data” are humans (scientists) who pour over the representation of data to gain some intuition about its properties, and these “properties” can be often phrased as a classification task.

This post explains the relationship between Tasks A and C, and why they get mixed up in students’ mind. We hope there is also some food for thought here for experts, namely, our discussion about the fragility of the usual “perplexity” definition of unsupervised learning. It explains why Task A doesn’t in practice lead to good enough solution for Task C. For example, it has been believed for many years that for deep learning, unsupervised pretraining should help supervised training, but this has been hard to show in practice.

acmtariat
org:bleg
nibble
machine-learning
acm
thinking
clarity
unsupervised
conceptual-vocab
concept
explanation
features
bayesian
off-convex
deep-learning
latent-variables
generative
intricacy
distribution
sampling
(Task B) Understanding latent structure in the data. This is not the same as (a); for example principal component analysis, clustering, manifold learning etc. identify latent structure but don’t learn a distribution per se.

(Task C) Feature Learning. Learn a mapping from datapoint → feature vector such that classification tasks are easier to carry out on feature vectors rather than datapoints. For example, unsupervised feature learning could help lower the amount of labeled samples needed for learning a classifier, or be useful for domain adaptation.

Task B is often a subcase of Task C, as the intended user of “structure found in data” are humans (scientists) who pour over the representation of data to gain some intuition about its properties, and these “properties” can be often phrased as a classification task.

This post explains the relationship between Tasks A and C, and why they get mixed up in students’ mind. We hope there is also some food for thought here for experts, namely, our discussion about the fragility of the usual “perplexity” definition of unsupervised learning. It explains why Task A doesn’t in practice lead to good enough solution for Task C. For example, it has been believed for many years that for deep learning, unsupervised pretraining should help supervised training, but this has been hard to show in practice.

june 2017 by nhaliday

9 Multivariate linear models for GWAS

pdf nibble article lecture-notes exposition bio biodet genetics genomics bioinformatics GWAS methodology explanation regression regularization machine-learning acm stats stanford 🌞 spearhead GCTA sparsity compressed-sensing linear-models concept levers ideas population-genetics

may 2017 by nhaliday

pdf nibble article lecture-notes exposition bio biodet genetics genomics bioinformatics GWAS methodology explanation regression regularization machine-learning acm stats stanford 🌞 spearhead GCTA sparsity compressed-sensing linear-models concept levers ideas population-genetics

may 2017 by nhaliday

Pearson correlation coefficient - Wikipedia

may 2017 by nhaliday

https://en.wikipedia.org/wiki/Coefficient_of_determination

what does this mean?: https://twitter.com/GarettJones/status/863546692724858880

deleted but it was about the Pearson correlation distance: 1-r

I guess it's a metric

https://en.wikipedia.org/wiki/Explained_variation

http://infoproc.blogspot.com/2014/02/correlation-and-variance.html

A less misleading way to think about the correlation R is as follows: given X,Y from a standardized bivariate distribution with correlation R, an increase in X leads to an expected increase in Y: dY = R dX. In other words, students with +1 SD SAT score have, on average, roughly +0.4 SD college GPAs. Similarly, students with +1 SD college GPAs have on average +0.4 SAT.

this reminds me of the breeder's equation (but it uses r instead of h^2, so it can't actually be the same)

https://www.reddit.com/r/slatestarcodex/comments/631haf/on_the_commentariat_here_and_why_i_dont_think_i/dfx4e2s/

stats
science
hypothesis-testing
correlation
metrics
plots
regression
wiki
reference
nibble
methodology
multi
twitter
social
discussion
best-practices
econotariat
garett-jones
concept
conceptual-vocab
accuracy
causation
acm
matrix-factorization
todo
explanation
yoga
hsu
street-fighting
levers
🌞
2014
scitariat
variance-components
meta:prediction
biodet
s:**
mental-math
reddit
commentary
ssc
poast
gwern
data-science
metric-space
similarity
measure
dependence-independence
what does this mean?: https://twitter.com/GarettJones/status/863546692724858880

deleted but it was about the Pearson correlation distance: 1-r

I guess it's a metric

https://en.wikipedia.org/wiki/Explained_variation

http://infoproc.blogspot.com/2014/02/correlation-and-variance.html

A less misleading way to think about the correlation R is as follows: given X,Y from a standardized bivariate distribution with correlation R, an increase in X leads to an expected increase in Y: dY = R dX. In other words, students with +1 SD SAT score have, on average, roughly +0.4 SD college GPAs. Similarly, students with +1 SD college GPAs have on average +0.4 SAT.

this reminds me of the breeder's equation (but it uses r instead of h^2, so it can't actually be the same)

https://www.reddit.com/r/slatestarcodex/comments/631haf/on_the_commentariat_here_and_why_i_dont_think_i/dfx4e2s/

may 2017 by nhaliday

Why Momentum Really Works

acmtariat techtariat org:bleg nibble machine-learning acm optimization gradient-descent exposition explanation yoga dynamic visualization visual-understanding better-explained linear-algebra iterative-methods iteration-recursion polynomials dynamical metabuch let-me-see ground-up oscillation fourier curvature convexity-curvature analysis concept atoms

april 2017 by nhaliday

acmtariat techtariat org:bleg nibble machine-learning acm optimization gradient-descent exposition explanation yoga dynamic visualization visual-understanding better-explained linear-algebra iterative-methods iteration-recursion polynomials dynamical metabuch let-me-see ground-up oscillation fourier curvature convexity-curvature analysis concept atoms

april 2017 by nhaliday

How do these "neural network style transfer" tools work? - Julia Evans

february 2017 by nhaliday

When we put an image into the network, it starts out as a vector of numbers (the red/green/blue values for each pixel). At each layer of the network we get another intermediate vector of numbers. There’s no inherent meaning to any of these vectors.

But! If we want to, we could pick one of those vectors arbitrarily and declare “You know, I think that vector represents the content” of the image.

The basic idea is that the further down you get in the network (and the closer towards classifying objects in the network as a “cat” or “house” or whatever”), the more the vector represents the image’s “content”.

In this paper, they designate the “conv4_2” later as the “content” layer. This seems to be pretty arbitrary – it’s just a layer that’s pretty far down the network.

Defining “style” is a bit more complicated. If I understand correctly, the definition “style” is actually the major innovation of this paper – they don’t just pick a layer and say “this is the style layer”. Instead, they take all the “feature maps” at a layer (basically there are actually a whole bunch of vectors at the layer, one for each “feature”), and define the “Gram matrix” of all the pairwise inner products between those vectors. This Gram matrix is the style.

techtariat
bangbang
deep-learning
model-class
explanation
art
visuo
machine-learning
acm
SIGGRAPH
init
inner-product
nibble
But! If we want to, we could pick one of those vectors arbitrarily and declare “You know, I think that vector represents the content” of the image.

The basic idea is that the further down you get in the network (and the closer towards classifying objects in the network as a “cat” or “house” or whatever”), the more the vector represents the image’s “content”.

In this paper, they designate the “conv4_2” later as the “content” layer. This seems to be pretty arbitrary – it’s just a layer that’s pretty far down the network.

Defining “style” is a bit more complicated. If I understand correctly, the definition “style” is actually the major innovation of this paper – they don’t just pick a layer and say “this is the style layer”. Instead, they take all the “feature maps” at a layer (basically there are actually a whole bunch of vectors at the layer, one for each “feature”), and define the “Gram matrix” of all the pairwise inner products between those vectors. This Gram matrix is the style.

february 2017 by nhaliday

What is the relationship between information theory and Coding theory? - Quora

february 2017 by nhaliday

basically:

- finite vs. asymptotic

- combinatorial vs. probabilistic (lotsa overlap their)

- worst-case (Hamming) vs. distributional (Shannon)

Information and coding theory most often appear together in the subject of error correction over noisy channels. Historically, they were born at almost exactly the same time - both Richard Hamming and Claude Shannon were working at Bell Labs when this happened. Information theory tends to heavily use tools from probability theory (together with an "asymptotic" way of thinking about the world), while traditional "algebraic" coding theory tends to employ mathematics that are much more finite sequence length/combinatorial in nature, including linear algebra over Galois Fields. The emergence in the late 90s and first decade of 2000 of codes over graphs blurred this distinction though, as code classes such as low density parity check codes employ both asymptotic analysis and random code selection techniques which have counterparts in information theory.

They do not subsume each other. Information theory touches on many other aspects that coding theory does not, and vice-versa. Information theory also touches on compression (lossy & lossless), statistics (e.g. large deviations), modeling (e.g. Minimum Description Length). Coding theory pays a lot of attention to sphere packing and coverings for finite length sequences - information theory addresses these problems (channel & lossy source coding) only in an asymptotic/approximate sense.

q-n-a
qra
math
acm
tcs
information-theory
coding-theory
big-picture
comparison
confusion
explanation
linear-algebra
polynomials
limits
finiteness
math.CO
hi-order-bits
synthesis
probability
bits
hamming
shannon
intricacy
nibble
s:null
signal-noise
- finite vs. asymptotic

- combinatorial vs. probabilistic (lotsa overlap their)

- worst-case (Hamming) vs. distributional (Shannon)

Information and coding theory most often appear together in the subject of error correction over noisy channels. Historically, they were born at almost exactly the same time - both Richard Hamming and Claude Shannon were working at Bell Labs when this happened. Information theory tends to heavily use tools from probability theory (together with an "asymptotic" way of thinking about the world), while traditional "algebraic" coding theory tends to employ mathematics that are much more finite sequence length/combinatorial in nature, including linear algebra over Galois Fields. The emergence in the late 90s and first decade of 2000 of codes over graphs blurred this distinction though, as code classes such as low density parity check codes employ both asymptotic analysis and random code selection techniques which have counterparts in information theory.

They do not subsume each other. Information theory touches on many other aspects that coding theory does not, and vice-versa. Information theory also touches on compression (lossy & lossless), statistics (e.g. large deviations), modeling (e.g. Minimum Description Length). Coding theory pays a lot of attention to sphere packing and coverings for finite length sequences - information theory addresses these problems (channel & lossy source coding) only in an asymptotic/approximate sense.

february 2017 by nhaliday

machine learning - What are the differences between sparse coding and autoencoder? - Cross Validated

q-n-a overflow acm machine-learning concept comparison confusion deep-learning model-class features sparsity explanation unsupervised nibble exploratory definition atoms bits

january 2017 by nhaliday

q-n-a overflow acm machine-learning concept comparison confusion deep-learning model-class features sparsity explanation unsupervised nibble exploratory definition atoms bits

january 2017 by nhaliday

teaching - Intuitive explanation for dividing by $n-1$ when calculating standard deviation? - Cross Validated

january 2017 by nhaliday

The standard deviation calculated with a divisor of n-1 is a standard deviation calculated from the sample as an estimate of the standard deviation of the population from which the sample was drawn. Because the observed values fall, on average, closer to the sample mean than to the population mean, the standard deviation which is calculated using deviations from the sample mean underestimates the desired standard deviation of the population. Using n-1 instead of n as the divisor corrects for that by making the result a little bit bigger.

Note that the correction has a larger proportional effect when n is small than when it is large, which is what we want because when n is larger the sample mean is likely to be a good estimator of the population mean.

...

A common one is that the definition of variance (of a distribution) is the second moment recentered around a known, definite mean, whereas the estimator uses an estimated mean. This loss of a degree of freedom (given the mean, you can reconstitute the dataset with knowledge of just n−1 of the data values) requires the use of n−1 rather than nn to "adjust" the result.

q-n-a
overflow
stats
acm
intuition
explanation
bias-variance
methodology
moments
nibble
degrees-of-freedom
sampling-bias
generalization
dimensionality
ground-up
intricacy
Note that the correction has a larger proportional effect when n is small than when it is large, which is what we want because when n is larger the sample mean is likely to be a good estimator of the population mean.

...

A common one is that the definition of variance (of a distribution) is the second moment recentered around a known, definite mean, whereas the estimator uses an estimated mean. This loss of a degree of freedom (given the mean, you can reconstitute the dataset with knowledge of just n−1 of the data values) requires the use of n−1 rather than nn to "adjust" the result.

january 2017 by nhaliday

ds.algorithms - How does the Multiplicative Weights Update method maximize entropy? - Theoretical Computer Science Stack Exchange

q-n-a overflow tcs acm algorithms optimization online-learning yoga characterization ground-up explanation proofs entropy-like nibble identity properties amortization-potential

january 2017 by nhaliday

q-n-a overflow tcs acm algorithms optimization online-learning yoga characterization ground-up explanation proofs entropy-like nibble identity properties amortization-potential

january 2017 by nhaliday

Breeding the breeder's equation - Gene Expression

december 2016 by nhaliday

- interesting fact about normal distribution: when thresholding Gaussian r.v. X ~ N(0, σ^2) at X > 0, the new mean μ_s satisfies μ_s = pdf(X,t)/(1-cdf(X,t)) σ^2

- follows from direct calculation (any deeper reason?)

- note (using Taylor/asymptotic expansion of complementary error function) that this is Θ(t) as t -> 0 or ∞ (w/ different constants)

- for X ~ N(0, 1), can calculate 0 = cdf(X, t)μ_<t + (1-cdf(X, t))μ_>t => μ_<t = -pdf(X, t)/cdf(X, t)

- this declines quickly w/ t (like e^{-t^2/2}). as t -> 0, it goes like -sqrt(2/pi) + higher-order terms ~ -0.8.

Average of a tail of a normal distribution: https://stats.stackexchange.com/questions/26805/average-of-a-tail-of-a-normal-distribution

Truncated normal distribution: https://en.wikipedia.org/wiki/Truncated_normal_distribution

gnxp
explanation
concept
bio
genetics
population-genetics
agri-mindset
analysis
scitariat
org:sci
nibble
methodology
distribution
tidbits
probability
stats
acm
AMT
limits
magnitude
identity
integral
street-fighting
symmetry
s:*
tails
multi
q-n-a
overflow
wiki
reference
objektbuch
proofs
- follows from direct calculation (any deeper reason?)

- note (using Taylor/asymptotic expansion of complementary error function) that this is Θ(t) as t -> 0 or ∞ (w/ different constants)

- for X ~ N(0, 1), can calculate 0 = cdf(X, t)μ_<t + (1-cdf(X, t))μ_>t => μ_<t = -pdf(X, t)/cdf(X, t)

- this declines quickly w/ t (like e^{-t^2/2}). as t -> 0, it goes like -sqrt(2/pi) + higher-order terms ~ -0.8.

Average of a tail of a normal distribution: https://stats.stackexchange.com/questions/26805/average-of-a-tail-of-a-normal-distribution

Truncated normal distribution: https://en.wikipedia.org/wiki/Truncated_normal_distribution

december 2016 by nhaliday

Information Processing: Assortative mating, regression and all that: offspring IQ vs parental midpoint

november 2016 by nhaliday

Assuming parental midpoint of n SD above the population average, the kids' IQ will be normally distributed about a mean which is around +.6n with residual SD of about 12 points. (The .6 could actually be anywhere in the range (.5, .7), but the SD doesn't vary much from choice of empirical inputs.)

possible to calculate the residual variance from first principles?

Some data on regression: http://infoproc.blogspot.com/2010/10/some-data-on-regression.html

hsu
parenting
iq
regression-to-mean
street-fighting
explanation
methodology
assortative-mating
scitariat
variance-components
biodet
nibble
behavioral-gen
multi
data
stories
education
acm
possible to calculate the residual variance from first principles?

Some data on regression: http://infoproc.blogspot.com/2010/10/some-data-on-regression.html

november 2016 by nhaliday

Information Processing: Looking back at the credit crisis

november 2016 by nhaliday

interesting take from Garett Jones (very pro-capital): https://twitter.com/GarettJones/status/893550933065388032

https://archive.is/vHyMR

http://www.felixsalmon.com/2008/01/predatory-borrowers/

Central limit theorem and securitization: how to build a CDO: http://infoproc.blogspot.com/2008/11/central-limit-theorem-and_16.html

Merton on the financial crisis: http://infoproc.blogspot.com/2009/04/merton-on-financial-crisis.html

http://infoproc.blogspot.com/search/label/credit%20crisis

The Meridian: MBS, CDO's and CDS's In Layman's Terms......: http://themeridian.blogspot.com/2008/09/mbs-cdos-and-cdss-in-laymans-terms.html

hsu
history
usa
finance
economics
macro
slides
explanation
postmortem
discussion
error
scitariat
cycles
complex-systems
market-failure
multi
twitter
social
econotariat
garett-jones
spearhead
rhetoric
contrarianism
hmm
regularizer
chart
journos-pundits
cracker-econ
marginal-rev
backup
concentration-of-measure
markets
ORFE
probability
street-fighting
applications
stats
data-science
risk
outcome-risk
moments
regulation
bounded-cognition
video
presentation
reflection
stochastic-processes
physics
interdisciplinary
list
stream
jargon
concept
housing
debt
events
methodology
acm
https://archive.is/vHyMR

http://www.felixsalmon.com/2008/01/predatory-borrowers/

Central limit theorem and securitization: how to build a CDO: http://infoproc.blogspot.com/2008/11/central-limit-theorem-and_16.html

Merton on the financial crisis: http://infoproc.blogspot.com/2009/04/merton-on-financial-crisis.html

http://infoproc.blogspot.com/search/label/credit%20crisis

The Meridian: MBS, CDO's and CDS's In Layman's Terms......: http://themeridian.blogspot.com/2008/09/mbs-cdos-and-cdss-in-laymans-terms.html

november 2016 by nhaliday

Probably Overthinking It: There is still only one test

june 2016 by nhaliday

all hypothesis tests are based on the same framework

stats
explanation
init
rhetoric
synthesis
acm
insight
concept
methodology
hi-order-bits
big-picture
hypothesis-testing
🔬
june 2016 by nhaliday

**related tags**

Copy this bookmark: