lol, gwern:
> What sort of person thinks “oh yeah, my beliefs about these coefficients correspond to a Gaussian with variance 2.5″? And what if I do cross-validation, like I always do, and find that variance 200 works better for the problem? Was the other person wrong? But how could they have known?
> ...Even ignoring the mode vs. mean issue, I have never met anyone who could tell whether their beliefs were normally distributed vs. Laplace distributed. Have you?
I must have spent too much time in Bayesland because both those strike me as very easy and I often think them! My beliefs usually are Laplace distributed when it comes to things like genetics (it makes me very sad to see GWASes with flat priors), and my Gaussian coefficients are actually a variance of 0.70 (assuming standardized variables w.l.o.g.) as is consistent with field-wide meta-analyses indicating that d>1 is pretty rare.
The variance of among-group variance is substantial and does not depend on the number of loci contributing to variance in the character. It is just as large for polygenic characters as for single loci with the same additive variance. This implies that one polygenic character contains exactly as much information about population relationships as one single-locus marker.

same is true of expectation apparently (so drift has same impact on polygenic and single-locus traits)
probability - Variance of maximum of Gaussian random variables - Cross Validated
In full generality it is rather hard to find the right order of magnitude of the variance of a Gaussien supremum since the tools from concentration theory are always suboptimal for the maximum function.

order ~ 1/log n
bounds - What is the variance of the maximum of a sample? - Cross Validated
- sum of variances is always a bound
- can't do better even for iid Bernoulli
- looks like nice argument from well-known probabilist (using E[(X-Y)^2] = 2Var X), but not clear to me how he gets to sum_i instead of sum_{i,j} in the union bound?
edit: argument is that, for j = argmax_k Y_k, we have r < X_i - Y_j <= X_i - Y_i for all i, including i = argmax_k X_k
- different proof here (later pages):
Var(X_n:n) <= sum Var(X_k:n) + 2 sum_{i < j} Cov(X_i:n, X_j:n) = Var(sum X_k:n) = Var(sum X_k) = nσ^2
why are the covariances nonnegative? (are they?). intuitively seems true.
- for that, see
- note that this proof shows more generally that sum Var(X_k:n) <= sum Var(X_k)
- apparently that holds for dependent X_k too?
Count–min sketch - Wikipedia
- estimates frequency vector (f_i)
- idea:
d = O(log 1/δ) hash functions h_j: [n] -> [w] (w = O(1/ε))
d*w counters a[r, c]
for each event i, increment counters a[1, h_1(i)], a[2, h_2(i)], ..., a[d, h_d(i)]
estimate for f_i is min_j a[j, h_j(i)]
- never underestimates but upward-biased
- pf: Markov to get constant probability of success, then exponential decrease with repetition
lecture notes:
- note this can work w/ negative updates. just use median instead of min. pf still uses markov on the absolute value of error.
teaching - Intuitive explanation for dividing by $n-1$ when calculating standard deviation? - Cross Validated
The standard deviation calculated with a divisor of n-1 is a standard deviation calculated from the sample as an estimate of the standard deviation of the population from which the sample was drawn. Because the observed values fall, on average, closer to the sample mean than to the population mean, the standard deviation which is calculated using deviations from the sample mean underestimates the desired standard deviation of the population. Using n-1 instead of n as the divisor corrects for that by making the result a little bit bigger.

Note that the correction has a larger proportional effect when n is small than when it is large, which is what we want because when n is larger the sample mean is likely to be a good estimator of the population mean.


A common one is that the definition of variance (of a distribution) is the second moment recentered around a known, definite mean, whereas the estimator uses an estimated mean. This loss of a degree of freedom (given the mean, you can reconstitute the dataset with knowledge of just n−1 of the data values) requires the use of n−1 rather than nn to "adjust" the result.
Nuts and Bolts of Applying Deep Learning
"1. When the available data is not enough, hand-craft work (like feature design) is really important.
2. Andrew also mentioned the reason why End-2-End learning is less likely to plateau, according to my interpretation(NOT VERY SURE), end2end system neglects the human-design intermediate structures (eg. phonemes, which may be the bottleneck for performance improvement), so as more data comes, the true mechanism will be better learned, with better performance achieved."
Understanding the Pseudo-Truth as an Optimal Approximation
for a mis-specified model m1 and a true model m2 (ie m2 generated the data), a model is the "pseudo-truth" if it is the version of m1 which is closest to m2.

Bias (and variance) as an "asymptotic property of a model class" vs. "a finite-sample property of an estimator". Paul Mineiro describes the "asymptotic property" view, and White adds, "But Paul Mineiro is not, I think, interested in these finite-sample properties of estimators. I believe he’s concerned about the intrinsic error introduced by approximating one function with another. And that’s a very important topic that I haven’t seen discussed as often as I’d like."
pr.probability - Google question: In a country in which people only want boys - MathOverflow
- limits to 1/2 w/ number of families -> ∞
- proportion of girls in one family is biased estimator of proportion in general population (larger families w/ more girls count more)
- interesting comment on Douglas Zare's answer (whether process has stopped or not)
