nhaliday + acm + overflow   70

multivariate analysis - Is it possible to have a pair of Gaussian random variables for which the joint distribution is not Gaussian? - Cross Validated
The bivariate normal distribution is the exception, not the rule!

It is important to recognize that "almost all" joint distributions with normal marginals are not the bivariate normal distribution. That is, the common viewpoint that joint distributions with normal marginals that are not the bivariate normal are somehow "pathological", is a bit misguided.

Certainly, the multivariate normal is extremely important due to its stability under linear transformations, and so receives the bulk of attention in applications.

note: there is a multivariate central limit theorem, so those such applications have no problem
nibble  q-n-a  overflow  stats  math  acm  probability  distribution  gotchas  intricacy  characterization  structure  composition-decomposition  counterexample  limits  concentration-of-measure
october 2017 by nhaliday
Karl Pearson and the Chi-squared Test
Pearson's paper of 1900 introduced what subsequently became known as the chi-squared test of goodness of fit. The terminology and allusions of 80 years ago create a barrier for the modern reader, who finds that the interpretation of Pearson's test procedure and the assessment of what he achieved are less than straightforward, notwithstanding the technical advances made since then. An attempt is made here to surmount these difficulties by exploring Pearson's relevant activities during the first decade of his statistical career, and by describing the work by his contemporaries and predecessors which seem to have influenced his approach to the problem. Not all the questions are answered, and others remain for further study.

original paper: http://www.economics.soton.ac.uk/staff/aldrich/1900.pdf

How did Karl Pearson come up with the chi-squared statistic?: https://stats.stackexchange.com/questions/97604/how-did-karl-pearson-come-up-with-the-chi-squared-statistic
He proceeds by working with the multivariate normal, and the chi-square arises as a sum of squared standardized normal variates.

You can see from the discussion on p160-161 he's clearly discussing applying the test to multinomial distributed data (I don't think he uses that term anywhere). He apparently understands the approximate multivariate normality of the multinomial (certainly he knows the margins are approximately normal - that's a very old result - and knows the means, variances and covariances, since they're stated in the paper); my guess is that most of that stuff is already old hat by 1900. (Note that the chi-squared distribution itself dates back to work by Helmert in the mid-1870s.)

Then by the bottom of p163 he derives a chi-square statistic as "a measure of goodness of fit" (the statistic itself appears in the exponent of the multivariate normal approximation).

He then goes on to discuss how to evaluate the p-value*, and then he correctly gives the upper tail area of a χ212χ122 beyond 43.87 as 0.000016. [You should keep in mind, however, that he didn't correctly understand how to adjust degrees of freedom for parameter estimation at that stage, so some of the examples in his papers use too high a d.f.]
nibble  papers  acm  stats  hypothesis-testing  methodology  history  mostly-modern  pre-ww2  old-anglo  giants  science  the-trenches  stories  multi  q-n-a  overflow  explanation  summary  innovation  discovery  distribution  degrees-of-freedom  limits
october 2017 by nhaliday
inequalities - Is the Jaccard distance a distance? - MathOverflow
Steinhaus Transform
the referenced survey: http://kenclarkson.org/nn_survey/p.pdf

It's known that this transformation produces a metric from a metric. Now if you take as the base metric D the symmetric difference between two sets, what you end up with is the Jaccard distance (which actually is known by many other names as well).
q-n-a  overflow  nibble  math  acm  sublinear  metrics  metric-space  proofs  math.CO  tcstariat  arrows  reduction  measure  math.MG  similarity  multi  papers  survey  computational-geometry  cs  algorithms  pdf  positivity  msr  tidbits  intersection  curvature  convexity-curvature  intersection-connectedness  signum
february 2017 by nhaliday
Difference between off-policy and on-policy learning - Cross Validated
The reason that Q-learning is off-policy is that it updates its Q-values using the Q-value of the next state s′ and the greedy action a′. In other words, it estimates the return (total discounted future reward) for state-action pairs assuming a greedy policy were followed despite the fact that it's not following a greedy policy.

The reason that SARSA is on-policy is that it updates its Q-values using the Q-value of the next state s′ and the current policy's action a″. It estimates the return for state-action pairs assuming the current policy continues to be followed.

The distinction disappears if the current policy is a greedy policy. However, such an agent would not be good since it never explores.
q-n-a  overflow  machine-learning  acm  reinforcement  confusion  jargon  generalization  nibble  definition  greedy  comparison
february 2017 by nhaliday
Simultaneous confidence intervals for multinomial parameters, for small samples, many classes? - Cross Validated
- "Bonferroni approach" is just union bound
- so Pr(|hat p_i - p_i| > ε for any i) <= 2k e^{-ε^2 n} = δ
- ε = sqrt(ln(2k/δ)/n)
- Bonferroni approach should work for case of any dependent Bernoulli r.v.s
q-n-a  overflow  stats  moments  distribution  acm  hypothesis-testing  nibble  confidence  concentration-of-measure  bonferroni  parametric  synchrony
february 2017 by nhaliday
probability - Variance of maximum of Gaussian random variables - Cross Validated
In full generality it is rather hard to find the right order of magnitude of the variance of a Gaussien supremum since the tools from concentration theory are always suboptimal for the maximum function.

order ~ 1/log n
q-n-a  overflow  stats  probability  acm  orders  tails  bias-variance  moments  concentration-of-measure  magnitude  tidbits  distribution  yoga  structure  extrema  nibble
february 2017 by nhaliday
bounds - What is the variance of the maximum of a sample? - Cross Validated
- sum of variances is always a bound
- can't do better even for iid Bernoulli
- looks like nice argument from well-known probabilist (using E[(X-Y)^2] = 2Var X), but not clear to me how he gets to sum_i instead of sum_{i,j} in the union bound?
edit: argument is that, for j = argmax_k Y_k, we have r < X_i - Y_j <= X_i - Y_i for all i, including i = argmax_k X_k
- different proof here (later pages): http://www.ism.ac.jp/editsec/aism/pdf/047_1_0185.pdf
Var(X_n:n) <= sum Var(X_k:n) + 2 sum_{i < j} Cov(X_i:n, X_j:n) = Var(sum X_k:n) = Var(sum X_k) = nσ^2
why are the covariances nonnegative? (are they?). intuitively seems true.
- for that, see https://pinboard.in/u:nhaliday/b:ed4466204bb1
- note that this proof shows more generally that sum Var(X_k:n) <= sum Var(X_k)
- apparently that holds for dependent X_k too? http://mathoverflow.net/a/96943/20644
q-n-a  overflow  stats  acm  distribution  tails  bias-variance  moments  estimate  magnitude  probability  iidness  tidbits  concentration-of-measure  multi  orders  levers  extrema  nibble  bonferroni  coarse-fine  expert  symmetry  s:*  expert-experience  proofs
february 2017 by nhaliday
teaching - Intuitive explanation for dividing by \$n-1\$ when calculating standard deviation? - Cross Validated
The standard deviation calculated with a divisor of n-1 is a standard deviation calculated from the sample as an estimate of the standard deviation of the population from which the sample was drawn. Because the observed values fall, on average, closer to the sample mean than to the population mean, the standard deviation which is calculated using deviations from the sample mean underestimates the desired standard deviation of the population. Using n-1 instead of n as the divisor corrects for that by making the result a little bit bigger.

Note that the correction has a larger proportional effect when n is small than when it is large, which is what we want because when n is larger the sample mean is likely to be a good estimator of the population mean.

...

A common one is that the definition of variance (of a distribution) is the second moment recentered around a known, definite mean, whereas the estimator uses an estimated mean. This loss of a degree of freedom (given the mean, you can reconstitute the dataset with knowledge of just n−1 of the data values) requires the use of n−1 rather than nn to "adjust" the result.
q-n-a  overflow  stats  acm  intuition  explanation  bias-variance  methodology  moments  nibble  degrees-of-freedom  sampling-bias  generalization  dimensionality  ground-up  intricacy
january 2017 by nhaliday
Existence of the moment generating function and variance - Cross Validated
This question provides a nice opportunity to collect some facts on moment-generating functions (mgf).

In the answer below, we do the following:
1. Show that if the mgf is finite for at least one (strictly) positive value and one negative value, then all positive moments of X are finite (including nonintegral moments).
2. Prove that the condition in the first item above is equivalent to the distribution of X having exponentially bounded tails. In other words, the tails of X fall off at least as fast as those of an exponential random variable Z (up to a constant).
3. Provide a quick note on the characterization of the distribution by its mgf provided it satisfies the condition in item 1.
4. Explore some examples and counterexamples to aid our intuition and, particularly, to show that we should not read undue importance into the lack of finiteness of the mgf.
q-n-a  overflow  math  stats  acm  probability  characterization  concept  moments  distribution  examples  counterexample  tails  rigidity  nibble  existence  s:null  convergence  series
january 2017 by nhaliday
pr.probability - When are probability distributions completely determined by their moments? - MathOverflow
Roughly speaking, if the sequence of moments doesn't grow too quickly, then the distribution is determined by its moments. One sufficient condition is that if the moment generating function of a random variable has positive radius of convergence, then that random variable is determined by its moments.
q-n-a  overflow  math  acm  probability  characterization  tidbits  moments  rigidity  nibble  existence  convergence  series
january 2017 by nhaliday
Breeding the breeder's equation - Gene Expression
- interesting fact about normal distribution: when thresholding Gaussian r.v. X ~ N(0, σ^2) at X > 0, the new mean μ_s satisfies μ_s = pdf(X,t)/(1-cdf(X,t)) σ^2
- follows from direct calculation (any deeper reason?)
- note (using Taylor/asymptotic expansion of complementary error function) that this is Θ(t) as t -> 0 or ∞ (w/ different constants)
- for X ~ N(0, 1), can calculate 0 = cdf(X, t)μ_<t + (1-cdf(X, t))μ_>t => μ_<t = -pdf(X, t)/cdf(X, t)
- this declines quickly w/ t (like e^{-t^2/2}). as t -> 0, it goes like -sqrt(2/pi) + higher-order terms ~ -0.8.

Average of a tail of a normal distribution: https://stats.stackexchange.com/questions/26805/average-of-a-tail-of-a-normal-distribution

Truncated normal distribution: https://en.wikipedia.org/wiki/Truncated_normal_distribution
gnxp  explanation  concept  bio  genetics  population-genetics  agri-mindset  analysis  scitariat  org:sci  nibble  methodology  distribution  tidbits  probability  stats  acm  AMT  limits  magnitude  identity  integral  street-fighting  symmetry  s:*  tails  multi  q-n-a  overflow  wiki  reference  objektbuch  proofs
december 2016 by nhaliday
gt.geometric topology - Intuitive crutches for higher dimensional thinking - MathOverflow
Terry Tao:
I can't help you much with high-dimensional topology - it's not my field, and I've not picked up the various tricks topologists use to get a grip on the subject - but when dealing with the geometry of high-dimensional (or infinite-dimensional) vector spaces such as R^n, there are plenty of ways to conceptualise these spaces that do not require visualising more than three dimensions directly.

For instance, one can view a high-dimensional vector space as a state space for a system with many degrees of freedom. A megapixel image, for instance, is a point in a million-dimensional vector space; by varying the image, one can explore the space, and various subsets of this space correspond to various classes of images.

One can similarly interpret sound waves, a box of gases, an ecosystem, a voting population, a stream of digital data, trials of random variables, the results of a statistical survey, a probabilistic strategy in a two-player game, and many other concrete objects as states in a high-dimensional vector space, and various basic concepts such as convexity, distance, linearity, change of variables, orthogonality, or inner product can have very natural meanings in some of these models (though not in all).

It can take a bit of both theory and practice to merge one's intuition for these things with one's spatial intuition for vectors and vector spaces, but it can be done eventually (much as after one has enough exposure to measure theory, one can start merging one's intuition regarding cardinality, mass, length, volume, probability, cost, charge, and any number of other "real-life" measures).

For instance, the fact that most of the mass of a unit ball in high dimensions lurks near the boundary of the ball can be interpreted as a manifestation of the law of large numbers, using the interpretation of a high-dimensional vector space as the state space for a large number of trials of a random variable.

More generally, many facts about low-dimensional projections or slices of high-dimensional objects can be viewed from a probabilistic, statistical, or signal processing perspective.

Scott Aaronson:
Here are some of the crutches I've relied on. (Admittedly, my crutches are probably much more useful for theoretical computer science, combinatorics, and probability than they are for geometry, topology, or physics. On a related note, I personally have a much easier time thinking about R^n than about, say, R^4 or R^5!)

1. If you're trying to visualize some 4D phenomenon P, first think of a related 3D phenomenon P', and then imagine yourself as a 2D being who's trying to visualize P'. The advantage is that, unlike with the 4D vs. 3D case, you yourself can easily switch between the 3D and 2D perspectives, and can therefore get a sense of exactly what information is being lost when you drop a dimension. (You could call this the "Flatland trick," after the most famous literary work to rely on it.)
2. As someone else mentioned, discretize! Instead of thinking about R^n, think about the Boolean hypercube {0,1}^n, which is finite and usually easier to get intuition about. (When working on problems, I often find myself drawing {0,1}^4 on a sheet of paper by drawing two copies of {0,1}^3 and then connecting the corresponding vertices.)
3. Instead of thinking about a subset S⊆R^n, think about its characteristic function f:R^n→{0,1}. I don't know why that trivial perspective switch makes such a big difference, but it does ... maybe because it shifts your attention to the process of computing f, and makes you forget about the hopeless task of visualizing S!
4. One of the central facts about R^n is that, while it has "room" for only n orthogonal vectors, it has room for exp⁡(n) almost-orthogonal vectors. Internalize that one fact, and so many other properties of R^n (for example, that the n-sphere resembles a "ball with spikes sticking out," as someone mentioned before) will suddenly seem non-mysterious. In turn, one way to internalize the fact that R^n has so many almost-orthogonal vectors is to internalize Shannon's theorem that there exist good error-correcting codes.
5. To get a feel for some high-dimensional object, ask questions about the behavior of a process that takes place on that object. For example: if I drop a ball here, which local minimum will it settle into? How long does this random walk on {0,1}^n take to mix?

Gil Kalai:
This is a slightly different point, but Vitali Milman, who works in high-dimensional convexity, likes to draw high-dimensional convex bodies in a non-convex way. This is to convey the point that if you take the convex hull of a few points on the unit sphere of R^n, then for large n very little of the measure of the convex body is anywhere near the corners, so in a certain sense the body is a bit like a small sphere with long thin "spikes".
q-n-a  intuition  math  visual-understanding  list  discussion  thurston  tidbits  aaronson  tcs  geometry  problem-solving  yoga  👳  big-list  metabuch  tcstariat  gowers  mathtariat  acm  overflow  soft-question  levers  dimensionality  hi-order-bits  insight  synthesis  thinking  models  cartoons  coding-theory  information-theory  probability  concentration-of-measure  magnitude  linear-algebra  boolean-analysis  analogy  arrows  lifts-projections  measure  markov  sampling  shannon  conceptual-vocab  nibble  degrees-of-freedom  worrydream  neurons  retrofit  oscillation  paradox  novelty  tricki  concrete  high-dimension  s:***  manifolds  direction  curvature  convexity-curvature  elegance  guessing
december 2016 by nhaliday
real analysis - Proof of "every convex function is continuous" - Mathematics Stack Exchange
bound above by secant and below by tangent, so graph of function is constrained to a couple triangles w/ common vertex at (x, f(x))
tidbits  math  math.CA  q-n-a  visual-understanding  acm  overflow  proofs  smoothness  nibble  curvature  convexity-curvature
november 2016 by nhaliday
predictive models - Is this the state of art regression methodology? - Cross Validated
I've been following Kaggle competitions for a long time and I come to realize that many winning strategies involve using at least one of the "big threes": bagging, boosting and stacking.

For regressions, rather than focusing on building one best possible regression model, building multiple regression models such as (Generalized) linear regression, random forest, KNN, NN, and SVM regression models and blending the results into one in a reasonable way seems to out-perform each individual method a lot of times.
q-n-a  state-of-art  machine-learning  acm  data-science  atoms  overflow  soft-question  regression  ensembles  nibble  oly
november 2016 by nhaliday