nhaliday + sampling   31

Is there a common method for detecting the convergence of the Gibbs sampler and the expectation-maximization algorithm? - Quora
In practice and theory it is much easier to diagnose convergence in EM (vanilla or variational) than in any MCMC algorithm (including Gibbs sampling).

https://www.quora.com/How-can-you-determine-if-your-Gibbs-sampler-has-converged
There is a special case when you can actually obtain the stationary distribution, and be sure that you did! If your markov chain consists of a discrete state space, then take the first time that a state repeats in your chain: if you randomly sample an element between the repeating states (but only including one of the endpoints) you will have a sample from your true distribution.

One can achieve this 'exact MCMC sampling' more generally by using the coupling from the past algorithm (Coupling from the past).

Otherwise, there is no rigorous statistical test for convergence. It may be possible to obtain a theoretical bound for the convergence rates: but these are quite difficult to obtain, and quite often too large to be of practical use. For example, even for the simple case of using the Metropolis algorithm for sampling from a two-dimensional uniform distribution, the best convergence rate upper bound achieved, by Persi Diaconis, was something with an astronomical constant factor like 10^300.

In fact, it is fair to say that for most high dimensional problems, we have really no idea whether Gibbs sampling ever comes close to converging, but the best we can do is use some simple diagnostics to detect the most obvious failures.
nibble  q-n-a  qra  acm  stats  probability  limits  convergence  distribution  sampling  markov  monte-carlo  ML-MAP-E  checking  equilibrium  stylized-facts  gelman  levers  mixing  empirical  plots  manifolds  multi  fixed-point  iteration-recursion  heuristic  expert-experience  theory-practice  project
october 2019 by nhaliday
Shuffling - Wikipedia
The Gilbert–Shannon–Reeds model provides a mathematical model of the random outcomes of riffling, that has been shown experimentally to be a good fit to human shuffling[2] and that forms the basis for a recommendation that card decks be riffled seven times in order to randomize them thoroughly.[3] Later, mathematicians Lloyd M. Trefethen and Lloyd N. Trefethen authored a paper using a tweaked version of the Gilbert-Shannon-Reeds model showing that the minimum number of riffles for total randomization could also be 5, if the method of defining randomness is changed.[4][5]
nibble  tidbits  trivia  cocktail  wiki  reference  games  howto  random  models  math  applications  probability  math.CO  mixing  markov  sampling  best-practices  acm
august 2019 by nhaliday
Stat 260/CS 294: Bayesian Modeling and Inference
Topics
- Priors (conjugate, noninformative, reference)
- Hierarchical models, spatial models, longitudinal models, dynamic models, survival models
- Testing
- Model choice
- Inference (importance sampling, MCMC, sequential Monte Carlo)
- Nonparametric models (Dirichlet processes, Gaussian processes, neutral-to-the-right processes, completely random measures)
- Decision theory and frequentist perspectives (complete class theorems, consistency, empirical Bayes)
- Experimental design
unit  course  berkeley  expert  michael-jordan  machine-learning  acm  bayesian  probability  stats  lecture-notes  priors-posteriors  markov  monte-carlo  frequentist  latent-variables  decision-theory  expert-experience  confidence  sampling
july 2017 by nhaliday
Unsupervised learning, one notion or many? – Off the convex path
(Task A) Learning a distribution from samples. (Examples: gaussian mixtures, topic models, variational autoencoders,..)

(Task B) Understanding latent structure in the data. This is not the same as (a); for example principal component analysis, clustering, manifold learning etc. identify latent structure but don’t learn a distribution per se.

(Task C) Feature Learning. Learn a mapping from datapoint → feature vector such that classification tasks are easier to carry out on feature vectors rather than datapoints. For example, unsupervised feature learning could help lower the amount of labeled samples needed for learning a classifier, or be useful for domain adaptation.

Task B is often a subcase of Task C, as the intended user of “structure found in data” are humans (scientists) who pour over the representation of data to gain some intuition about its properties, and these “properties” can be often phrased as a classification task.

This post explains the relationship between Tasks A and C, and why they get mixed up in students’ mind. We hope there is also some food for thought here for experts, namely, our discussion about the fragility of the usual “perplexity” definition of unsupervised learning. It explains why Task A doesn’t in practice lead to good enough solution for Task C. For example, it has been believed for many years that for deep learning, unsupervised pretraining should help supervised training, but this has been hard to show in practice.
acmtariat  org:bleg  nibble  machine-learning  acm  thinking  clarity  unsupervised  conceptual-vocab  concept  explanation  features  bayesian  off-convex  deep-learning  latent-variables  generative  intricacy  distribution  sampling  grokkability-clarity  org:popup
june 2017 by nhaliday
Econometric Modeling as Junk Science
The Credibility Revolution in Empirical Economics: How Better Research Design Is Taking the Con out of Econometrics: https://www.aeaweb.org/articles?id=10.1257/jep.24.2.3

On data, experiments, incentives and highly unconvincing research – papers and hot beverages: https://papersandhotbeverages.wordpress.com/2015/10/31/on-data-experiments-incentives-and-highly-unconvincing-research/
In my view, it has just to do with the fact that academia is a peer monitored organization. In the case of (bad) data collection papers, issues related to measurement are typically boring. They are relegated to appendices, no one really has an incentive to monitor it seriously. The problem is similar in formal theory: no one really goes through the algebra in detail, but it is in principle feasible to do it, and, actually, sometimes these errors are detected. If discussing the algebra of a proof is almost unthinkable in a seminar, going into the details of data collection, measurement and aggregation is not only hard to imagine, but probably intrinsically infeasible.

Something different happens for the experimentalist people. As I was saying, I feel we have come to a point in which many papers are evaluated based on the cleverness and originality of the research design (“Using the World Cup qualifiers as an instrument for patriotism!? Woaw! how cool/crazy is that! I wish I had had that idea”). The sexiness of the identification strategy has too often become a goal in itself. When your peers monitor you paying more attention to the originality of the identification strategy than to the research question, you probably have an incentive to mine reality for ever crazier discontinuities. It is true methodologists have been criticized in the past for analogous reasons, such as being guided by the desire to increase mathematical complexity without a clear benefit. But, if you work with pure formal theory or statistical theory, your work is not meant to immediately answer question about the real world, but instead to serve other researchers in their quest. This is something that can, in general, not be said of applied CI work.

This post should have been entitled “Zombies who only think of their next cool IV fix”
massive lust for quasi-natural experiments, regression discontinuities
barely matters if the effects are not all that big
I suppose even the best of things must reach their decadent phase; methodological innov. to manias……

Following this "collapse of small-N social psych results" business, where do I predict econ will collapse? I see two main contenders.
One is lab studies. I dallied with these a few years ago in a Kenya lab. We ran several pilots of N=200 to figure out the best way to treat
and to measure the outcome. Every pilot gave us a different stat sig result. I could have written six papers concluding different things.
I gave up more skeptical of these lab studies than ever before. The second contender is the long run impacts literature in economic history
We should be very suspicious since we never see a paper showing that a historical event had no effect on modern day institutions or dvpt.
On the one hand I find these studies fun, fascinating, and probably true in a broad sense. They usually reinforce a widely believed history
argument with interesting data and a cute empirical strategy. But I don't think anyone believes the standard errors. There's probably a HUGE
problem of nonsignificant results staying in the file drawer. Also, there are probably data problems that don't get revealed, as we see with
the recent Piketty paper (http://marginalrevolution.com/marginalrevolution/2017/10/pikettys-data-reliable.html). So I take that literature with a vat of salt, even if I enjoy and admire the works
I used to think field experiments would show little consistency in results across place. That external validity concerns would be fatal.
In fact the results across different samples and places have proven surprisingly similar across places, and added a lot to general theory
Last, I've come to believe there is no such thing as a useful instrumental variable. The ones that actually meet the exclusion restriction
are so weird & particular that the local treatment effect is likely far different from the average treatment effect in non-transparent ways.
Most of the other IVs don't plausibly meet the e clue ion restriction. I mean, we should be concerned when the IV estimate is always 10x
larger than the OLS coefficient. This I find myself much more persuaded by simple natural experiments that use OLS, diff in diff, or
discontinuities, alongside randomized trials.

What do others think are the cliffs in economics?
PS All of these apply to political science too. Though I have a special extra target in poli sci: survey experiments! A few are good. I like
Dan Corstange's work. But it feels like 60% of dissertations these days are experiments buried in a survey instrument that measure small
changes in response. These at least have large N. But these are just uncontrolled labs, with negligible external validity in my mind.
The good ones are good. This method has its uses. But it's being way over-applied. More people have to make big and risky investments in big
natural and field experiments. Time to raise expectations and ambitions. This expectation bar, not technical ability, is the big advantage
economists have over political scientists when they compete in the same space.
(Ok. So are there any friends and colleagues I haven't insulted this morning? Let me know and I'll try my best to fix it with a screed)

HOW MUCH SHOULD WE TRUST DIFFERENCES-IN-DIFFERENCES ESTIMATES?∗: https://economics.mit.edu/files/750
Most papers that employ Differences-in-Differences estimation (DD) use many years of data and focus on serially correlated outcomes but ignore that the resulting standard errors are inconsistent. To illustrate the severity of this issue, we randomly generate placebo laws in state-level data on female wages from the Current Population Survey. For each law, we use OLS to compute the DD estimate of its “effect” as well as the standard error of this estimate. These conventional DD standard errors severely understate the standard deviation of the estimators: we find an “effect” significant at the 5 percent level for up to 45 percent of the placebo interventions. We use Monte Carlo simulations to investigate how well existing methods help solve this problem. Econometric corrections that place a specific parametric form on the time-series process do not perform well. Bootstrap (taking into account the auto-correlation of the data) works well when the number of states is large enough. Two corrections based on asymptotic approximation of the variance-covariance matrix work well for moderate numbers of states and one correction that collapses the time series information into a “pre” and “post” period and explicitly takes into account the effective sample size works well even for small numbers of states.

‘METRICS MONDAY: 2SLS–CHRONICLE OF A DEATH FORETOLD: http://marcfbellemare.com/wordpress/12733
As it turns out, Young finds that
1. Conventional tests tend to overreject the null hypothesis that the 2SLS coefficient is equal to zero.
2. 2SLS estimates are falsely declared significant one third to one half of the time, depending on the method used for bootstrapping.
3. The 99-percent confidence intervals (CIs) of those 2SLS estimates include the OLS point estimate over 90 of the time. They include the full OLS 99-percent CI over 75 percent of the time.
4. 2SLS estimates are extremely sensitive to outliers. Removing simply one outlying cluster or observation, almost half of 2SLS results become insignificant. Things get worse when removing two outlying clusters or observations, as over 60 percent of 2SLS results then become insignificant.
5. Using a Durbin-Wu-Hausman test, less than 15 percent of regressions can reject the null that OLS estimates are unbiased at the 1-percent level.
6. 2SLS has considerably higher mean squared error than OLS.
7. In one third to one half of published results, the null that the IVs are totally irrelevant cannot be rejected, and so the correlation between the endogenous variable(s) and the IVs is due to finite sample correlation between them.
8. Finally, fewer than 10 percent of 2SLS estimates reject instrument irrelevance and the absence of OLS bias at the 1-percent level using a Durbin-Wu-Hausman test. It gets much worse–fewer than 5 percent–if you add in the requirement that the 2SLS CI that excludes the OLS estimate.

Methods Matter: P-Hacking and Causal Inference in Economics*: http://ftp.iza.org/dp11796.pdf
Applying multiple methods to 13,440 hypothesis tests reported in 25 top economics journals in 2015, we show that selective publication and p-hacking is a substantial problem in research employing DID and (in particular) IV. RCT and RDD are much less problematic. Almost 25% of claims of marginally significant results in IV papers are misleading.

Ever since I learned social science is completely fake, I've had a lot more time to do stuff that matters, like deadlifting and reading about Mediterranean haplogroups
--
Wait, so, from fakest to realest IV>DD>RCT>RDD? That totally matches my impression.

https://archive.is/EZu0h
Great (not completely new but still good to have it in one place) discussion of RCTs and inference in economics by Deaton, my favorite sentences (more general than just about RCT) below
Randomization in the tropics revisited: a theme and eleven variations: https://scholar.princeton.edu/sites/default/files/deaton/files/deaton_randomization_revisited_v3_2019.pdf
org:junk  org:edu  economics  econometrics  methodology  realness  truth  science  social-science  accuracy  generalization  essay  article  hmm  multi  study  🎩  empirical  causation  error  critique  sociology  criminology  hypothesis-testing  econotariat  broad-econ  cliometrics  endo-exo  replication  incentives  academia  measurement  wire-guided  intricacy  twitter  social  discussion  pseudoE  effect-size  reflection  field-study  stat-power  piketty  marginal-rev  commentary  data-science  expert-experience  regression  gotchas  rant  map-territory  pdf  simulation  moments  confidence  bias-variance  stats  endogenous-exogenous  control  meta:science  meta-analysis  outliers  summary  sampling  ensembles  monte-carlo  theory-practice  applicability-prereqs  chart  comparison  shift  ratty  unaffiliated  garett-jones
june 2017 by nhaliday
Equivalence between counting and sampling
also: every counting problem either has FPTRAS or no approx. w/i polynomial factor
pdf  exposition  lecture-notes  berkeley  nibble  tcs  counting  sampling  characterization  complexity  approximation  rand-approx  proofs
february 2017 by nhaliday
gt.geometric topology - Intuitive crutches for higher dimensional thinking - MathOverflow
Terry Tao:
I can't help you much with high-dimensional topology - it's not my field, and I've not picked up the various tricks topologists use to get a grip on the subject - but when dealing with the geometry of high-dimensional (or infinite-dimensional) vector spaces such as R^n, there are plenty of ways to conceptualise these spaces that do not require visualising more than three dimensions directly.

For instance, one can view a high-dimensional vector space as a state space for a system with many degrees of freedom. A megapixel image, for instance, is a point in a million-dimensional vector space; by varying the image, one can explore the space, and various subsets of this space correspond to various classes of images.

One can similarly interpret sound waves, a box of gases, an ecosystem, a voting population, a stream of digital data, trials of random variables, the results of a statistical survey, a probabilistic strategy in a two-player game, and many other concrete objects as states in a high-dimensional vector space, and various basic concepts such as convexity, distance, linearity, change of variables, orthogonality, or inner product can have very natural meanings in some of these models (though not in all).

It can take a bit of both theory and practice to merge one's intuition for these things with one's spatial intuition for vectors and vector spaces, but it can be done eventually (much as after one has enough exposure to measure theory, one can start merging one's intuition regarding cardinality, mass, length, volume, probability, cost, charge, and any number of other "real-life" measures).

For instance, the fact that most of the mass of a unit ball in high dimensions lurks near the boundary of the ball can be interpreted as a manifestation of the law of large numbers, using the interpretation of a high-dimensional vector space as the state space for a large number of trials of a random variable.

More generally, many facts about low-dimensional projections or slices of high-dimensional objects can be viewed from a probabilistic, statistical, or signal processing perspective.

Scott Aaronson:
Here are some of the crutches I've relied on. (Admittedly, my crutches are probably much more useful for theoretical computer science, combinatorics, and probability than they are for geometry, topology, or physics. On a related note, I personally have a much easier time thinking about R^n than about, say, R^4 or R^5!)

1. If you're trying to visualize some 4D phenomenon P, first think of a related 3D phenomenon P', and then imagine yourself as a 2D being who's trying to visualize P'. The advantage is that, unlike with the 4D vs. 3D case, you yourself can easily switch between the 3D and 2D perspectives, and can therefore get a sense of exactly what information is being lost when you drop a dimension. (You could call this the "Flatland trick," after the most famous literary work to rely on it.)
2. As someone else mentioned, discretize! Instead of thinking about R^n, think about the Boolean hypercube {0,1}^n, which is finite and usually easier to get intuition about. (When working on problems, I often find myself drawing {0,1}^4 on a sheet of paper by drawing two copies of {0,1}^3 and then connecting the corresponding vertices.)
3. Instead of thinking about a subset S⊆R^n, think about its characteristic function f:R^n→{0,1}. I don't know why that trivial perspective switch makes such a big difference, but it does ... maybe because it shifts your attention to the process of computing f, and makes you forget about the hopeless task of visualizing S!
4. One of the central facts about R^n is that, while it has "room" for only n orthogonal vectors, it has room for exp⁡(n) almost-orthogonal vectors. Internalize that one fact, and so many other properties of R^n (for example, that the n-sphere resembles a "ball with spikes sticking out," as someone mentioned before) will suddenly seem non-mysterious. In turn, one way to internalize the fact that R^n has so many almost-orthogonal vectors is to internalize Shannon's theorem that there exist good error-correcting codes.
5. To get a feel for some high-dimensional object, ask questions about the behavior of a process that takes place on that object. For example: if I drop a ball here, which local minimum will it settle into? How long does this random walk on {0,1}^n take to mix?

Gil Kalai:
This is a slightly different point, but Vitali Milman, who works in high-dimensional convexity, likes to draw high-dimensional convex bodies in a non-convex way. This is to convey the point that if you take the convex hull of a few points on the unit sphere of R^n, then for large n very little of the measure of the convex body is anywhere near the corners, so in a certain sense the body is a bit like a small sphere with long thin "spikes".
q-n-a  intuition  math  visual-understanding  list  discussion  thurston  tidbits  aaronson  tcs  geometry  problem-solving  yoga  👳  big-list  metabuch  tcstariat  gowers  mathtariat  acm  overflow  soft-question  levers  dimensionality  hi-order-bits  insight  synthesis  thinking  models  cartoons  coding-theory  information-theory  probability  concentration-of-measure  magnitude  linear-algebra  boolean-analysis  analogy  arrows  lifts-projections  measure  markov  sampling  shannon  conceptual-vocab  nibble  degrees-of-freedom  worrydream  neurons  retrofit  oscillation  paradox  novelty  tricki  concrete  high-dimension  s:***  manifolds  direction  curvature  convexity-curvature  elegance  guessing
december 2016 by nhaliday

bundles : abstractacmmathtcs

Copy this bookmark:

description:

tags: