Experiments at Airbnb – Airbnb Engineering & Data Science – Medium

december 2017 by foodbaby

Why did we know to not stop when the p-value hit 0.05? It turns out that this pattern of hitting “significance” early and then converging back to a neutral result is actually quite common in our system. There are various reasons for this. Users often take a long time to book, so the early converters have a disproportionately large influence in the beginning of the experiment. Also, even small sample sizes in online experiments are massive in the scale of classical statistics in which these methods were developed. Since the statistical test is a function of the sample- and effect sizes, if an early effect size is large through natural variation it is likely for the p-value to be below 0.05 early. But the most important reason is that you are performing a statistical test every time you compute a p-value and the more you do it, the more likely you are to find an effect.

AB
testing
stopping
december 2017 by foodbaby

Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO

november 2017 by foodbaby

4.1.2 Hash and partition

Unlike the pseudorandom approach, this method is completely stateless. Each user is assigned a unique identifier, which is maintained either through a database or a cookie. This identifier is appended onto the name or id of the experiment. A hash function is applied to this combined identifier to obtain an integer which is uniformly distributed on a range of values. The range is then partitioned, with each variant represented by a partition.

This method is very sensitive to the choice of hash function. If the hash function has any funnels (instances where adjacent keys map to the same hash code) then the first property (uniform distribution) will be violated. And if the hash function has characteristics (instances where a perturbation of the key produces a predictable perturbation of the hash code), then correlations may occur between experiments. Few hash functions are sound enough to be used in this technique.

We tested this technique using several popular hash functions and a methodology similar to the one we used on the pseudorandom number generators. While any hash function will satisfy the second requirement (by definition), satisfying the first and third is more difficult. We found that only the cryptographic hash function MD5 generated no correlations between experiments. SHA256 (another cryptographic hash) came close, requiring a five-way interaction to produce a correlation. The .NET string hashing function failed to pass even a two-way interaction test.

AB
testing
hash
Unlike the pseudorandom approach, this method is completely stateless. Each user is assigned a unique identifier, which is maintained either through a database or a cookie. This identifier is appended onto the name or id of the experiment. A hash function is applied to this combined identifier to obtain an integer which is uniformly distributed on a range of values. The range is then partitioned, with each variant represented by a partition.

This method is very sensitive to the choice of hash function. If the hash function has any funnels (instances where adjacent keys map to the same hash code) then the first property (uniform distribution) will be violated. And if the hash function has characteristics (instances where a perturbation of the key produces a predictable perturbation of the hash code), then correlations may occur between experiments. Few hash functions are sound enough to be used in this technique.

We tested this technique using several popular hash functions and a methodology similar to the one we used on the pseudorandom number generators. While any hash function will satisfy the second requirement (by definition), satisfying the first and third is more difficult. We found that only the cryptographic hash function MD5 generated no correlations between experiments. SHA256 (another cryptographic hash) came close, requiring a five-way interaction to produce a correlation. The .NET string hashing function failed to pass even a two-way interaction test.

november 2017 by foodbaby

Selection Bias in Online Experimentation – Airbnb Engineering & Data Science – Medium

november 2017 by foodbaby

Measurement plays a crucial role in data informed decision making. When online experiments are costly and have to be performed efficiently, we inevitably carry out measurements on the same data used for both inference and model selection. There has been a long ongoing discussion in both academia and industry around “p-hacking” and similar ideas. An extensive literature exists trying to tackle this problem in various applications in econometrics or genome-wide association studies. Our approach, although with simplified assumptions about the selection rule, is a quick and effective way to account for the selection bias without many additional assumptions or prior knowledge, especially in large scale online experimentation platforms.

AB
testing
bias
november 2017 by foodbaby

Measurement and analysis of predictive feed ranking models on Instagram – @Scale

november 2017 by foodbaby

Thomas uses the launch of Instagram’s feed ranking as a working example to talk through issues in quantifying network effects, while exploring unusual A/B testing techniques such as country-level tests, testing on balanced graph partitions, and author-side experiments.

AB
testing
video
@scale
november 2017 by foodbaby

**related tags**

Copy this bookmark: