15901
Applying the Delta Method in Metric Analytics
During the last decade, the information technology industry has adopted a data-driven culture, relying on online metrics to measure and monitor business performance. Under the setting of big data, the majority of such metrics approximately follow normal distributions, opening up potential opportunities to model them directly without extra model assumptions and solve big data problems via closed-form formulas using distributed algorithms at a fraction of the cost of simulation-based procedures like bootstrap. However, certain attributes of the metrics, such as their corresponding data generating processes and aggregation levels, pose numerous challenges for constructing trustworthy estimation and inference procedures. Motivated by four real-life examples in metric development and analytics for large-scale A/B testing, we provide a practical guide to applying the Delta method, one of the most important tools from the classic statistics literature, to address the aforementioned challenges. We emphasize the central role of the Delta method in metric analytics by highlighting both its classic and novel applications.
ab-testing  delta-method 
2 days ago
Controlled experiments on the web: survey and practical guide
The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments, A/B tests (and their generalizations), split tests, Control/Treatment tests, MultiVariable Tests (MVT) and parallel flights. Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. We provide a practical guide to conducting online experiments, where endusers can help guide the development of features. Our experience indicates that significant learning and return-on-investment (ROI) are seen when development teams listen to their customers, not to the Highest Paid Person’s Opinion (HiPPO). We provide several examples of controlled experiments with surprising results. We review the important ingredients of running controlled experiments, and discuss their limitations (both technical and organizational). We focus on several areas that are critical to experimentation, including statistical power, sample size, and techniques for variance reduction. We describe common architectures for experimentation systems and analyze their advantages and disadvantages. We evaluate randomization and hashing techniques, which we show are not as simple in practice as is often assumed. Controlled experiments typically generate large amounts of data, which can be analyzed using data mining techniques to gain deeper understanding of the factors influencing the outcome of interest, leading to new hypotheses and creating a virtuous cycle of improvements. Organizations that embrace controlled experiments with clear evaluation criteria can evolve their systems with automated optimizations and real-time analyses. Based on our extensive practical experience with multiple systems and organizations, we share key lessons that will help practitioners in running trustworthy controlled experiments
ab-testing  hash  md5 
2 days ago
[1409.3174] Designing and Deploying Online Field Experiments
Online experiments are widely used to compare specific design alternatives, but they can also be used to produce generalizable knowledge and inform strategic decision making. Doing so often requires sophisticated experimental designs, iterative refinement, and careful logging and analysis. Few tools exist that support these needs. We thus introduce a language for online field experiments called PlanOut. PlanOut separates experimental design from application code, allowing the experimenter to concisely describe experimental designs, whether common "A/B tests" and factorial designs, or more complex designs involving conditional logic or multiple experimental units. These latter designs are often useful for understanding causal mechanisms involved in user behaviors. We demonstrate how experiments from the literature can be implemented in PlanOut, and describe two large field experiments conducted on Facebook with PlanOut. For common scenarios in which experiments are run iteratively and in parallel, we introduce a namespaced management system that encourages sound experimental practice.
ab-testing  facebook  hash 
2 days ago
Trustworthy Analysis of Online A/B Tests: Pitfalls, challenges and solutions
A/B tests (or randomized controlled experiments) play an integral role in the research and development cycles of technology companies. As in classic randomized experiments (e.g., clinical trials), the underlying statistical analysis of A/B tests is based on assuming the randomization unit is independent and identically distributed (i.i.d.). However, the randomization mechanisms utilized in online A/B tests can be quite complex and may render this assumption invalid. Analysis that unjustifiably relies on this assumption can yield untrustworthy results and lead to incorrect conclusions. Motivated by challenging problems arising from actual online experiments, we propose a new method of variance estimation that relies only on practically plausible assumptions, is directly applicable to a wide of range of randomization mechanisms, and can be implemented easily. We examine its performance and illustrate its advantages over two commonly used methods of variance estimation on both simulated and empirical datasets. Our results lead to a deeper understanding of the conditions under which the randomization unit can be treated as i.i.d. In particular, we show that for purposes of variance estimation, the randomization unit can be approximated as i.i.d. when the individual treatment effect variation is small; however, this approximation can lead to variance under-estimation when the individual treatment effect variation is large.

delta methods
ab-testing 
2 days ago
AdvancedTopic_Triggering.docx - Microsoft Word Online
Triggering is an important concept that provides experimenters with a way to improve sensitivity (statistical power) by filtering out noise created by users who could not have been impacted by the experiment. A user should be triggered into the analysis of an experiment if there is (potentially) some difference in the system or user behavior between the variant they are in and any other variant (counterfactual). Used right, triggering is a highly valuable tool in your arsenal; but there are also several pitfalls that can lead to incorrect results.
ab-testing 
5 days ago
Toward Predicting the Outcome of an A/B Experiment for Search Relevance - Microsoft Research
A standard approach to estimating online click-based metrics of a ranking function is to run it in a controlled experiment on live users. While reliable and popular in practice, configuring and running an online experiment is cumbersome and time-intensive. In this work, inspired by recent successes of offline evaluation techniques for recommender systems, we study an alternative that uses historical search log to reliably predict online click-based metrics of a new ranking function, without actually running it on live users.
To tackle novel challenges encountered in Web search, variations of the basic techniques are proposed. The first is to take advantage of diversified behavior of a search engine over a long period of time to simulate randomized data collection, so that our approach can be used at very low cost. The second is to replace exact matching (of recommended items in previous work) by fuzzy matching (of search result pages) to increase data efficiency, via a better trade-off of bias and variance. Extensive experimental results based on large-scale real search data from a major commercial search engine in the US market demonstrate our approach is promising and has potential for wide use in Web search.
offline  evaluation  IR  contextual-bandit  casual-inference 
17 days ago
Memory Barriers: a Hardware View for Software Hackers
So what possessed CPU designers to cause them to inflict memory barriers on poor unsuspecting SMP software designers? In short, because reordering memory references allows much better performance, and so memory barriers are needed to force ordering in things like synchronization primitives whose correct operation depends on ordered memory references. Getting a more detailed answer to this question requires a good understanding of how CPU caches work, and especially what is required to make caches really work well
memory  cache-coherence  mechanical-sympathy 
17 days ago
Dev vs. Ops, DevOps, and SRE – Medium
The Dev vs. Ops split, DevOps, and Site Reliability Engineering are examples of three different approaches used by companies to structure work: Specialisation, convergence, and split incentives.
SRE 
24 days ago
« earlier      
20090622 2_visit ab ab-testing airlines airlines-flights analysis angularjs-vs architecture art asia async auckland audio aws aws-lambda backup banking bayesian beijing_-_travel_-_what_to_do benchmark blog blogging blogging_software blogs books bpamp burma business cache cambodia cassandra china cnn code community comparative_foreign_policy computing courses_2005fc critique crystal_reports culture data database database-theory design development dnn docker download downloads/software economics education email embeddings emr envoy eu europe evaluation example experience facebook fbwall finance flights forex free freeware friends gc gis golang google gps guide hardware hash hiring history hive howto imported individual_articles indonesia interest international international_phone_calling internet internet_applications interpretability investing ir java javascript-mvc job jobs jobs/study/professional_dev jvm kafka kubernetes learning library linux local_web ltr mail management maps media memory metrics microfinance microsoft microsoft_word ml mobile money monitoring mp3 mp3_players music network networking neural news newzealand nio nlp nz online opensource optimization overview p2p papers parquet perf perf-testing-theory performance philippines philosophy phone politics presto production productivity programming prop_trading_systems psychology python recipes recsys reference relevance research resources reviews rstats rust s3 scala science search security shopping skype slides social sociology sociology_of_media software solr spam spark sql statistics stats strategy study symantec technology tensorflow testing text theory tips tools torrents trading trading_systems travel tutorial ubuntu utilities video visualization voip vs web web2.0 windows windows_xp/2003 word word2vec wordpress wordpress_wp_plugins writing xbmc

Copy this bookmark:



description:


tags: