foodbaby + testing   102

DeepTest: automated testing of deep-neural-network-driven autonomous cars | the morning paper
In this paper, we design, implement and evaluate DeepTest, a systematic testing tool for automatically detecting erroneous behaviors of DNN-driven vehicles that can potentially lead to fatal crashes. First, our tool is designed to automatically generated test cases leveraging real-world changes in driving conditions like rain, fog, lighting conditions, etc. DeepTest systematically explores different parts of the DNN logic by generating test inputs that maximize the numbers of activated neurons. DeepTest found thousands of erroneous behaviors under different realistic driving conditions (e.g., blurring, rain, fog, etc.) many of which lead to potentially fatal crashes in three top performing DNNs in the Udacity self-driving car challenge.
DNN  testing 
3 days ago by foodbaby
zaproxy/zaproxy: The OWASP ZAP core project
The OWASP Zed Attack Proxy (ZAP) is one of the world’s most popular free security tools and is actively maintained by hundreds of international volunteers*. It can help you automatically find security vulnerabilities in your web applications while you are developing and testing your applications. Its also a great tool for experienced pentesters to use for manual security testing.
security  testing 
january 2018 by foodbaby
Large, production quality distributed systems still fail periodically, and do so sometimes catastrophically, where most or all users experience an outage or data loss. We present the result of a comprehensive study investigating 198 randomly selected, user-reported failures that occurred on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop MapReduce, and Redis, with the goal of understanding how one or multiple faults eventually evolve into a user-visible failure. We found that from a testing point of view, almost all failures require only 3 or fewer nodes to reproduce, whih is good news considering that these services typically run on a very large number of nodes. However, multiple inputs are needed to trigger the failures with the order between them being important. Finally, we found the error logs of these systems typically contain sufficient data on both the errors and the input events that triggered the failure, enabling the diagnose and the reproduction of the production failures. We found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code – the last line of defense – even without an understanding of the software design. We extracted three simple rules from the bugs that have lead to some of the catastrophic failures, and developed a static checker, Aspirator, capable of locating these bugs. Over 30% of the catastrophic failures would have been prevented had Aspirator been used and the identified bugs fixed. Running Aspirator on the code of 9 distributed systems located 143 bugs and bad practices that have been fixed or confirmed by the developers
testing  papers  error  handling 
january 2018 by foodbaby Developer Testing in the IDE
Software testing is one of the key activities to software quality in practice. Despite its importance, however, we have a remarkable lack of knowledge on how developers test in real-world projects. In this paper, we report on the surprising results of a large-scale field study with 2,443 software engineers whose development activities we closely monitored over the course of 2.5 years in four Integrated Development Environments (IDEs). Our findings question several commonly shared assumptions and beliefs about developer testing: half of the developers in our study does not test; developers rarely run their tests in the IDE; only once they start testing, do they do it heftily; most programming sessions end without any test execution; only a quarter of test cases is responsible for three quarters of all test failures; 12% of tests show flaky behavior; Test-Driven Development (TDD) is not widely practiced; and software developers only spend a quarter of their time engineering tests, whereas they think they test half of their time. We observed only minor differences in the testing practices among developers in different IDEs, Java, and C#. We summarize these practices of loosely guiding ones development efforts with the help of testing as Test-Guided Development (TGD).
software  engineering  research  testing 
december 2017 by foodbaby
Experiments at Airbnb – Airbnb Engineering & Data Science – Medium
Why did we know to not stop when the p-value hit 0.05? It turns out that this pattern of hitting “significance” early and then converging back to a neutral result is actually quite common in our system. There are various reasons for this. Users often take a long time to book, so the early converters have a disproportionately large influence in the beginning of the experiment. Also, even small sample sizes in online experiments are massive in the scale of classical statistics in which these methods were developed. Since the statistical test is a function of the sample- and effect sizes, if an early effect size is large through natural variation it is likely for the p-value to be below 0.05 early. But the most important reason is that you are performing a statistical test every time you compute a p-value and the more you do it, the more likely you are to find an effect.
AB  testing  stopping 
december 2017 by foodbaby
Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO
4.1.2 Hash and partition

Unlike the pseudorandom approach, this method is completely stateless. Each user is assigned a unique identifier, which is maintained either through a database or a cookie. This identifier is appended onto the name or id of the experiment. A hash function is applied to this combined identifier to obtain an integer which is uniformly distributed on a range of values. The range is then partitioned, with each variant represented by a partition.

This method is very sensitive to the choice of hash function. If the hash function has any funnels (instances where adjacent keys map to the same hash code) then the first property (uniform distribution) will be violated. And if the hash function has characteristics (instances where a perturbation of the key produces a predictable perturbation of the hash code), then correlations may occur between experiments. Few hash functions are sound enough to be used in this technique.

We tested this technique using several popular hash functions and a methodology similar to the one we used on the pseudorandom number generators. While any hash function will satisfy the second requirement (by definition), satisfying the first and third is more difficult. We found that only the cryptographic hash function MD5 generated no correlations between experiments. SHA256 (another cryptographic hash) came close, requiring a five-way interaction to produce a correlation. The .NET string hashing function failed to pass even a two-way interaction test.
AB  testing  hash 
november 2017 by foodbaby
Selection Bias in Online Experimentation – Airbnb Engineering & Data Science – Medium
Measurement plays a crucial role in data informed decision making. When online experiments are costly and have to be performed efficiently, we inevitably carry out measurements on the same data used for both inference and model selection. There has been a long ongoing discussion in both academia and industry around “p-hacking” and similar ideas. An extensive literature exists trying to tackle this problem in various applications in econometrics or genome-wide association studies. Our approach, although with simplified assumptions about the selection rule, is a quick and effective way to account for the selection bias without many additional assumptions or prior knowledge, especially in large scale online experimentation platforms.
AB  testing  bias 
november 2017 by foodbaby
Measurement and analysis of predictive feed ranking models on Instagram – @Scale
Thomas uses the launch of Instagram’s feed ranking as a working example to talk through issues in quantifying network effects, while exploring unusual A/B testing techniques such as country-level tests, testing on balanced graph partitions, and author-side experiments.
AB  testing  video  @scale 
november 2017 by foodbaby
« earlier      
per page:    204080120160

related tags

@scale  a/b  AA  ab  AFL  api  ARIMA  armed  automated  aws  aws-lambda  bandit  bash  bayesian  bdd  bias  Bonferroni  calculator  chaos  Computing  container  contract  coordinated  correction  course  critique  crowd  data  design  DNN  docker  early  engineering  error  evaluation  experience  facebook  feature  filetype:pdf  framework  fuzzing  generator  handling  hash  heap  http  iid  imported  indeed  integration  integration-tests  interleaving  IR  java  json  junit  library  load  LTR  mechanize  media:document  metrics  microsoft  ML  mockobjects  multi  multiple  MVT  offline  omission  opensource  overview  papers  peaking  perf  performance  pipeline  practical  programming  property-based  property-based-testing  pyramid  python  qa  regression  relevance  research  scala  script  search  security  segment  selection  service  shell  significance  software  sourcing  spark  split  stopping  strategy  tdd  test  testing  testing-strategy  tools  unit  variance  video  vs  web  webapp  Web_automation/web_app_testing 

Copy this bookmark: