How are effects of online A/B tests distributed? How often are they not significant? Does achieving significance guarantee meaningful business impact?

We answer these questions in our new paper, “False Discovery in A/B Testing”, recently out in Management Science >>
The paper is co-authored with Christophe Van den Bulte and analyzes over 2,700 online A/B tests that were run on the @Optimizely platform by more than 1,300 experimenters.

Link to paper: pubsonline.informs.org/doi/10.1287/mn…
Non paywalled: ron-berman.com/papers/fdr.pdf
>>
A big draw of the paper is that @Optimizely have graciously allowed us to publish the data we used in the analysis. We hope this would be valuable to other researchers as well.
>>
First, we analyze the effects of all the A/B tests in our data. They are quite small. The median (and average) webpage variations have roughly zero effect on webpage Engagement.

But the distribution is quite long-tailed with some variations showing big effects.
>> Image
We then classify effects into “null” (zero) and “non-null” (pos or neg), to understand how many experiments, on average, have an underlying zero effect.

The answer is about 70%.

That is, 70% of effects will not show any impact on Engagement compared to a baseline.
>>
The concept of a true null is somewhat subtle (as we explain in the paper), since we often assume that true-nulls don’t exist.

However, as long as we test for a null hypothesis of a true null using significance testing, there is no reason to assume the null cannot be true.
>>
70% true nulls just means that if one picks a random webpage variation in our data, it has a high chance of yielding no (or very small) impact.

But what about the statistically significant effects? Wouldn’t they guard us from true nulls by having a low false positive rate?
>>
A statistically significant result generated by a true null effect is called a false discovery. Although we often think of the significance threshold we set for hypothesis testing (e.g., alpha=0.05) as the rate of false discoveries, this is not actually what we get.
>>
alpha is the false positive rate (FPR), or
Pr(effect is significant | effect is null).

We care about the opposite,
Pr(effect is null | effect is significant),
which is called the false discovery rate (FDR).
>> Image
Our analysis shows that in our data, hypothesis tests conducted with alpha=0.05 yield an FDR of 18%-25%.

Much higher than 5%.

That is, about 20% of significant effects chosen for implementation will not generate the business impact that was observed in the experiment.
>> Image
We use multiple methods to estimate the rate of true nulls and the FDR, but one was particularly fun to learn about, as it was developed to estimate false discoveries in genomewide studies. You can read about it here: pnas.org/content/100/16….
>>
Our paper estimates the business costs of these false discoveries, and discusses and tests possible solutions that firms can implement. The details are a bit beyond the scope of this thread, but we hope that the paper with the accompanying data and code will prove useful.
>>
Generally, firms should try to test more radical variations in order to have larger impact on consumer behavior (we call this “swing for the fences”). Most variations will probably not be very impactful, but once in a while an innovation will prove to be very lucrative.
(Fin)
P.S. I didn't go into all details, minutiae, caveats and disclaimers in the paper. There is much more, and I would love for people to read it.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Ron Berman

Ron Berman Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @marketsensei

2 Jan
@LauraALibby raises an important point and question (thanks for the kind words Laura).

Maybe the high FDR we find is because the experiments have too small sample sizes and low statistical power?
>>
Same question was also asked by a reviewer. This is where peer review improves a paper, IMO.

So we did two types of analyses (in section 5.3):
1. We estimated what the power is in these experiments (spoiler: not so low).
2. We asked what the FDR would be with 100% power.
>>
For the effective power in the experiments, the table below shows that it is 50-80% depending on significance level used.

50% sounds low, but the following analysis shows you can't improve much on FDR.
>> Image
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(