Same question was also asked by a reviewer. This is where peer review improves a paper, IMO.
So we did two types of analyses (in section 5.3): 1. We estimated what the power is in these experiments (spoiler: not so low). 2. We asked what the FDR would be with 100% power.
>>
For the effective power in the experiments, the table below shows that it is 50-80% depending on significance level used.
50% sounds low, but the following analysis shows you can't improve much on FDR.
>>
In the analysis we plugged-in 100% power into the FDR formula.
This is a theoretical, unachievable best case scenario.
Why unachievable? Because achieving 100% power requires rejecting every hypothesis regardless of the result, which clearly will increase false positives.
>>
The chart below (Figure 4 in the paper) shows the estimated FDR is in the data (green line) vs. the best case scenario with 100% power (red line called minFDR).
We can see that increasing power to even 100% helps somewhat, but not dramatically.
(Fin)
@lizzieredford hope this thread also answers a few of the points you raised.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
How are effects of online A/B tests distributed? How often are they not significant? Does achieving significance guarantee meaningful business impact?
We answer these questions in our new paper, “False Discovery in A/B Testing”, recently out in Management Science >>
The paper is co-authored with Christophe Van den Bulte and analyzes over 2,700 online A/B tests that were run on the @Optimizely platform by more than 1,300 experimenters.
A big draw of the paper is that @Optimizely have graciously allowed us to publish the data we used in the analysis. We hope this would be valuable to other researchers as well.
>>