You have been testing two different versions of an element through Ad A and Ad B. Your ads received a significant amount of traffic and conversions in the last 2 weeks. At first glance, the results might seem distinctive and trustworthy. However, will its performance really repeat the result of your A/B test when you publish the winner ad to all of your audience? You could answer it either by trusting your gut feelings or by calculating it. In this article, I prefer to lean on the power of statistics and explain how to measure the liability of your results by evaluating a statistical approach.
What is a statistically significant A/B test?
A/B testing results are considered “significant” when they are very unlikely to have occurred by a random chance. A low significance level means that your “winner ad” might not be a real winner.
Measuring the results
Let’s move on with the scenario of Ad A and Ad B and assume that both ads received 10 000 traffic in 2 weeks. Their conversions amounts are;
Ad A(control group): 270 conversions, CR=2.70%
Ad B(alternative group): 325 conversions, CR=3.25%
As most of the A/B test statistics are calculated based on the conversion rates of test elements, CR uplift becomes an important variable. It is the increase in conversion rate between two ads and either positive or negative. Mostly, you create your hypothesis based on this CR uplift value. For instance, Ad B received 20.37% more conversion in comparison with the control Ad A. 20.37% is called conversion rate uplift.
Consequently, hypothesis should be set up as below:
Η0: CR(B) <= CR(A)
ΗA: CR(B) > CR(A)
Interpreting the results
In order to compare test results automatically, you can benefit from one of the A/B statistics tools on the internet. This is the one I used for this article. Values you need to put on the dashboard are traffic and conversion amounts per each ad variation and confidence level. I picked it as 95%.
If the confidence level is 95%(α), consider a significance level of 5%(1-α).
Dashboard results share the values for conversion rates, uplift, statistical power, p value, Z score and standard errors. Z-score and standard error values are only necessary for the calculation of p-value. Since p-value is also shared there, you do not need to use these parameters in the first place unless you want to validate its tool’s accuracy by calculating p-value manually. As a result, p value and statistical power are the most important parameters we need to check.
Statistical power provides an answer for what percentage of the time you are willing to miss a real effect. The closer to 1, the more secure your test is. For practical purposes, we try to catch statistical power at least on 80% level. This is interpreted as a 20% chance of a false negative.If your test failed to reach this minimum value, you need either a large sample size or a longer duration test.
The smaller the p-value, the more certainty there is that the null hypothesis can be rejected. If this number is lower than 1-α value then your results are significant, therefore, an alternative hypothesis is accepted with confidence level α. Oppositely, a higher p-value than α signals that the null hypothesis is accepted. In this scenario, you may want to increase your sample size, then measure p-value again before having a final decision. If p-value equals to exactly zero, indicates a failure to reject the null hypothesis and you need to check for the reasons why the statistical functions are failing.
If p-value is lower than 1-α value then your results are significant, therefore, an alternative hypothesis is accepted with confidence level α.
In this case, P-value = 0.00386 and it is lower than the significance level of 5%. Therefore, we proved here that Ad B performs better than Ad A, and null hypothesis is rejected.
- Reaching statistical significance the first time shouldn’t be your stopping point for your A/B test. Even your AB test with statistical significance can still be false positives. Because, most A/B tests fluctuate between significant and insignificant levels throughout the experiment even after a desired significant level has been achieved. Therefore, you need to give more time to your experiment, even though the winner ad is the same over days.
- Your p-value might differ from web page to web page. It is most of the time caused by different types of calculations. 1-sided hypothesis is used to observe positive CR Uplift. If there is a chance of negative CR Uplift, then the two-sided hypothesis is computed for p-value.
A/B testing will not provide you 100% certainty, but it lowers your risk for sure. The purpose of statistical significance is giving you a clear image about your test success and helping you achieve it in the long run. Nonetheless, A/B test requires a honest holistic approach rather than sticking only with a p-value threshold of 0.05. Calculating significance level is a part of the A/B test, and A/B test is part of improving the data and business.