# Understanding A/B testing statistics to get REAL Lift in Conversions

If you want to increase your chances of getting a real lift through A/B tests then you need to understand the statistics behind it.

If you don’t like learning statistics then I am afraid A/B testing is not for you.

Running a A/B test is actually quite easy. As long as you know what A/B test means and which software allow you to run such tests. You can go ahead and setup such test on your website.

This is the easy bit. The hard bit is to actually get any real lift in conversion volume through your tests.

You can run A/B tests 24 hours a day, 7 days a week, 365 days a year and still not see any improvement in conversions if you don’t understand the statistics behind such tests.

The very first thing that you need to understand is, what A/B test can do and can’t do for you. This is important in order to manage your expectations from A/B test.

## A/B test is not a Swiss Army Knife

The purpose of A/B test is to evaluate landing page design to improve conversions.

It can’t fix data collection issues, attribution issues, data integration issue, data interpretation issues or underlying problems with your marketing campaigns, product pricing, business model, business operations, measurement framework etc.

In short you can’t just A/B test your way to the top. You need to do lot more than A/B test to improve conversions.

A/B test is just another tool for conversion optimization. It is not a complete solution on its own. It can’t solve all of your conversion problems. It doesn’t deserve the level of attention it gets in conversion optimization conferences.

*A/B tests are not be all and end all of conversion optimization.*

These tests cannot on their own take your business to new heights. These tests cannot on their own produce significant improvement in conversion volume.

Had running A/B tests and getting real lift was so easy, every webmaster running A/B tests would be a millionaire by now.

**Following are the underlying issues with A/B test which you need to acknowledge: **

## #1 A/B tests are difficult to design and execute and usually fail.

Many marketers can’t run A/B test correctly because of the lack of knowledge of statistics.

Consequently their A/B tests are considerably prone to statistical error and test design issues from the very start and they often don’t see any real lift in sales and/or conversion rate even after repeated testing.

## #2 A/B tests take long time to show results, at least 3 to 4 weeks.

But even after waiting for a month and getting a statistically significant test result there is no guarantee that the winning variation will actually bring any real lift in sales and/or conversion rate.

## #3 In A/B test you are basically testing your own assumptions

This is one of the biggest drawback of A/B tests. You may argue that if the hypothesis is based on quantitative and qualitative data then it is not the case. But it is still the case.

Even if your hypothesis is based on quantitative and qualitative data, at the end of the day it is your hypothesis, it is your assumption. It is what you think may solve your customers’ problem if tested.

## #4 A/B test results are heavily dependant on sample size

You need right sample size in order to finish the test and draw conclusion from the test results. This sample size criteria usually mean you need to run A/B test for several weeks.

That also means you need a high traffic website.

## #5 A/B test measure users’ preference and not behaviour

This is another major drawback of A/B tests and the main reason that most A/B tests fail to generate real lift in conversions.

**What you are basically testing in an A/B test is whether version ‘B’ is better than version ‘A’. **

**You are not testing how good version ‘B’ is in a range of context.**

May be your user would have preferred version ‘C’ or version ‘D’ had he/she got the chance to look at it.

That’s is why even after conducting several A/B tests and getting statistically significant results each time with the right sample size, there is no guarantee that your winning variation will actually result in any real lift in conversion volume or conversion rate.

## #6 In A/B tests it is quite common to get imaginary lift in conversions

This happen when confounding variable(s) (more about them later) are not identified before the test and controlled during the test. Such imaginary lifts soon die out when confounding variable(s) cease to exist.

**The probability of your A/B test to produce real lift is directional proportional to the understanding of your client’s business and the knowledge of statistics.**

The two factors that actually power your A/B tests are:

#1 Great understanding of the client business

#2 Good understanding of the statistics behind A/B tests.

## Great understanding of the client business

If you don’t have great understanding of your client’s business, you are most likely to create and test a hypothesis which won’t solve your customers’ problems either wholly or in parts.

And if something doesn’t solve your customers’ problems then it won’t impact the business bottomline. It is as simple as that.

You need to be sure that what you are testing actually matter to your target audience.

So how confident you are on the scale of 1 to 10 that what you are going to test actually matter to your target audience?

You need such confidence level to power your hypothesis. On the basis of this confidence level, I can categorize all hypothesis into two categories:

#1 Underpowered hypothesis

#2 Powerful hypothesis

## Underpowered hypothesis

An underpowered hypothesis is the one which is based on personal opinion or whatever your client/boss has to say.

Underpowered hypothesis can also be based on inadequate/flawed analysis or data which has got collection issues.

If your hypothesis is underpowered, your A/B test is doomed to fail from the very start.

## Powerful hypothesis

A powerful hypothesis is the one which is based on customers’ objections.

If you are not already collecting customers’ objections via surveys, feedbacks, usability testing, quantitative data etc then you chances of creating a powerful hypothesis is close to zero.

Your chances of getting any real lift from A/B test is also close to zero.

The power level of your hypothesis is directly proportional to your understanding of the client’s business.

The more confident you are, that what you are testing is something that really matters to your customers, the more powerful your hypothesis become.

You get this confidence by developing great understanding of the client’s business. You develop this great understanding by **asking questions**.

Ask questions which solve your customer’s problems either wholly or in parts.

This is the fastest way to find and fix conversion issues.

Off course you can dive deep into GA reports too.

But in order to develop a truly great understanding of your client’s business you need to ask lot of questions from the people who actually run the business and also their target audience.

Don’t try to figure out everything on your own. Any such attempt is not only a waste of time but also futile.

Many marketers make assumption about the problems their customers’ are facing. They then create hypothesis around such assumptions and then test and fail spectacularly.

Once you have created a powerful hypothesis you have won half the battle. The other half can be won by using the knowledge of statistics to design and run your tests.

## Good understanding of the statistics behind A/B tests

Once you have developed great understanding of your client’s business then the only thing standing in your way of getting a real lift from A/B tests is the ‘understanding of the statistics behind A/B test’.

**Statistics fuel your A/B test design, control your test environment and help in interpreting test results.**

You don’t need to be a full blown statistician to run A/B tests. You just need to know and do few things right:

- Understand Statistical Significance
- Realize the power of Effect size
- Decide your sample size in advance
- Understand Statistical power
- Avoid running Underpowered A/B test
- Avoid running Overpowered A/B test
- Understand Minimum Detectable effect (MDE)
- Select high quality sample for your A/B test
- Keep your A/B test results free from outliers
- Keep an eye on Confidence interval
- Identify Confounding Variables and minimize their adverse effects
- Avoid running A/B/C/D… tests
- Break down a complex test into several smallest possible tests
- Integrate your A/B testing tool with Google Analytics

## #1 Understand Statistical Significance

Once you understand what statistical significance is and what statistical significance is not, **you have learned 50% of the statistics behind A/B testing**.

Statistical Significance means statistically meaningful or statistically important.

This is the simplest definition of statistical significance.

When someone say to you “this is not statistically significant”, he meant, it is not statistically meaningful. It is not statistically important.

Now how statisticians define, what is statistically significant and what is not?…….They define it through a metric known as **Significance level (or confidence level)**

## #1.1 Significance level (or confidence level)

Significance level is the value of statistical significance.

It is the level of confidence (denoted by P) in the A/B test result that the difference between control and variation is not by chance.

There are two accepted significance levels:

- 95%
- 99%

Significance level can also be expressed as the level of confidence in the A/B test result that the difference between control and variation is by chance.

In that case there could be two accepted significance levels:

- 5%
- 1%

Data scientist rarely use percentages to denote significance level. So significance level of 95% is usually denoted as 0.95 (or 0.05 if the significance level has been expressed in terms of getting results by chance).

Similarly, significance level of 99% is usually denoted as 0.99 (or 0.01 if the significance level has been expressed in terms of getting results by chance)

For a test result to be statistically important (or statistically significant) the significance level should be 95% or above.

If the significance level is below 95% then a test result is not statistically important.

**There are two things which you need to remember about significance level:**

#1 **Significance level change throughout the duration of A/B test**.

So you should never believe in significance level until the test is over.

For example in the first week of running a test, the significance level could be 98%.

By the time second week is over, significance level could drop to 88%.

By the time third week is over, significance level could be 95%. But the time fourth week is over, significance level could be 60%.

Until your test is over, you can’t trust the significance level. Many marketers stop the test as soon as they see significance level of 95% or above. This is a big mistake which I will explain later in this article.

#2 Don’t use significance level to decide whether a test should stop or continue – significance level of 95% or more means nothing if there is little to no impact on conversion volume.

## #1.2 What statistical significance can tell you?

Statistical significance only tell you whether or not there is a difference between variation and control.

So when your significance level is 95% or above, you can conclude that there is difference between control and variation. That’s it.

**#1.3 What statistical significance can’t tell you?**

#1 Statistical significance can’t tell you whether variation is better than control. Many marketers wrongly conclude that just because their test results are statistically significant that means their variation is better than control. Remember, Statistical significance only tell you whether or not there is a difference between variation and control.

#2 Statistical significance can’t tell you how big or small the difference is between variation and control.

#3 Statistical significance can’t tell you whether or not the difference between control and variation is important or helpful in decision making.

#4 Statistical significance can’t tell you anything about the magnitude of your test result.

#5 Statistical significance can’t tell you whether or not to continue the A/B test.

95% statistical significance does not automatically translate to 95% chance of beating the original.

This is one of the biggest lie every told by A/B testing softwares.

## #2 Realize the power of Effect size

Effect size or size of the effect is the magnitude of your A/B test result.

Effect size is also the magnitude/size of the difference between control and variation. The difference between control and variation is important only when the difference is big.

Formula for calculating the effect size

**Effect size = (mean of experimental group – mean of control group) / standard deviation**

If effect size is:

< 0.1 => trivial difference between control and variation

0.1 – 0.3 => small difference between control and variation

0.3 – 0.5 => moderate difference between control and variation

> 0.5 => large difference between control and variation

Use the effect size value of 0.5 or more as it indicates moderate to large difference between control and variation.

You need large effect size to increase your chances of getting a winning variation which can actually result in real lift in conversion rate/volume.

Statistical significance of 95% or higher doesn’t mean anything, if there is little to no impact on effect size (conversion volume).

So if you run an ecommerce website then you should track ‘revenue’ as a goal for your A/B test.

By tracking revenue as a goal, you would be able to measure following metrics in your A/B test results:

- Revenue per variation
- Revenue per product per variation
- Average revenue per visitor per variation

Revenue (or sales) is an excellent measure of effect size. It is an excellent measure of the magnitude of A/B test result.

Similarly, if you run a website which generate leads then you should track number of leads generated as a goal for your A/B test.

Often marketers set and track trivial goals like CTR, email signups and other micro conversions for their A/B test which is a complete waste of time and resources as they are poor measure of effect size.

You have better chances of getting real lift in conversions if you track macro conversions as goal for your A/B test.

## #3 Decide your sample size in advance

If you keep running the A/B test while selecting the sample size as you go, you will at some point get statistically significant result even if the control and variation are exactly the same.

This happens because of repeated significance testing error in which your test increases it chances of getting false positive results.

False positive result is a positive test result which is more likely to be false than true.

For example your A/B test find the difference between control and variation when the difference does not actually exist.

So what you need to do, is to decide your sample size in advance before your start the test. There are lot of sample size calculators available out there.

Pick one and calculate the sample size you need for your A/B test in advance. To avoid getting false positive test results, stop your test as soon as you have reached your predetermined sample size.

**#4 **Understand Statistical power (or power of the A/B test)

Statistical Power is the probability of getting statistically significant results.

Statistical power is the probability that your test will accurately find a statistically significant difference between the control and variation when such difference actually exist.

It is widely accepted that statistical power should be 80% (0.8) or greater. If the statistical power is less than 0.8 then you need to increase your sample size.

Why is statistical power important? This is because marketers who don’t understand ‘Statistical power’ generally end up running underpowered A/B test.

## #5 Avoid running Underpowered A/B test

An underpowered test is the one which has got inadequate sample size.

Underpowered A/B tests greatly increase the likelihood of getting false positive or false negative results.

A false negative result is negative test result which is more likely to be true than false.

For example, your A/B test does not find difference between control and variation when the difference does actually exist.

Statistical power is related to sample size and minimum detectable effect.

Statistical power increases with sample size as large sample means you have collected more information.

If you take a very small sample size for your A/B test then the statistical power of the test will be very small. In other words, the probability that your A/B test will accurately find a statistically significant difference between the control and variation is going to be very small.

If you take a big sample size for your A/B test then the statistical power of the test will be big. In other words, the probability that your A/B test will accurately find a statistically significant difference between the control and variation is going to be high.

## #6 Avoid running Overpowered A/B test

An overpowered A/B test is the one which has got much more than sufficient sample size.

When the statistical power of your A/B test is 80%, there is a 20% probability of making type 2 error (or false negative error).

When you run an overpowered A/B test, the statisical power of A/B test becomes greater than 80% which decreases the probability of getting type 2 errors but at the same time increases the probability of getting type 1 error (or false positive error).

Statisticians world wide consider type 1 error to be 4 time more serious than type 2 error as finding something that is not there is considered more serious error than the failure to find something that is there.

That’s why the statistical power of your A/B test should not exceed or go below 80% (or 0.8)

Not only overpowered A/B tests increase your chances of getting false positive results but they also waste your time and resources by collecting more test data than needed.

## #7 Understand Minimum Detectable effect (MDE)

Minimum Detectable effect (MDE) is the smallest amount of change that you want to detect from the baseline/control.

For example:

1% MDE => detect changes in conversion rate of 1% or more. You won’t be able to detect changes in conversion rate which is less than 1%.

10% MDE => detect changes in conversion rate of 10% or more. You won’t be able to detect changes in conversion rate which is less than 10%.

40% MDE => detect changes in conversion rate of 40% or more. You won’t be able to detect changes in conversion rate which is less than 40%.

There is a strong correlation between Minimum detectable effect and Sample size

Smaller your MDE, larger the sample size you will need per variation. Conversely, bigger your MDE, smaller the sample size you will need per variation.

This is because you need less traffic to detect big changes and more traffic to detect small changes. That’s why it is prudent to make and test big changes.

## #8 Select high quality sample for your A/B test

A high quality sample is the one which is random, in other words it is free from selection bias.

A selection bias is a statistical error which occurs when you select a sample which is not a good representative of all of the website traffic.

For example when you select only returning visitors for A/B testing or only the visitors from organic search then the traffic sample that you have selected is not a good representative of all of the website traffic as returning visitors or organic visitors may behave differently than the average visitors to your website.

So if you run A/B test and the traffic sample is not a good representative of the average visitors to your website then you are not going to get an accurate insight on how your website visitors respond to different landing page variations (unless off course if you are running your test only for a particular traffic segment).

In that case launching a winning variation may not result in any real uplift in sales/conversion rate. The launch of winning variation may in fact lower your conversion rate.

## #9 Keep your A/B test results free from outliers

If you are tracking any goal which is an average metric (like average revenue per visitor) than the presence of outliers (extremely large values) like few abnormally large orders can easily skew the test results.

The solution to this problem is to

Stop any abnormally large value from passing to your A/B test results in the first place.

So if you are tracking revenue as a goal in your A/B testing tool, you should set up a code which filters out abnormally large orders from your test results.

For example if your website average order value in the last 3 months has been $150 then any order which is above $200 can be considered as an outlier.

You can then write a code which doesn’t pass any purchase order greater than $200 to your A/B testing tool.

For example in case of optimizely, the code to exclude abnormally large orders would look something like the one below:

if(priceInCents <20000){

window.optimizely = window.optimizely || [];

window.optimizely.push([‘trackEvent’,

‘orderComplete’, {‘revenue’: priceInCents}]);

}

## #10 Keep an eye on Confidence interval

Confidence interval is the amount of error allowed in A/B testing. It is the measure of the reliability of an estimate.

It can be expressed like: 20.0% ± 2.0%.

Confidence interval is made up of conversion rate and margin of error.

For example:

*Confidence interval for control: 15% ± 2%*** =>** it is likely that 13 to 17% of the visitors to the control version of the web page will convert. Here 15% is the conversion rate of the control version of the web page and **± 2% **is the margin of error.

*Confidence interval for variation: 30% ± 2%*** =>** it is likely that 28 to 32% of the visitors to the variation page will convert. Here 30% is the conversion rate of the variation page.

Conversion rate is the percentage of unique visitors who saw the control/variation and triggered the goal = conversions / unique visitors who saw the control/ variation

Improvement is the relative difference between conversion rate of variation and conversion rate of control.

For example:

If 30% is the conversion rate of the variation page and 15% is the conversion rate of the control version of the web page then

Improvement = 30% – 15% = 15 percentage points or 100% ([30-15] / 15 * 100)

So there is 100% increase in conversion rate for the variation page.

There should not be overlap of confidence intervals between control and variation as it indicates you need bigger sample size and continue the test.

**#11 Identify Confounding Variables and minimize their adverse effects **

Confounding variables are those variables which a tester failed to identify, control / eliminate/ measure while conducting a statistical test.

Confounding variables can adversely affect the relationship between dependant and independent variables thus leading to a false positive results.

**Note:** Confounding variables are also known as third variables or confounding factors.

Presence of confounding variables is a sign of weakness in the experiment design.

You must identify as many confounding variables as possible before starting the test and then eliminate or minimize their adverse effects on your test.

Following confounding factors, if occur in the middle of a test can considerably impact your website traffic and hence skew the test results:

- New marketing campaigns launched.
- Certain marketing campaigns turned off.
- Occurrence of special events like christmas, new year or any public holiday.
- Major positive or negative news/announcement about your website/ business like:
- New product launch
- New business division launch
- Closure of a business division
- Departure/appointment of a key employee/executive
- Media mention etc.

- Major update to search engine algorithm.
- Complete redesign of the website.
- Redesign of the control and/or variation pages
- Website hit with a new Search engine penalty or got rid of an existing penalty.
- Prolonged website outage or some other server side issue.
- Major website crawling and/or indexing issues (like unwanted robots.txt exclusion which negatively impact the organic search traffic and direct traffic)
- Change in experiment settings.
- Change in test goals

Do not change experiment settings in the middle of the test.

For example, if you changed the amount of traffic allocated to original and each variation in the middle of the test then it can easily skew your test results as one variation could end up getting lot more returning visitors than the others. Returning visitors have got higher probability of making a purchase which can skew the test results.

However if you think it is absolutely necessary to change the traffic allocation settings in the middle of the test then by all means do it. But then reset the test and restart it.

Similarly, do not change your test goals in the middle of the test as it can skew your test results. However if you think it is absolutely necessary to change the goals then do it. But then rest the test and restart it.

Make notes of confounding factors that affect your test by creating annotation on the test results’ chart.

Majority of A/B tests fail simply because of the presence of confounding variables which skew the test results.

## #12 Avoid running A/B/C/D… tests

The more test variations you create and compare with control, the higher is the probability of getting false positive results. This issue is commonly known as **‘ The Multiple Comparisons Problem‘.**

The other disadvantage of testing multiple variations is that, the more variations you have in your test, the more traffic you would need to get test results which are statistically significant and longer it will take to finish the test.

So keep your test variants to minimum. That means avoid A/B/C test or A/B/C/D test or A/B/C/D/E….. test.

## #13 Break down a complex test into several smallest possible tests

**Multivariate and Multi page tests** are complex tests. This is because the volume of variables/factors involved in such tests make them harder to analyse and harder to draw conclusions from.

Not only such tests are difficult to set up, harder to manage, take long time to finish but are also much more prone to test design and statistical errors than the simple A/B tests.

So avoid running multivariate and multi-page tests and stick to simple A/B tests.

## #14 Integrate your A/B testing tool with Google Analytics

Before you start your test, always make sure that your A/B testing tool is ready to send the test data to Google Analytics as soon as the test starts:

By integrating your A/B testing tool with GA, you can correlate A/B test results with website usage metrics like: sessions, goal completions, Goal conversion rate, bounce rate, revenue, average time on page etc.

This is very important in order to do deep analysis of your A/B test results.

Other article you will find useful: Geek guide to removing referrer spam in Google Analytics

## Announcement about my books

**Maths and Stats for Web Analytics and Conversion Optimization**

This expert guide will teach you how to leverage the knowledge of maths and statistics in order to accurately interpret data and take actions, which can quickly improve the bottom-line of your online business.

**Master the Essentials of Email Marketing Analytics**

This book focuses solely on the ‘analytics’ that power your email marketing optimization program and will help you dramatically reduce your cost per acquisition and increase marketing ROI by tracking the performance of the various KPIs and metrics used for email marketing.

**Attribution Modelling in Google Analytics and Beyond**

Attribution modelling is the process of determining the most effective marketing channels for investment. This book has been written to help you implement attribution modelling. It will teach you how to leverage the knowledge of attribution modelling in order to allocate marketing budget and understand buying behaviour.