A/B testing, or split testing, allows eCommerce sellers to measure the effectiveness of various aspects of their stores and marketing — both paid and organic — by showing two or more variants of the same web page or app to multiple test groups. A/B testing allows sellers to isolate and manipulate variables to help identify what is working well and what needs to change; this data allows them to then improve key performance indicators (KPIs) like conversion rate (CR), average order value (AOV), revenue per visitor (RPV), dwell time, and click-through rate (CTR). This article explores the fundamentals of A/B testing and is segmented into three parts:
A/B Testing: Examining Independent Variables
The Four Key Principles of A/B Testing
How to Conduct Your Own A/B Tests
A/B Testing: Examining Independent Variables
As mentioned above, A/B testing involves testing multiple versions of the same webpage that are shown to users at random (with a 50/50 split); one version acts as a control, and the other version features a change to one independent variable (like a photograph or product description). Researchers evaluate the impact of that altered independent variable to determine its overall influence on a designated KPI (like conversion). In the context of eCommerce, this practice can be applied to a range of independent variables on a wide variety of platforms. Common variables of interest include:
Photography and graphics
Product offerings and/or bundles
SKUs per page
Number of sold out SKUs
Text on page
Name of product
Original vs. marked-down pricing
Font color of pricing
Reviews on page
The Four Key Principles of A/B Testing
To conduct valid A/B tests, it is important for sellers to follow these useful key principles:
Hold things constant and isolate change.
Testing should have a set period length.
Minimize external validity threats to preserve data.
Traffic levels must be high enough to achieve a chosen sample size (essential)
In the absence of the above, it is difficult to determine the validity of the data obtained from statistical analysis. Consider:
1. Hold things constant and isolate change.
Valid A/B testing depends on the researcher’s ability to identify a chosen variable for change while holding all other things constant. If the researcher changes more than one variable, the results will become conflated, and it will be impossible to tell which variable influenced the data — this is a type of internal validity threat (IVT). For Example: A seller would like to improve conversion rate. They decide to run an A/B test with two versions of a product page that have different copy. Additionally, they introduce a free shipping discount on orders over $35. Changing both the copy and the shipping discount will conflate the test results, rendering the data and any conclusions drawn from it unreliable.
2. Regulate the length of your testing period.
Your testing period is the duration for which you collect data from your variants. Rather than just continually collecting data, it is necessary to set a specific period for testing to ensure accuracy, maintain consistency, and avoid skewing data. In most cases, the testing period should be for at least two weeks or until you achieve a 95% confidence rate. This minimizes skew due to fluctuations caused by factors such as:
Indecisive buyers (“I need to think about it” buyers)
Various traffic sources (Facebook, email newsletters, organic search, etc.)
Day of the week and the time of day (running tests from Monday to Monday, not Monday to Friday)
3. Minimize external validity threats to your data.
To run a successful A/B test, it is critical to avoid data pollution and external validity threats (EVTs). Data pollution happens most commonly when a user in your test is counted twice — this can happen when visitors delete their cookies or visit from a different device (mobile, desktop, etc.). EVTs are factors that similarly endanger the legitimacy of your testing, and include things like:
Black Friday/Cyber Monday (BFCM) sales
A positive or negative press mention
A major paid campaign launch
The day of the week
The changing seasons
Ultimately, EVTs cannot be eliminated but can be mitigated. One way to minimize the effect of EVTs on your data is to run tests for an adequate length; utilizing the two-week minimum for test duration, for example, helps reduce the impact of outlying data points overall.
4. Traffic levels must be high enough to achieve a chosen sample size.
For sellers to be confident in their A/B testing, their stores need to have the necessary traffic to conduct a valid test — this is absolutely essential. Before beginning a test, you must define your sample size, which depends on the minimum detectable effect you are trying to examine.
For Example: You would like to conduct an A/B test for conversion. You have a current conversion rate of 2%, and you are trying to measure a relative minimal detectable effect of 10% (a change in conversion rate to either 1.8% or 2.2%). Therefore, to run a valid split test, your store would need to be able to get 78,039 visitors for each variation within the testing window.
Try It:Here’s a calculator to determine the level of traffic your store will need to conduct a valid split test.
If your determined visitor threshold is not met during the test, then the results are likely to be inaccurate. Therefore, you could end up making changes to your store based on false assumptions, which may hurt instead of help. NOTE: If your store does not currently achieve the necessary levels of traffic to conduct a split test, there are a number of alternatives.
How to A/B Test Your eCommerce Store
Once you have settled upon your variables of interest, set the testing period, controlled for validity threats, and verified that your store has adequate levels of traffic, take the following steps to set up your A/B test:
1. Develop a hypothesis.
Ensure that your hypothesis is a statement that can be tested and measured. Your statement should include how you expect changes to an independent variable (like color scheme, product description, buttons, etc.) will affect a specific dependent variable. Typical dependent variables for eCommerce stores include KPIs like:
Average Order Value
Revenue Per Visitor
Your hypothesis should be based on any previous data or knowledge you have about improving the desired metric (the dependent variable).
2. Select a testing tool.
Next, choose a testing tool to facilitate your split test. Google Optimize is a free tool which provides the necessary functionality to run several tests on eCommerce stores and is very user-friendly. As you configure your testing tool, ensure that you are controlling (as much as possible) for threats to your data, and establish an adequate testing period to avoid skewness.
3. Conduct the test.
After ensuring your hypothesis is measurable and setting up your testing tool, you are ready to conduct your test. For the duration of your testing period, analyze the results for any drastic changes in the data, and confirm — as best as you can — that the test has not been materially affected by data pollution, EVTs, and IVTs. After 2 to 4 weeks (or once you have achieved a 95% confidence rate in your data), stop your test and analyze the final results.
4. Analyze the results.
An A/B test is considered successful if the data supports the original hypothesis. If your hypothesis was supported, then you should make the change that you tested, and begin another experiment with a new hypothesis to test a different independent variable. If the hypothesis is not supported, then you should not make the change, and you should create a new hypothesis based on information gleaned from your experiment. If a test was not successful, however, that does not always mean that it failed: some tests may not achieve the desired results overall, but they yield new information about market segments that you would not have known otherwise.
For Example: You ran a test on a landing page, but your hypothesis failed. While the change you made did not achieve your desired click-through rate in general, you learned that Android visitors, as compared to other visitor segments like iOS visitors and desktop visitors, had a better experience overall. In many cases, test results cannot be viewed simply as a success or failure and must be analyzed closely, especially by segment.
For eCommerce stores with higher levels of traffic, split testing should be a regular day-to-day activity, and it can be especially useful in preparation for peak retail season. A basis point here and there across multiple KPIs can make a substantial difference over the course of a year.