A/B testing is an essential part of business, vital to almost every aspect of decision making - from webpage layout to the wording of email subject lines and PPC ads. It sounds easy. Take, for a example, a website. You design two versions of the site with different layouts, divide the traffic between the two, and opt for the one which leads to the most conversions. However, it is far more complicated than it appears.
There are a number of factors that will impact the results of the test. Firstly, before you start out, most people will look to determine the length of the test. This would seem to be common sense, but it is hard to determine how long you should run the test for as this is often what will be revealed by the test. Vineet Abhishek, Data Scientist at Walmart Labs, recommends checking daily and stopping only if the numbers reveal something truly significant, and continue as per usual if not. You should not call tests before you’ve reached 95% statistical significance or higher to ensure the results are accurate.
Another issue is that people will look at the data while the A/B test is still running and make decisions based on an incomplete set of results. This could have extremely detrimental consequences as the results are likely to change, but it is hard to stop as people will invariably dip in but it is important to avoid using any insights. You should ensure that you set the sample size from the beginning and set proper governance so that people don’t adjust it or take results until you reach the size you need, even if you’re bleeding money. You don’t want to make conclusions based on a small sample size, and if you’re basing insights on just a few sales, they’re not going to be accurate. You want to be looking at a sample set at least in the hundreds, if not more. One work around is sequential testing, which will give you answers that are always correct.
Determining your primary success metric is vital too. When you are evaluating a web page, you are often measuring different metrics to define how successful it is. It may be that your A/B testing is looking good against some metrics and inconclusive around others. You should agree on a primary success metric beforehand so that you know which to give precedence to, not once you have seen the results because then you will shape it to your prejudice. If you can, combine multiple metrics into a single success metric - again, beforehand - although
Ensure as well that you set your hypotheses and only change one thing at a time so you can pinpoint exactly what the difference is. It’s a natural human impulse to get attached to your hypothesis or design treatment, and if your best hypothesis ends up not being significantly different you may be tempted to game the test. Ultimately, testing random ideas comes at a huge expense. If you’re doing it wrong, you could be wasting valuable time and traffic.