Image for post
Image for post

PM 101: Pitfalls of A/B Testing

A/B testing is a powerful tool — if used correctly

  1. Using feature level metrics
  2. Looking at too many metrics
  3. Not having enough sample size
  4. Peeking before reaching sample size
  5. Changing allocation during the test
  6. Not learning from failed tests
  7. Using A/B testing as the only validation method
Image for post
Image for post

No real hypothesis

The worst way of running an A/B test is to implement a change, roll it out to A/B test, and then just “see what happens”. This is bad for all sorts of reasons; most notably, just due to probability, some potential success metrics are likely to see statistically significant improvements this way. What is even more concerning, though, is that you will not learn anything beyond the very limited scope of the product change itself from an A/B test that is run this way.

  • The change you are making
  • The hypothesized impact on user behavior
  • The way to measure this impact (the key metric that is predicted to improve)
Image for post
Image for post

Too many metrics

One point that is already included in the elements above but is worth calling out separately: An A/B test should have a single metric that you are basing the success of the test on. Looking at multiple metrics makes everything more complex: Firstly, the probability of false positives increases when looking at multiple metrics. Secondly, if some of the multiple metrics move in different directions, the trade-offs are unclear and A/B testing alone cannot decide those trade-offs.

Image for post
Image for post

Using feature level metrics

Even if your hypothesis contains all the elements above, and you are only using a single success metric, it might still not be a good hypothesis. The pitfall that many inexperienced PMs fall for is using feature level metrics in A/B tests. Feature level metrics are typically metrics that measure the usage of a feature. For example, in a communication app, the following hypothesis contains a local metric: “If we include a floating button to compose a new message on the home screen, more new users will compose a message”. This hypothesis seems valid, but it has big problems: It is guaranteed that some users in the test group will tap this button, and if even a tiny fraction of them ends up sending a message, the test experience will win the A/B test. The hypothesis above is a glorified version of the very bad (and almost always true) hypothesis “If we create a new button, some people will click it” (or equally bad “If we make the button bigger, more people will click it”).

Image for post
Image for post

Not enough sample size

When you have set up a proper hypothesis, you can start thinking about the required sample size, i.e. the number of users that will have to be exposed to your experience to be able to measure a statistically significant improvement.

Image for post
Image for post

Peeking

Once you have determined the required sample size, you need to wait until that sample size is reached to analyze the results and determine statistical significance. If you “peek” earlier and call the experiment whenever a statistically significant result shows, you are severely increasing the risk of false positives, due to fluctuations in the target metric over time, which is noise and not signal.

Image for post
Image for post

Changing allocation during the test

An A/B test in which 50% of the population is allocated to the control group and 50% to the test group is fine. An A/B test in which 90% of the population is allocated to the control group and 10% to the test group is also fine. What is not fine is for example starting a test at 90/10 (for example, to rule out any extremely negative impact) and then later changing to 50/50.

Image for post
Image for post
Image for post
Image for post

No learning in case of failure

Product management guru Marty Cagan says that there are two “inconvenient truths about product”:

Image for post
Image for post

A/B testing as the only validation method

As discussed above, most of our improvement ideas end up not working or needing multiple iterations to work out. If we are only using A/B testing to validate, it means we always have to build out the full solution before we learn whether the idea has any merit. We should therefore employ cheaper means of testing (qualitatively) before the A/B test, for example, by showing users prototypes of our solution and watching them interact with it.

Previous articles in my “PM 101” series:

Written by

Experienced product leader, previously at 8fit, Yammer, BCG. Currently working on something new.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store