If you're a business running low-powered experiments, you risk missing out on insights worth thousands of dollars
Statistical power is one of the most important – but also most commonly misunderstood – concepts in the world of experiments and research. Experiments without sufficient statistical power mean you're risking false negatives. For example, if you’re a clinical psychologist testing a new therapy, you risk missing out on detecting a new, life-altering treatment. If you’re a business running low-powered experiments, you risk missing out on insights worth thousands or even millions of dollars.
So what exactly is statistical power? And how do you make sure you're using it properly?
Sampling and Random Variation
To truly understand statistical power, it is essential to step back and remember why we run experiments in the first place. When we run experiments, we take samples from two potentially different populations: the treatment group and the control group. We use samples because observing an entire population takes too much time and resources. The data observed from those two samples are then used to something about how the populations likely behave: If we show people B instead of A, they will do X.
Implicit here is the assumption that these samples are representative of the population we are drawing from. But here is the catch: we can never really be sure how representative these samples are of the general population. Even more, we know that random variation outside our control within these samples can drive large deviations from the population.
This means that, purely by random chance, we could observe a difference within our samples that are drastically different from the true population. Even worse, we might observe a negative impact on our metric when a positive result actually exists.
If this is the case, how could we ever begin to learn something about the whole population if our sample might be a biased representation of that population? The answer is…we can’t! But statistics can help us quantify our certainty (or lack thereof). More specifically, statistical power answers the question, ”If we ran this experiment an infinite number of times (as opposed to just once), how many times would we find a significant difference between our two samples if it truly did exist?”
I know infinity is a lot to wrap your head around. At this point, statistical power is probably only making slightly more sense than it did after your chat with the grumpy statistician. Let me help further bridge the gap with an example.
Understanding Power Through the Viewpoint of a Wizard
Let’s imagine you are an omniscient and omnipotent wizard. Instead of using this power to take over the world, you’re employed at a tech company due to a series of poor choices at Hogwarts, School of Witchcraft and Wizardry. It is a great gig, but you don’t exactly have the title “Ruler of the Universe” as you could have.
Your boss asks you to estimate the impact of a new homepage design on sign-up rates. Because you’re omniscient, you know that every user who could ever possibly touch the old homepage across all of space and time will convert at a rate of 40% on the old homepage and a rate of 46% on the new homepage. So, you know that the actual impact of the new design is huge at +15%.
But because you know that making your boss aware of your omniscient essence will increase asks coming your way (you’re also an extremely lazy wizard), you pretend that you don’t already know the impact. And because you’re omniscient and omnipotent, and have some time to kill, you decide it’ll not only be fun to analyze one experiment but analyze multiple experiments. For each experiment, you want to understand if a statistically significant, positive result was observed.
So, to begin, you look into your population of users across all of time and space and draw a random sample of 100 users. Within this sample of 100 users, you find that the old homepage has a conversion rate of 40% and the new homepage has a conversion rate of 54% for a statistically-significant difference of +35%. Well, that is interesting. We have a statistically-significant result but it’s quite a bit different than what we’d expect given that we know the population converts at 46%, not 54% on the new design.
So, we take another sample of 100 users and observe a conversion rate of 52% for the old homepage and a conversion rate of 45% for the new homepage. This implies a statistically insignificant but negative difference of -13%. Now, this is really strange and much different from the +15% effect we know exists within the population.
At this point, you’re a little confused. How could these first two samples be so different than what you know is the truth? You think to yourself, “These first two results were very unexpected, they were just two samples. I have more samples I can use to draw conclusions from.”
So, off you go. After sampling 100 users from the old homepage population and the new homepage population a large number of times (20k, to be exact), you find that you only observed a statistically significant result 15% of the time across those experiments. This means you failed to detect a significant effect 85% of the time, despite knowing that a very significant +15% effect exists in the broader population. How could this be?
This is due to the law of large numbers, which states that as the sample size increases, the mean of those samples gets closer and closer to the average of the whole population. At only 100 users per sample, it is extremely unlikely that both the old homepage sample and the new homepage sample are representative of their respective populations. This means that purely by random chance, we are sometimes going to get estimates for one or both of our samples that vary substantially from the actual population value.
Okay, so what if we increased our sample to 500 users and ran 20k experiments again? When doing this, we detect a significant result about 80% of the time. And what if we increased our sample to 10,000 users? We detect a significant result nearly 100% of the time.
To illustrate this a little more clearly, it helps to look at the distribution of the conversion rates observed for every sample at each size of N. Although the centers of these distributions are always centered on the population mean (40% and 46%), the spread around this average decreases significantly as N gets larger (remember the law of large numbers?).
As sample size increases, so does the chance of observing an estimated delta similar to that of the population. Similarly, as observing a delta closer to the actual value of 15% increases, so does our probability of observing a statistically significant result. We become less and less likely to observe the confusing +35% increases or the -13% decreases we saw in our two samples earlier.
And this is the core concept behind statistical power! As sample size increases, we can become more and more certain that the observed results accurately represent differences that exist in the population. Although we might not always get the exact estimate of these differences right, we become more and more sure that something positive is there. When we do that, we become less and less likely to miss a significant result when it exists.
Planning for Well-Powered Experiments
So, what can we do to ensure we are running experiments with adequate statistical power? It all starts with some planning prior to running your experiment.
To begin planning an experiment with adequate statistical power, you should start by defining how small of an effect – the impact of whatever changes you’re making – you expect. Maybe for our homepage design experiment, we’re expecting an effect of +4% from the new designs. You should also define the statistical power you hope to achieve in your experiment. The industry standard is 80%, but some companies are more risk-averse than others. If you want to be more certain that you didn’t miss a positive effect, simply increase your desired statistical power.
Given these two inputs, along with the baseline conversion rate and your significance level (significance levels are a topic for another chat), you can estimate the sample size required to achieve adequate power for your hypothetical effect. You will then run your experiment until you reach the sample size required.
By running your experiment until it reaches an adequate sample size, you can be pretty sure that if an effect of X percent does exist, you would’ve seen it in your experiment. This means you can rest easy knowing that you didn’t miss some insight that would’ve generated your business millions of dollars in revenue (let’s hope).
Building the Modern Experimentation Stack
The Warehouse-Native Experimentation Workflow
How to Set Up an Experiment in Eppo