A holistic comparison of statistical methods for online experimentation
Many different statistical regimes are used in hypothesis testing, and it can be easy to get lost in the array of choices. Although it’s easy to find zealots arguing that one approach is universally better (also known as the “statistics wars”), every statistical method has unique strengths and weaknesses. The wide range of nuance in data analysis means that any of them may be most suited for your particular use case.
While there are many resources comparing statistical regimes, they tend to focus solely on statistical power. In this post, we argue that a more holistic comparison is beneficial, and analyze the five common statistical techniques along multiple dimensions (statistical power, intuitiveness, flexibility, robustness, and ease of implementation) in order to help you pick which methodology is the best fit for your use case.
In order to graphically compare these different statistical methods, we’ll evaluate them on a few key criteria:
Statistical power quantifies how sensitive a method is to picking up differences in outcomes if they exist. Using a formal definition, it is difficult to avoid apples to oranges comparisons. Instead, when we refer to “statistical power,” we mean it more broadly: how much data must be collected to reach a particular specificity? An alternative way to think about power: if you run an experiment for X amount of time, how narrow are the confidence intervals (or credible intervals) at its conclusion?
For a deeper dive into comparing statistical power, Mårten Schultzberg and Sebastian Ankargren have written an excellent article.
Flexibility in this case refers to how well a method is able to adapt on the fly as more data becomes available. For example, if you realize the effect size might be bigger or smaller than you initially had anticipated, can you adjust the experiment's runtime, or do you have to start from scratch?
This factor could also be described as the comprehensibility of the statistical method and its usage to a general audience. Proponents of Bayesian analysis, for instance, often prefer it for its intuitive definitions and explanations. Intuition covers both the ease (or difficulty) of setting up and running an experiment confidently (i.e., do you need an intimate understanding of statistics to use this method well?) and the ease of explaining the outcomes and results of your experiment to less-technical stakeholders.
This factor refers to the technical effort required to run and analyze experiments using the method. Is it possible to quickly implement a particular statistical test? Are there standard packages available in popular programming languages? Does the approach scale well when analyzing many metrics and experiments?
Finally, we look at robustness. Every statistical methodology is built upon assumptions that ensure validity. However, in practice, it is not always possible to guarantee that these assumptions hold. Robust methods can handle (small) violations in assumptions without completely changing the results. Some examples of assumptions that are often made when analyzing experiments:
Hypothesis testing has a long and storied history, dating back to the classical t-test often taught in introductory statistics courses. This foundational method is the result of the combined work of Fisher, Neyman, and Pearson. In the classical t-test, you decide how much data to gather ahead of time (e.g., using a sample size calculator), then wait for all the results to come in before looking at the outcome to make a decision.
In the 1950s, sequential tests began to gain popularity, particularly in the medical field. One reason is that they allow for adaptability to the effect size being studied. For example, if there is uncertainty about how effective a drug is, a cure rate of 80% could be easy to detect quickly, but we wouldn't want to discard smaller effects. Even if a drug is only effective for 20% of cases, that may still be very important for an otherwise incurable disease, but it would take much longer to detect. It would be unethical to continue providing patients with a placebo when we know a drug is effective. However, we also do not want to conduct the experiment with too short a duration, as this would result in reduced statistical power and make it difficult to detect smaller yet clinically relevant effects.
Group sequential testing plans for experimenters to make a number of interim analyses, or “peeks,” which are fixed in advance in both quantity and frequency. Group sequential tests have become particularly popular in the medical field and have also recently seen a surge of interest in industry, particularly at Spotify and Booking.com.
Fully sequential testing, on the other hand, allows the experimenter to stop at any point – no need to pre-determine your cadence and number of peeks. This method has gained more popularity in tech circles in recent years due to the data revolution, as we can now collect data continuously.
Fully sequential methods are a good default when scaling up experimentation across teams less experienced with statistical analysis. There are no prerequisites to designing each incremental experiment, and the linear progress of the p-value makes it nearly impossible to misinterpret. It is also good at making fast ship/no-ship decisions with good statistical guarantees. That is, when making a quick decision is more important than understanding the precise impact. In this sense, it is the opposite of the classical t-test.
Because sequential tests generate a sequence of tests (or confidence intervals), rather than a single one in the case of a t-test, there is no single optimal “boundary” along the sequence. Instead, the experimenter has to choose where to concentrate statistical power. Particular choices for group sequential tests carry names, such as the O’Brien-Fleming test and Pocock test, while fully sequential methods usually have hyper-parameters one can set. More custom forms of customization are also possible. For example, it is possible to combine a fully sequential test during the experiment with a t-test at the end of the experiment period, using $\alpha/2$ as the significance level for each. This combines the best of both worlds, while a union bound shows that statistical guarantees are still met. We will call this the hybrid sequential method.
When you want to combine the benefits from the fixed t-test approach (namely, power at the end of an experiment) and sequential approach (namely, early stopping).
Finally, there is the Bayesian standpoint on experimentation. While all the above methods are frequentist in nature, Bayesian hypothesis testing involves forming a prior belief and then updating it with data to create a posterior. This approach can be difficult to compare with frequentist methods, as the underlying philosophy is quite different. Whether you like or dislike Bayesian methodology is mostly a matter of taste (debates on frequentist vs. Bayesian epistemology and intuitiveness are a common manifestation of this).
Whether to use a frequentist or Bayesian approach often boils down to preference, but here are some other subjective reasons:
As is often the case in both statistics and life more generally, there is no silver bullet when it comes to selecting frequentist vs. Bayesian vs. sequential statistical methodology for analyzing your experiment results. Which method is most appropriate depends both on the specific situation and is partially a matter of preference.
Interestingly, the choice to utilize any one of the methods on this list relies not just on the technical context of what is being tested and how your experiment is implemented, but also on the experimentation culture of the team running and monitoring experiments. The experience level of the experimenters and other stakeholders, familiarity with the definitions of key terms, and even basic preferences may make any of these methods more or less appropriate.
For experimentation to drive stronger decision-making, tests must be both run and communicated well. These are challenges we’ve thought a lot about when designing the UI of Eppo’s experiment results and reports for each statistical method. If non-statisticians struggle to look at and understand the results, it will be tricky to grow an experimentation culture – regardless of how optimal your method is.
Building the Modern Experimentation Stack
The Warehouse-Native Experimentation Workflow
How to Set Up an Experiment in Eppo