Products

Experimentation

Product Experimentation Web Experimentation Lifecycle Experimentation Lifecycle Experimentation

Feature Flagging

Release Management Automated Rollouts Config Flags Release Management

AI Personalization

Contextual Bandits Contextual Bandits

Why Eppo

WHY EPPO

By Role

Data Scientists Engineers Product Managers Product Managers

Resources

Customers Outperform Updates White Papers White Papers

FEATURED CASE STUDY

Coinbase Saves Millions, Reduces Experiment Analysis Time by 40%, and Restores Trust in Experimentation with Eppo

Learn more

Blog

About

A/B Testing

December 12, 2023

Holdouts: Measuring Experiment Impact Accurately

Holdouts help measure the cumulative, long-term impact of an experimentation program. But getting value out of holdouts requires scale and maturity that not all organizations are ready for.

Giorgio Martini

Before Eppo, Giorgio was a Machine Learning & Economics Manager at LinkedIn. He holds a PhD in Economics from Stanford University.

As an experimentation program develops and matures, every data person has had some version of the following experience. They look at the total reported impact of a group of experiments, and something doesn’t add up. The sum of impacts looks implausibly large, defying common sense or contradicting the observed trend (which somehow never grows as much as promised).

Savvy teams will know not to trust this “bottom-up” method for evaluating total impact. Instead, they will use a holdout: one cumulative experiment that incorporates all the winning variants, comparing users to whom they assigned all the winning variants to users “held back” with all the control variants (no change). Lo and behold, they find that with this more accurate method, cumulative impact is lower – sometimes far lower – than the bottom-up approach of adding individual impacts.

An example of how holdouts measure the impact of cumulative experiments more accurately than summing — *Holdouts will often uncover that the cumulative impact of several experiments is lower than expected*

In this article, I’ll explain the prerequisites and considerations in using holdouts to obtain these accurate estimates of cumulative impact. First, let’s explore why simply adding up the sum of winning experiment impacts so often leads us to incorrect conclusions.

Why might the cumulative impact be so much lower?

Many forces push towards a divergence between the cumulative impact and the bottom-up approach of summing individual effects (or, more accurately, multiplying together 1 + x, where x is the lift; this distinction is negligible when x is small). True effects change over time. Novelty effects mean that short-run experiment results often overestimate the long-run impact. When different experiments interact (rare, but not impossible), they tend to partially cancel out each other’s effects instead of enhancing them.

Winner’s curse can be the strongest of these forces and also the least intuitive. A/B tests give a valid estimate of the relative performance between A and B. Technically, the estimate is unbiased. But when restricted to winning variants – those that were actually shipped – this nice statistical property no longer holds. Some variants were actually good. Some just got lucky. We don’t know which is which, and so when we look across all winning variants, we tend to overestimate the impact.

Underpowered experiments are especially susceptible to the winner’s curse. When a test is run without the statistical power required to detect the true effect size, positive outcomes are disproportionately more likely to be due to luck. The only way to get a “statistically significant” result from an underpowered experiment is to get lucky, which biases results upwards (or whatever direction is good). The more underpowered it is, the more inflated the estimate will be.

Beyond winner’s curse – an unavoidable outcome of re-using the same data to make a launch decision and to evaluate it – there is the ever-present specter of p-hacking. Global impact is neutral? Find a subpopulation where it’s positive. Experiment didn’t show significance when the pre-specified sample size was reached? Run it longer. Outliers look like they’re adding too much noise? Change the winsorization threshold ex post, or just throw the outliers out manually. All of these are no-nos; that doesn’t mean they aren’t done in practice. We need an approach to cumulative evaluation that’s robust and hard to game.

Holdouts offer a straightforward way out of the influence of all of these “curses”. They are easy to explain: no fancy statistics, just a regular old A/B test with specially constructed treatment and control groups. Whatever the potential sources of discrepancy between cumulative and sum of impacts, holdouts work: they always measure the right thing.

When to use holdouts

The first use case for holdouts is to measure the aggregate impact of entire teams or organizations over time. Simply adding up the results of separate tests is a poor guide to understanding cumulative impact, even when experimentation best practices are followed. Holdouts give an unbiased – if noisy – measure of the true cumulative impact of a team’s work.

A/B tests help make tactical decisions: what should we ship? Holdouts guide organizational strategy: which teams are driving the largest gains and should be given more funding?

A second use case for holdouts is to understand the long-term effects of innovation. Product experimentation often dictates rapid decision-making over small, incremental changes. These short-term experiments can give inflated results due to novelty effects. In cases where this is a major concern, the prudent course of action is to hold individual experiments for longer. But often it isn’t: the long-term effect may be a dampened version of the short-term effect, but it doesn’t change direction. Making faster decisions is more important. In those cases, holdouts can be an appropriate tool to keep tabs on long-term impact, separate from the high-octane pace of regular experiments.

Finally, a commonly-cited use case for holdouts is to account for interactions between experiments. Holdouts do address this concern, but the fear of interactions alone is typically not a good reason to run a holdout. Two different experiments, each delivering a 2% gain to some metric, once combined may deliver a gain which is larger than 4% (a positive interaction) or smaller than 4% (a negative interaction). It truly can be the case that 2 + 2 = 5 (or 3…)! However, worries about experiment interactions are usually overblown. When interactions do exist, they are usually catastrophic (two mutually-incompatible features that break the product when combined) and known about in advance; mutual exclusion solves for this. Evidence from Microsoft, among others, shows that unexpected experiment interactions are rare and tiny. Running a holdout can help you rest easy that any such interactions are accounted for. But when the gap between the cumulative effect and the sum of individual effects is large, interactions are rarely the culprit.

Requirements for holdout experiments

Not all experimentation programs have the scale and maturity to run holdouts. Even when feasible, running holdouts does not come for free: the potential for learning should always be compared against the additional overhead incurred.

The first hard requirement is sample size. If you’re struggling to get the necessary power to run regular A/B tests with a 50%/50% traffic allocation, then you’re in no place to run holdouts. It’s true that holdouts can tolerate higher minimum detectable effects – hopefully, the combined effect of many experiments is larger than the typical effect size from a single experiment. But you also don’t want to hold back too many users from the latest product improvements, nor do you want to slow down your experimentation speed because so much of your sample is reserved for the holdout. You need to run a power analysis to determine whether, say, holding back 5% of users will give you enough sample size. (A recommendation you may read elsewhere is to run 1% holdouts. Unless you have Facebook’s scale, such small holdouts are unlikely to be powered for the cumulative effects you hope to detect. You’re better off skipping holdouts entirely.)

Second, you need to be able to keep the product experience stable for the control group. This means keeping feature flags that would otherwise be fully rolled out and cleaned up. Although temporary, this tech debt will cause developer friction and can be the source of additional dependencies and incompatibilities. Beyond code, there can be additional overhead from maintaining old features, from the need to maintain separate documentation to training support staff. All of this needs to be balanced against what can realistically be learned from holdouts.

When you should not use holdouts

Holdouts are not always applicable; even when they are, they are not a panacea for all experimentation ills.

Experiments that randomize users based on non-persistent identifiers are poor candidates for holdouts. For guests or logged-out users, where variant assignment relies on session IDs or cookies, it is nearly impossible to guarantee that a user is truly held back from all product changes. Cookies expire; users change devices. When this occurs, users will be exposed to a different mix of product changes. Far from a clean and stable comparison, a holdout analysis will have a tainted control group and give unreliable results. Holdouts should be reserved for experiments where targeting is stable over time.

Another way holdouts can give unreliable results is when the product experience itself cannot be kept stable. A common way this can occur is when a machine learning model is central to the experience whose performance degrades as its training data becomes stale. Think of a recommendation engine that is being continuously improved by a dedicated machine learning team: engineering new features, changing model architecture, and so on. At the same time, the model itself is being frequently retrained with the latest data. A naive approach to holdouts is to use a frozen version of the model as the control variant. The treatment group gets the latest model that incorporates all of the winning variants shipped by the team, and is trained on the latest data. This comparison will be biased in favor the treatment because the control model is artificially handicapped by its stale data. For a holdout to be valid, the training data has to be comparable for treatment and control.

Even when the holdout setup is itself reliable from the point of view of the experiment design and the product experience, it should not be the go-to solution for all concerns about A/B tests.

When there are strong doubts about novelty effects of an individual experiment, especially if there are worries that the long-term effect is opposite in sign to the short-term effect. In these cases, the individual experiment should be held for longer. Alternatively, a slow rollout can be used, holding back a fraction of users for a single experiment. A global holdout will not be able to disentangle the potential novelty effect from other sources of bias, nor attribute it to any individual experiment.
When the individual experiments themselves are suspected to be invalid due to interference effects, such as marketplace cannibalization or network effects. In extreme cases, ratcheting may occur: each experiment seems to increase metrics more and more, but all the gains in the treatment group come at the expense of the control group. This happens whenever the treatment diverts scarce resources to itself: it looks good when compared to control, but may be neutral or even harmful overall. This is a serious concern, and a holdout may act as a canary in the coal mine; but if the individual experiments were invalid, so is the holdout (which, after all, is just a special type of A/B test).

What to take away from holdout results

Holdouts will give surprising and unexpected results from time to time. (If they never did, why bother running them?) More often than not, they will show that the bottom-up overestimates cumulative impact. Occasionally, they will reveal that a whole series of supposedly successful experiments amounted to nothing, or even harmed key metrics.

Holdouts can serve as a useful diagnostic for the health of an experimentation program. When the cumulative impact measured through a holdout is 80% of the sum of individual impacts, it should be cause for celebration. (Don’t expect 100%!) Even following experimentation best practices, the “curses” discussed above will push towards overestimating impacts from the bottom-up approach. And don’t forget that holdouts are noisy like all experiments are: comparisons should be made across confidence intervals, not point estimates (which will never perfectly line up).

When instead the holdout impact is 20% of the sum of impacts, something is likely amiss. Holdouts are not designed to answer “Which of these experiments didn’t deliver?”: by design, they entangle different interventions in order to measure their aggregate impact. Instead of looking for a culprit among the experiments included in the holdout, a more productive approach is to identify what parts of the experimentation process itself might be faulty. Are tests run without a power analysis? Are tests run only for a short time despite evidence of novelty effects (the lifts are stronger in the first few days before they “burn in”)? What fraction of tests are successful? (When the fraction is low, it may be evidence that the false discovery rate is high.) Are interventions restricted ex-post to segments where they performed well? All of these factors may be contributing to the gap between the bottom-up and holdout results.

The true value of holdouts comes out in the long run. They help build trust that experimentation best practices are being followed. They help identify which teams are driving the most value in a way that’s more robust to gamesmanship. Double down on those teams, and those best practices, and your experimentation program will flourish.