Back to blog

Table of contents

Ready for a 360° experimentation platform?
Turn blind launches into trustworthy experiments
See Eppo in Action

If you’ve been looking around at the state of the art in online experimentation, you’ve probably come across a technique called “CUPED” - maybe advertised as a feature by an A/B testing tool, or in research published by companies like Microsoft, Netflix, or DoorDash. It’s a deservedly-popular topic, given its promise: to reduce experiment runtime, enabling experiments to conclude even up to 65% faster.

One of the most common complaints in experimentation programs is around how long it takes to run experiments. The bulk of that time isn’t active work from data teams planning or analyzing experiments - it’s simply waiting for a sufficient sample size to be collected. Unless you work at a company with FAANG-scale user traffic, an experiment with standard statistical parameters likely takes over a month to collect sufficient data. Several months, sometimes.

image

The necessary sample size for an experiment (at a given power level) really boils down to two variables, the minimum effect size you care about detecting, and the variance in your measured outcome. Before Eppo became the first commercial experimentation platform to offer CUPED, if you wanted to run experiments faster, your options were pretty much limited to increasing that Minimum Detectable Effect, inflating your false-negative rate.

But what if you had a magic wand that would make experiments run faster? With no tradeoffs, at all? That’s what CUPED promises, by reducing that other variable - variance.

If you’re looking into utilizing CUPED, it’s important to understand what it is, exactly, and where it can successfully be applied (hint: not everywhere). In this article, we’ll walk through what exactly CUPED is, why it can be so challenging to implement (both for in-house teams and commercial vendors), and what’s so special about Eppo’s “CUPED++” implementation.

What is CUPED?

In 2013, a team from Microsoft led by Alex Deng wrote a paper “Controlled-experiment using pre-experiment data”, CUPED for short, introducing a new method that could speed up experiments with no tradeoff required. With this method, Microsoft could bend time, making experiments that typically took 8 weeks only take 5-6 weeks. Since that paper, the method has gone mainstream.

CUPED, at its core, is a variance reduction technique. It leverages historical data about your users to reduce noise in the observations made in your experiment. In other words, if we know some pre-experiment data about user behavior for a certain metric, we can use that to decrease our uncertainty about the estimated means of said metric in each experiment variation.

Suppose that you are McDonald’s and want to run an experiment to see if you can increase the number of Happy Meals sold by including a menu in Spanish.

  • A standard approach would be to divide locations into two groups and assign the Spanish menu to one group. The final analysis would compare which group of stores sells more Happy Meals. This is a normal experiment; it will measure impact accurately but takes a long time.
  • A slightly more sophisticated experiment approach would use the same locations but could correct for pre-experiment differences in Happy Meal sales across treatment and control stores. Mathematically, this takes less time by controlling for natural variation between stores. This is what an approach like CUPED accomplishes.
  • A full CUPED approach doesn’t just use previous year’s sales as a control; it uses all known pre-experiment information. For example, controlling for the following factors might speed up the experiment even more: 1. The number of families (who are more likely to order Happy Meals) that go to each store; 2. Time of day when customers go to each location as morning diners will more likely order a breakfast than a Happy Meal; 3. Stores that were part of a limited launch of the McRib which might draw attention away from Happy Meals.

Each successive method reduces time-to-significance by reducing noise in the experiment data. Just like noise-canceling headphones, CUPED can take out ambient effects to help the experimenter detect impact more clearly.

In visual terms, suppose that the treatment-versus-control experiment data looked something like this before applying CUPED:

image

After applying CUPED, the uncertainty around the averages decreases, like this:

image

The net effect of sharpening these measurements is that it takes less time to figure out whether an experiment is having a positive impact or not. CUPED is bending time with math – alleviating the largest pain point in experimentation today.

image

Why does experiment speed matter?

In business terms, a long-running experiment is the same as a long-delayed decision. Organizations that can learn from experiments and react quickly enjoy a tactical advantage over their competitors.

Besides being a decisional dead weight, long-running experiments have other technical and cultural ramifications: think of the technical debt incurred by keeping the two code paths running for several weeks or months. Think about whether anyone at the company will be excited to run an experiment if the result is slower than a container ship crossing the Pacific Ocean. Think of all the small wins and little ideas that will go unmeasured and unimplemented because experiments are just too slow.

The scarcest resource in experimentation today isn't tooling or even technical talent. The scarcest resource is time.

Experimentation speed is about creating a feedback loop so that good ideas lead to even better ideas and misguided ideas lead to rapid learning. The faster experiments finish, the tighter that feedback loop gets, creating a compound interest effect on good ideas.

Why can it be challenging to implement CUPED?

CUPED can be challenging to implement both because it has a limited scope of potential applications, and because it’s computationally expensive.

When it comes to determining if you have a potential use case in the first place, remember that the key to CUPED lies in that “pre-experiment” piece. If you are experimenting on users (or other experimental units) that don’t already interact with you, or don’t interact with you in a way that might predict future behavior, there isn’t going to be the requisite pre-experiment data. This means that traditional CUPED implementations are no help when testing around things like onboarding flows or new users. (This is what Eppo’s CUPED++ approach solves for - more on that in the next section).

That data also needs to be accessible to your experiment tool, which is why historically commercial software vendors were unable to offer CUPED as a feature. Eppo inherently solves for this as the only experimentation platform that’s 100% data-warehouse native, which is why we were the first platform to offer CUPED as well. But if you use a tool that requires you to send events to it, the likelihood of it having reliable pre-experiment data stored and accessible is low.

CUPED is also computationally expensive. More data needs to be ingested - once a user is assigned to an experiment, a pipeline must fetch that user’s historical data from some reasonable recent window of time. But more importantly, the CUPED-adjusted means themselves involve linear regressions that take longer to execute.

We ran into this ourselves in our CUPED beta at Eppo - as customers would try to apply CUPED to large data sets, we started hitting dreaded OOMKilled errors. Out of memory. To build a scaleable solution, we developed a new approach to the computation in pure SQL, described in-depth by Eppo Statistics Engineer Evan Miller in a 2023 QCon conference talk.

CUPED vs. CUPED++

If you lack the specific pre-experiment data required to leverage CUPED, what else might be available to you? Inspired by a deep dive on one of the mathematical foundations of CUPED (dating all the way back to 1933), the Eppo statistics engineering team noticed a missed opportunity for many teams. In most implementations, CUPED refers to reducing the variance of a metric by using pre-experiment data on only that metric itself based on the covariance between the two; equivalent to running a simple regression. However, it’s also possible (as the original paper discusses) to include a full vector of other experiment metrics, or all treatment assignments (i.e., all experiments a user has been bucketed into) and reduce variance even further… or in cases where our pre-experiment data is lacking.

Here’s what Eppo’s CUPED++ makes possible:

  • For some metrics, there is no clear pre-experiment equivalent for that metric: e.g. a conversion or retention metric. In our implementation, we can still leverage historical data of the other experiment metrics to help improve estimates of these conversion and retention metrics. This allows us to get improved estimates for conversion and retention metrics versus a standard CUPED approach.
  • The standard CUPED approach does not help for experiments where no pre-experiment data exists (e.g. experiments on new users, such as onboarding flows). Because we also use assignment properties as covariates in the regression adjustments model, we are able to reduce variance for these experiments as well, which leads to smaller confidence intervals for such experiments.

You can read more about it in a white paper on Eppo’s statistics engine from MIT’s Conference on Digital Experimentation.

---

Although it’s already a decade old, CUPED certainly represents one of the most exciting innovations to how we statistically analyze digital experiments. For most of that decade, it was an approach available only to giant tech companies with large experimentation platform teams, given the difficulty of implementation - and the roadblocks preventing legacy commercial tools from offering it. With Eppo’s first-in-class warehouse native experimentation platform, and development of CUPED++, we’ve made variance reduction available to more companies, and more use cases, than ever before.

Back to blog

If you’ve been looking around at the state of the art in online experimentation, you’ve probably come across a technique called “CUPED” - maybe advertised as a feature by an A/B testing tool, or in research published by companies like Microsoft, Netflix, or DoorDash. It’s a deservedly-popular topic, given its promise: to reduce experiment runtime, enabling experiments to conclude even up to 65% faster.

One of the most common complaints in experimentation programs is around how long it takes to run experiments. The bulk of that time isn’t active work from data teams planning or analyzing experiments - it’s simply waiting for a sufficient sample size to be collected. Unless you work at a company with FAANG-scale user traffic, an experiment with standard statistical parameters likely takes over a month to collect sufficient data. Several months, sometimes.

image

The necessary sample size for an experiment (at a given power level) really boils down to two variables, the minimum effect size you care about detecting, and the variance in your measured outcome. Before Eppo became the first commercial experimentation platform to offer CUPED, if you wanted to run experiments faster, your options were pretty much limited to increasing that Minimum Detectable Effect, inflating your false-negative rate.

But what if you had a magic wand that would make experiments run faster? With no tradeoffs, at all? That’s what CUPED promises, by reducing that other variable - variance.

If you’re looking into utilizing CUPED, it’s important to understand what it is, exactly, and where it can successfully be applied (hint: not everywhere). In this article, we’ll walk through what exactly CUPED is, why it can be so challenging to implement (both for in-house teams and commercial vendors), and what’s so special about Eppo’s “CUPED++” implementation.

What is CUPED?

In 2013, a team from Microsoft led by Alex Deng wrote a paper “Controlled-experiment using pre-experiment data”, CUPED for short, introducing a new method that could speed up experiments with no tradeoff required. With this method, Microsoft could bend time, making experiments that typically took 8 weeks only take 5-6 weeks. Since that paper, the method has gone mainstream.

CUPED, at its core, is a variance reduction technique. It leverages historical data about your users to reduce noise in the observations made in your experiment. In other words, if we know some pre-experiment data about user behavior for a certain metric, we can use that to decrease our uncertainty about the estimated means of said metric in each experiment variation.

Suppose that you are McDonald’s and want to run an experiment to see if you can increase the number of Happy Meals sold by including a menu in Spanish.

  • A standard approach would be to divide locations into two groups and assign the Spanish menu to one group. The final analysis would compare which group of stores sells more Happy Meals. This is a normal experiment; it will measure impact accurately but takes a long time.
  • A slightly more sophisticated experiment approach would use the same locations but could correct for pre-experiment differences in Happy Meal sales across treatment and control stores. Mathematically, this takes less time by controlling for natural variation between stores. This is what an approach like CUPED accomplishes.
  • A full CUPED approach doesn’t just use previous year’s sales as a control; it uses all known pre-experiment information. For example, controlling for the following factors might speed up the experiment even more: 1. The number of families (who are more likely to order Happy Meals) that go to each store; 2. Time of day when customers go to each location as morning diners will more likely order a breakfast than a Happy Meal; 3. Stores that were part of a limited launch of the McRib which might draw attention away from Happy Meals.

Each successive method reduces time-to-significance by reducing noise in the experiment data. Just like noise-canceling headphones, CUPED can take out ambient effects to help the experimenter detect impact more clearly.

In visual terms, suppose that the treatment-versus-control experiment data looked something like this before applying CUPED:

image

After applying CUPED, the uncertainty around the averages decreases, like this:

image

The net effect of sharpening these measurements is that it takes less time to figure out whether an experiment is having a positive impact or not. CUPED is bending time with math – alleviating the largest pain point in experimentation today.

image

Why does experiment speed matter?

In business terms, a long-running experiment is the same as a long-delayed decision. Organizations that can learn from experiments and react quickly enjoy a tactical advantage over their competitors.

Besides being a decisional dead weight, long-running experiments have other technical and cultural ramifications: think of the technical debt incurred by keeping the two code paths running for several weeks or months. Think about whether anyone at the company will be excited to run an experiment if the result is slower than a container ship crossing the Pacific Ocean. Think of all the small wins and little ideas that will go unmeasured and unimplemented because experiments are just too slow.

The scarcest resource in experimentation today isn't tooling or even technical talent. The scarcest resource is time.

Experimentation speed is about creating a feedback loop so that good ideas lead to even better ideas and misguided ideas lead to rapid learning. The faster experiments finish, the tighter that feedback loop gets, creating a compound interest effect on good ideas.

Why can it be challenging to implement CUPED?

CUPED can be challenging to implement both because it has a limited scope of potential applications, and because it’s computationally expensive.

When it comes to determining if you have a potential use case in the first place, remember that the key to CUPED lies in that “pre-experiment” piece. If you are experimenting on users (or other experimental units) that don’t already interact with you, or don’t interact with you in a way that might predict future behavior, there isn’t going to be the requisite pre-experiment data. This means that traditional CUPED implementations are no help when testing around things like onboarding flows or new users. (This is what Eppo’s CUPED++ approach solves for - more on that in the next section).

That data also needs to be accessible to your experiment tool, which is why historically commercial software vendors were unable to offer CUPED as a feature. Eppo inherently solves for this as the only experimentation platform that’s 100% data-warehouse native, which is why we were the first platform to offer CUPED as well. But if you use a tool that requires you to send events to it, the likelihood of it having reliable pre-experiment data stored and accessible is low.

CUPED is also computationally expensive. More data needs to be ingested - once a user is assigned to an experiment, a pipeline must fetch that user’s historical data from some reasonable recent window of time. But more importantly, the CUPED-adjusted means themselves involve linear regressions that take longer to execute.

We ran into this ourselves in our CUPED beta at Eppo - as customers would try to apply CUPED to large data sets, we started hitting dreaded OOMKilled errors. Out of memory. To build a scaleable solution, we developed a new approach to the computation in pure SQL, described in-depth by Eppo Statistics Engineer Evan Miller in a 2023 QCon conference talk.

CUPED vs. CUPED++

If you lack the specific pre-experiment data required to leverage CUPED, what else might be available to you? Inspired by a deep dive on one of the mathematical foundations of CUPED (dating all the way back to 1933), the Eppo statistics engineering team noticed a missed opportunity for many teams. In most implementations, CUPED refers to reducing the variance of a metric by using pre-experiment data on only that metric itself based on the covariance between the two; equivalent to running a simple regression. However, it’s also possible (as the original paper discusses) to include a full vector of other experiment metrics, or all treatment assignments (i.e., all experiments a user has been bucketed into) and reduce variance even further… or in cases where our pre-experiment data is lacking.

Here’s what Eppo’s CUPED++ makes possible:

  • For some metrics, there is no clear pre-experiment equivalent for that metric: e.g. a conversion or retention metric. In our implementation, we can still leverage historical data of the other experiment metrics to help improve estimates of these conversion and retention metrics. This allows us to get improved estimates for conversion and retention metrics versus a standard CUPED approach.
  • The standard CUPED approach does not help for experiments where no pre-experiment data exists (e.g. experiments on new users, such as onboarding flows). Because we also use assignment properties as covariates in the regression adjustments model, we are able to reduce variance for these experiments as well, which leads to smaller confidence intervals for such experiments.

You can read more about it in a white paper on Eppo’s statistics engine from MIT’s Conference on Digital Experimentation.

---

Although it’s already a decade old, CUPED certainly represents one of the most exciting innovations to how we statistically analyze digital experiments. For most of that decade, it was an approach available only to giant tech companies with large experimentation platform teams, given the difficulty of implementation - and the roadblocks preventing legacy commercial tools from offering it. With Eppo’s first-in-class warehouse native experimentation platform, and development of CUPED++, we’ve made variance reduction available to more companies, and more use cases, than ever before.

Subscribe to our monthly newsletter

A round-up of articles about experimentation, stats, and solving problems with data.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.