Products

Experimentation

Product Experimentation Web Experimentation Lifecycle Experimentation Lifecycle Experimentation

Feature Flagging

Release Management Automated Rollouts Config Flags Release Management

AI Personalization

Contextual Bandits Contextual Bandits

Why Eppo

WHY EPPO

By Role

Data Scientists Engineers Product Managers Product Managers

Resources

Customers Outperform Updates White Papers White Papers

FEATURED CASE STUDY

Coinbase Saves Millions, Reduces Experiment Analysis Time by 40%, and Restores Trust in Experimentation with Eppo

Learn more

Blog

About

Statistics

June 17, 2021

Reducing Experiment Durations

The solution that improves everything

Che Sharma

Eppo's Founder and CEO, former early data scientist who built experimentation tools and cultures at Airbnb and Webflow

TL;DR:

Early-stage companies have the most pressing need to reduce experiment durations but often lack the advanced tooling available to companies like Facebook or Microsoft.
Implementing techniques like CUPED can reduce experiment runtime by up to 20-30% (which Eppo supports out of the box).
Other methods, such as stratified sampling and interleaving, can also help lower experiment durations.
Choosing tractable metrics and reducing the incidence of bugs can further shorten experiment time and deliver ROI.

I think the most neglected topic in experimentation discourse is experiment duration and the levers you can pull to lower it. When I talk to companies who are at sub-Facebook volumes of traffic, so many problems are rooted in long durations to converge business metrics.

To illustrate, each of these challenges in experimentation are caused by or exacerbated by long experiment durations:

My team's experiments keep getting interrupted by bugs. Long experiment durations mean more time when bugs can be introduced
My team's experiments keep getting affected by other team's experiments. Long experiment durations mean overlapping more experiments at the same time
My team isn't building and learning at an ROI-generating pace. Longer durations mean fewer experiments, and thus fewer chances to build, validate, and learn

Wait, you can reduce experiment duration?

It's true! The most versatile method is to implement a technique called CUPED. CUPED is similar to the concept of lead scoring that you see in marketing, but applied to AB experimentation. When we implemented CUPED at Airbnb, we were able to decrease experiment runtime by up to 20-30%.

It works like this: for each customer in the experiment, you make a guess on how likely they are to make a purchase. It turns out that experiments run faster if you measure Purchases - f(Guess) instead of Purchases. An illustrative example is included as a footnote for those interested (1).

The problem with CUPED is that only mature companies have the resources to implement it. CUPED has the same technical barriers as machine learning, complete with point-in-time data pipelines, offline simulation, and model calibration. The result is that the biggest and most valuable companies in the world receive an extra advantage of shorter experiment durations, while startups that desperately need every advantage they can get struggle to run experiments on low traffic.

Besides CUPED (and its cousin, quantile regression), there are other methods that help lower runtime:

You can change your random sampling or aggregations to explicitly balance customer types (stratified sampling)
Your choice of random seed can lead to lower durations
For search ranking, interleaving results reduces duration

At Eppo, we believe that the value of experimentation at scale shouldn't be limited to the companies that can afford PhD Data Scientists and 20-person experimentation platform teams. Eppo provides CUPED out of the box to all our customers, along with a variety of other variance reduction techniques to shorten runtime.

Pick tractable metrics

Besides these advanced techniques, there's an easy way to lower experiment runtimes: your choice of metric. There are ways to use metrics that lower your experiment durations.

The first way is to reframe your core metrics to be yes/no instead of counts. Instead of counting "sums", count "uniques". For example, # subscription upgrades (where a customer might make 1, 2, 3, ... 100+ purchases) will make experiments run much longer than # customers who upgrade (where a customer either made a purchase or didn't).

The second way is to pick a different metric, one that is on the path to the outcome you want. The most famous example is Facebook's 7 friends in 10 days metric, which converges experiments more quickly than long-term retention. For companies whose north stars are too delayed to be statistically massaged into a reasonable timeframe, these metric "indicators" become a necessity.

Unfortunately, finding indicators again requires a specialized skill set. The process is written up in the Quora post, but it involves a.) creating a dataset with a bunch of candidate indicators, b.) running a kitchen-sink regression with every candidate, and c.) seeing which ones are most predictive. This process is tricky to execute, as it's easy to find some spurious pattern that doesn't hold up if you're not careful. But when you succeed, you have a metric that can shorten experiment time dramatically while still delivering ROI.

Both approaches have drawbacks. In an ideal world, you'd use the metric that best matches business goals. Indicators require research time to run a bunch of regressions. But they present a path forward for the low-volume startup to adopt an experimentation strategy.

Make fewer mistakes

There's one last technique for lowering experiment runtime, which is not to have any bugs or mistakes that necessitate restarting the experiment. It's unfortunately all too commonplace for experiment assignment infra to have issues, or for crucial data to not be tracked, or for a bug to break the test on a specific browser. For all of the time poured into experiment execution, it still remains an incredibly brittle process.

Today's commercial experimentation tools do us no favors here. They lack the diagnostic and investigative capabilities to even notice if something has gone awry and just assume that some PM will constantly refresh experiment results to catch any mistakes.

While advanced statistics and metric choices are helpful, it's always good to remember that the shortest experiment is the one that executes cleanly.

Tooling for startup-speed

At Eppo, we recognize that purpose-built technology and powerful statistics shouldn't just belong to Facebook. It's actually the companies who are early in their experimentation maturity who most need this support.

Reducing experiment duration simultaneously improves a host of other problems. Whether by advanced statistical techniques, guided metric choices, or making more robust experiment design processes, we want to help companies quickly get ROI from their experimentation practices.

Interested in hearing more about what Eppo's a/b testing tool can do for your practice? Email me at che@geteppo.com. We'd love to chat!

‍

Fo‎otnotes

1. For example, if you're McDonalds, you can probably make some smart guesses on whether each customer will buy a Happy Meal. People with kids are more likely than teenagers. People arriving for breakfast will probably get an Egg McMuffin instead. Or even more simply, people who previously purchased Happy Meals will purchase more Happy Meals. With these guesses in hand, you can then calculate (# happy meals) - (X*guessed # happy meals).