Products

Experimentation

Product Experimentation Web Experimentation Lifecycle Experimentation Lifecycle Experimentation

Feature Flagging

Release Management Automated Rollouts Config Flags Release Management

AI Personalization

Contextual Bandits Contextual Bandits

Why Eppo

WHY EPPO

By Role

Data Scientists Engineers Product Managers Product Managers

Resources

Customers Outperform Updates White Papers White Papers

FEATURED CASE STUDY

Coinbase Saves Millions, Reduces Experiment Analysis Time by 40%, and Restores Trust in Experimentation with Eppo

Learn more

Blog

About

Culture

July 26, 2023

From 10s to 1000s: How to Scale Experimentation Velocity

How do you get from 10 experiments to 1000? Here are some practical tips to scale your velocity.

Anne Kim

Generation-defining companies run thousands of experiments a month, thanks to well-defined processes and dedicated tooling. By democratizing experimentation, not only do these companies understand the value every project is generating, they also ensure that good ideas are actually evaluated. Given that only a third of ideas lead to positive outcomes, the best way to maximize the number of game-changing releases is to increase the number of ideas tested.

But how do you get from 10 experiments to 1000? There are a lot of things that can go wrong. This post will share some of my learnings from working with companies to level up their experiment velocity. I’ll also share some practical tips to help you do the same. I'll cover not just process changes and opportunities for dedicated tooling, but also how they connect back to building a broad experimentation culture.

Specifically, I’ll go through three high-level stages of the experimentation lifecycle:

Running a healthy experiment
Acting on results quickly
Ensuring that learnings are cumulative

Running a healthy experiment

Nothing kills experimentation culture faster than wasting time on a non-productive experiment. It’s painful to notice an issue days or weeks after launch, but it's even worse if a decision was already made and months pass before discovering the error. The most common obstacles that prevent companies from running healthy experiments are:

A bug in one variant leads to latency or tracking issues, causing sample ratio mismatch (SRM) and biased results
Experiments are underpowered and will take an unfeasibly long time to reach a decision

Tip 1: Perform proactive experiment diagnostics and alert when issues are detected

If you’re running hundreds or thousands of experiments, there will inevitably be situations where a bug goes live and causes an issue with one variant. The key here is to catch things early and fix them before too much time is wasted. Some teams attempt to solve this by building diagnostic dashboards, but I’ve found that proactive notifications - for instance in the team or project’s Slack channel - are much more effective (as much as everyone loves their morning dashboard routine).

Tip 2: Democratize experiment planning, not just analysis

If you want everyone in the company to be able to run an experiment, you’ll need everyone to understand how long an experiment will take and a realistic effect size to target. Experiment planning is a nuanced problem that requires both statistical expertise and deep knowledge of your own data. Accordingly, planning an experiment typically requires involving a Data Scientist and inevitably leads to slower experiment velocity. A better option is to enable PMs and other experimentation practitioners to self-serve power analysis with a dedicated tool.

Acting on results quickly

You can push out a ton of healthy tests, but it means nothing if you don’t have a process in place to ensure that results are actionable. Time wasted between data collection and taking action can cause substantial delays in an experimentation calendar. As the number of experiments increases, it becomes more and more important to establish a process for decision-making.

Tip 3: Define experiment end criteria during planning

No experiment should be launched without clear end criteria defined. If you’re using a sequential statistical framework, this could be as simple as saying “we will run the experiment until we either see a statistically significant impact or we have reached a specified minimum detectable effect.” By referencing back to these criteria - or better yet, providing guidance to experimentation practitioners via automated progress reporting - you can ensure every experiment has clear exit criteria.

Tip 4: Choose a statistics methodology that allows for quick decision making

Real-world datasets are often prone to outliers and high variance, leading to long time horizons to reach statistical significance. Given this, it's important for experimentation tools to apply variance reduction methods. The two most common approaches are controlling for pre-period differences across users (commonly referred to as “CUPED”) and careful treatment of outliers via methods such as winsorization.

In addition, using modern statistical methods like sequential testing can lower the time to decision. With classical statistics (e.g., t-tests), a decision cannot be made until a pre-determined sample size is reached. This pre-determined sample size is typically computed from a minimum detectable effect (MDE) and can lead to rather conservative experiment run times. Modern sequential methods allow practitioners to make decisions at any point, meaning that an experiment originally scheduled for several weeks can be called much earlier if the results are positive.

Tip 5: Enable experiment practitioners to self-serve explorations

Perhaps the largest bottleneck to rapid decision-making is slow analytics cycles. Experimentation practitioners need to be able to define their metrics of interest and perform their own exploratory analyses. Every company that runs thousands of experiments has solved two problems: drag-and-drop metric selection per experiment (with an emphasis on business metrics) and self-serve slice-and-dice analysis to understand the impact across key customer dimensions.

Tip 6: Ensure guardrails are in place

One final friction point for taking timely action is the concern of unforeseen negative effects. The common example is an experiment that increases top-of-funnel metrics but accomplishes this by getting more low-intent users into the funnel. The impact on bottom-of-funnel metrics is then much less impressive (or even negative).

To account for this, top experimentation teams implement a system of guardrails to ensure that, for all experiments, any potential impact on down-funnel or adjacent business metrics is explicitly checked for. By establishing an org-wide agreement on what these guardrails are, companies are able to de-risk rollout decisions and speed up time-to-action.

Ensuring that learnings are cumulative

Once teams are enabled to run healthy experiments and act on results, the final problem to solve is how to organize all of the learnings. In a company with 20 teams running experiments, there is an immense opportunity for cross-pollination of ideas. Unfortunately, most of the time these learnings live in a silo, or are at best briefly presented at an all-hands meeting. The final piece of the puzzle is thus ensuring that the learnings from one team’s experiments are readily disseminated around the company.

Tip 7: Centralize experiment analysis

Implementing an experiment varies a lot by use case: product teams may prefer to put new features behind flags or control parameters with remote config, marketers may prefer no-code visual editors to run experiments, and machine learning teams may use a custom Python job to implement traffic splits. All of these experiments, however, likely tie back to a relatively small set of business metrics, presumably tracked in a data warehouse.

Building a centralized reporting framework to measure the impact on these business metrics naturally gets all of product, marketing, machine learning, and any other team running experiments to not only speak the same language, but also understand the ideas their peers are testing and what learnings those tests are producing.

Tip 8: Use reporting to tell the full story, not just the metrics

A centralized reporting framework also serves as an amazing opportunity for context sharing. By adding hypothesis, supporting analysis, explorations, takeaways, and screenshots, low-context peers can quickly understand not just what an experiment was testing, but why the experiment was run. This experiment repository naturally lends itself to tags, searching, and meta-analysis and enables teams to go back in time and get the full context of an experiment, beyond just the impact on specific metrics.

Done correctly, a repository of past experiments solves the final part of growing experimentation: creating new ideas to test. Once every team member has the ability to share in the learning from all of the experiments your company is running, the number of potentially game-changing ideas to test will only increase.