I think the most neglected topic in experimentation discourse is experiment duration, and the levers you can pull to lower it. When I talk to companies who are at sub-Facebook volumes of traffic, so many problems are rooted in long durations to converge business metrics.
To illustrate, each of these challenges in experimentation are caused by or exacerbated by long experiment durations:
- My team's experiments keep getting interrupted by bugs. Long experiment durations means more time when bugs can be introduced
- My team's experiments keep getting affected by other team's experiments. Long experiment durations means overlapping more experiments at the same time
- My team isn't building and learning at an ROI generating pace. Longer durations mean fewer experiments, and thus fewer chances to build, validate, and learn
Wait, you can reduce experiment duration?
It's true! The most versatile method is to implement a technique called CUPED. CUPED is similar to the concept of lead scoring that you see in marketing, but applied to AB experimentation. When we implemented CUPED at Airbnb, we were able to decrease experiment runtime by up to 20-30%.
It works like this, for each customer in the experiment, you make a guess on how likely he or she is to make a purchase. It turns out that experiments run faster if you measure Purchases - f(Guess) instead of Purchases. An illustrative example is included as a footnote for those who are interested 1.
The problem with CUPED is that only mature companies have the resources to implement. CUPED has the same technical barriers as machine learning, complete with point-in-time data pipelines, offline simulation, and model calibration. The result is that biggest and most valuable companies in the world receive an extra advantage of shorter experiment durations, while startups which desperately need every advantage they can get struggle to run experiments on low traffic.
Besides CUPED (and its cousin, quantile regression), there are other methods that help lower runtime:
- You can change your random sampling or aggregations to explicitly balance customer types (stratified sampling)
- Your choice of random seed can lead to lower durations
- For search ranking, interleaving results reduces duration
At Eppo, we believe that the value of experimentation at scale shouldn't be limited to the companies that can afford PhD Data Scientists and 20 person experimentation platform teams. We will be providing CUPED out of the box to all our customers, along with a variety of other variance reduction techniques to shorten runtime.
Pick tractable metrics
Besides these advanced techniques, there's an easy way to lower experiment runtimes, your choice of metric. There are ways to use metrics that lower your experiment durations.
The first way is to reframe your core metrics to be yes/no instead of counts. Instead of counting "sums", count "uniques". For example, # subscription upgrades (where a customer might make 1, 2, 3, ... 100+ purchases) will make experiments run much longer than # customers who upgrade (where a customer either made a purchase or didn't).
The second way is to pick a different metric, one that is on the path to the outcome you want. The most famous example of this is Facebook's 7 friends in 10 days metric, which converges experiments more quickly than long term retention. For companies whose north stars are too delayed to be statistically massaged into a reasonable timeframe, these metric "indicators" become a necessity.
Unfortunately, finding indicators again requires a specialized skillset. The process is written up in the Quora post, but it involves a.) creating a dataset with a bunch of candidate indicators b.) running a kitchen-sink regression with every candidate, and c.) seeing which ones are most predictive. This process is tricky to execute, as it's easy to find some spurious pattern that doesn't hold up if you're not careful. But when you succeed, you have a metric that can shorten experiment time dramatically while still delivering ROI.
Both approaches have drawbacks. In an ideal world you'd use the metric that best matches business goals. Indicators require research a time to run a bunch of regressions. But they present a path forward for the low volume startup to adopt an experimentation strategy.
Make fewer mistakes
There's one last technique for lowering experiment runtime, which is to not have any bugs or mistakes that necessitate restarting the experiment. It's unfortunately all too commonplace for experiment assignment infra to have issues, or for crucial data to not be tracked, or for a bug to cripple the test on a specific browser. For all of the time poured into experiment execution, it still remains an incredibly brittle process.
Today's commercial experimentation tools do us no favors here. They lack the diagnostic and investigative capabilities to even notice if something has gone awry, and just assume that some PM will constantly refresh experiment results to catch any mistakes.
While advanced statistics and metric choices are helpful, it's always good to remember that the shortest experiment is the one that executes cleanly.
Tooling for startup-speed
At Eppo, we recognize that purpose built technology and powerful statistics shouldn't just belong to Facebook. It's actually the companies who are early in their experimentation maturity who most need this support.
Reducing experiment duration simultaneously improves a host of other problems. Whether by advanced statistical techniques, guided metric choices, or making more robust experiment design processes, we want to help companies quickly get ROI from their experimentation practices.
Interested in hearing more about what Eppo can do for your practice? Email me at email@example.com. We'd love to chat!
1. For example, if you're McDonalds, you can probably make some smart guesses on whether each customer will buy a happy meal. People with kids are more likely than teenagers. People arriving for breakfast will probably get an Egg McMuffin instead. Or even more simply, people who previously purchased happy meals will purchase more happy meals. With these guesses in hand, you can then calculate (# happy meals) - (X*guessed # happy meals).