Machine learning models cannot be evaluated without AB experiments. The same is true for machine learning teams.
Recent developments in machine learning have been jaw dropping. Language learning models can write conversational text that is plausibly human, can write code that reliably works, all on limited data via transfer learning. (or even a few records with one-shot GPT-3 training!) And the large cohort of new ML infra startups essentially guarantees that machine learning workflows are going to get much, much easier.
We’re excited to see the space progress, because history has taught us that successful machine learning teams end up driving A|B experimentation adoption.
The trend is not an accident, machine learning teams existentially need to run A|B experiments. Without these experiments, CFOs cannot appreciate the team’s ROI and won’t add investment (in headcount, dollars), due to a unique set of constraints of machine learning.
ML in particular needs A/B experiments because there is often no new UI, no obvious change, no perceptible change, and yet there can still be big aggregate effects that are impossible to know without running the new models in production with an experiment.
To illustrate, from 2013-2015, Airbnb’s search ranking team powered some of Airbnb’s biggest wins. But these wins were invisible to end users. (the same is true of recommendation engines at Netflix, StitchFix, pricing at Uber, …) And they would have been invisible to management if the ranking team didn’t have the receipts: their A|B experiment results.
When I joined Airbnb, the original “search ranking model” was a column in the Airbnb listing database table with a hand wavy heuristic. All new listings got a “boost” in the score, and getting booked would add to that number. The ranking team would sometimes add to some listing’s scores that “just felt like it needed to be higher”, such as the famous Mushroom Dome in Aptos (a favorite of management for how it typified Airbnb’s brand).
Eventually, we spun up a machine learning team, which replaced that heuristic with a machine learning driven ranking model. Even though this model demoted the Mushroom Dome somewhat, the company was excited to have a sophisticated approach to a core workflow of finding listings.
As time went on, the search ranking team delivered more and more wins:
But as the machine learning model evolved to be more sophisticated and customer centric, the UX looked the same. You still did a search for “Airbnbs in New York City from 10/12-10/14”, and saw a set of listings pop up. The evolved model that knew about location context, host preferences, and host quality still looked the same as the original hand wavy heuristic. For a company that prided itself on world class design, these wins didn’t have any visual design to speak of.
But the good news is that the ranking team ran A|B experiments. They had proof that these models were driving larger booking increases than nearly every other product initiative. When time came to make further investments, the search ranking team’s scope increased, its key members were promoted, and the company happily invested in machine learning engineers and infrastructure.
Management felt confident to invest in machine learning because while the product work was invisible, the effect on metrics was crystal clear from the A|B experiments.
If you have ever been part of a machine learning team that does both “offline” evaluation (via cross validation on historical datasets) and “online” evaluation (via A|B experiments), you know that models don’t win in production as much as they do in development. Here are a sample of reasons why (and I’m not even giving a comprehensive list):
The ranking team at Airbnb compared offline vs. online analysis of 20 model iterations and found only a loose correlation, with the majority of models showing great offline performance and no effect in production.
A recent blog post by DoorDash shows how sophisticated companies can get with offline evaluation. And yet, the authors themselves acknowledge that simulations don’t work in most scenarios. And even when the simulation works, DoorDash still tests all new models with A|B experiments.
If all offline evaluation were to be believed, each machine learning team would be the highest performing group in the organization and all of their companies would be unicorns.
Machine learning teams use incredibly sophisticated algorithms. Company CFOs have a much simpler equation: will this investment yield a good enough financial return.
If you have ever been part of a budgetary planning process, you know that each team can be understood as “If I put XX money into this org, I expect YY money/outcomes/brand to come out”. For machine learning teams, the output is always a boost in metrics. Airbnb built a search ranking team to drive more bookings than a heuristic. Machine learning teams follow metric strategies, not shipping strategies.
Even as the models grow in complexity and capabilities (and the LLMs are truly amazing), organizations still need proof that investment in ML technology leads to business success. It’s going to be an A|B experiment with GPT-3 as the treatment group and the hand-wavy heuristic in the control. And with the way these models are trending, we at Eppo are looking forward to celebrating more experiment wins with our machine learning customers.
Building the Modern Experimentation Stack
The Warehouse-Native Experimentation Workflow
How to Set Up an Experiment in Eppo