Back to Outperform

Recent developments in machine learning have been jaw dropping. Language learning models can write conversational text that is plausibly human, can write code that reliably works, all on limited data via transfer learning. (or even a few records with one-shot GPT-3 training!) And the large cohort of new ML infra startups essentially guarantees that machine learning workflows are going to get much, much easier.

We’re excited to see the space progress, because history has taught us that successful machine learning teams end up driving A|B experimentation adoption.

  • Airbnb’s first A|B experiments were run by the search ranking team, to see if new ranking models outperformed old ones
  • Netflix ran A|B experiments on its recommendation engine since its earliest days when it mailed DVDs, measuring each model iteration’s effect on customer retention.
  • Facebook’s A|B experimentation program began with automated advertisement auctions, newsfeed ranking, and growth.

The trend is not an accident, machine learning teams existentially need to run A|B experiments. Without these experiments, CFOs cannot appreciate the team’s ROI and won’t add investment (in headcount, dollars), due to a unique set of constraints of machine learning.

Machine Learning ROI is invisible without AB experiments

100s of search ranking model changes all looked exactly like this, just reordered listings.

ML in particular needs A/B experiments because there is often no new UI, no obvious change, no perceptible change, and yet there can still be big aggregate effects that are impossible to know without running the new models in production with an experiment.

To illustrate, from 2013-2015, Airbnb’s search ranking team powered some of Airbnb’s biggest wins. But these wins were invisible to end users. (the same is true of recommendation engines at Netflix, StitchFix, pricing at Uber, …) And they would have been invisible to management if the ranking team didn’t have the receipts: their A|B experiment results.

When I joined Airbnb, the original “search ranking model” was a column in the Airbnb listing database table with a hand wavy heuristic. All new listings got a “boost” in the score, and getting booked would add to that number. The ranking team would sometimes add to some listing’s scores that “just felt like it needed to be higher”, such as the famous Mushroom Dome in Aptos (a favorite of management for how it typified Airbnb’s brand).

Eventually, we spun up a machine learning team, which replaced that heuristic with a machine learning driven ranking model. Even though this model demoted the Mushroom Dome somewhat, the company was excited to have a sophisticated approach to a core workflow of finding listings.

As time went on, the search ranking team delivered more and more wins:

  • A location context model changed “location desirability” from the center pixel of an Airbnb map to specific desirable neighborhoods. This meant that searchers for New York City were guided to Manhattan and Brooklyn instead of Astoria and Jackson Heights.
  • A host scoring model started penalizing hosts who chronically rejected guests, drastically lowering the horrible experience of searching, booking, and finding out you have to search again after being denied.
  • A host preference model noticed when hosts seemed to want to avoid gaps in their calendars, or to prefer long (or short) stays, and used that preference so people searching would see hosts who like their trip dates. This again lowered the rate of being rejected from your stay.
This map shows up if you search “New York, NY” on Airbnb. The center of the map in Queens is not what Airbnb first recommends

But as the machine learning model evolved to be more sophisticated and customer centric, the UX looked the same. You still did a search for “Airbnbs in New York City from 10/12-10/14”, and saw a set of listings pop up. The evolved model that knew about location context, host preferences, and host quality still looked the same as the original hand wavy heuristic. For a company that prided itself on world class design, these wins didn’t have any visual design to speak of.

But the good news is that the ranking team ran A|B experiments. They had proof that these models were driving larger booking increases than nearly every other product initiative. When time came to make further investments, the search ranking team’s scope increased, its key members were promoted, and the company happily invested in machine learning engineers and infrastructure.

Management felt confident to invest in machine learning because while the product work was invisible, the effect on metrics was crystal clear from the A|B experiments.

Offline evaluation is not enough

Despite extensive offline simulation evaluations of machine learning models, A|B experiments are the only way to measure impact.

If you have ever been part of a machine learning team that does both “offline” evaluation (via cross validation on historical datasets) and “online” evaluation (via A|B experiments), you know that models don’t win in production as much as they do in development. Here are a sample of reasons why (and I’m not even giving a comprehensive list):

  • Optimizing the model off of proxy ML metrics instead of business metrics
  • The model training dataset is missing a segment of users (eg. mobile users)
  • The data pipelines serving production scoring are not the same as the pipelines that calculate the dataset
  • The features used to predict are not “time-in-place”, and unknowingly are leaking information from after the moment of prediction
  • The model-training process overfit the dataset and doesn’t generalize well anymore
  • The machine learning model used a loss function that doesn’t match core business metrics
  • Users change their behavior in response to the model (”adversarial machine learning”), such as with a pricing change or fraud detection algorithm

The ranking team at Airbnb compared offline vs. online analysis of 20 model iterations and found only a loose correlation, with the majority of models showing great offline performance and no effect in production.

A recent blog post by DoorDash shows how sophisticated companies can get with offline evaluation. And yet, the authors themselves acknowledge that simulations don’t work in most scenarios. And even when the simulation works, DoorDash still tests all new models with A|B experiments.

If all offline evaluation were to be believed, each machine learning team would be the highest performing group in the organization and all of their companies would be unicorns.

CFOs need to see business impact to invest

Machine learning teams use incredibly sophisticated algorithms. Company CFOs have a much simpler equation: will this investment yield a good enough financial return.

If you have ever been part of a budgetary planning process, you know that each team can be understood as “If I put XX money into this org, I expect YY money/outcomes/brand to come out”. For machine learning teams, the output is always a boost in metrics. Airbnb built a search ranking team to drive more bookings than a heuristic. Machine learning teams follow metric strategies, not shipping strategies.

Even as the models grow in complexity and capabilities (and the LLMs are truly amazing), organizations still need proof that investment in ML technology leads to business success. It’s going to be an A|B experiment with GPT-3 as the treatment group and the hand-wavy heuristic in the control. And with the way these models are trending, we at Eppo are looking forward to celebrating more experiment wins with our machine learning customers.

Table of contents

Ready for a 360° experimentation platform?
Turn blind launches into trustworthy experiments
See Eppo in Action