Optimizing search ranking algorithms can drive millions of dollars in revenue. Here's how to A/B test them.
Aaron Maurer, Senior Engineer Manager at Slack, has devoted his career to building Machine Learning systems that make products more delightful and generate business impact. At Slack and Airbnb, his focus was search-ranking experiments, which ensure that search functionality provides users with the most relevant results.
For marketplaces, e-commerce platforms, and collaboration tools, optimizing the search-ranking algorithm can drive millions of dollars in revenue. And according to Maurer, the only way to properly evaluate your model’s success is through A/B testing.
Search ranking is determined by a complex algorithm that takes various factors into account, including relevance, authority, and popularity of the search terms. At both Airbnb and Slack, Maurer worked on internal search for the core products.
At Airbnb, users try to find a listing by entering a geographic area and some dates, and the system picks out a set of results to display. Maurer’s search ranking team also supported Airbnb’s algorithmic recommendation system, and other elements adjacent to that experience.
Search-ranking teams are responsible for optimizing and improving a specific search experience within a product. The most lucrative optimizations tend to involve the search experience around shopping. “Amazon's a huge one,” Maurer says. “Any time where you're connecting users to inventory, you can very directly measure the impact of ranking in dollars, and it tends to get a lot of investment.” Reddit, Twitter, and Dropbox all have a search component, and all three have search-ranking teams.
According to Maurer, teams can fiddle with search results without getting into machine learning. “But if you’re investing significant time in your ranking, you start doing machine learning almost instantly. The main meat of ranking is optimizing the machine learning model.”
For a machine learning team, experimentation happens in the final stage of a rollout. The team has built a model, and evaluated it on historical data—but at the end of the day, the only way to know whether it works or not is to run an experiment.
According to Maurer, “In Machine Learning, no one person can evaluate a model. You can run a couple examples, but you can't capture anywhere near the full realm of the experience people will get from this model. Because if you have a million users, a million users will get slightly different results in different contexts. So evaluating it any way but running an experiment is insane. There's no way you can really have faith that you’ve even proved a thing until you run an experiment.”
Maurer reflects on a cautionary example of when his team didn't run an experiment on a change.
“There was a confluence of different product changes that the PM wanted to make at once. With machine learning, you'll have a bunch of metrics from training, and those metrics will give you optimism that this should be a good change. And the PM decided that given our optimism over the metrics, and given other experiments they wanted to run, they decided to just launch it. It turned into a huge boondoggle, and we launched a terrible thing. It took us a couple months to figure out how badly we screwed up.”
Maurer says that there’s a lesson here for all machine learning teams. “ML teams, if they have any sense, know they need to run experiments. I think you'll be hard pressed to find engineers more excited about experimentation than ML engineers.”
As Ronny Kohavi once put it, “During my tenure at Airbnb, we ran some 250 experiments in search relevance. Those small improvements added up to an overall 6% improvement in revenue.” You need to run many experiments for small gains to compound, and when Maurer worked at Airbnb, most of his work involved iterating on their experimentation process to make it move faster.
It turned out that the team’s main limiting factor was how many experiments they could run at once. “There were lots of plausible things we could be testing,” Maurer says, “and we were always discussing what were the best bets to immediately put forward. Each incremental week of concluding experiments faster meant we could make more changes and improve the product.”
With search-ranking, Product and Engineering teams often have hundreds of different ideas for changes to the model, and all of the ideas seem important.
According to Maurer, “usually with search ranking, there's one model at the heart of it, and at best, each person gets one variation of the model. So you have this bottleneck where you can't try out a bunch of different things at once. You only have one model to work with. And so the quicker you can iterate through and conclude experiments on it, the more you can do.”
When you visit the Airbnb app and ask it to find a place to stay in Atlanta for a given set of dates, three phases occur:
The teams involved in search break down along those same lines. The retrieval phase usually maps to a search infrastructure team that runs a big search cluster that can do the filtering quickly.
Then you have a ranking team—generally some combination of machine learning, data science, and ML infra. This team owns the algorithm that turns a thousand listings into a ranked list, and returns some top listings. The ranking team might also choose to reconfigure the order of the results, determining that Airbnb should not prioritize too many listings from the same host.
And then, the rendering and in-product presentation is usually owned by a standard front-end product engineering team.
The search infrastructure team usually doesn't run experiments. The search product team will often run experiments on presentation, but the process is distinct from the ranking team.
And then the ranking team iterates on the ranking algorithm. According to Maurer, “when a search ranking team runs an experiment, what they're saying is, we have this black box that takes in 10,000 results and returns the top 20 in ranked order. We're going to substitute the model currently in production for a couple other alternatives, and then we'll measure, are these alternatives better?”
In most product experimentation, a team will experiment on one component, then experiment on a second component. But a search ranking team is repeatedly running experiments on the same algorithm.
According to Maurer, “Airbnb’s team has probably run a hundred different models, where it’s literally just substituting in one little chunk over and over again. Each search-ranking experiment is basically a repeat in most ways of the last 10 search ranking experiments. The only difference is, the branches (or treatments) they’re doing are new models.”
A search-ranking team experiments on the same piece of the product over and over again, but they vary the treatments. “If I could travel into the future and know the next 100 models my team’s gonna experiment with,” Maurer says, “I could just run a 100-way experiment and it would get me basically the same result.”
Different members of the ranking team work on different projects that could improve the ranking model:
While different team members are working on different model improvements, there are usually a couple different models at any given time that might be ready to run an experiment.
“When you’re running a ranking experiment,” Maurer says, “you have to make a judgment call about the best candidate models. Obviously, if you throw in absolutely everything, you're stuck waiting for that experiment to end, and you're bandwidth constrained, because you only have this one thing you're experimenting on over and over.
“So you need to make a tradeoff: maybe you experiment only on the best models, so that you’re not running too much at the same time. Or maybe you combine together a couple of the changes if you think they're really likely to work well, and risk not learning if they don't pay off.”
At Airbnb, the search-ranking teams paid attention to a single metric: bookings per user. Whatever model increased bookings the most was considered the winning option. “The second you measure a clear improvement in bookings from models,” Maurer says, “you should launch that. And if you have multiple models that are doing that, you pick the highest-performing one, and then go back and layer on the changes that seem promising.”
However, the decision-making process can always get more complicated.
Search-ranking teams are often bandwidth-constrained. They need to be able to quickly abandon experiments that don't seem positive, and quickly conclude positive-seeming experiments without putting their thumbs on the scale. “I worked on a lot of statistical tooling for this problem,” Maurer says. “In search ranking at a place like Airbnb, it was really important for team velocity.”
Sometimes, Maurer says, his teams used experimentation as a way to remove complexity. “Maybe we had made the model overly complex, and it had some features that we didn't think were doing anything. What we’d want to do is remove the complex features and run an experiment to make sure we weren't hurting performance.”
In that case, instead of trying to get a clear signal of improvement, the team just wanted to be able to put a lower bound on how much the model’s complexity was costing them. “We wanted to be able to say, ‘we ran this experiment long enough that the worst case scenario is taking a 0.25% haircut to bookings, and we're okay with that.’”
In conclusion, search-ranking experiments can be a complicated undertaking, but they’re critical for validating algorithm updates and driving impact. There are myriad treatments to test that can compound to big impacts to revenue or user experience improvement, and so experimentation velocity is key. If you’d like to learn more about how Eppo can help you run experiments on search-ranking algorithms or other ML models, contact us.
Building the Modern Experimentation Stack
The Warehouse-Native Experimentation Workflow
How to Set Up an Experiment in Eppo