Products

Experimentation

Product Experimentation Web Experimentation Lifecycle Experimentation Lifecycle Experimentation

Feature Flagging

Release Management Automated Rollouts Config Flags Release Management

AI Personalization

Contextual Bandits Contextual Bandits

Why Eppo

WHY EPPO

By Role

Data Scientists Engineers Product Managers Product Managers

Resources

Customers Outperform Updates White Papers White Papers

FEATURED CASE STUDY

Coinbase Saves Millions, Reduces Experiment Analysis Time by 40%, and Restores Trust in Experimentation with Eppo

Learn more

Blog

About

A/B Testing

August 24, 2023

Experimentation Metrics: Deciding What To Measure

Metrics are the vehicle that drives change in data-driven organizations.

Anne Kim

You can’t run experiments without having something to measure, but choosing the right metrics to assess the success of your experiment can be daunting. There’s also the added question of what to do when faced with competing metrics moving in different directions. In this blog post, we’ll walk you through the different types of metrics and some considerations when choosing the right measurement for your experiments.

‍

Why Care About Metrics

‍

Too often, companies focus on shipping features rather than improving metrics. However, shipping a feature doesn’t tell you if the feature was used, if the user experience improved, or if revenues increased. By shifting the conversation toward measuring metrics rather than features, teams are incentivized to drive business goals rather than focusing on just shipping more features.

Metrics are used to improve how an organization works, but also to enable faster and better product development.

‍

The Metrics Framework

‍

Metrics are a fundamental building block of experimentation. We run experiments to identify a causal relationship between the thing we’re changing and the impact we want to have. Metrics are the vehicle that drives change in data-driven organizations.

While a company will state its vision and mission in qualitative terms, quantitative goals are how companies measure progress against the mission. These quantitative goals are aptly named Goal Metrics.

Goal Metrics are supported by additional faster-moving and more sensitive metrics, sometimes called Indirect or Driver Metrics. Apart from goal and indirect metrics, we also have Guardrail Metrics, which help protect us against unexpected consequences.

Consider Airbnb as an example to illustrate this framework and explore the need for each metric type.

Let’s start with Airbnb’s mission, which is to “create a world where anyone can belong anywhere.” While this mission is ambitious, it is difficult to measure, necessitating more quantitative metrics.

Airbnb has goals that are set at the executive level. These reflect how the company will measure its progress against its mission. In fact, we don’t have to go far to find these. Airbnb’s Q3 2022 Shareholder Letter outlines these metrics on page two.

‍

Goal Metrics

‍

Given that Airbnb is a publicly traded company, it’s no surprise that metrics such as Revenue, Net Income, and EBITDA are highlighted. Nights and Experiences Booked is a metric that aligns with the broader mission of creating a world where anyone can belong.

These are great examples of Goal Metrics. They are reported every quarter, leadership and executives keep a watchful eye on them, and they represent the company's fundamental objectives. Teams may also have their own version of Goal Metrics, depending on what the team owns.

In a perfect world, every experiment would use these metrics to determine success, and life would be easy. However, these metrics are slow to change in the real world and hard to impact through most initiatives. Metrics like Net Income and EBITDA might only be calculated once a month, given their complexity. Even metrics like Revenue, Nights Booked, Booking Value, and Average Daily Rate can be difficult to change in the short term.

To book on Airbnb, many things need to happen. First, you need to get to the Airbnb website or app. Maybe you opened the app directly on your phone, or on a rainy day in New York City you searched for ‘Sunniest Places to Work Remotely’.

Next, you’re presented with the Airbnb home page. You still need to search for a place, find a location that has availability, look at the price, not get offended at the cleaning fee, click Reserve, login, consider if you want travel insurance, enter your credit card details, and possibly request to book, hope the host accepts you despite a review that says you forgot to sweep at your last stay, and finally get a booking.

While our goal may be the final booking, there are many improvements we can potentially make to the entire funnel that drives the desired behavior.

‍

Driver Metrics

‍

Given the difficulty of measuring the impact of day-to-day deliverables on the goal metrics, we form a hypothesis on what we think are drivers of our goals. We know that to book a stay, we first need people to visit our website, so it’s natural to think about a causal model between the factors that affect top-of-funnel visits with our underlying goal of increasing booking.

For example, we may use metrics such as the time it takes the page to load, or total volume of traffic to our site as indirect measures of our success. The benefit in using these metrics is that they are more sensitive to changes, and so can provide us with greater feedback and statistical evidence to help us determine whether our experiments are successful.

For a more middle-of-the-funnel example, a team that is responsible for which properties show up by default might use measures such as whether a listing was shared, time spent on the listing page, or whether a property was added as a favorite, since those measures might be more timely and sensitive than bookings or revenue.

Another consideration is how feasible it is to measure a metric in an experiment: you may ultimately care about improving revenue, but this is known to be hard to measure due to the large variance. Instead, you could look at the number of users that have purchased something. This is, statistically speaking, a lot easier to measure, and there is a plausible link between increasing users that purchase and improving revenue.

‍

Guardrail Metrics

‍

Finally, guardrail metrics are used to protect against unwanted side effects. Returning to the middle-of-funnel example, we might use page-load times as a simple guardrail against an overly complex recommendation engine. In an extreme example, a model could take several seconds to identify highly-personalized results, but in doing so, increase page load times substantially. Other guardrails might include the number of support tickets or the count of errors in a client application.

‍

Choosing Metrics

‍

Now that we’ve explained the three different types of metrics, how do you choose which ones to use?

Goal metrics will come from leadership. They should be simple, stable, and not change frequently. Using the Airbnb example, Nights Booked is a simple metric to understand. Though the metric might have been a complex choice initially, it has been a staple of Airbnb for years, and is easily understood. It connects directly with the mission of the company.

For driver metrics, they should be aligned with goal metrics, actionable, relevant, sensitive, and resistant to gaming. This often requires finding a balance, and different teams might focus on separate driver metrics that make sense for their respective surface area. And while a single improvement for one of the driver metrics may not translate to improvement in a goal metric, the combined impact over many experiments should.

Back to our example, page load time is aligned with our goal of increasing bookings; perhaps we’ve found research or proven experimentally that lower page load times increase bookings.

It is actionable in that teams are empowered to reduce the time it takes for a page to load through various optimizations. It is likely to be sensitive in that every user will have experienced page load and the variance is not nearly as high as, for example, revenue. It’s also hard to imagine how one would game page load time, at least not without someone noticing large portions of the page are now missing.

Guardrail metrics are used to protect the business from unintended consequences and perverse incentives. These might be tickets to customer support, latency requirements, or other measures of quality. In general, we do not expect these metrics to move, but want to be alerted when performance on these metrics degrade.

‍

Evaluating Metrics

‍

It’s important to note that metric selection is never finished. You should always evaluate your usage of metrics for experiments over time. Business goals may change, you might identify perverse incentives and gaming of metrics, and new guardrails or objectives might be worth considering.

Regular review of metrics and their impact on your business is essential. You could do this via customer survey results, user experience research and customer calls, and observational analysis.

‍

Wrapping Up

‍

Metrics form a key part of your experimentation decision-making process. While there is a lot to consider, remember that experimentation is an experiment. There is no perfect technique, and it’s better to measure something than measure nothing. So long as the focus remains on output rather than activity, your experimentation journey will continue to lead you to better results and a more refined process.

‍