Products

Experimentation

Product Experimentation Web Experimentation Lifecycle Experimentation Lifecycle Experimentation

Feature Flagging

Release Management Automated Rollouts Config Flags Release Management

AI Personalization

Contextual Bandits Contextual Bandits

Why Eppo

WHY EPPO

By Role

Data Scientists Engineers Product Managers Product Managers

Resources

Customers Outperform Updates White Papers White Papers

FEATURED CASE STUDY

Coinbase Saves Millions, Reduces Experiment Analysis Time by 40%, and Restores Trust in Experimentation with Eppo

Learn more

Blog

About

Culture

February 5, 2024

A Simple "Build vs. Buy" Decision Framework for Experimentation

How many engineers will you commit to internal tool buildouts?

Che Sharma

Eppo's Founder and CEO, former early data scientist who built experimentation tools and cultures at Airbnb and Webflow

If you have 10,000+ daily active users, A/B experiments are a table-stakes process for building competitive products. Your customer base is too diverse to just talk to every customer, meaning you're flying blind when launching without a randomized experiment.

For B2C and PLG companies who often have 10,000+ DAUs, the question is not if you should run experiments, but how. And the biggest question is build vs. buy: build a bespoke platform from the ground up, or buy one of the commercial options off-the-shelf (like Eppo's experimentation platform)?

Our team at Eppo has some unique insight on the "build vs. buy debate": 44% of us helped build internal experiment platforms at a diverse array of companies, including Airbnb, Stitch Fix, Groupon, LinkedIn, and Angi. We've seen these internal builds succeed and struggle to get off the ground.

In this post, I'll share several questions you can ask (and answer) to help identify the common factors around the success or failure of an internal experimentation platform and determine the right approach for your company. You can lump them into two categories, broadly speaking:

1) How consequential is it to get AB testing right?
2) Does your organization have the right expertise to build?

How consequential is it to get AB testing right?

There are three ways for experimentation programs to fail, each one potentially staining a team's reputation:

[Adoption] No one runs experiments
[Rigor] The experiments lead to bad decisions and wrong learnings
[Speed] Experiments go so slowly that the team has no impact, and appears like a bad operation

For product growth and AI/ML teams, the risks of failing are existential: growth loops and AI model iterations have no internal company optics except for the metrics they lift. On the other hand, a marketing team doing a few copy tests will probably find impact in other workstreams. Any team - product, AI, or marketing - will suffer from slow speed though.

The Eppo team has collectively built ~10 in-house experimentation platforms. In our experience, the internal, undocumented SDK with static Tableau reports feels "automated," but successfully making a decision becomes weeks of engineering investigations and analytics cycles.

If you buy, the software is already built, documented, and likely reviewed online for adoption, rigor, and speed. Online materials and free trials can validate integrations and useability. Enterprise-oriented SaaS products like Eppo will also come ready with a built-in support model to augment your team's capabilities.

Of course, an internal build can achieve adoption, rigor, and speed. Just look at Airbnb and Netflix, whose systems power 1000s of experiments across diverse teams. But their success was certainly built on a foundation of committed headcount.

Does your organization have the right expertise to build?

For teams building in-house, the equivalent of asking vendors for SaaS pricing is to figure out the required headcount for an internal build. Look at the staffing behind any in-house platform that achieves adoption, rigor, and speed. You'll find a 5+ headcount, and often a 25+ headcount if there are multiple business units, all specialized talent like engineers and data scientists.

Airbnb commits 50 engineers and data scientists to experimentation because of adoption, rigor, and speed:

Rigor: making hygienic decisions and avoiding pitfalls

Running successful experiments requires avoiding statistical theater - when results are suddenly misleading, swayed by random noise. The right team will understand each of experimentation's failure patterns: improper randomization, broken statistical assumptions, low statistical power, data pipeline failures, wrong metric definitions, and more.

An in-house platform designed for only the "happy path" will, at best, bog the team down with manual diagnostic checks. At worst, it can lead to completely wrong decisions.

Speed: building for scale, computational efficiency

Even when experiments avoid pitfalls and provide accurate results, experiment analysis quickly becomes one of the thornier data engineering tasks.

This is because experiment analysis involves several combinatoric explosions of results to calculate. All metric events need to be windowed to the experiment time period, a Cartesian join that incurs significant cost for every experiment. On top of that, experiment teams routinely want to look at many metrics and slice the results by many segmentations. (# assignments) (# metric events) (# segmentations) will tax your analytics infrastructure.

Experimentation is also a dynamic analytics process. New metrics are added all the time, driven by the experimenting team's curiosity as they try to find the root cause of unexpected results, as they launch new product lines, or as entirely new product teams emerge. An experimentation platform needs to both compute the combinatoric explosion and quickly add and calculate novel metrics every week.

Adoption: can engineers setup experiments trivially easily? Can product managers reports on experiments without statisticians?

It's extremely rare for in-house experiment infrastructure to secure design resources and front-end engineers. And like any product without design resources, there's a constant risk that no one will understand how to self-operate simple workflows.

In the case of experimentation, we're asking engineers and PMs to put their necks on the line. Engineers are being asked to prove that they aren't tanking metrics by affecting the production environment. PMs have to evangelize wins and findings in front of their peers, and are often blamed for operational hiccups that aren't their fault.

The lack of design resources and the degree of consequences in experimentation lead many in-house approaches to lean heavily on data teams to provide trustworthy results. This becomes problematic when data team headcount is scarce and statistical inference expertise is even rarer.

Are you committed to maintaining this expertise on staff at all times?

A common failure pattern with in-house tools comes when their architects and maintainers leave the company. This is especially pertinent in today's macroeconomic environment, where high interest rates and scarce capital have led to constant staffing reshuffles (whether by companies slimming down or employees voting with their feet).

Also exacerbating this issue, in-house experiment tools are often built when a surplus of specialized talent is available. Companies who choose the in-house build route must also promise to backfill the talent involved when they inevitably move on.

So, there you have a few straightforward questions to form your "build vs. buy" decision framework. For most readers, carefully considering your answers will likely illuminate why building an in-house experimentation platform may not be the best option. Regardless of whether you build or buy, a thorough gut check is an essential test of your confidence. The good news is that commercially available platforms like Eppo now offer all the best capabilities and world-class features that were once exclusive to in-house tools.

No Headings

Turn blind launches
into trustworthy experiments

See Eppo in Action

Ready to go from knowledge to action?

Talk to our team of experts and see why companies like Twitch, DraftKings, and Perplexity use Eppo to power experimentation for every team.

Get a demo