Products

Experimentation

Product Experimentation Web Experimentation Lifecycle Experimentation Lifecycle Experimentation

Feature Flagging

Release Management Automated Rollouts Config Flags Release Management

AI Personalization

Contextual Bandits Contextual Bandits

Why Eppo

WHY EPPO

By Role

Data Scientists Engineers Product Managers Product Managers

Resources

Customers Outperform Updates White Papers White Papers

FEATURED CASE STUDY

Coinbase Saves Millions, Reduces Experiment Analysis Time by 40%, and Restores Trust in Experimentation with Eppo

Learn more

Blog

About

Engineering

July 23, 2024

Why We Replaced Airflow in our Experimentation Platform

Chas DeVeas

Engineer at Eppo. Before joining Eppo, Chas was Director of Engineering at Storyblocks where he worked on their experimentation platform

This is the story of how Airflow got Eppo from zero to one, and why we recently ditched Airflow for an in-house solution built with NodeJs, BullMQ, and Postgres.

Eppo is a warehouse-native experimentation platform, which means we run a lot of queries against customer warehouses. We compute experiment results nightly to make sure customers have meaningful feedback on their most important business decisions in a timely manner. For our largest customers, this can mean thousands of running experiments equating to thousands of warehouse queries, all launched from scheduled jobs jostling to be run at once.

The gold standard for scheduled jobs in the field is Airflow, and it was the obvious choice for Eppo when we were getting off the ground. At the start of our journey, we chose Cloud Composer as a hosted option that spun up Airflow environments with ease. From there, we began adding DAGs for our various customer workflows.

‍

Airflow with Dynamic DAGs

There are some core jobs that Eppo needs to run on a regular basis when a customer successfully configures experiments in our system:

Experiment Pipeline: Compute results for active experiments.
Metric Event Aggregations: Calculate daily volume and sums for configured metrics.
Observed Feature Flags: Scan customer tables for distinct experiment & variant keys.
…and more

Originally, these jobs existed each in their own scheduled DAG, with a couple of unscheduled versions that could be invoked via Airflow’s API when we needed to launch jobs from the Eppo UI:

The two main issues with that setup were that all customer jobs ran at the same time, and the DAGs got extremely messy with multiple tasks per customer. After a while we moved to customer-specific DAGs, which allowed us to schedule them at different times and have a single view of a nightly experiment pipeline run for a customer:

The individual tasks in the DAGs were composed of KubernetesPodOperator nodes which let us invoke console commands defined in our NodeJs application. All application logic was kept in our app, while orchestration logic lived in Airflow.

One noteworthy aspect to this setup is how dynamic it is. These DAGs changed constantly with customer data. For example, the experiment_pipeline DAG changed all the time as customers added experiments. We handled this by calling our web server’s API from Airflow to fetch the list of experiments to run. This practice is non-traditional and discouraged by Airflow, but we found it was the best way to keep customer configurations synced with Airflow with low(ish) latency.

# experiment_pipeline.py

from utils import create_company_dags

def main():
    # fetch running experiments from Eppo API
    response = requests.get("<eppo internal api>/company-running-experiments")
	  company_to_experiment_map = response.json()

    for company_id, experiments in company_to_experiment_map.items():
        create_company_dags(experiments)

main()

The source of record for experiment and company configurations was our NodeJs application; Airflow only held the DAG structure for jobs.

Airflow Issues

For almost two years, Airflow on Cloud Composer served us well. We were able to onboard new companies—some with several hundred experiments running every night. There were several points of friction and annoyance, though:

Airflow workers would “lose” launched KubernetesPodOperator pods.
- Airflow tasks would fail, but we found that the job had actually continued running in the launched k8s pod, leading to duplicate overlapping runs and other issues. In two years, we created at least three support tickets with Google Support to address this issue, and it kept recurring.
Extensive tuning was required on a fairly regular basis. We adjusted settings like parallelism, dag_concurrency, max_threads, worker_concurrency, and more several times, and it required a lot of iteration, support tickets, and load testing to get right.
Our application logic was separate from our orchestration logic. If we wanted to change a core data job we had to navigate and edit Typescript code in addition to Python DAG code.
Because of our dynamic setup, any API failure or network blip between Airflow’s components and our web server would lead to DAG errors and failure to render the DAGs.

Having so many infrastructure components made it difficult to debug (Airflow web server, scheduler, workers, and the kubernetes cluster where we would launch tasks).
We did not have automated tests in place to protect our orchestration logic; only manual QA.

Throughout this time, it kept feeling like Airflow wasn’t the “right tool for the job.” We kept needing to build stability hacks as customers grew in size and number: File-based caching of API responses, sharding company experiment DAGs to keep the number of tasks per DAG low, etc. We were spending too much time on keeping a basic orchestration system running instead of building meaningful features for our customers. The breaking point came when we had enough concurrent experiments being calculated overnight that the Google Composer environment started crashing on a regular basis.

It turns out that the Composer 2 docs estimate the limit of max concurrent tasks to be 400 (to our surprise… we started out on Composer 1 which had no estimate at the time).

Working with support and an external Airflow consultant did not rectify the issues we were seeing, so in desperation we actually sharded Google Composer and created an entire Airflow setup dedicated to our largest customers.

Airflow Alternatives

This was not going to be a long-term solution, so we started seeking alternatives. With the primary goal of being able to run 50,000 concurrent experiments, our team split up and looked into these candidates:

Dagster, Prefect, Argo CD, Flyte
Self-hosted Airflow
NodeJs via BullMQ

We built out simulated experiment pipeline DAGs in each system to get a feel for how they performed in terms of scalability, DAG organization, and user experience.

‍

Early results with the Airflow-like options were not great; the platforms all seemed to struggle with our simulated workloads for different reasons. For example, Dagster was able to scale to handle massive bursts of jobs only if the underlying Kubernetes cluster already had capacity; failures began popping up as the cluster took time to scale, which would require messy retry logic to get it operating smoothly in production. Argo CD handled bursts of DAG runs well, but the UI became laggy (something we also saw with Airflow). Overall, it seemed like the Airflow alternatives would perform similarly to (if not worse than) Airflow.

This led us to believe that among the off-the-shelf options, sticking with a self-hosted version of Airflow would be preferable to ramping up a whole new system. That would still come with its own major challenges, since we would have to manage all of the infrastructure ourselves. All of these options shared a similar weakness for our use-case: They launched tasks by spinning up containers in Kubernetes, incurring all of the CPU-, Memory-, and most importantly Time-intensive overhead associated with booting a new instance of our app.

In contrast, our NodeJs + BullMQ scaling test was massively successful; a single process did ~10k jobs in 10 seconds. The reasons for this are obvious: all jobs run in the same process with minimal overhead. This test revealed more than just predictable performance and scaling benefits. It also demonstrated that BullMQ is a mature, easy-to-use, and seemingly reliable queue option for NodeJs. However, we were still reluctant to choose this option, since all it gave us out of the box was a simple Redis-based queue. We would need to build a lot of the features offered by those orchestration systems: advanced retry logic, execution history, a user-friendly UI, “zombie” task logic, trigger rules, DAG scheduling logic, cron jobs, etc. The classic build-vs-buy dilemma.

Paddle

This time, we decided to build. It felt like the overall effort of learning, building, and maintaining a system on any of the other platforms was roughly equivalent to the effort to build those additional features on top of BullMQ. More importantly, the cost savings alone would justify the effort. Since most of our DAG tasks are simply executing SQL against customer warehouses, they require very little CPU and memory. By using workers that run continuously and cut out most of the overhead, we would be able to more quickly process jobs using fewer resources.

The system we built was named “Paddle” because that is the name for a group of platypuses, our informal mascot at the time: