Engineering
What’s Wrong with Feature Flags?
Engineering and Growth teams aren't speaking the same language
Learn more
This is the story of how Airflow got Eppo from zero to one, and why we recently ditched Airflow for an in-house solution built with NodeJs, BullMQ, and Postgres.
Eppo is a warehouse-native experimentation platform, which means we run a lot of queries against customer warehouses. We compute experiment results nightly to make sure customers have meaningful feedback on their most important business decisions in a timely manner. For our largest customers, this can mean thousands of running experiments equating to thousands of warehouse queries, all launched from scheduled jobs jostling to be run at once.
The gold standard for scheduled jobs in the field is Airflow, and it was the obvious choice for Eppo when we were getting off the ground. At the start of our journey, we chose Cloud Composer as a hosted option that spun up Airflow environments with ease. From there, we began adding DAGs for our various customer workflows.
There are some core jobs that Eppo needs to run on a regular basis when a customer successfully configures experiments in our system:
Originally, these jobs existed each in their own scheduled DAG, with a couple of unscheduled versions that could be invoked via Airflow’s API when we needed to launch jobs from the Eppo UI:
The two main issues with that setup were that all customer jobs ran at the same time, and the DAGs got extremely messy with multiple tasks per customer. After a while we moved to customer-specific DAGs, which allowed us to schedule them at different times and have a single view of a nightly experiment pipeline run for a customer:
The source of record for experiment and company configurations was our NodeJs application; Airflow only held the DAG structure for jobs.
For almost two years, Airflow on Cloud Composer served us well. We were able to onboard new companies—some with several hundred experiments running every night. There were several points of friction and annoyance, though:
KubernetesPodOperator
pods.Throughout this time, it kept feeling like Airflow wasn’t the “right tool for the job.” We kept needing to build stability hacks as customers grew in size and number: File-based caching of API responses, sharding company experiment DAGs to keep the number of tasks per DAG low, etc. We were spending too much time on keeping a basic orchestration system running instead of building meaningful features for our customers. The breaking point came when we had enough concurrent experiments being calculated overnight that the Google Composer environment started crashing on a regular basis.
It turns out that the Composer 2 docs estimate the limit of max concurrent tasks to be 400 (to our surprise… we started out on Composer 1 which had no estimate at the time).
Working with support and an external Airflow consultant did not rectify the issues we were seeing, so in desperation we actually sharded Google Composer and created an entire Airflow setup dedicated to our largest customers.
This was not going to be a long-term solution, so we started seeking alternatives. With the primary goal of being able to run 50,000 concurrent experiments, our team split up and looked into these candidates:
We built out simulated experiment pipeline DAGs in each system to get a feel for how they performed in terms of scalability, DAG organization, and user experience.
Early results with the Airflow-like options were not great; the platforms all seemed to struggle with our simulated workloads for different reasons. For example, Dagster was able to scale to handle massive bursts of jobs only if the underlying Kubernetes cluster already had capacity; failures began popping up as the cluster took time to scale, which would require messy retry logic to get it operating smoothly in production. Argo CD handled bursts of DAG runs well, but the UI became laggy (something we also saw with Airflow). Overall, it seemed like the Airflow alternatives would perform similarly to (if not worse than) Airflow.
This led us to believe that among the off-the-shelf options, sticking with a self-hosted version of Airflow would be preferable to ramping up a whole new system. That would still come with its own major challenges, since we would have to manage all of the infrastructure ourselves. All of these options shared a similar weakness for our use-case: They launched tasks by spinning up containers in Kubernetes, incurring all of the CPU-, Memory-, and most importantly Time-intensive overhead associated with booting a new instance of our app.
In contrast, our NodeJs + BullMQ scaling test was massively successful; a single process did ~10k jobs in 10 seconds. The reasons for this are obvious: all jobs run in the same process with minimal overhead. This test revealed more than just predictable performance and scaling benefits. It also demonstrated that BullMQ is a mature, easy-to-use, and seemingly reliable queue option for NodeJs. However, we were still reluctant to choose this option, since all it gave us out of the box was a simple Redis-based queue. We would need to build a lot of the features offered by those orchestration systems: advanced retry logic, execution history, a user-friendly UI, “zombie” task logic, trigger rules, DAG scheduling logic, cron jobs, etc. The classic build-vs-buy dilemma.
This time, we decided to build. It felt like the overall effort of learning, building, and maintaining a system on any of the other platforms was roughly equivalent to the effort to build those additional features on top of BullMQ. More importantly, the cost savings alone would justify the effort. Since most of our DAG tasks are simply executing SQL against customer warehouses, they require very little CPU and memory. By using workers that run continuously and cut out most of the overhead, we would be able to more quickly process jobs using fewer resources.
The system we built was named “Paddle” because that is the name for a group of platypuses, our informal mascot at the time:
If you want to learn more about Paddle, the story continues in these blog posts about its implementation and the results of this effort.