Engineering
What’s Wrong with Feature Flags?
Engineering and Growth teams aren't speaking the same language
Learn more
The value of identifying the right opportunity, running an experiment and making an informed decision can be worth millions. Building an experiment-driven company culture is a large undertaking that involves rewarding behavior to constantly search for new opportunities and not being afraid to shut down good ideas if they prove not to provide ROI.
Once you have the organizational buy-in, the greatest challenge of running experiments is ensuring that your data is reliable. You need reliable data to properly identify opportunities for experiments. For example, you would need correct data about the signup funnel in order to identify that your drop-off rate is lower than the industry average. Reliable data is also paramount to trusting the outcome of experiments so that your leadership team is comfortable with implementing business-critical changes.
In this post we look into five things you can do to ensure data quality when running experiments.
One of the most common challenges when running experiments is misalignment between the engineering teams implementing the feature flag and data scientists analyzing the outcome of the experiments. This leads to unfortunate outcomes such as:
To mitigate these types of problems, you should include the data scientist from the beginning of the process; they should work alongside the engineering team when planning out the experiment. Companies with the most sophisticated experimentation frameworks, such as Airbnb, define and certify core metrics ahead of any experiment process, and not reactively for each experiment.
Experimentation often involves new products that require novel instrumentation. It's helpful to consider: "Suppose this experiment doesn't work. What questions will I ask and what data do I need?”
Thinking a few steps ahead helps reduce the stress on the data team. If an important feature has just been shipped, it’s not uncommon for senior stakeholders to take an interest in the outcome. It can be helpful for the data scientist to work with the Product Manager on preemptively communicating expectations for the experiment by sharing a brief snippet with stakeholders.
Example:
“We’re rolling out an experiment to 20% of users to reduce contact rate. We expect the result to reach statistical significance within 14 days of being shipped. We’ll share a preemptive update on the initial results after the first 7 days. Keep track of health metrics in this dashboard: https://example.cloud.looker.com/dashboards/1”
Be clear up front about how long you need to monitor your metrics to be able to confidently make a decision based on your experiment. For example, if you make a change to customer support to encourage more customers to use the in-app chat instead of calling in, you may want to measure the long-term impact on NPS and customer satisfaction.
You may evaluate the success of an experiment based on the overall reduction in support tickets while monitoring phone calls during the same period. But you may decide to measure NPS and customer satisfaction over a longer time period to account for implications such as the impact on happiness for new customers or delay in survey responses.
While this may look simple on the surface, it can mean that you have to put in guardrails to guarantee the reliability of these metrics over the entire time period. If the methodology for asking users about the NPS has changed in this period or you neglect to monitor the data for potential data quality issues it can make it harder to assess the medium to long term effect of the experiment.
If you’re working at scale, you likely have hundreds of thousands of tables in your data warehouse. While not all are critical to your experiments, you’ll often be surprised by just how interconnected tables are and how an issue upstream can propagate downstream.
In the last resort, you learn about issues from stakeholders or end-users, but if you’re taking the quality of your data seriously, you’re likely running manual or automated data tests to catch issues proactively.
Manual data tests should be the backbone of your error detection and are available out-of-the-box in tools such as dbt. These are curated based on your business circumstances, and should at a minimum cover your most important data models and sources. Well-built tests help you catch issues before stakeholders and simplify the debugging process by highlighting or ruling out where issues occurred.
Synq has written an in-depth guide with ten practical steps to level up for tests in dbt, with concrete recommendations for how to achieve state-of-the-art monitoring with dbt tests.
Your first resort should be manual tests, as these can help cover gaps and tightly couple your business knowledge to expectations from the data. Adding checks to automatically detect anomalies on your data can be helpful to learn about issues that your manual controls may not capture.
Anomaly detection controls and data observability platforms can help you detect issues across quality, data freshness, volume and schema issues:
If you’re running business-critical experiments, watch it like a hawk the first few days and make sure you’ve shared responsibility between the data scientist, the business team and the product & engineering team.
Your experiments might succeed or fail, but your brand requires demonstrating reliable execution. That means systematic early detection and mitigation of bugs and setup issues.
You may only be able to say anything conclusive about the outcome of an experiment after the experiment's duration, but having dashboards and checks in place early on can help you catch unexpected issues. In order to catch issues early on, you might consider segmenting key metrics by factors that are key to your experiment – such as operating system or user type.
It can help to have a shared Slack channel to discuss the experiment, as well as a dashboard that’s shared between business, data and product & engineering teams. This enables you to get input from as many places as possible, and people working on the business side can often bring in a unique set of operational insights from working with customers day-to-day.
For companies running many experiments, it’s not uncommon to leave behind a lot of remnant data and outdated dashboards. This can be costly, and it contributes to the overall messiness of the data warehouse and dashboards.
A new experiment typically requires spinning up at least one new data model. It may require that you build a dashboard, and in some cases you want to update the data model in (nearly) real time when the experiment is live. In some cases, we’ve seen teams update their experimentation data model every five minutes, only to forget to disable it after the experiment has concluded. This meant that one single data model for an archived experiment was costing upward of $10,000 annually.
It can be a good idea to have a documented approach to how you archive experiments. This could include:
In this article we looked at five ways you can ensure the data quality when running experiments.