Are mutually exclusive experiments necessary, or dangerous? Here's how to run them in Eppo.
How and when should you run experiments mutually exclusive of each other? It’s a common question in experimentation, but a tricky one to address. Sometimes teams who are newer to experimentation over-index to running mutually exclusive experiments from a fear of interaction effects, or even a lack of statistical understanding. Experienced teams with validated use cases may face technical challenges in executing mutually exclusive experiments. In this post, we’ll talk through all of those topics - from what mutually exclusive experiments are, why you might want to use them, and how to execute.
Mutually exclusive experiments are when you are running two or more experiments, but you don’t want your users to be included in both experiments; essentially your users would see one experiment or the other, but never both.
Let’s use Eppo’s homepage to demonstrate this example. Our Product Marketing team wants to test out new messaging, and the Design team wants to rearrange the page layout — both with the hopes of increasing clicks on the ‘Contact Us’ button. If they ran both tests at the same time, they will essentially run an ABCD test on our home page. Some users will see no changes, some will see just the new copy, some will see just the new layout, and others will see both the new copy and layout.
One question we hear often from new participants in experimentation goes along these lines: “If we’re running two experiments at once in the same place, on the same metric, if we see movement on that metric how will we know which experiment caused the impact?” This is where the importance of randomization in experimentation comes in. Because each experiment will independently randomize users between the control and variation, users have an equal likelihood of seeing each permutation of changes. Just because users see the control in our messaging experiment does not make them more likely to also see the control in our layout experiment. In this way, we ensure the two experiments are independent and each experiment’s results readout is un-influenced by the other.
But suppose that there were changes that specifically conflicted with each other in these two experiments. Maybe one of the layout changes in the Design team’s experiment entirely removes a section that the Product Marketing team was testing updating. In this case, we want to more specifically control the permutations.
There are a couple of experiment design solutions these teams can consider:
Running an A/B/n test with every valid permutation included is one clear way to observe the interaction of both changes at the same time. This experiment design works well with a few changes, but once you start to run 4 or 5 changes that becomes 16 and 32 variations respectively which will take much longer to reach statistical significance. For the same reason, this is also not a solution that scales well.
On the other hand if you opt to make your experiments mutually exclusive, you face a similar trade off in having a smaller sample size and less statistical power. You can use Eppo’s Sample Size Calculator to evaluate these tradeoffs and design your experiments.
Research conducted at Microsoft showed that only one product group across 4 products showed interaction effects, observed as “a tiny number of abnormally small p-values, corresponding to 0.002%, or 1 in 50,000 A/B test pair metrics” [source]. Given the virtually nil risk of observing a true interaction effect, we recommend only making experiments mutually exclusive when the variant in one experiment fully invalidates the other experiment.
The example that the Microsoft article uses is an experiment on ad copy where one experiment is testing a gray vs red background and the other experiment is testing black vs red text and users who receive the red background and red text variant will not be able to read the ad copy. This is clearly a bad user experience, and the experiment design introduces an interaction effect that will disrupt the results.
When interference effects are not an issue, it’s worth considering that it can be against your best interest to run mutually exclusive experiments. Note what observations you are missing out on by not exploring all of the combinations of your variations; you will be rolling out untested experiences that may have worse outcomes than the data shows from running each test as a siloed experience.
The opposite may also be true - you may miss out on positive effects, say from a more cohesive site experience, by failing to test certain combinations of treatments. Imagine we are updating our checkout flow and have 2 tests we'd like to run: one is a test on our cart page and another is on our checkout page. Say we run our checkout page test first, but fail to observe a positive result and thus abandon the treatment. Our user allocation would look like this:
Since we ran the checkout page test first, we never got to observe the impact of the Cart page variation in combination with the Checkout page variation. If, on the other hand, the experiments were run at the same time and were not siloed, the user assignments would look like this:
In this example you will be able to see how all users will be able to experience all permutations of the variations.
To provide analysis on what the effects of rolling out untested experiences could be, we will reference a simulation that the team at Analytics Toolkit ran.
Their example simulated random interactions between two variants of two tests (4 total interactions) and then checked if analyzing the tests blind to the interactions would yield outcomes different than the correct ones. From their analysis:
“Running 1000 simulated experiments, 323 (32.3%) resulted in picking the wrong variation for one of the two tests. 0 (0%) cases resulted in picking the wrong variation on both tests. In all cases only one of the winners was the wrong one, but this doesn’t make the error less bad, since sometimes the chosen combination will have the worst or second worst performance of the four…it is theoretically entirely possible for interference between the two tests to yield results different than the correct ones.”
In short, it’s possible that you’re leaving value on the table by not examining all of the permutations of your experiment variations. Knowing this in light of the relatively low risk of interaction effects of most well designed experiments, it is worth it most cases to run tests concurrently and examine all of the intersections of those test variations.
OK - so you’re comfortable with the tradeoffs (both strategic and statistical), and think there would be a true interaction effect between two experiments planned to be run simultaneously. How do you configure mutually exclusive experiments in Eppo?
We use nested feature flags with targeting rules to create a top-level exclusion group. You can read the full set of instructions in our docs here.
Once you've configured nested flags, it will be easy to create each experiment allocation as a mutually exclusive group:
Overall, just remember to think carefully about the context and situation at hand when deciding whether or not to run mutually exclusive experiments. Experiment responsibly!
Building the Modern Experimentation Stack
The Warehouse-Native Experimentation Workflow
How to Set Up an Experiment in Eppo