Statistics
April 29, 2025

The Illusion of Safety: How Non-Informative Priors Lead to False Confidence

Tyler Buffington
Before Eppo, Tyler built the in-house experimentation platform at Big Fish Games. He holds a PhD from the University of Texas at Austin.

If you work in A/B testing, you have probably heard about using Bayesian statistics to analyze results. One of the most commonly cited benefits is the ability to incorporate prior knowledge into the analysis. However, few Bayesian A/B testing tools provide the ability to specify priors, instead relying on non-informative priors as a default.

These priors provide the illusion of safety; there is an understandable instinct to eliminate the influence of human judgment in experimental results, which makes the non-informative prior’s apparent objectivity quite appealing. However, a closer inspection reveals that the non-informative prior is actually quite a strong assumption in A/B testing that harms decision-making.

The A/B testing setting: weak information about direction but strong information about magnitude

In Bayesian statistics, non-informative priors are a popular default because they skirt around the potential subjectivity of choosing an informed prior. Also, in some applications, the prior information is quite weak relative to the information in the measured data, which means that any reasonable prior would effectively lead to the same results as a non-informative prior. For example, consider measuring the temperature of something that you are cooking. Before taking the measurement, you may know the approximate range of reasonable temperatures, perhaps between 50 and 75 °C. A good thermometer would tell you the temperature within about 1 °C. In this example, it’s clear that the measured data is much more precise than the prior beliefs about the temperature. However, if the measured data is imprecise relative to the prior knowledge, using non-informative priors is dangerous.

In Andrew Gelman’s post, Hidden dangers of noninformative priors, he accurately described the setting in which non-informative priors are problematic.

Any setting where the prior information really is strong, so that if you assume a flat prior, you can get silly estimates simply from noise variation.

Unfortunately, A/B testing falls squarely into this setting. This may sound surprising at first — after all, A/B test outcomes are notoriously difficult to predict, so how can it be true that the prior information is strong? To understand this, we must separate prior knowledge about the direction of the effect from prior knowledge about its magnitude.

In A/B testing, it’s reasonable to use a weak or non-informative prior about the direction of the treatment effect, indicating complete uncertainty about whether a test’s treatment will have a positive or negative effect on key business metrics. However, the prior information is clearly strong regarding the magnitude of the treatment effect, and this is where non-informative priors fall short.

It is rare for a well-powered A/B test to show more than a single-digit lift. It is common practice to call Twyman’s law on these results because extremely large lifts are quite surprising and should be viewed with skepticism. Despite this strong prior knowledge about the magnitude of A/B testing treatment effects, the non-informative prior assumes that any treatment effect is equally likely, meaning your idea is just as likely to triple your product’s revenue as it is to have a more realistic 1% lift. We also know that with small sample sizes (underpowered tests), it’s common to see Twyman’s law-inducing results or “silly estimates simply from noise variation,” so it is clear that A/B testing is subject to the dangers of non-informative priors that Gelman warns about.

The directionally agnostic and magnitude-informed prior is also consistent with typical results from meta-analyses of A/B tests, which often show a tight distribution of treatment effects centered near zero. See here for examples.

In many cases, the standard frequentist NHST approach used in A/B testing does a better job of accounting for the strong prior knowledge about the magnitude of treatment effects than most Bayesian A/B testing tools. This is because the standard frequentist A/B testing approach accounts for this prior knowledge through test planning — it’s a common convention to design tests with a minimum detectable effect of 5% or smaller, leading to target sample sizes in the hundreds of thousands for common business metrics such as conversion rate or revenue per user.

Some Bayesian testing tools claim to mitigate large sample size requirements by enabling one to reach confidence faster. Given that Bayesian results with a non-informative prior mirror those from a frequentist analysis, the accelerated confidence is purely due to using less stringent thresholds. For example, a p-value of 0.25 looks "insignificant," but it seems like a "good bet" when misinterpreted as a 75% chance to beat. As we will see in the next section, when one uses a realistic prior, the Bayesian approach is actually slower to make claims with confidence, and that’s a feature rather than a bug. When we use non-informative Bayesian approaches that enable us to reach “confidence” faster, we really end up shipping more harmful changes to the product. There is no free lunch.

First pitfall: overconfident results

At first glance, this is counterintuitive — how does capturing less information in the prior make us more confident in the experiment’s results?

Gelman and Tuerlinckx (2000) and a later post by Gelman shed light on the topic by investigating the conditions in which Bayesian methods make “claims with confidence,” which they define as a result in which the 95% credible interval excludes zero. The conclusion is that the tendency for a Bayesian analysis to make claims depends on a variance ratio, $\tau / \sigma$, where, in the context of A/B testing, $\sigma$ is the standard error of the estimated lift and $\tau$ is the standard deviation of the prior distribution on the true lift. The standard error of the lift, $\sigma$, is a function of the sample size as larger sample sizes provide more precise estimates.

When the variance ratio is small (as is the case when sample sizes are small), the Bayesian approach has nearly a 0% chance of making a claim with confidence. In the same setting, the classical frequentist approach makes claims with confidence 5% of the time. As the variance ratio increases, the Bayesian and classical frequentist approaches converge and eventually yield claims with confidence at the same rate.

With this lens, the intuition starts to become clear — when the estimates are noisy relative to the expected magnitude of true effects, we should avoid confident conclusions. The non-informative prior is equivalent to setting $\tau \rightarrow \infty$ and therefore $\tau / \sigma \rightarrow \infty$, which means that the Bayesian and classical frequentist approaches yield claims with confidence at the same rate *at any sample size,* and that rate will be at least 5%. In other words, the non-informative prior enables undue confidence based on weak evidence.

To illustrate this overconfidence in a realistic setting, we present a simulation with the following setup:

  • We simulate A/B tests at various sample sizes ranging from 2,000 to 200,000.
  • For each sample size, we simulate 500,000 A/B tests.
  • Users are assigned to either the control or treatment group, each with 50% probability.
  • The metric of interest is a conversion rate metric with a baseline (control) value of 5%.
  • We simulate treatment effects by drawing a random relative lift from a distribution of true effects modeled a $N(0, \tau^2)$ with $\tau$ is set to 2.5% of the baseline conversion rate. This corresponds to a fairly realistic setting in which 95% of the true lifts are between -5% and 5%.
  • For each test, we analyze the results using a Bayesian approach, both with a non-informative prior and an informed prior that matches the distribution of true effects.
  • For each sample size, we calculate the proportion of tests whose 95% credible interval excludes zero, corresponding to a “claim with confidence.”

The results are shown below:

Consistent with the results of Gelman and Tuerlinckx, our simulation shows that the informed prior avoids making claims with confidence at small sample sizes. Interestingly, claims with confidence are extremely rare when the sample size is below 50,000 with the informed prior, but they are substantially more common with the non-informative prior. These results are in direct contrast to the widespread notion that using non-informative priors is “conservative” or safe. It is actually the informed prior that is conservative in the context of making claims with confidence. At first, this may seem to indicate an advantage of using non-informative priors, as they enable experimenters to reach confidence more quickly. However, as we will show in the next section, this confidence is actually false confidence that lures one into shipping harmful product changes.

Second pitfall: tests are stopped early with false confidence

One of the main selling points of the Bayesian approach in A/B testing is that it enables faster decision-making and avoids the peeking problem. Building on the simulation described in the previous section, we now explore what happens when one stops the tests when the Bayesian approaches confidently proclaim a winner. Specifically, we will focus on tests whose 95% posterior credible interval lies entirely above zero. This is equivalent to stopping the test if it reports more than a 97.5% probability to beat control.ho

The simulation in this example matches the setup of the one in the previous section, except that we now enroll 20,000 users per week and calculate the posterior distributions at the end of each week.

Based on the findings of the previous section, we should expect the non-informative prior to report more winners with confidence earlier in the test, but how often are the reported winners actually winners, meaning that the true treatment effect, $\Delta$, is positive? The results are shown below.

With the non-informative prior, we see that more than 3% of tests are stopped after the first week because they have a greater than 97.5% “chance to beat control.” Despite this reported confidence, far fewer than 97.5% of these tests are winners! The additional claims with confidence made by the non-informative Bayesian approach lead to shipping a surprisingly large number of ideas that harm conversion rates.

Let’s compare the results to the case with the informed prior:

These results tell a very different story. For one, the informed prior declares almost no tests winners with confidence until around week 3, when the total sample size is at least 60,000. Additionally, the proportion of tests with true positive treatment effects among those stopped early is above 97.5%, regardless of when they are stopped!

It’s worth noting that there are arguably better Bayesian stopping rules than “end the test when the posterior probability to beat control exceeds X,” but we use it here due to its widespread usage in Bayesian A/B testing calculators.

Third pitfall: estimated treatment effects are exaggerated

One of the main advantages of the proper application of Bayesian inference is that it corrects for exaggerated effects caused by the winner’s curse by shrinking estimates of the treatment effect. Amazon has recently described these benefits in their publication Overcoming the winner’s curse: Leveraging Bayesian inference to improve estimates of the impact of features launched via A/B tests.

However, these advantages are completely lost when we use non-informative priors.

We can show this using the same simulation we presented before. This time, we compare the estimated lift to the true lift among winning tests at different sample sizes. The results are shown below.

The results show that the tests declared winners by the non-informative Bayesian approach are quite exaggerated. Conversely, the informed Bayesian approach provides shrunk estimates that are unbiased on average. Note that we only show results for sample sizes that have more than 100 tests with a reported chance to beat above 97.5%, which is why the informed Bayesian approach does not have any results for sample sizes below 40,000.

Final remarks and conclusions

Although non-informative priors may appear safe at first glance, it is clear that they lead to poor, overconfident decisions. The core issue is that they make a surprisingly strong assumption that any treatment effect is possible, which is not realistic in the context of A/B testing. Although it may be reasonable to specify a prior that is non-informative about direction, it is certainly not reasonable to specify a prior that is non-informative about magnitude. This is not just a pedantic point, as specifying a realistic prior can significantly change the results, especially in the early stages of a test when only a small sample has been collected.

It’s worth noting that our simulations have the luxury of specifying the “correct” prior, which is generally difficult to do in practice. However, we should be thoughtful and choose one that is sensible, based on available information, rather than one that is clearly wrong. Avoid the temptation to skirt around a difficult question by choosing a very wrong, but seemingly objective answer.

Lastly, this is not an endorsement of Bayesian inference over frequentist inference. Both approaches have their merits, and both can sensibly leverage relevant prior information, whether through a prior or experimental design. If one opts for a Bayesian approach, one should accept the responsibility of choosing a thoughtful prior.


Thank you to Sven Schmit, Lukas Goetz-Weiss, and Ryan Lucht for valuable comments and discussions that shaped this post, as well as Ryan Cala and Katie Petriella for editing and artwork.

Table of contents

Ready for a 360° experimentation platform?
Turn blind launches
into trustworthy experiments
See Eppo in Action

Ready to go from knowledge to action?

Talk to our team of experts and see why companies like Twitch, DraftKings, and Perplexity use Eppo to power experimentation for every team.
Get a demo