Products

Experimentation

Product Experimentation Web Experimentation Lifecycle Experimentation Lifecycle Experimentation

Feature Flagging

Release Management Automated Rollouts Config Flags Release Management

AI Personalization

Contextual Bandits Contextual Bandits

Why Eppo

WHY EPPO

By Role

Data Scientists Engineers Product Managers Product Managers

Resources

Customers Outperform Updates White Papers White Papers

FEATURED CASE STUDY

Coinbase Saves Millions, Reduces Experiment Analysis Time by 40%, and Restores Trust in Experimentation with Eppo

Learn more

Blog

About

A/B Testing

February 22, 2024

Surviving the "Peeking Problem": A Self-Help Guide for Product Managers

How to avoid the pitfalls of peeking at A/B tests without flying blind

Eric Metelka

Before joining Eppo as Head of Product, Eric led experimentation programs at companies like Cameo and Cars.com

If you’re like me, you’ve sat down in front of your A/B experimentation tool of choice and pressed F5 to refresh the page. And just like me, you didn’t just refresh once. You refreshed every ten minutes. Or five minutes. Or you refreshed after every page load because, why not, you can.

Unbeknownst to me at the time, there were downsides to the refresh addiction I had developed, which is better known in the A/B testing realm as “peeking”. And those downsides could easily have included totally incorrect decisions. There are alternative tools that I could have used that would soothe my anxiety while affording me the confidence my experiments were moving in the right direction.

PMs and Experimentation: Tricky Terrain

As a product manager, running A/B tests is anxiety-inducing. PMs are goaled on making an impact, which means shipping features as often as possible to move your goal metrics. You spend time upfront in discovery and design to ensure you’re building the right solution to solve a user problem. Once you’ve released the feature, experimentation is the core tool you have to show leadership you’ve achieved your goals. That means all this effort and time leads you to experimentation software that shows you a different number for your performance every time you load it.

Every time I refreshed the A/B testing tool, I was trying to get that hit of dopamine. It’s like doomscrolling on social media and hoping the algorithm will serve me that divine piece of content that triggers the right part of my brain. But “doom-refreshing” an experiment comes with even greater consequences. I was hoping that on every next refresh, the number would go up and to the right. Because if it did, I wasn't just seeing a good piece of content, I was potentially meeting my quarterly goals, making the company more money, and putting myself in place for a bonus or promotion.

What Is Peeking and Why Is It a Problem?

The problem is that looking at interim results too often can undermine the test through a phenomenon known as the "peeking problem". Peeking refers to checking the interim results of an A/B test before it has been completed. Most companies use a “fixed sample” test setup (a simple t-test being the most common), which assumes no one will peek at results during the experiment runtime.

The pitfalls of “peeking” at interim calculations in a fixed-sample test can be as dramatic as making the opposite decision of the true outcome. Statistical significance or p-value calculations aren’t making any allowance for your still-collecting sample size and may show something as a “stat-sig” win when it may be demonstrably harmful with enough data, or vice-versa. This is exacerbated by sophisticated tools offering updates in near-real-time, tempting us to refresh our dashboards with the hope of catching a positive trend.

But of course we likely want to get regular updates on how our ideas are performing in the real world. When an experiment is rolling out, we want to know that it’s configured correctly and that end users can use the experience we shipped. We also want to make sure we’re not causing any harm to site performance or key revenue metrics. Finally, we want to know if a primary metric is statically significant at an earlier point than expected, potentially allowing us to ship the experiment earlier, meaning we can make a greater impact sooner.

To account for this, modern A/B testing software often uses a sequential test. Sequential tests allow for peeking, but does so by requiring the sacrifice of some of the statistical power of the test, widening the final confidence intervals around our metrics. In other words, we’re making a trade-off of being able to peek at the results in return for longer running tests with less precise results. This tradeoff is correlated in magnitude too: the more peeks planned, the more power sacrificed. Knowing this trade-off, would we want to use Sequential tests or would we run Fixed Sample tests in certain circumstances?

Additional Substitutes for Peeking

Asking PMs to practice self-discipline and seek a deep understanding of statistics is not a solution. But fighting the urge to peek doesn't mean flying blind until the experiment concludes. Instead, there are strategic practices and tools that can offer peace of mind without compromising the integrity of experiments:

Rollout Monitoring: The early stages of an experiment are critical for ensuring that everything is functioning as intended. This is where rollout monitoring comes into play, focusing on reducing risk by verifying an even user split and confirming the operational status of new features. It's a foundational step for safeguarding against performance issues or disruptions to core metrics.
Diagnostics: Checking the pulse of the experiment doesn't stop with the initial rollout; it's an ongoing rhythm that requires constant attention. Diagnostics serve as the stethoscope, listening for irregularities in user assignments and ensuring the experiment's health remains robust. This continuous oversight helps preserve the validity of the test, protecting it from unseen variables or shifts.
Notifications: The modern PM's dilemma of wanting to stay informed without succumbing to constant manual checks can be alleviated through smart notifications that push results to the PM, instead of them needing to go to the results. By setting up alerts for significant milestones or statistical shifts, PMs can keep their finger on the pulse of their experiments without the temptation to peek, allowing them to make informed decisions based on substantial changes.

Incorporating these tools not only helps manage the peeking impulse but also ensures that our decisions are rooted in comprehensive, reliable data. By adopting a patient, methodical approach to experimentation, we can navigate the challenges of product development with confidence, ensuring that when we do look at our results, they're as meaningful and accurate as possible.

Conclusion

A/B testing with proper statistical methods is crucial for product development. But as I know all too well, impatience can undermine results. With the right mix of tools and knowledge of trade-offs, PMs can navigate the suspense of experiments with wisdom and confidence, ultimately leading to the results that lead to their career growth. If I can tackle my peeking problem, any PM can.