A/B Testing Pitfalls: What Works and What Doesn’t with Real Data
Why Most “Winning” Experiments Fail in Production and How Top Companies Avoid It
Full article excerpt tap to expand
Image by Author # Introduction You've shipped what looks like a winning test: conversion up 8%, engagement metrics glowing green. Then it crashes in production or quietly fails a month later. If that sounds familiar, you're not alone. Most A/B test failures don't come from bad product ideas; they come from bad experimentation practices. The data misled you, the stopping rule was ignored, or no one checked if the "win" was just noise dressed as a signal. Here's the uncomfortable truth: the infrastructure around your test matters more than the variant itself, and most teams get it wrong. Let's break down the four silent killers of A/B testing — from misleading data to flawed logic — and reveal the disciplined practices that separate the best from the rest. Image by Author # When Data Lies: SRM and Data Quality Failures Pitfall: Most "surprising" test results aren't insights; they're data-quality bugs wearing a disguise. Sample Ratio Mismatch (SRM) is the canary in the coal mine. You expect a 50/50 split, you get 52/48. Sounds harmless. It's not. SRM signals broken randomization, biased traffic routing, or logging failures that silently corrupt your results. Real-world case: Microsoft found that SRM signals severe data quality issues that invalidate experiment results, meaning tests with SRM often lead to wrong ship decisions. DoorDash detected SRM after low-intent users dropped out disproportionately from one group following a bug fix, skewing results and creating phantom wins. What to check if you have SRM: Image by Author Chi-squared test for traffic splits: automate this before any analysis. User-level vs. session-level logging: mismatched granularity creates phantom effects. Time-based bucketing bugs: Monday users in control, Friday users in treatment = confounded results. Solution: The fix isn't statistical cleverness. It's data hygiene. Run SRM checks before looking at metrics. If the test fails the ratio check, stop. Investigate. Fix the randomization. No exceptions. Want to practice spotting data-quality issues like SRM or logging mismatches? Try a few real SQL data-cleaning and anomaly-detection challenges on StrataScratch. You'll find datasets from real companies to test your debugging and data validation skills. Most teams skip this step. That's why most "successful" tests fail in production. # Stop Peeking: How Early Looks Ruin Validity Pitfall: Checking your test results every morning feels productive. It's not. It's systematically inflating your false positive rate. Here's why: every time you look at p-values and decide whether to stop, you're giving randomness another chance to fool you. Run 20 peeks on a null effect, and you'll eventually see p < 0.05 by pure luck. Optimizely's research found that uncorrected peeking can raise false positives from 5% to over 25%, meaning one in four "wins" is noise. How to recognize a naive approach: Run the test for two weeks. Check daily. Stop when p < 0.05. Result: You've run 14 multiple comparisons without adjustment. Solution: Use sequential testing or always-valid inference methods that adjust for multiple looks. Real-world case: Spotify's approach: Group sequential tests (GST) with alpha spending functions optimally account for multiple looks by exploiting the correlation structure between interim tests. Optimizely's solution: Always-valid p-values that account for continuous monitoring, allowing safe peeking without inflating error rates. Netflix's method: Sequential testing with…
This excerpt is published under fair use for community discussion. Read the full article at KDnuggets.