Product Experimentation with Propensity Scores: Causal Inference for LLM-Based Features in Python
Product teams often face challenges measuring the true impact of LLM-based features when users self-select into using them via opt-in toggles. Naïve comparisons between users who opt in and those who don't are biased due to differences in engagement, intent, and risk tolerance. Propensity score methods help correct this bias by statistically reweighting or matching groups to approximate the results of a randomized experiment.
- ▪Users who opt into AI features like 'Try agent mode' are not a random sample and differ systematically from non-users.
- ▪The observed performance gap between opt-in and non-opt-in users often reflects pre-existing differences rather than the actual effect of the feature.
- ▪Propensity score methods, such as inverse-probability weighting and matching, can adjust for selection bias by balancing observable user characteristics.
- ▪This tutorial demonstrates the full pipeline using a synthetic dataset with known causal effects, including diagnostics and confidence intervals.
- ▪The companion code provides an end-to-end implementation in Python for applying these methods to real-world product data.
Opening excerpt (first ~120 words) tap to expand
April 30, 2026 / #product experimentation Product Experimentation with Propensity Scores: Causal Inference for LLM-Based Features in Python Rudrendu Paul Every product experimentation team running causal inference on LLM-based features eventually hits the same wall: when users click "Try our AI assistant," the volunteers aren't a random sample. Your product shipped a new agent mode last quarter. Users have to tap the "Try agent mode" toggle to enable it. The dashboard numbers look stunning: agent-mode users complete 21 percentage points more tasks than non-users. The CPO calls it the best feature launch of the year. But you know something's off. Heavy-engagement users opt into new features constantly, while light users ignore toggles entirely.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More .