Proper Scoring Rules for Agentic Uncertainty Quantification
The paper introduces the Trajectory Proper Score (TPS) for evaluating agentic uncertainty quantification in AI. It highlights the limitations of existing evaluation metrics and demonstrates how TPS can better elicit success probabilities. Experimental results show that recalibrating probabilities can significantly impact TPS outcomes while rank metrics remain stable.
- ▪The Trajectory Proper Score (TPS) is a new family of scoring rules for evaluating per-step uncertainty signals in AI.
- ▪Existing metrics like AUROC and Trajectory ECE do not fully capture the success-probability process.
- ▪Experiments on various datasets reveal that probability recalibration can alter TPS results without affecting rank metrics.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.24756 (cs) [Submitted on 23 May 2026] Title:Proper Scoring Rules for Agentic Uncertainty Quantification Authors:Suresh Raghu, Satwik Pandey, Shashwat Pandey View a PDF of the paper titled Proper Scoring Rules for Agentic Uncertainty Quantification, by Suresh Raghu and 2 other authors View PDF HTML (experimental) Abstract:Language-model agents increasingly emit uncertainty signals throughout a trajectory, but existing agentic UQ evaluations often conflate ranking usefulness with probabilistic truthfulness.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.