StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning
The paper presents StepOPSD, a new framework for improving reinforcement learning in multi-turn agents. This framework addresses the issue of credit-assignment mismatch by focusing on action-centered step segments for better supervision. The results demonstrate significant performance improvements in various tasks sensitive to local causal errors.
- ▪StepOPSD decomposes agent trajectories into action-centered segments for credit redistribution.
- ▪The framework achieved first-place performance on tasks like ALFWorld Heat and PickTwo.
- ▪Findings suggest that step-aware distillation is beneficial when trajectory-level rewards are weakly aligned with local actions.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.27140 (cs) [Submitted on 26 May 2026] Title:StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning Authors:Yanfei Zhang, Xu Lin, Chenglin Wu View a PDF of the paper titled StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning, by Yanfei Zhang and 2 other authors View PDF HTML (experimental) Abstract:Reinforcement learning for multi-turn agents suffers from a credit-assignment mismatch: rewards are sparse and trajectory-level, while success often hinges on a few local decisions. Existing online policy distillation (OPD) provides denser token-level supervision, but typically treats heterogeneous agent trajectories as monolithic strings rather than causal interaction units.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.