Token-level eval harness for tool-calling agents: what we wired up
Nexus Labs has developed a token-level evaluation harness for tool-calling agents to improve performance assessment. This new system replaces a simple pass/fail metric with four distinct signals that provide deeper insights into agent behavior. The implementation aims to enhance the accuracy of tool selection and argument handling, addressing issues that previously went unnoticed in evaluations.
- ▪The evaluation harness now measures tool selection accuracy, argument F1, recovery rate, and trajectory length delta separately.
- ▪Previously, a single pass rate of 72% did not reveal issues with tool selection and argument handling, prompting the redesign.
- ▪The new system allows for running evaluations against multiple models without rewriting the harness, using a single OpenAI-compatible endpoint.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3859428) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Marcus Chen Posted on May 26 Token-level eval harness for tool-calling agents: what we wired up #mlops #llm #machinelearning #devops TL;DR: We replaced our "did the agent finish the task" pass/fail eval with a token-level harness that scores tool selection, argument shape, and recovery behavior separately. Pass rate went from a single 73% number to four signals that actually tell us what broke.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).