Token-level eval harness for tool-calling agents: what we wired up

May 26, 2026 · 4:03 PM UTC ·4 min read · 0 reactions · 0 comments · 32 views

TL;DR · WeSearch summary

Nexus Labs has developed a token-level evaluation harness for tool-calling agents to improve performance assessment. This new system replaces a simple pass/fail metric with four distinct signals that provide deeper insights into agent behavior. The implementation aims to enhance the accuracy of tool selection and argument handling, addressing issues that previously went unnoticed in evaluations.

Key facts

▪The evaluation harness now measures tool selection accuracy, argument F1, recovery rate, and trajectory length delta separately.
▪Previously, a single pass rate of 72% did not reveal issues with tool selection and argument handling, prompting the redesign.
▪The new system allows for running evaluations against multiple models without rewriting the harness, using a single OpenAI-compatible endpoint.

Original article

DEV.to (Top)

Read full at DEV.to (Top) →

Opening excerpt (first ~120 words) tap to expand

try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3859428) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Marcus Chen Posted on May 26 Token-level eval harness for tool-calling agents: what we wired up #mlops #llm #machinelearning #devops TL;DR: We replaced our "did the agent finish the task" pass/fail eval with a token-level harness that scores tool selection, argument shape, and recovery behavior separately. Pass rate went from a single 73% number to four signals that actually tell us what broke.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).

Anonymous · no account needed

Discussion

0 comments

Token-level eval harness for tool-calling agents: what we wired up

Discussion

More from DEV.to (Top)