Skills Without Evals Are Just Markdown and Hope
Daniel Sogl developed an Anthropic Agent Skill for @ngrx/signals and evaluated it using benchmarks, token usage, and a description optimizer, finding it improved pass rates from 84% to 100% but increased latency and cost. Despite the performance gain, the skill's description did not improve through optimization, and perfect scores raised concerns about eval saturation. The results highlight that most teams deploy AI skills without proper evaluation, risking ineffective or underused functionality.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 1395106) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Daniel Sogl Posted on May 1 Skills Without Evals Are Just Markdown and Hope #claude #ai #angular #ngrx TL;DR. I built an Anthropic Agent Skill for @ngrx/signals and ran it through the full eval pipeline: capability A/B benchmarks, token and wall-time accounting, and a description-optimizer loop. The skill lifts pass rate from 84% to 100%. It also adds 14 seconds and ~12,000 tokens per invocation (about $0.04 at Sonnet 4.6 input pricing).
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV Community.