A curated, non-BS library of the best resources for evaluating agents
A curated, non-BS library of the best resources for building and evaluating AI agents — papers, blogs, talks, tools, benchmarks. Maintained by BenchFlow. - benchflow-ai/awesome-evals
Opening excerpt (first ~120 words) tap to expand
Awesome Agent Evals A curated, opinionated, non-BS library of the best resources for building and evaluating AI agents — papers, blog posts, talks, courses, tools, and benchmarks. Maintained by BenchFlow · Most "awesome" lists are link dumps. This one is annotated and verified: every entry says what it is and why it belongs, URLs are checked, quotes are verbatim, and dead/abandoned tools are pruned (not silently listed). It was assembled by: a depth-4 recursive citation crawl (11.6k papers, ranked by in-degree) to surface the academic canon, targeted practitioner-web discovery for the industry sources citation graphs miss (Eugene Yan, Han-Chung Lee, Hamel Husain, Shreya Shankar, Nathan Lambert, …), 47 talks & podcasts transcribed and deep-noted (verbatim + timestamps), and per-section gap…
Excerpt limited to ~120 words for fair-use compliance. The full article is at GitHub.