CVE-Bench: testing LLM agents on real-world vulnerability patches

CVE-Bench· May 29, 2026 · 7:28 PM UTC ·20 min read · 0 reactions · 0 comments · 30 views

#ai #security #vulnerabilities #benchmarking #technology

via

Github

TL;DR · WeSearch summary

A recent evaluation of AI models for fixing security vulnerabilities revealed mixed results. The CVE-Bench benchmark tested five models on 20 real-world CVEs, finding that no model consistently resolved vulnerabilities. The best-performing model achieved a 60% success rate under optimal conditions, highlighting the challenges AI faces in this domain.

Key facts

▪The CVE-Bench benchmark was created to assess AI models' ability to fix real-world security vulnerabilities.
▪Five models were tested on 20 CVEs, with the highest success rate being 60% under the best conditions.
▪The study identified structured failure modes in the models, such as wrong-search drift and budget exhaustion.

Original article

Github · CVE-Bench

Read full at Github →

Opening excerpt (first ~120 words) tap to expand

I Tested Whether AI Can Fix Security Vulnerabilities. Well, It's Complicated. ~15 min read Correction (2026-05-28): Five security tests in the original benchmark were found to reject valid alternative fixes that nonetheless addressed the reported vulnerability. Results were recalculated after correcting the tests. Solve rates increased by 3–7 points per model; the ranking order is unchanged, but cross-family pairwise comparisons that previously fell short of significance now cross α = 0.05 under McNemar with continuity correction. All affected numbers and statistical conclusions in this post have been updated.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at Github.

Anonymous · no account needed

Discussion

0 comments

CVE-Bench: testing LLM agents on real-world vulnerability patches

Discussion

More from Github