CVE-Bench: testing LLM agents on real-world vulnerability patches
A recent evaluation of AI models for fixing security vulnerabilities revealed mixed results. The CVE-Bench benchmark tested five models on 20 real-world CVEs, finding that no model consistently resolved vulnerabilities. The best-performing model achieved a 60% success rate under optimal conditions, highlighting the challenges AI faces in this domain.
- ▪The CVE-Bench benchmark was created to assess AI models' ability to fix real-world security vulnerabilities.
- ▪Five models were tested on 20 CVEs, with the highest success rate being 60% under the best conditions.
- ▪The study identified structured failure modes in the models, such as wrong-search drift and budget exhaustion.
Opening excerpt (first ~120 words) tap to expand
I Tested Whether AI Can Fix Security Vulnerabilities. Well, It's Complicated. ~15 min read Correction (2026-05-28): Five security tests in the original benchmark were found to reject valid alternative fixes that nonetheless addressed the reported vulnerability. Results were recalculated after correcting the tests. Solve rates increased by 3–7 points per model; the ranking order is unchanged, but cross-family pairwise comparisons that previously fell short of significance now cross α = 0.05 under McNemar with continuity correction. All affected numbers and statistical conclusions in this post have been updated.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Github.