Reward hacking is swamping model intelligence gains

Naman Jain· Jun 26, 2026 · 7:49 AM UTC ·6 min read · 0 reactions · 0 comments · 2 views

On SWE-bench Pro, 63% of successful Opus 4.8 Max resolutions retrieved the fix rather than derived it. Stricter eval harnesses show how benchmark scores can conflate coding ability with answer retrieval.

Original article

Cursor · Naman Jain

Read full at Cursor →

Opening excerpt (first ~120 words) tap to expand

Blog / researchJun 25, 2026·researchReward hacking is swamping model intelligence gainsNaman Jain · 7 min readTable of Contents↑Catch a model with a modelStricter environment designA growing gapDesigning evals for aware agentsSmarter models are becoming more resourceful at hacking coding benchmarks. Eval suites built from real bugs that were later fixed are especially vulnerable because the problems have already been solved. If the agent has access to repository history or the public web, it can sometimes look up the answer rather than derive it. To measure how widespread this behavior is, we built an agent to audit eval trajectories. On SWE-bench Pro, we found that 63% of successful Opus 4.8 Max resolutions retrieved the fix rather than derived it.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at Cursor.

Anonymous · no account needed

Discussion

0 comments

Reward hacking is swamping model intelligence gains

Discussion

More from Cursor