WeSearch

GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo

·14 min read · 0 reactions · 0 comments · 2 views
#ai coding models#model comparison#code generation#benchmarking#open source
GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo
⚡ TL;DR · AI summary

GPT-5.5 outperformed GPT-5.4 and Opus 4.7 in coding tasks across two open-source repositories, achieving higher test pass rates and better code review acceptance. While Opus 4.7 produced smaller, more efficient patches, it often missed required implementation details, leading to incomplete solutions. The results highlight the importance of repo-specific evaluations, as model performance varied significantly between codebases.

Key facts
Original article
Stet
Read full at Stet →
Opening excerpt (first ~120 words) tap to expand

GPT-5.5 vs GPT-5.4 vs Opus 4.7 on 56 real coding tasks from 2 open source reposMay 1, 2026summaryscorecardmethodtestsreviewopus5.4 → 5.5costtakeaway Opus 4.7 writes smaller patches. GPT-5.5 writes patches that more often survive review. Which one you want depends on whether "small" means disciplined or incomplete in your repo. I ran both models, plus GPT-5.4, on 56 real coding tasks from two open-source repos: 27 tasks from Zod and 29 from graphql-go-tools (these codebases were selected arbitrarily and may not represent your experience - that's the point of why running your own benchmarks is important!) Each model ran in its native agent harness at default settings: Anthropic models in Claude Code, OpenAI models in OpenAI Codex CLI.

Excerpt limited to ~120 words for fair-use compliance. The full article is at Stet.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from Stet