GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo

May 1, 2026 · 4:06 PM UTC ·14 min read · 0 reactions · 0 comments · 2 views

#ai coding models #model comparison #code generation #benchmarking #open source

GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo

⚡ TL;DR · AI summary

GPT-5.5 outperformed GPT-5.4 and Opus 4.7 in coding tasks across two open-source repositories, achieving higher test pass rates and better code review acceptance. While Opus 4.7 produced smaller, more efficient patches, it often missed required implementation details, leading to incomplete solutions. The results highlight the importance of repo-specific evaluations, as model performance varied significantly between codebases.

Key facts

▪GPT-5.5 achieved 38/56 test passes and 28/56 clean passes (tests + review), outperforming Opus 4.7 and GPT-5.4.
▪Opus 4.7 produced the smallest patches with the lowest footprint risk but frequently under-implemented required changes.
▪On Zod, GPT-5.5 and Opus tied on test passes; on graphql-go-tools, GPT-5.5 significantly outperformed both models in clean passes and behavioral equivalence.

Original article

Stet

Read full at Stet →

Opening excerpt (first ~120 words) tap to expand

GPT-5.5 vs GPT-5.4 vs Opus 4.7 on 56 real coding tasks from 2 open source reposMay 1, 2026summaryscorecardmethodtestsreviewopus5.4 → 5.5costtakeaway Opus 4.7 writes smaller patches. GPT-5.5 writes patches that more often survive review. Which one you want depends on whether "small" means disciplined or incomplete in your repo. I ran both models, plus GPT-5.4, on 56 real coding tasks from two open-source repos: 27 tasks from Zod and 29 from graphql-go-tools (these codebases were selected arbitrarily and may not represent your experience - that's the point of why running your own benchmarks is important!) Each model ran in its native agent harness at default settings: Anthropic models in Claude Code, OpenAI models in OpenAI Codex CLI.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at Stet.

Anonymous · no account needed

Discussion

0 comments

GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo

Discussion

More from Stet