WeSearch

Analyzing GPT-5.5 and Opus 4.7 with ARC-AGI-3

·7 min read · 0 reactions · 0 comments · 1 view
#ai research#machine learning#benchmarking#reasoning#model analysis#GPT-5.5#Opus 4.7#ARC-AGI-3#OpenAI#Anthropic#Greg Kamradt#Codex#Claude Code
Analyzing GPT-5.5 and Opus 4.7 with ARC-AGI-3
⚡ TL;DR · AI summary

GPT-5.5 and Opus 4.7 were evaluated using the ARC-AGI-3 benchmark, which reveals not just scores but also reasoning processes during novel problem-solving tasks. The analysis identified common failure modes in how models interpret actions and form world models despite observing correct local effects. Researchers open-sourced their analysis package after reviewing over 160 replays and reasoning traces to better understand AI decision-making in ambiguous environments.

Key facts
Original article
ARC Prize
Read full at ARC Prize →
Opening excerpt (first ~120 words) tap to expand

By Greg KamradtPublished 01 May 2026Analyzing GPT-5.5 & Opus 4.7 with ARC-AGI-3 AI benchmarks can be incredible tools, but they usually only tell you if a model passed or failed. With ARC-AGI-3, however, we can see the thought process behind the score, not just the outcome. This week we went through 160 replays and reasoning traces from OpenAI’s GPT-5.5 and Anthropic’s Opus 4.7 attempting novel, long-horizon environments. The scores were just one data point, but the interesting story is how they achieved their score. Today we’re open-sourcing our analysis package.

Excerpt limited to ~120 words for fair-use compliance. The full article is at ARC Prize.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from ARC Prize