Analyzing GPT-5.5 and Opus 4.7 with ARC-AGI-3

May 1, 2026 · 5:10 PM UTC ·7 min read · 0 reactions · 0 comments · 1 view

#ai research #machine learning #benchmarking #reasoning #model analysis #GPT-5.5 #Opus 4.7 #ARC-AGI-3 #OpenAI #Anthropic #Greg Kamradt #Codex #Claude Code

Analyzing GPT-5.5 and Opus 4.7 with ARC-AGI-3

⚡ TL;DR · AI summary

GPT-5.5 and Opus 4.7 were evaluated using the ARC-AGI-3 benchmark, which reveals not just scores but also reasoning processes during novel problem-solving tasks. The analysis identified common failure modes in how models interpret actions and form world models despite observing correct local effects. Researchers open-sourced their analysis package after reviewing over 160 replays and reasoning traces to better understand AI decision-making in ambiguous environments.

Key facts

▪GPT-5.5 achieved an ARC-AGI-3 score of 0.43%, outperforming Opus 4.7, which scored 0.18%.
▪Three primary failure modes were identified: true local effect with false world model, wrong level of abstraction from training data, and solving a level without learning the underlying game mechanics.
▪ARC-AGI-3 consists of 135 hand-crafted environments designed to test AI's ability to adapt to novelty without relying on cultural or prior knowledge.
▪The analysis included over 1,000,000 games played and used human-validated strategies to compare model reasoning against ground truth solutions.
▪Models often misinterpreted ARC-AGI-3 mechanics by mapping them to known games like Tetris, Sokoban, or Pong, leading to incorrect strategies.

Original article

ARC Prize

Read full at ARC Prize →

Opening excerpt (first ~120 words) tap to expand

By Greg KamradtPublished 01 May 2026Analyzing GPT-5.5 & Opus 4.7 with ARC-AGI-3 AI benchmarks can be incredible tools, but they usually only tell you if a model passed or failed. With ARC-AGI-3, however, we can see the thought process behind the score, not just the outcome. This week we went through 160 replays and reasoning traces from OpenAI’s GPT-5.5 and Anthropic’s Opus 4.7 attempting novel, long-horizon environments. The scores were just one data point, but the interesting story is how they achieved their score. Today we’re open-sourcing our analysis package.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at ARC Prize.

Anonymous · no account needed

Discussion

0 comments

Analyzing GPT-5.5 and Opus 4.7 with ARC-AGI-3

Discussion

More from ARC Prize