Analyzing GPT-5.5 and Opus 4.7 with ARC-AGI-3
GPT-5.5 and Opus 4.7 were evaluated using the ARC-AGI-3 benchmark, which reveals not just scores but also reasoning processes during novel problem-solving tasks. The analysis identified common failure modes in how models interpret actions and form world models despite observing correct local effects. Researchers open-sourced their analysis package after reviewing over 160 replays and reasoning traces to better understand AI decision-making in ambiguous environments.
- ▪GPT-5.5 achieved an ARC-AGI-3 score of 0.43%, outperforming Opus 4.7, which scored 0.18%.
- ▪Three primary failure modes were identified: true local effect with false world model, wrong level of abstraction from training data, and solving a level without learning the underlying game mechanics.
- ▪ARC-AGI-3 consists of 135 hand-crafted environments designed to test AI's ability to adapt to novelty without relying on cultural or prior knowledge.
- ▪The analysis included over 1,000,000 games played and used human-validated strategies to compare model reasoning against ground truth solutions.
- ▪Models often misinterpreted ARC-AGI-3 mechanics by mapping them to known games like Tetris, Sokoban, or Pong, leading to incorrect strategies.
Opening excerpt (first ~120 words) tap to expand
By Greg KamradtPublished 01 May 2026Analyzing GPT-5.5 & Opus 4.7 with ARC-AGI-3 AI benchmarks can be incredible tools, but they usually only tell you if a model passed or failed. With ARC-AGI-3, however, we can see the thought process behind the score, not just the outcome. This week we went through 160 replays and reasoning traces from OpenAI’s GPT-5.5 and Anthropic’s Opus 4.7 attempting novel, long-horizon environments. The scores were just one data point, but the interesting story is how they achieved their score. Today we’re open-sourcing our analysis package.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at ARC Prize.