AI Can Find the Code. It Didn't Know How the System Worked
21 bug fixes, two models, same failures. Better LLMs marginally improve things, but still failed on system boundaries and integration.
Full article excerpt tap to expand
AI Can Find the Code. It Didn't Know How the System Worked. Why LLMs fail on real codebases Posted on April 26, 2026 by Adam Wespiser A quick change for a simple feature. The task was to add a basic UI panel to show a warning, a couple API calls to verify references, and a little data validation. The hard part? Modifying a complex monorepo sitting at the core of our business. Not large like “been working on it for a while”, but large like thousands of contributors going back to the second Bush Administration. Even worse, the sections of code I’d be working on laid dormant for nearly 10 years. But this wasn’t the problem, it was something else: the solution depended on information that wasn’t locally visible. The expectation was simple: AI would entirely compress the onboarding and we’d ship something fast. Get in and get out, and let the AI do the work. I jumped right in to creating an implementation plan, resolving classes to specific files, and decomposed the plan into small verifiable steps. The context soon filled up, but running the AI coding agent in a sub-folder unblocked me. Now, instead of fixing things, the AI agent was using libraries incorrectly, modifying the wrong files, and sometimes inserting nonsensical changes. The implementation plan hadn’t just gone off the rails, it never found the rails to begin with. The suggested fixes were in correct place, but the agent made category errors about how the application worked. Template re-rendering, dependency injection registry, missing surface area— all a problem. The agent found the files. But it didn’t know how the system worked. AI Non-obviousness The fix wasn’t in the code the agent was looking at. It was somewhere else in the system. That’s what I mean by non-obviousness: the degree to which the information required to solve a task is not directly specified and must be discovered, selected, and composed from a large search space. A quick pilot study To understand LLM coding agents failures, I used Claude Sonnet 4.6 to run a quick pilot. Sonnet 4.6 is close to what my team uses in practice, not the best, but representative of AI agents in the wild. For the repo, I picked Jenkins. It’s 2 million lines of code, and is architecturally similar to my work. I found 21 bug fix commits with at least one test that modify source files. For each test, I ran the following loop: Checkout the commit before the bugfix Add in the new test and confirm it fails Prompt LLM to fix the issue Re-run the test and record the result I used two different prompts: a full description of the commit with specific files, and just issue description from the bug tracker. Results: For the full description, Claude was able to fix 61.9% of the bugs, and for just the issue description, 57.1%, with overlapping results for all but one bug. I expected this to be a code search problem, but even when given the exact files, the model failed in the same way. None of the results were hallucinations: the code changes were coherent, they simply lacked understanding of concepts outside the implicated files. This is what I saw before— system correctness appears non-obvious at the location of the code change. Results Deep Dive To understand why the agent failed, I dove into a few example where Claude was unable to fix the issue, even if the exact files needed for the change are given to the agents. Here’s what I found: PR #23859 is about API token expiration, with the required fix extending the doGenerateNewToken()…
This excerpt is published under fair use for community discussion. Read the full article at Wespiser.