Why coding agents are so far ahead
and what it’ll take to bring reliable agents into every domain
Coding agents have proliferated faster and gone further than agents in any other domain. Claude Code and its peers now run autonomously for hours on tasks that would take human engineers a full workday. METR has been tracking the time horizon at which frontier models hit 50% success on software tasks, and that horizon has been doubling roughly every 7 months since 2019, with the pace accelerating to about every 4 months in 2024-2025.1 Agents in other domains are nowhere close.
The usual explanations are valid but incomplete. Models understand code well, open source training data is abundant, agents are good at using the terminal, and version control makes iteration and collaboration natural.
But I believe the most important factor is that coding agents have a better feedback loop than any other domain. That loop depends on two properties:
Verifiability with precise error attribution -> you can grade an outcome against ground truth and point to exactly where it went wrong
Cheap, realistic iteration -> you can run thousands of cycles in an environment that closely resembles deployment
An agent writes code, runs and tests it in a sandbox VM similar to production, and gets a ground-truth signal in seconds. When it fails, the agent can easily identify the exact line of code that caused the failure.
The Anthropic compiler experiment illustrates this well when applied to a long horizon task. In February 2026, Anthropic ran parallel agents using Claude Opus 4.6 to build a C compiler in Rust capable of compiling the Linux kernel.2 After roughly 2,000 sessions and $20,000 in API costs, the agents produced a functional 100,000-line compiler. While the code (published on Github) is a bit of a mess, it is an insanely impressive accomplishment for agents to have autonomously (for the most part) built such a complex piece of software.
The experiment worked because of a tight and accurate testing/validation loop. Nicholas Carlini at Anthropic used GCC (the de facto standard open-source compiler) as a known-good oracle. An oracle, in testing, is a trusted reference that tells you what the correct output should be.
For software validation, the test can be many things, like whether the syntax is correct, whether or not the code ran, whether it passes specific unit/integration tests, etc. But since the end goal was building a compiler, GCC was the perfect oracle. Carlini compiled most of the Linux kernel with GCC and a random subset with Claude’s compiler. Any divergence between the two outputs was, by definition, a bug in Claude’s version. Each failure pointed to a specific code path he could direct the agents to iteratively find and fix.
The loops ran in virtual machines which are similar to real deployment environments and relatively cheap and fast to spin up and iterate at scale.
Compare this to customer service agents where the outcomes are much more subjective. Customer service agents fail the verifiability test because there is no oracle for the right response to a frustrated human. Humans are varied and complicated! Then, there’s no way to attribute exactly what sentence in the AI customer service agent’s response caused the customer to churn or stay.
Similarly, AI sales rep agents fail the iteration speed test. The sales feedback cycle can run for months. Even if you had a perfect oracle, the loop is too slow to drive learning. There is no GCC for determining if a cold email was perfectly crafted, and even if there were, it’d probably take a multiple month sales cycle to figure out if it was right.
These are structural problems for domains involving humans. But, the oracles and simulation environments don’t have to be perfect - they just have to be good enough and fast enough to drive learning.
Verifiable, cheap and realistic loops are possible and already exist in domains much messier than code. Waymo’s simulator is a perfect example.3 Driving is chaotic and full of human pedestrians, cyclists weaving through traffic, other drivers behaving unpredictably, weather, construction, and a whole host of edge cases. Waymo has built a simulator that runs roughly 25,000 virtual vehicles covering up to 10 million simulated miles per day. The other road users in it are reactive agents like simulated pedestrians, cyclists, and drivers which respond to how the Waymo moves, calibrated against millions of miles of real-world driving data.
There of course is no ground truth like GCC is to compiling code for “did the car drive well”, but the simulator is good enough and trained on enough real world data to validate that trips can be completed safely and efficiently. The sandbox isn’t a perfect physics simulator but it is plausible enough to mimic real world behavior, and iterates fast enough and inexpensively enough to be run millions of simulated miles every day. The iterations are both verifiable and cheap.
I believe the next wave of agent progress will come from people manufacturing realistic simulations with fast iterations for domains that don’t have them yet. My guess is we will get progress from constructed oracles for narrow slices of a domain rather than full-fidelity simulations of the whole thing.
If you’re building simulation environments, constructed oracles, or fast feedback loops for messy domains, I want to hear from you!

