25 June 20264 min read

AI 'reasoning' has a cliff. Apple went and found the edge.

Apple tested frontier reasoning models on puzzles of rising complexity and watched accuracy fall off a cliff past a threshold, even when the models were handed the exact algorithm. The critics have a real point about the setup. The lesson for your hardest problems survives both.

tsukumo

Short version: the reasoning models are sold on the hard cases, the gnarly logic a normal model fumbles. Apple ran the experiment and found the opposite of the pitch: on problems past a certain complexity, the reasoning collapses, all the way to near-zero accuracy. The most telling part is that it happened even when the model was handed the exact algorithm to follow. The paper drew sharp criticism, some of it fair. The operating lesson holds either way.

What Apple actually tested#

Apple's Illusion of Thinking paper avoided the usual benchmarks, which are leaky and easy to memorize, and used controllable puzzles like Tower of Hanoi where they could dial complexity up one notch at a time. They ran frontier reasoning models, o3-mini, Claude 3.7 Sonnet Thinking, DeepSeek R1, and watched what happened as the problems got harder.

Three regimes fell out:

Simple problems: plain models did fine, and the reasoning models often overthought them.
Medium complexity: the reasoning models earned their keep.
Hard problems: both collapsed. Accuracy fell toward zero past a threshold, not gracefully.

A cliff, not a slope.

Three regimes Apple's puzzles fell into

Complexity	Plain model	Reasoning model
Simple	wins, fast	often overthinks
Medium	falls behind	wins, earns its keep
Hard	collapses to near-zero	collapses to near-zero

The part that should worry you#

The collapse alone is interesting. The next finding is the one that matters for engineering work: the models failed to execute an algorithm even when Apple handed it to them. Given the exact steps, they still fell apart past a few iterations. That's not a knowledge gap a bigger training run fixes. It's weak symbolic execution, the model losing the thread of a procedure it was told how to run.

Apple also saw something strange in the effort: as problems approached the hard end, the models spent fewer reasoning tokens, not more. They appeared to give up right when they should have dug in. Whatever is happening inside, it isn't "try harder when it gets harder".

The critics have a point#

This paper got hit, and some of the hits land. A widely-shared rebuttal, "The Illusion of the Illusion of Thinking", showed that part of the collapse was an artifact of the setup: models ran into output token limits on the longer puzzles (they knew the answer was too long to write out), some puzzle instances were mathematically impossible, and the automated grader marked correct-but-truncated answers as failures. Adjust for that and the cliff softens.

So take the headline with that grain of salt. We're not claiming the models can't reason at all. But "the benchmark was partly broken" doesn't rescue the practical point. Whether the wall is at complexity N or N-plus-a-bit, your hardest problems are on the far side of it, and "a smarter model will get there" is a bet, not a plan.

What the cliff means for your hardest problems#

This lines up with the rest of the independent research: Stanford found AI's gains evaporate on complex code; Apple found its reasoning evaporates on complex problems. Same shape, different layer. And the fix is the same shape too: don't put the hard thing in front of the model whole.

Decompose. Break the gnarly problem into steps each small enough to sit well inside the model's reliable range. The cliff is about complexity per step, so lower it.
Offload exact execution. When something needs precise, many-step procedure, give the model a tool that runs it (a solver, a script, a checker) instead of asking it to simulate the steps in its head.
Verify each step. A run that drifts at step four should fail at step four, not at the end. Gates and tests are how you catch the drift the model won't catch itself.

This is just the operating model again, pointed at reasoning. The model is one component. The structure you put around it is what crosses the complexity it can't.

What to do on Monday#

Find where you're betting on raw reasoning. Any workflow that hands an agent a complex, multi-step problem and trusts the output is sitting near the cliff. Add decomposition and checks.
Give the agent tools, not raw thinking. For anything that needs exact execution, wire a tool and let the model orchestrate it. Stop asking it to be a calculator.
Test on your real complexity. A demo on a toy problem tells you nothing about the cliff. Run it on the messy, many-step thing you actually need it to do.

How we think about it#

We build agent fleets that work on real, multi-step problems in production, so we plan around the cliff instead of pretending it isn't there. The agents that hold up aren't running a bigger model. They're running a smaller problem, with tools for the exact parts and checks at every step. The model is maybe 10% of a working setup. The other 90% is the structure that keeps it on the near side of the edge Apple went looking for. (The context layer is part of that 90%: trovex cuts roughly 60% of the tokens per lookup, which leaves more room for the steps that matter.)

If you're betting a hard workflow on a model reasoning it through, that bet is the thing to pressure-test.

We map where your agents are betting on raw reasoning, then put decomposition, tools, and checks around the steps that fall off the edge.

Find your reasoning cliff before production does

Book an assessment →

Common questions

Can AI reasoning models actually think?

They do something useful on simple and medium problems and fall apart on hard ones. Apple's study found frontier reasoning models collapsed to near-zero accuracy past a complexity threshold. Whether you call the rest "thinking", the practical limit is real: complexity has an edge, and models go over it.

What did Apple's 'Illusion of Thinking' study find?

Testing models like o3-mini, Claude 3.7 Sonnet Thinking, and DeepSeek R1 on controllable puzzles, Apple found three regimes: plain models win on simple tasks, reasoning models win at medium complexity, and both collapse to near-zero on hard ones. Models also failed to execute an algorithm even when it was handed to them.

Why do reasoning models collapse on hard problems?

Apple observed reasoning effort rising with complexity, then dropping near the hard end, as if the model gives up. It points to weak step-by-step execution rather than missing knowledge. Critics counter that some collapse came from output token limits and flawed puzzle instances, not reasoning itself.

What does the reasoning cliff mean for hard engineering problems?

Don't hand the model a hard problem whole and hope. Decompose it into steps the model handles reliably, give it tools to offload exact execution, and verify each step. The structure around the model, not a bigger model, is what gets you across the complexity it can't cross alone.

Want this running on your team?

Get your assessment