25 June 20264 min read
AI 'reasoning' has a cliff. Apple went and found the edge.
Apple tested frontier reasoning models on puzzles of rising complexity and watched accuracy fall off a cliff past a threshold, even when the models were handed the exact algorithm. The critics have a real point about the setup. The lesson for your hardest problems survives both.
Short version: the reasoning models are sold on the hard cases, the gnarly logic a normal model fumbles. Apple ran the experiment and found the opposite of the pitch: on problems past a certain complexity, the reasoning collapses, all the way to near-zero accuracy. The most telling part is that it happened even when the model was handed the exact algorithm to follow. The paper drew sharp criticism, some of it fair. The operating lesson holds either way.
What Apple actually tested#
Apple's Illusion of Thinking paper avoided the usual benchmarks, which are leaky and easy to memorize, and used controllable puzzles like Tower of Hanoi where they could dial complexity up one notch at a time. They ran frontier reasoning models, o3-mini, Claude 3.7 Sonnet Thinking, DeepSeek R1, and watched what happened as the problems got harder.
