The research is clear that AI underdelivers by default. It's just as clear about why, and that points straight at the fix. Five operating levers separate the teams that get real gains from the ones that get rework. This is the model we install, and none of it is about a better model.
tsukumo
Short version: if AI is underwhelming on your team, the instinct is to wait for a smarter model. Don't. The independent research that found AI slowing teams down was already using frontier models. The problem was never intelligence. It was everything around the model: what it works on, how its output gets reviewed, what context it sees, and the state of the code it touches. Fix those and the same model that disappointed you starts paying off. That set of fixes has a name. It's an operating model, and it's the actual product.
We laid out the problem, with sources, in . This is the other half: what the teams getting real value do differently. Five levers.
Stanford's data is blunt about this: AI delivered 35-40% gains on greenfield, low-complexity work and single digits on the complex, brownfield code most teams live in. So the first decision is where you aim it. Boilerplate, tests, migrations, scaffolding, the high-volume low-context work, is where the gains are real. The gnarly core a senior holds in their head is where AI burns time you thought you saved.
The greenfield bar is a range Stanford put at 35 to 40 percent. The brownfield bar is single digits. Same model, same developers, two very different answers. That gap is your targeting instruction.
Lever 2: Keep batches small, and make review mean something#
DORA found rising AI adoption tracked with lower delivery stability, because AI inflates change size and big batches are riskier. The fix is the oldest one in delivery: small, reviewable changes behind a gate that actually catches problems. The twist now is that volume is higher, so a rubber-stamp review fails faster than it once did.
Cap PR size. Split AI-generated work into pieces a human can hold in their head. Make review about understanding the change, not approving it. The mechanics are in AI makes it easy to ship more.
METR's randomized trial found developers felt about 20% faster with AI while measuring 19% slower. That gap is the trap: every activity metric (seats, "percent AI code", self-reported speed) can rise while delivery rots. So you measure the outcome instead, change-failure rate, time-to-merge, rework, the numbers that tell the truth.
ETH Zurich found that stuffing a coding agent with a big context file made it worse and more expensive, while a short, current, scoped one helped. GitClear found AI driving duplication, partly because the agent can't see the code it should be reusing. Both point to the same lever: the agent needs the right context, served, not the whole repo, stuffed.
This is the one place we have a hard number. trovex cuts roughly 60% of the tokens per lookup by serving the currently-correct code and decisions for the task, so the agent fits your codebase instead of inventing around it.
Stanford's other finding: net AI productivity rises with test coverage, type coverage, documentation, and modularity. The cleaner the environment, the more of AI's speed survives as real output. Good tests catch the agent's mistakes; clear modules give it a target it can hit; docs are context it can use instead of guessing.
This is the lever leaders underrate, because it looks like unrelated maintenance work. It isn't. It's the ceiling on every other gain. The duplication problem it prevents is in AI ships code by copy-paste.
None of these levers is exotic. That's the point, and also why they get skipped: they look like the boring fundamentals, so teams reach for a new tool instead. The teams that win treat the five as a system. Right work, small batches, honest metrics, served context, clean code. Pull one and the others sag.
Buying AI vs operating it
Criterion
Buying a tool
Operating a model
What you change
Procurement
How work flows
Where AI points
Wherever
The right tasks
Batch size
Whatever the agent emits
Capped, reviewable
Metric watched
Seats, percent AI code
Change-failure, rework
Context served
The whole repo
Current, scoped
Result
Felt faster
Measured faster
When we install this with a team, we're not handing over a model or a tool. We're wiring the operating model around whatever tools they already bought: scoping where agents are pointed, tightening the review gates, standing up the context layer, and fixing the metrics so they tell the truth. Then we train the team's own developers to run it, so the capability stays after we leave. That's the difference between buying AI and operating it.
Find your weakest lever.Score your team on all five, or just eyeball it: PR size (batch), rework rate (gates and code quality), where AI is pointed (tasks), and what context it runs on. One of them is costing you most.
Fix the cheapest one first. Usually batch size or context. Quick to change, immediate signal.
Re-measure on outcomes. If change-failure rate and rework improve, you're operating AI. If only "lines shipped" improves, you're still just running a tool.
We run agent fleets in production to build our own software, so we built this operating model because we needed it, not because it makes a tidy framework. The model is maybe 10% of a working setup. The other 90% is these five levers, and it's the same 90% five independent research teams just finished measuring from the other side.
If you've adopted AI and the results don't match the promise, the gap is the operating model, and that's exactly what we install. Book an assessment and we'll map which lever is costing you most.
We score your team on all five levers and tell you which one to fix first.
Run it as an operating model, not a tool. Point AI at the right tasks, keep change sets small behind real review, measure outcomes over output, serve trusted context, and raise codebase quality. The independent research shows these five levers, not the model choice, decide whether AI helps or hurts.
What is an AI operating model?
The set of practices around the tool that determine its results: which work it does, how that work gets reviewed and shipped, what context it runs on, and how you measure it. The model writes code; the operating model decides whether that code makes your team faster or slower.
Why isn't a better model enough to get value from AI?
Because the studies that found AI slowing teams down used frontier models. The bottleneck wasn't intelligence; it was oversized batches, weak review, missing context, and messy codebases. A smarter model amplifies whatever operating model it lands in, good or bad.
How does a team start fixing its AI operating model?
Measure first. Look at PR size, rework rate, and where AI is pointed. Then fix the cheapest broken lever, usually batch size or context. A scoped assessment maps which of the five levers is costing you most before you invest in the others.