AI makes it easy to ship more. DORA's data says that's the problem.
Google's 2024 DORA report found that as teams adopted AI, delivery throughput and stability went down, not up. The culprit isn't bad AI code. It's batch size: AI removes the friction that used to keep change sets small, and big batches break delivery.
tsukumo
Short version: the pitch is that AI makes your team ship faster and more reliably. The largest study of software delivery we have found the opposite happening: as teams adopted AI, their delivery got less stable. Not because the AI writes bad code. Because it writes a lot of code, fast, and volume without discipline is how delivery breaks. The tool didn't fail. The operating model around it did.
DORA is Google's long-running research program on what makes software teams deliver well. Its 2024 report surveyed a large population of engineers, about 76% of whom were already using AI for part of their work. The finding that didn't make the keynote:
A 25% increase in AI adoption was associated with an estimated 1.5% drop in delivery throughput and a 7.2% drop in delivery stability.
DORA's read on why: AI makes it easy to generate more code, which inflates change set size, and larger batches carry more risk. That's not a new claim. It's one of the most consistent results in their entire body of work.
So the thing AI is best at, producing more code with less effort, runs straight into the thing DORA has warned about for a decade. You didn't get faster delivery. You got bigger batches.
What a 25% rise in AI adoption tracked with
Delivery throughput1.5 %
Delivery stability7.2 %
Source: Google DORA 2024 report
This isn't a lone signal, either. A 2025 randomized trial from METR found experienced developers were about 19% slower with AI while feeling faster. Two serious, independent measurements, pointing the same direction: adoption is not the same as improvement.
Think about what used to keep your pull requests small. Writing code was expensive, so people wrote less of it per change. That friction was annoying, and it was also load-bearing. It forced small, reviewable batches almost by accident.
AI removes the friction. The blank page is free now. A developer can produce a 600-line change in the time a 60-line one used to take, and the natural thing to do is ship it as one batch. Now the reviewer faces a wall of plausible code, skims it, approves it, and a larger, less-understood change lands in production. Multiply that across a team and your stability number drifts down, exactly as DORA measured.
The mechanism is almost boring. Bigger change, more surface area, harder to review, more ways to fail. AI didn't introduce a new failure mode. It removed the brake that was hiding how risky your batch sizes always were.
This is the layer an agent-ops assessment looks at first: not the model, but the delivery discipline around it.
Read the data carefully and it isn't saying "AI is bad for delivery." It's saying AI is a force multiplier on whatever your delivery habits already are. If your batches were disciplined and your gates were real, AI lets you do more of that, well. If your review was already a rubber stamp, AI hands that rubber stamp a firehose.
The two outcomes aren't subtle. Same tool, opposite effect, decided entirely by the operating model underneath:
Same AI, two delivery cultures
Under AI
Disciplined team
Rubber-stamp team
Batch size
capped, splits stay reviewable
balloons, one giant diff
Code review
a human holds the change in their head
skims a wall, approves on trust
What gets watched
change-failure rate, time-to-restore
lines and percent written by AI
Net effect on stability
holds or improves
drifts down, as DORA measured
The teams in DORA's data who kept delivery strong weren't the ones who avoided AI. The fundamentals it points back to, small batches and real testing, are operating choices that AI makes more important, not less. The tool raises the stakes on the discipline you bring to it.
If volume is the threat, the fixes are the unglamorous ones DORA has always pointed at, now with teeth:
Cap batch size. Put a real limit on PR size and split AI-generated work into reviewable pieces. A change a human can actually hold in their head is the whole point.
Make the gate mean something. A review the AI can talk past is not a review. Tests, checks, and a human who has to understand the change, not wave it through. We wrote up how we run this in how we ship with an agent fleet.
Measure stability, not output. Track change-failure rate and time-to-restore, the metrics that survive AI. "Lines shipped" is the number that looks great while stability rots.
There's a quieter cost under all of this: the agent generating those big diffs is often working from the wrong context, so it writes more than it needs to. This is the one place we have a hard number. trovex cuts roughly 60% of the tokens per lookup by serving the currently-correct context instead of stuffing the window. Less noise in, smaller and more correct changes out.
Look at your PR sizes since AI landed. If the median crept up, that's your stability risk, measured. Set a cap.
Audit one recent AI-heavy merge. Ask whether the reviewer actually understood it or just trusted it. The honest answer tells you where your gate stands.
Stop celebrating volume. Retire "percent of code written by AI" from the dashboard and put change-failure rate where everyone can see it.
We run agent fleets in production to ship our own software, which means we generate a lot of code and have to keep it from sinking our own delivery. The thing that works isn't slowing the agents down. It's holding the line on batch size and gates so the extra volume stays shippable. The model is maybe 10% of a working setup. The other 90% is the delivery discipline DORA keeps measuring, and AI keeps testing.
Not automatically. The 2024 DORA report found rising AI adoption was associated with lower delivery throughput and stability, not higher. AI speeds up writing code, but delivery performance depends on how that code gets reviewed, batched, and shipped, which AI doesn't fix and can quietly make worse.
What did the 2024 DORA report find about AI?
With about 76% of respondents using AI for part of their work, DORA estimated that a 25% increase in AI adoption was associated with a 1.5% decrease in delivery throughput and a 7.2% decrease in delivery stability. More adoption tracked with worse delivery outcomes, the opposite of the usual pitch.
Why does AI adoption lower delivery stability?
Because it inflates batch size. DORA has shown for years that large change sets are riskier. AI makes producing more code nearly free, so pull requests get bigger and harder to review, and big batches fail more often in production. The problem is the size of the change, not the origin of the code.
How do you adopt AI without hurting delivery?
Keep the batches small and the gates real. Cap PR size, require review the AI can't talk its way past, and watch change-failure rate, not lines shipped. The teams that stay stable with AI are the ones that kept their delivery discipline instead of letting volume erase it.