Agency16 June 20264 min read
Running agent fleets in production: what it actually takes
Going from one agent to a fleet in production isn't a prompt change. It's four engineering layers: context, orchestration, observability, and an operating model your devs run.
The short answer
Running a fleet of AI agents in production takes four engineering layers, not a better prompt: context the agents can trust, orchestration so they don't collide, observability so you operate on evidence, and an operating model your developers actually run. Skip any one and the fleet is expensive, unreliable, or quietly abandoned.
Short version: one agent doing a task is a script. A fleet doing real work in production is a system, and it needs four things the demo never shows: context the agents can trust, orchestration so they don't collide, observability so you can run them on evidence not faith, and an operating model your developers actually run. Skip any one and the fleet is expensive, unreliable, or quietly abandoned. Here's the honest version of what's involved.
1. Context the agents can trust#
An agent is only as good as what it knows about your codebase. At fleet scale, "let it read the repo" is both expensive (every agent, every session, re-deriving the same things) and wrong (it picks stale or duplicate docs). You need a context layer that serves the current, canonical answer cheaply and on demand. Get this right and agents are fast and correct; get it wrong and you're paying premium tokens for confident mistakes. We hit this ourselves and built trovex for it; the version we run does a doc lookup at about 60% fewer tokens than letting an agent re-read the repo. That's the whole game at fleet scale: the same lookup, many times a day, across many agents.