loading
Loading.loading
Loading.Journal
Field notes from running agent fleets in production — what survives past the demo, and how dev teams become agentic operators.
Most teams scale their AI coding agents right after the demo works, which is the wrong moment. The signal to scale isn't enthusiasm. It's that the constraints keeping solo agent use safe have started to break across the team.
The archive · 35
Most 'AI assessments' are a slide deck or a readiness quiz with a sales call attached. A real one happens in your repo, in production, and tells you what AI won't fix. Here's what it is and what you keep.
Most teams ship agent changes on vibes: it felt better in the demo. But an agent silently regresses, and "it seems good" isn't a measurement. Evals are a golden set of real tasks with gradeable outcomes, run on every change, with the judge itself checked.
The moment you give an AI coding agent tools, prompt injection stops being a content problem and becomes remote code execution. The agent reads a poisoned repo or issue, and the injected instruction runs with the agent's permissions. You don't prompt your way out of this. You treat the agent as an untrusted client.
"AI reads your documents" dies on the boring parts. Replacing a legacy DMS isn't a chatbot over your files. It's a split-classify-extract pipeline that has to beat the humans it replaces across an 89,000-document backlog and 142 real categories.
The license tag on an open-weight model often isn't the license you're actually bound by. Licenses inherit through fine-tunes, and the tag inherits wrong all the time. Ship on the tag and you can be running a restricted model in production without knowing it.
Your best senior tried the AI, got code that was almost right and took longer to fix than write, and quietly stopped. That's not resistance, it's judgment. Here's how a lead earns real buy-in instead of fighting it.
One agent needs no coordination. Five do. Run them in parallel without it and you get collisions, duplicated work, and a handoff loop where nobody owns the task. Orchestration is the discipline that turns a pile of agents into a team.
The PR looks clean, passes the tests, and imports a library that doesn't exist. AI-written code fails differently than human code, and shipping it safely is a review problem, not a model problem.
Your agents ran overnight. This morning there are merged changes, a token bill, and something that looks off. Can you reconstruct what they did and why? That question is what agent observability answers, and your existing dashboards don't.
Thirty engineers, a codebase that works, everyone now using AI, and delivery somehow isn't faster. Scale-ups hit a bind the startup and enterprise playbooks don't fix. Here's the one that fits.
If you check a user's permissions on every API call, you're doing auth at the wrong layer. Inject the claim into the JWT at login with a Supabase auth hook, so the token carries it. Here's the pattern, plus the two traps that make it fail silently.
Production RAG over regulations isn't embed-and-retrieve. It's a layered pipeline (hybrid search, reranking, hierarchical summaries, graph context), and the failure that bites hardest is a silent embedding-dimension mismatch that returns confident garbage.
A loose-cannon agent is dangerous and a shackled one is useless. The way out is to put the judgment in versioned, fail-closed skill definitions and to gate which tools the agent can touch per skill and per turn. Capable without being a liability.
The thing that takes down an AI system in production usually isn't the model or the app. It's the boring infrastructure underneath, and it fails green. Four real ones from a dockerized AI stack: a mount writing to the void, a runaway container, a tripped autoscaler, dead proxy routes.
A context window is not memory. For an agent that handles one client's accounting across nine months, we built memory as five distinct layers (facts, history, decisions) with a promotion path that turns a one-off ruling into a standing rule.
Making a real ERP usable by AI agents isn't an integration project. It's a contract. We exposed a fiduciary back office as 13 MCP servers and 222 tools, one server per domain, one uniform envelope, per-principal auth. Here's the shape that holds.
In a regulated business, compliance isn't a layer you add after the AI works. It's the constraint that decides what the architecture is allowed to be. Here's how Swiss data and professional-secrecy law shaped every layer of an AI system we built for a fiduciary.
Most 'we use AI' stories are an autocomplete in someone's editor. Ours is an org chart: a CTO agent, domain leads, a coordination layer, tickets claimed off a board, one isolated worktree per agent, and a review gate nothing skips. Here's how it actually runs.
The reliable parts of a production AI agent aren't in the model. They're in the deterministic code wrapped around it. Stop trying to prompt your way to correctness and start constraining what the model is allowed to do.
If your agent token bill keeps climbing, the model price usually isn't the problem. The waste is in how much context you pay for on every call. The real cost drivers, and the levers that actually move them.
Lines of AI-written code and acceptance rates measure activity, not impact. The honest question is whether your team ships more of the right work at the same quality. How to read that, and the one number that's actually real.
Making an agent reliable is one problem. Governing a non-human actor with commit access is another: who owns its actions, how far a bad one reaches, and whether you can prove what happened. The governance layer, plainly.
The bugs that hurt a production pipeline don't crash. They return 200, paint the dashboard green, and quietly stop doing their job. Here are five we hit running an agentic accounting platform, and how we caught them.
AI readiness isn't a license count. It's whether your team can run agents on real work in production. Six dimensions tell you where you actually stand, and which gap is stalling you.
Most AI consultancies sell decks or dependency. A few transfer real capability onto your team. Six questions that tell the difference before you sign.
Agents lose context because a big repo doesn't fit a window, and a bigger window doesn't fix it. The fix is serving the canonical answer on demand.
Reliable agents aren't a better model, they're the engineering around it: scoped permissions, review gates, observability, and context the agent can trust.
If the goal of AI is to ship the same work cheaper, you'll be disappointed. The win is prod-grade output and roughly 10x from the team you already trust.
An agentic operator runs AI agents that do whole units of work instead of typing every line. We operate this way every day — here's the job, concretely, and the one skill that's actually new.
The fear that AI is there to replace developers is what quietly caps the capability you paid for. Augment-not-replace isn't ethics, it's what works.
Most AI training is a slide deck and a prompt cheatsheet. Turning a dev team into agentic operators is hands-on, on your own codebase, on real production work.
Buying an AI tool gives access, not capability. Building alone burns senior quarters. For most teams the real answer is neither. The honest build-vs-buy framing.
The AI demo always works. Then it meets your real codebase, standards, and scale, and quietly dies. The demo-to-production gap is where most AI initiatives fail.
Your team has AI autocomplete, maybe 10% of what coding agents can do. The gap to agents running work in production is an operating problem, not a license problem.
Going from one agent to a fleet in production isn't a prompt change. It's four engineering layers: context, orchestration, observability, and an operating model your devs run.
We write when we've shipped or learned something about running AI agents in production. No cadence quota, no filler.