GPT isn't enough: we wrap deterministic state machines around the LLM
The reliable parts of a production AI agent aren't in the model. They're in the deterministic code wrapped around it. Stop trying to prompt your way to correctness and start constraining what the model is allowed to do.
The short answer
You don't make an LLM reliable by giving it a better prompt or a bigger model. You wrap it in deterministic code that constrains what it is allowed to do. The model handles the fuzzy part (reading intent); a state machine owns every step that has to be correct (which client, which ID, which transition is legal). In production accounting AI, reliability lives in the shell around the model, not in the model.
tsukumo
Short version: the parts of a production AI agent you can trust are not in the model. They are in the deterministic code wrapped around it. Most teams try to reach reliability by prompting harder or upgrading the model. That moves the probability of a correct answer; it never makes a wrong answer impossible. For work where being wrong costs real money, you want the opposite approach: let the model do the fuzzy part, and put a state machine in charge of everything that has to be correct. We build accounting AI for a Swiss fiduciary, where "confidently wrong" is the worst outcome there is, and this is the pattern that holds up.
Why doesn't a better LLM make your agent reliable?#
Because a model gives you a probability, not a guarantee, and the two are not close enough for money work.
A model that is 98% correct on "which client does this email refer to" sounds excellent. In a fiduciary processing thousands of documents, it means roughly one in fifty gets attached to the wrong client, silently, with no exception thrown. You cannot prompt that away. A better prompt takes you from 98% to 99%, which is the same problem one zero further out.
The mental shift is to stop asking "how do I make the model more accurate" and start asking "how do I make the wrong action impossible to execute." Those are different engineering problems, and only the second one ends.
The pattern: deterministic shell, probabilistic core#
The model sits in the middle and does one job: turn messy human input into a structured, checkable intent. Around it, ordinary deterministic code owns everything with a single correct answer.
The model reads an email and proposes: "book this invoice to client X, account Y."
The deterministic layer checks: is X a real, currently-active client in this context? Is Y a legal account for that operation? Is this transition allowed from the current state?
If any check fails, the action is rejected before it touches a system. The model gets to suggest. It never gets to commit.
This inverts the usual trust model. Instead of trusting the model and adding guardrails as an afterthought, you distrust the model by default and let it earn each action through code that cannot be sweet-talked.
The discipline that makes this work is being strict about the boundary. Some things are the model's; most things are not.
The model owns (fuzzy)
The state machine owns (must be correct)
Reading intent from messy text
Which client / entity the work applies to
Drafting a human-readable summary
Which IDs chain to which records
Proposing the next action
Whether a transition is legal from this state
Classifying a document type
Executing the write to your systems
Suggesting a category
Pausing at anything irreversible
If you find the model on the right-hand column, that is your bug. The most common failure we see in other teams' agents is letting the model hold a record ID across several turns and "remember" it. Models don't remember; they re-derive, and re-derivation is where the wrong ID sneaks in. The state machine holds the ID. The model is never asked to.
CapabilityIntercept: making the model's output safe to execute#
Between the model's suggestion and the actual execution, we run an interception layer. Its whole job is to resolve the fuzzy references in the model's output into exact, verified identifiers before anything runs.
Three things it handles that break naive agents:
Sticky client. Once a conversation is about a specific client, that binding is held in deterministic state, not in the prompt. The model cannot accidentally drift to another client mid-task because it never held the client in the first place.
Coreference. When the user says "that invoice" or "the same account as last time," the deterministic layer resolves the reference against actual records. The model flags that a reference exists; the code decides what it points to.
ID chaining. Operations that depend on each other (this payment belongs to that invoice belongs to this client) are chained by the interceptor against real foreign keys, not by the model stringing IDs together in text.
The model produces intent. CapabilityIntercept turns intent into a verified, executable command, or refuses it. That refusal is a feature. It is the moment a silent error would have happened and didn't.
The deterministic layer is also what makes human-in-the-loop bearable. Without it, "keep a human in the loop" means a person babysitting every step, which no one sustains. With it, the state machine knows exactly which transitions are irreversible and pauses only there.
A payment, a tax filing, a destructive change: those stop and wait for a human. Everything reversible runs on its own. You get oversight where it matters and speed everywhere else, because the code knows the difference. A model asked to decide "is this risky enough to ask a human" would, again, only give you a probability.
Two roads out of an unreliable agent. They lead to different places.
Prompt harder
Constrain harder
Reliability is a probability you keep nudging
Reliability is a property of the code
Each new edge case is a new prompt tweak
Each new rule is a new transition, tested once
Failures are silent and statistical
Failures are explicit refusals you can see
Hard to audit (why did it do that?)
Auditable (every action passed a known check)
Gets worse as the task space grows
Scales with ordinary software discipline
The second road is less exciting and far more durable. It is also, not incidentally, the road that lets a regulated business put an AI agent anywhere near its books.
If you are building an agent for anything that touches money, records, or compliance, the model is the easy part. The work is the deterministic shell: the state machine, the interception layer, the explicit human-in-the-loop boundaries. That is where reliability is won, and it is ordinary, testable engineering rather than prompt alchemy.
We learned this building it for a fiduciary, where the cost of "confidently wrong" is measured in someone else's accounts. If your team is trying to get an AI agent from impressive-in-a-demo to trusted-in-production, the shell around the model is the work, and it's the work we do. Talk to us about your agent.
Because prompting changes the probability of a correct answer, never the guarantee. For a task where being wrong costs money (posting to the wrong account, picking the wrong client), a 98% correct model is still wrong one time in fifty, silently. Reliability has to come from deterministic code that makes the wrong action impossible, not less likely.
What is a deterministic state machine in an AI agent?
It's ordinary code that defines the legal states and transitions for a task, and refuses anything outside them. The LLM proposes an action; the state machine decides whether that action is allowed from the current state and executes it. The model never writes to your systems directly. It makes a suggestion the deterministic layer is free to reject.
Doesn't constraining the model defeat the point of using AI?
No. You constrain what it can do, not what it can understand. The model still does the thing only a model can do well: turn messy human input into a structured intent. You just stop trusting it with the parts that have a single correct answer. The result feels flexible to the user and behaves like software.
When should a human stay in the loop?
On anything irreversible or hard to undo: a payment, a filing, a destructive change. The deterministic layer is what makes human-in-the-loop practical, because it can pause exactly at the irreversible transitions and nowhere else, instead of asking a human to babysit every step.