Why can't you just prompt an LLM to be reliable?

Because prompting changes the probability of a correct answer, never the guarantee. For a task where being wrong costs money (posting to the wrong account, picking the wrong client), a 98% correct model is still wrong one time in fifty, silently. Reliability has to come from deterministic code that makes the wrong action impossible, not less likely.

What is a deterministic state machine in an AI agent?

It's ordinary code that defines the legal states and transitions for a task, and refuses anything outside them. The LLM proposes an action; the state machine decides whether that action is allowed from the current state and executes it. The model never writes to your systems directly. It makes a suggestion the deterministic layer is free to reject.

Doesn't constraining the model defeat the point of using AI?

No. You constrain what it can do, not what it can understand. The model still does the thing only a model can do well: turn messy human input into a structured intent. You just stop trusting it with the parts that have a single correct answer. The result feels flexible to the user and behaves like software.

When should a human stay in the loop?

On anything irreversible or hard to undo: a payment, a filing, a destructive change. The deterministic layer is what makes human-in-the-loop practical, because it can pause exactly at the irreversible transitions and nowhere else, instead of asking a human to babysit every step.

tsukumo

Deterministic state machines over LLMs: the reliable-agent pattern · tsukumo

tsukumo

Agency17 June 20265 min read

GPT isn't enough: we wrap deterministic state machines around the LLM

The reliable parts of a production AI agent aren't in the model. They're in the deterministic code wrapped around it. Stop trying to prompt your way to correctness and start constraining what the model is allowed to do.

The short answer

You don't make an LLM reliable by giving it a better prompt or a bigger model. You wrap it in deterministic code that constrains what it is allowed to do. The model handles the fuzzy part (reading intent); a state machine owns every step that has to be correct (which client, which ID, which transition is legal). In production accounting AI, reliability lives in the shell around the model, not in the model.

tsukumo

GPT isn't enough: we wrap deterministic state machines around the LLM

Short version: the parts of a production AI agent you can trust are not in the model. They are in the deterministic code wrapped around it. Most teams try to reach reliability by prompting harder or upgrading the model. That moves the probability of a correct answer; it never makes a wrong answer impossible. For work where being wrong costs real money, you want the opposite approach: let the model do the fuzzy part, and put a state machine in charge of everything that has to be correct. We build accounting AI for a Swiss fiduciary, where "confidently wrong" is the worst outcome there is, and this is the pattern that holds up.

The model owns (fuzzy)	The state machine owns (must be correct)
Reading intent from messy text	Which client / entity the work applies to
Drafting a human-readable summary	Which IDs chain to which records
Proposing the next action	Whether a transition is legal from this state
Classifying a document type	Executing the write to your systems
Suggesting a category	Pausing at anything irreversible

Prompt harder	Constrain harder
Reliability is a probability you keep nudging	Reliability is a property of the code
Each new edge case is a new prompt tweak	Each new rule is a new transition, tested once
Failures are silent and statistical	Failures are explicit refusals you can see
Hard to audit (why did it do that?)	Auditable (every action passed a known check)
Gets worse as the task space grows	Scales with ordinary software discipline

GPT isn't enough: we wrap deterministic state machines around the LLM

Why doesn't a better LLM make your agent reliable?#

The pattern: deterministic shell, probabilistic core#

What the state machine owns#

CapabilityIntercept: making the model's output safe to execute#

Where humans stay in the loop#

Prompt harder, or constrain harder?#

What this means for your stack#

What agentic product development actually is (and how it beats a dev shop)

When to scale your agent setup: the team signals that actually matter

What an AI engineering assessment actually is (and what you walk away with)

Want this running on your team?