What is document intelligence versus a document-management system?

A document-management system (DMS) stores, versions, and routes files that humans have already filed and tagged. Document intelligence does the filing and tagging itself: it takes raw scans and emails, splits them into individual documents, classifies each, and extracts the key fields. The DMS is the cabinet; doc-intelligence is the clerk who decides what goes in which drawer.

Can an LLM replace a document-management system on its own?

Not on its own. An LLM is good at reading and answering questions about a document you hand it. It does not, by itself, split a 200-page scan into the right documents, classify reliably across a hundred-plus categories, or extract fields consistently at scale. Those need a pipeline of specialized stages around the model, which is most of the real work.

How do you classify documents into many real categories?

Not with a five-label demo classifier. A real business often has a hundred-plus categories with overlapping language, so classification needs the actual taxonomy, training or prompting against real examples, and a confidence threshold that routes uncertain documents to a human instead of guessing. Getting from 5 demo classes to 142 real ones is where most projects stall.

What is the hardest part of automating document processing?

Splitting and classification, not extraction. Everyone pictures the AI reading a clean invoice. The real difficulty is upstream: a scanned bundle that is actually twelve different documents, and a taxonomy with dozens of near-identical categories. Get those wrong and everything downstream is confidently filed in the wrong place.

Replacing a legacy DMS with document intelligence (the pipeline)

tsukumo

Replacing a legacy DMS with document intelligence (the pipeline) · tsukumo

Short version: "AI reads your documents" is the easy demo and the wrong mental model. When we replaced a legacy document-management system with a doc-intelligence platform, the model reading a page was the part that already worked. The work was the boring pipeline around it: splitting a scanned bundle into the actual documents inside it, classifying each one into 142 real business categories, and extracting the entities that matter, reliably, across an 89,000-document backlog. A chatbot over your files impresses in a meeting. Beating the humans you are replacing on splitting and classification is what ships.

Why isn't "AI reads your documents" enough?#

Because the document you hand the AI in the demo is already split, already classified, and already clean. Real ones are none of those.

The picture everyone has is an LLM reading a tidy invoice and pulling out the total. That part genuinely works now. But it assumes the upstream problems are solved: that someone already separated this invoice from the eleven other documents in the same scan, already decided it was an invoice and not a delivery note, and already produced legible text. In a real backlog, nothing upstream is done. The model is waiting on a pipeline that doesn't exist yet, and building that pipeline is the project.

The demo starts at the one place the real work has already finished. That gap, between "AI can read this page" and "AI can process this backlog," is the whole engagement.

The pipeline: split, then classify, then extract#

A doc-intelligence system is three stages, and they run in this order for a reason.

Split. Turn a raw bundle (a 200-page scan, an email with attachments) into the individual documents it actually contains. Get this wrong and every later stage is operating on the wrong unit.
Classify. Sort each document into the real category it belongs to, out of the full business taxonomy, with a confidence score.
Extract. Pull the fields and entities that matter from each classified document.

Most teams build stage 3 first because it is the fun one, then discover the system fails on stages 1 and 2, where the documents are still bundled and unlabelled.

Splitting: the stage everyone forgets#

A scan is not a document. It is a stack, and the AI has to find the seams.

A single PDF from a scanner is routinely several distinct documents joined end to end: a contract, then a form, then three invoices, then a letter. A human flipping through finds the boundaries without thinking. An automated pipeline has to detect them, and a wrong boundary poisons everything after it, because now an invoice's second page is classified as the start of a new document. Splitting is unglamorous, it is rarely in the demo, and it is the first thing that breaks at scale.

Classifying into 142 real categories, not 5#

Demos classify into a handful of clean buckets. The business had 142, and many of them looked almost the same.

A real taxonomy is not "invoice, contract, other." It is dozens of categories with overlapping language and subtle distinctions that matter legally or operationally. A classifier trained on five labels collapses the moment it meets the real list. What works is the actual taxonomy, real labelled examples, and, critically, a confidence threshold: a document the model is unsure about goes to a human queue instead of being filed wrong with false confidence. The jump from 5 demo classes to 142 real ones is where most document-AI projects quietly stall.

Extraction that survives a messy backlog#

Once a document is split and classified, extraction is the part that most resembles the demo, but it still has to survive bad scans.

This is where a focused entity model earns its place. We used GLiNER, an open named-entity model, for reliable extraction across document types, rather than asking a general LLM to free-form every field and hoping the format held. The point is the same one that runs through every production-AI story: use the right tool for each stage, and don't make one model carry a job a smaller, specialized one does more reliably.

Legacy DMS vs a doc-intelligence pipeline#

The old system and the new one are doing different jobs, which is exactly why "just buy a better DMS" misses the point.

	Legacy DMS (M-Files-style)	Doc-intelligence pipeline
Who files the document	A human, manually	The pipeline (split + classify)
Who tags / extracts	A human keys the fields	The model, with human review on low confidence
Unit of work	A file someone already separated	A raw bundle the system splits
Where it breaks	Human throughput and consistency	Splitting and classification accuracy
What it costs to scale

A DMS is a filing cabinet that assumes a clerk. Doc-intelligence is the clerk. Replacing one with the other is not a migration, it is automating the judgement the clerk was doing.

What it takes to beat the humans you're replacing#

Automation only counts if it is more accurate than the people it replaces, not just faster.

That bar is unforgiving on an 89,000-document backlog: a 2% misclassification rate is over a thousand documents filed wrong, quietly, in a system people will trust. So the design has to know its own uncertainty and escalate, treat splitting as a first-class accuracy problem, and prove itself against a human-labelled set before it is allowed to run unattended. The win is real, but it comes from the pipeline's discipline, not from the model's cleverness.

What this means for your team#

If you are looking at an aging document system and a vendor is selling you "AI search over your files," ask what happens before the search: who splits the bundles, who classifies into your real taxonomy, what happens when the model is unsure. The model reading a page is solved. The pipeline that gets a messy backlog to that page is the engagement, and it is the part that decides whether the thing is trustworthy.

We build these where the document is the business and a misfile is a real problem. If your team is trying to retire a legacy document system with AI that actually beats the status quo, that's the work we do.

Replacing a legacy document system with doc-intelligence is a pipeline, not a chatbot

Why isn't "AI reads your documents" enough?#

The pipeline: split, then classify, then extract#

Splitting: the stage everyone forgets#

Classifying into 142 real categories, not 5#

Extraction that survives a messy backlog#

Legacy DMS vs a doc-intelligence pipeline#

What it takes to beat the humans you're replacing#

What this means for your team#

What agentic product development actually is (and how it beats a dev shop)

When to scale your agent setup: the team signals that actually matter

What an AI engineering assessment actually is (and what you walk away with)

Want this running on your team?