Three-stage OCR or bust — Eric Tetzlaff

It was around 3 AM on a Thursday when a single document broke my entire OCR pipeline and forced me to rethink the architecture from the ground up.

The document was 137 pages. It was a hybrid — a compilation of a declaration, a set of bylaws, an outdated rulebook that had since been superseded, articles of incorporation, and the recorder's statements, notary stamps, and signature pages for all of the above, assembled into a single PDF of varying scan quality throughout. Overall roughly average, at best. The kind of document that exists in the real world of community association governance and nowhere else: decades of legal history stapled together, photocopied multiple times, scanned by someone who clearly had other things on their mind.

My single-stage pipeline choked on it completely. Not gracefully. The extraction output was a wreckage of duplicate section numbers, misattributed headings, random character strings where sentences should have been, recorder's stamps indexed as governing language, and notary blocks treated as content. The table of contents bore almost no resemblance to what was actually extracted. I had started with pdf2json and already upgraded once — but I was staring at the proof that one model, no matter how capable, was not going to handle the full spectrum of what BoardPath's document corpus would look like in production.

"A governing document scanned in 1987 on a photocopier that's been through three offices is still a legally binding source. Your pipeline has to handle it — or your answers aren't trustworthy."

The scale of the problem became clear fast. When I started ingesting real governing documents from former client associations and pulling public document sets from county recorder websites for testing, the quality distribution was not what a clean benchmark corpus looks like. It was what real HOA governance actually looks like.

85–90%

of corpus documents were average to below-average scan quality

~10%

were native electronic text — what single-stage OCR is actually designed for

A pipeline optimized for the 10% is useless for a product that lives or dies on the quality of its answers across the other 90%. If BoardPath can't reliably extract a 1974 declaration that's been photocopied six times, it can't be trusted. And an AI governance platform that can't be trusted is worse than no platform at all — because it produces confident-sounding answers built on corrupted source material.

The three-stage architecture that emerged from that Thursday night is designed around one principle: every document gets the extraction quality it actually requires, not a one-size-fits-all pass that leaves the hard cases unresolved.

LlamaParse — Primary Extraction

Handles clean and semi-clean documents. Fast, capable, sufficient for the fraction of the corpus that arrives in reasonable condition. Outputs a confidence score that determines whether the document needs to go further.

≥93% → done

MistralAI — Redundancy & Enhancement

Low-confidence outputs from Stage 1 route here for enhanced extraction. Mistral's OCR capabilities handle a significant portion of the difficult material that LlamaParse couldn't resolve cleanly. A second confidence check determines whether the document is resolved or needs escalation.

<20% → human review flag

Google Vision — Tertiary Fallback

Documents that survive Stage 2 below the confidence threshold route here — the most capable available tool for worst-quality scan material. Not deployed on every document. Reserved precisely for the cases that need it, keeping compute cost proportional to actual extraction difficulty.

93%+ target across all

The 20% confidence floor at Stage 2 is a deliberate design decision worth explaining. When Mistral's OCR run produces output below that threshold, I don't send the document to Google Vision. The reason is straightforward: if the document is that degraded, Google Vision is unlikely to produce output that's reliable enough to cite in a governance answer. Spending the compute to confirm what's already apparent — that the source material is illegible — isn't efficient. It's expensive noise.

Those documents get flagged for human review instead. In a future version of BoardPath, that flag will trigger a human-in-the-loop workflow — a pool of dedicated reviewers who can locate a better quality version of the document, manually transcribe legible sections, or otherwise close the gap that technology can't close on genuinely degraded source material. That's an additional service tier, an additional cost to clients who need it, and an honest acknowledgment that some problems don't have a purely automated solution. Designing around that reality is better than pretending it doesn't exist.

The 3 AM document also taught me something about what to strip out of the extraction pipeline entirely. Recorder's stamps, document numbers, notary blocks, signature pages — these are legally required components of recorded governing documents and they appear on almost every page of a compiled document set. They are also completely useless for the purpose of building a queryable governance corpus. An OCR pipeline that doesn't know to ignore them will index them as content, producing noise that pollutes every downstream retrieval operation. The system prompt and tooling instructions for the extraction agents now explicitly exclude this class of content — not as a suggestion, but as a hardcoded constraint on what gets indexed.

The confidence spectrum also exposed what I consider the genuinely dangerous category of OCR output — and it isn't the obviously broken extractions.

Low confidence

Obvious failures. Random character strings, massive coverage gaps versus the table of contents, duplicate section numbers and headings. The problems announce themselves. These don't fool anyone paying attention.

Flagged immediately
Human review queue

Moderate confidence ⚠

The dangerous category. Looks mostly right. Could fool someone who isn't paying close attention. Missing a section or two — but the surrounding content flows well enough that an LLM agent will infer the gaps and fill them with hallucinated language that matches the document's tone and style. This is where ungoverned agents cause real harm.

Hard-flagged
Human reviewed
until protocol proven

High confidence

Reliable extraction at 93%+ threshold. Full coverage consistent with table of contents. Citation-grade output.

Ready for indexing
and retrieval

The moderate-confidence category requires a specific architectural response that a system prompt bullet point cannot provide. If an agent is operating on an extraction that's missing two sections of Article IV, a well-designed general-purpose LLM will notice the gap — and fill it. Not maliciously. Because that's what language models do when they encounter incomplete text that they can pattern-complete from context. The resulting output will sound authoritative, cite Article IV correctly, and be factually wrong in ways that are difficult to detect without reading the original document alongside the answer.

A system prompt instruction that says "do not fill gaps" will not reliably prevent this. A hardcoded tool exclusion that structurally prevents the agent from generating answers from documents flagged as moderate-confidence — without human review — makes the failure mode impossible rather than discouraged. That's a different level of architectural protection, and it's the one BoardPath uses.

The 137-page Thursday document that broke my pipeline was the best thing that could have happened to the architecture. It surfaced every edge case simultaneously — hybrid document types, mixed scan quality, irrelevant recorded content, format chaos — before any of those edge cases could reach a production user and produce a wrong answer. The system is more robust because of what broke it.

That's the only honest way to build something that handles real documents under real legal stakes. You run it against the worst material you can find. You watch it fail. You build the thing that doesn't.

BoardPath OCR Architecture Document Intelligence Confidence Scoring Human-in-the-Loop Guardrail Design

Three-stage OCR or bust: why single-model document extraction fails at scale