It was around 3 AM on a Thursday when a single document broke my entire OCR pipeline and forced me to rethink the architecture from the ground up.
The document was 137 pages. It was a hybrid — a compilation of a declaration, a set of bylaws, an outdated rulebook that had since been superseded, articles of incorporation, and the recorder's statements, notary stamps, and signature pages for all of the above, assembled into a single PDF of varying scan quality throughout. Overall roughly average, at best. The kind of document that exists in the real world of community association governance and nowhere else: decades of legal history stapled together, photocopied multiple times, scanned by someone who clearly had other things on their mind.
My single-stage pipeline choked on it completely. Not gracefully. The extraction output was a wreckage of duplicate section numbers, misattributed headings, random character strings where sentences should have been, recorder's stamps indexed as governing language, and notary blocks treated as content. The table of contents bore almost no resemblance to what was actually extracted. I had started with pdf2json and already upgraded once — but I was staring at the proof that one model, no matter how capable, was not going to handle the full spectrum of what BoardPath's document corpus would look like in production.
The scale of the problem became clear fast. When I started ingesting real governing documents from former client associations and pulling public document sets from county recorder websites for testing, the quality distribution was not what a clean benchmark corpus looks like. It was what real HOA governance actually looks like.
A pipeline optimized for the 10% is useless for a product that lives or dies on the quality of its answers across the other 90%. If BoardPath can't reliably extract a 1974 declaration that's been photocopied six times, it can't be trusted. And an AI governance platform that can't be trusted is worse than no platform at all — because it produces confident-sounding answers built on corrupted source material.
The three-stage architecture that emerged from that Thursday night is designed around one principle: every document gets the extraction quality it actually requires, not a one-size-fits-all pass that leaves the hard cases unresolved.
The 20% confidence floor at Stage 2 is a deliberate design decision worth explaining. When Mistral's OCR run produces output below that threshold, I don't send the document to Google Vision. The reason is straightforward: if the document is that degraded, Google Vision is unlikely to produce output that's reliable enough to cite in a governance answer. Spending the compute to confirm what's already apparent — that the source material is illegible — isn't efficient. It's expensive noise.
Those documents get flagged for human review instead. In a future version of BoardPath, that flag will trigger a human-in-the-loop workflow — a pool of dedicated reviewers who can locate a better quality version of the document, manually transcribe legible sections, or otherwise close the gap that technology can't close on genuinely degraded source material. That's an additional service tier, an additional cost to clients who need it, and an honest acknowledgment that some problems don't have a purely automated solution. Designing around that reality is better than pretending it doesn't exist.
The 3 AM document also taught me something about what to strip out of the extraction pipeline entirely. Recorder's stamps, document numbers, notary blocks, signature pages — these are legally required components of recorded governing documents and they appear on almost every page of a compiled document set. They are also completely useless for the purpose of building a queryable governance corpus. An OCR pipeline that doesn't know to ignore them will index them as content, producing noise that pollutes every downstream retrieval operation. The system prompt and tooling instructions for the extraction agents now explicitly exclude this class of content — not as a suggestion, but as a hardcoded constraint on what gets indexed.
The confidence spectrum also exposed what I consider the genuinely dangerous category of OCR output — and it isn't the obviously broken extractions.
Human review queue
Human reviewed
until protocol proven
and retrieval
The moderate-confidence category requires a specific architectural response that a system prompt bullet point cannot provide. If an agent is operating on an extraction that's missing two sections of Article IV, a well-designed general-purpose LLM will notice the gap — and fill it. Not maliciously. Because that's what language models do when they encounter incomplete text that they can pattern-complete from context. The resulting output will sound authoritative, cite Article IV correctly, and be factually wrong in ways that are difficult to detect without reading the original document alongside the answer.
A system prompt instruction that says "do not fill gaps" will not reliably prevent this. A hardcoded tool exclusion that structurally prevents the agent from generating answers from documents flagged as moderate-confidence — without human review — makes the failure mode impossible rather than discouraged. That's a different level of architectural protection, and it's the one BoardPath uses.
The 137-page Thursday document that broke my pipeline was the best thing that could have happened to the architecture. It surfaced every edge case simultaneously — hybrid document types, mixed scan quality, irrelevant recorded content, format chaos — before any of those edge cases could reach a production user and produce a wrong answer. The system is more robust because of what broke it.
That's the only honest way to build something that handles real documents under real legal stakes. You run it against the worst material you can find. You watch it fail. You build the thing that doesn't.