Two weeks before the BoardPath demo, I sat down and asked myself a question I hadn't fully answered yet: what, exactly, happens when this thing is wrong?
Not wrong in a way that's obvious — a 500 error, a blank response, a clearly garbled answer. Those are easy. I mean wrong in the specific way that AI systems fail in governance contexts: confidently, plausibly, and with citations. A board member reads an answer. It sounds authoritative. The source document is cited by name. The confidence score is high. And the answer is wrong in a way that could cost the association thousands of dollars and months of legal exposure before anyone realizes it.
That's the failure mode that matters. And the only way to know whether you've built a system that can survive it is to try, deliberately, to break it.
The meaningfulCandidates threshold — the gate that decides which retrieved
document sections are relevant enough to pass to the model — was set at a combined score
of 0.05. Authority bonus for a Declaration section alone is 0.15. Which means a Declaration
section with zero semantic score and zero keyword match would clear the threshold
by a factor of three. The model would receive it labeled "SOURCE 1 — Authority Level:
Declaration / CC&Rs" and would attempt to construct an answer from it rather than
reporting that the documents didn't address the question. High authority. Zero relevance.
The most dangerous hallucination pathway in the pipeline — and it was open by design.
I hadn't put it there intentionally. It was the kind of thing that happens when you're building retrieval logic incrementally — you add an authority bonus to surface higher-quality documents, you set a reasonable-sounding threshold, and you don't notice until you sit down and do the arithmetic that the bonus alone eclipses the gate. The code worked exactly as written. The behavior it produced was quietly catastrophic.
Finding that one issue made me want to find all of them. So I built a systematic hardening gate — a formal testing protocol that had to be cleared before the demo rehearsal could begin. Not a checklist. A gate. The demo doesn't happen until the gate passes.
Here's how it works, and what building it taught me about designing AI systems for environments where failure has real consequences.
The gate is structured in three stages. Each depends on the previous one completing cleanly.
The classification system was the most important design decision in the whole protocol. I resisted the urge to just count failures and set a pass threshold. Pass rates are useful, but they're not sufficient — a system that fails 15% of the time uniformly across all question types is a very different product from one that fails 15% of the time specifically on fabrication detection and document scope. One is calibration work. The other is a safety problem.
So the battery is organized by failure mode, not by question topic. Ten categories. Each one isolating a specific way the pipeline can produce a wrong answer that a user might believe:
Categories G and J are disqualifying. If the model answers a weather question, or produces a credible-sounding answer about a topic that isn't in the documents, the gate doesn't pass regardless of how everything else performed. These aren't edge cases I'm willing to accept at some low percentage. They're the foundation. If a board member can't trust that the system will say "I don't know" when it doesn't know, nothing else about the product matters.
Any FAIL in Category G (out-of-scope) — the model must refuse non-governance questions without exception.
Any FAIL in Category J, Test J1 (known-silent topic) — this is the core hallucination guardrail. It cannot have a failure rate.
Any fabrication paired with confidence_level: "high" — high-confidence fabrications are the most dangerous demo scenario.
Any JSON parse failure or 500 error during testing — the pipeline must be stable before any behavioral evaluation means anything.
The battery also forced several code patches I hadn't prioritized. The most consequential
one was adding a documents_silent boolean to the answer schema — a structured
field that the system prompt explicitly instructs the model to set when the retrieved
sections don't address the question. Before this change, "the corpus is silent" was
communicated via a low confidence score and an ambiguity note. Both of those can be
wrong. A boolean field cannot equivocate. It gets set or it doesn't. The
downstream UI knows what to render. The client application doesn't have to interpret
a confidence gradient to understand that no answer exists.
Another patch: removing hardcoded Ohio statute references from the system prompt. I'd built the initial Q&A pipeline against Ohio HOA law — it's what I knew — and somewhere in the iteration cycle, specific statute citations had made it into the model's instructions as examples. For an Ohio association they were correct. For an association in Florida, Nevada, or Texas, the model would produce Ohio legal citations with no qualification. That's not a performance problem. It's a factual error with a professional liability dimension, at scale, in every association that isn't in Ohio. It got replaced with generic language and a runtime injection path keyed to the association's stored jurisdiction.
The conversation history contamination fix was subtler. Prior Q&A pairs were prepended to the user prompt with the instruction to "answer the current question with awareness of this context" — which sounds reasonable until you think about what a language model does with that instruction. It has latitude to treat prior answers as established facts, to blend citations across different topics, to give a parking answer that silently inherits reasoning from the pets question that preceded it. The fix was a scoping instruction that I should have written the first time: use the history only to interpret follow-up references — "what about that rule?" — and answer each new question solely from the sections provided below. The model's context window is a liability as much as it's an asset if you don't define exactly what it's allowed to use it for.
What this process clarified is something I now treat as a first principle in agentic system design: you need two separate failure inventories. The first is a risk taxonomy — a systematic enumeration of every failure mode your system could exhibit, classified by severity, before you run a single test. The second is a behavioral test suite that deliberately triggers each failure mode and gives you a documented result. Neither is sufficient alone. The taxonomy without the tests is just speculation. The tests without the taxonomy will miss the failure modes that are hardest to think of — and those are, not coincidentally, the ones that show up in demos.
The other thing it clarified: the correct goal of pre-launch testing is not a perfect pass rate. It's documented known behavior at every failure mode, with a clear line between what's acceptable and what isn't. E2 in my battery — an orphaned amendment test — is a known partial. The amendment exists, the parent link is absent, the authority ranking degrades. I documented it. I classified the conditions under which it appears. I built a warning log that fires when it happens. It didn't block the gate because an orphaned amendment is a data quality issue I can surface to the board, not a silent fabrication they'd never know about.
That distinction — between failure modes that require remediation before any user sees the system, and failure modes that require documentation and a mitigation path — is the practical output of building a hardening gate instead of just "doing QA." A gate has explicit pass criteria. The criteria encode what you've actually decided about your risk tolerance. Writing them down forces that decision to be made consciously rather than discovered retrospectively when someone's relying on the system for something that matters.
The gate is how I know the demo is ready. Not because I'm confident — I've done enough of this to know that confidence is not the relevant variable. Because I have a documented record of what the system does under adversarial conditions, I know which failure modes it handles cleanly, and I know exactly what I built to address the ones it didn't.
That documentation is the deliverable. The passing test results are evidence. The gate itself is a commitment to the people who will be in the room.