BoardPath had a scoring engine before it had paying customers. Every answer the system gave a board came back with a 0–100 confidence score and a per-dimension breakdown — authority, grounding, retrieval quality, and the rest. It was the part of the product I was proudest of, because it was the part that made a skeptic willing to act on an AI answer.
Then I noticed I kept reaching for it in other projects. A forensic document tool wanted the same thing. A side experiment wanted it. Each time, I copy-pasted a slightly different version of the same logic — and each copy drifted a little further from the others.
That's the moment a feature is telling you it wants to be a library.
What It Actually Was
The scoring engine wasn't governance logic. It never had been. It didn't know what a CC&R was, didn't retrieve anything, didn't call a model. It took signals a RAG pipeline already produces — retrieval scores, document metadata, citation overlap, how much the retrieved chunks agreed with each other — and composed them into one auditable number with a reason attached to every point.
The governance lived in which documents got ranked how. The scoring was domain-agnostic the whole time. I just hadn't drawn the boundary.
Three projects, three slightly different copies of the same scorer, none of them the source of truth. A bug fixed in one didn't reach the others. A new dimension added in one made the others quietly out of date. The logic wasn't specific to any of the three products it lived inside — which is exactly why it shouldn't have lived inside any of them.
The Design Response
I pulled the scorer out and published it as transparent-confidence —
Apache-2.0, on npm. Three decisions defined what it became, and each one was a deliberate
constraint, not a default.
Zero runtime dependencies
If the job is scoring and policy — not retrieval, not inference — then it should run with what the caller already has. No ML stack, no server, no model calls. There is nothing to audit but the package itself. For a library whose entire purpose is trust, a dependency tree is a liability.
It runs at query time, not in an eval pipeline
This is the line that separates it from RAGAs, TruLens, and DeepEval. Those are evaluation
frameworks — they run offline, after the fact, and call an LLM to judge answer quality.
That's valuable, and this doesn't replace it. transparent-confidence runs
inline, the moment your system answers, using signals already in hand. No extra
round-trip. It sits next to the eval tools, not on top of them.
It returns an action, not just a number
Every scorecard comes back with a recommendation — answer,
review, or abstain — and a reason string. A score you have to
interpret is a score you'll ignore under load. The point is to gate on it: drop below your
threshold, route the question to a human before the user ever sees the response. The most
important thing one of these systems can say is the corpus does not address this —
and the package is built so it can say that out loud.
What It Revealed
Extracting it forced an honesty I'd skipped while it was buried in a product. Inside BoardPath, I could lean on context the scorer technically shouldn't know about. As a standalone library, every assumption had to be stated in a type or a default — and a few of them turned out to be wrong, or at least too specific to governance.
The dimension set got cleaner. The weights got documented. The thing got a calibration story it didn't have before: the score is not a probability of correctness until you calibrate it against your own labeled outcomes — and now the README says exactly that, because a confidence number that overpromises is worse than no number at all.
The version that shipped is v0.3 — eight dimensions, 412 tests, dual ESM/CJS, zero dependencies. It scores authority, grounding, retrieval quality, corpus coverage, freshness, consistency, answer relevance, and index integrity. BoardPath still uses it. So does everything else I build that touches retrieval. The copy-paste drift is gone, because there's one source now, versioned in the open.
The lesson worth keeping
It wasn't about scoring. It was about boundaries. The most reusable thing in a product is usually the part that doesn't know which product it's in.
The package is open source under Apache-2.0. Install it, read the algorithm docs, or take the dimensions apart — it's all in the open.