Open-Source Library — 04

transparent-confidence

Explainable Confidence Scoring for RAG Systems

A confidence number without a per-dimension breakdown and a recommended action isn't auditable. It's a vibe with a decimal point. This is the scoring engine that started inside BoardPath, extracted into a zero-dependency npm package because the most reusable part of a product is the part that doesn't know which product it's in.

Status
Published · npm
License
Apache-2.0 · v0.3
Stack
TypeScript · Vitest · tsup · dual ESM/CJS · zero runtime deps
Role
Author — sole builder

The same scorer, copy-pasted into three products, drifting apart.

BoardPath had a scoring engine before it had paying customers. Every answer the system gave a board came back with a 0–100 confidence score and a per-dimension breakdown — authority, grounding, retrieval quality, and the rest. It was the part of the product that made a skeptic willing to act on an AI answer.

Then the same logic kept getting reached for in other projects. A forensic document tool wanted it. A side experiment wanted it. Each time, a slightly different copy of the same scorer got pasted in — and each copy drifted a little further from the others. A bug fixed in one didn't reach the others. A new dimension added in one made the rest quietly out of date.

The logic wasn't specific to any of the three products it lived inside — which is exactly why it shouldn't have lived inside any of them. That's the moment a feature is telling you it wants to be a library.

Domain-agnostic scoring, mislabeled as governance logic.

The scoring engine never knew what a CC&R was. It didn't retrieve anything. It didn't call a model. It took signals a RAG pipeline already produces — retrieval scores, document metadata, citation overlap, how much the retrieved chunks agreed with each other — and composed them into one auditable number with a reason attached to every point.

The governance lived in which documents got ranked how. The scoring was domain-agnostic the whole time — the boundary just hadn't been drawn. Drawing it meant pulling the scorer out and publishing it as transparent-confidence.

npm install transparent-confidence

Eight dimensions. One auditable score. A required action.

8
Scoring dimensions
412
Tests
0
Runtime deps

Every scorecard is built from eight independently weighted dimensions, each with its own reason string so the final number can always be taken apart and defended:

authority
grounding
retrieval quality
corpus coverage
freshness
consistency
answer relevance
index integrity

It ships as dual ESM/CJS, fully typed, with 412 tests and versioned algorithm schemas so a score computed today can be reproduced tomorrow. The whole library is small enough that there is nothing to audit but the package itself.

Three deliberate constraints, not three defaults.

Extracting the scorer forced an honesty that was easy to skip while it was buried in a product. Three decisions defined what it became.

Decision 1 — Zero runtime dependencies

If the job is scoring and policy — not retrieval, not inference — it should run with what the caller already has. No ML stack, no server, no model calls. For a library whose entire purpose is trust, a dependency tree is a liability. Zero-dependency is a hard architectural constraint here, not a preference.

Decision 2 — Runs at query time, not in an eval pipeline

This is the line that separates it from RAGAs, TruLens, and DeepEval. Those are evaluation frameworks — they run offline, after the fact, and call an LLM to judge answer quality. transparent-confidence runs inline, the moment the system answers, using signals already in hand. No extra round-trip. It sits next to the eval tools, not on top of them.

Decision 3 — Returns an action, not just a number

Every scorecard comes back with a recommendation — answer, review, or abstain — and a reason string. A score you have to interpret is a score you'll ignore under load. The point is to gate on it: drop below your threshold and the question routes to a human before the user ever sees the response. The most important thing one of these systems can say is the corpus does not address this — and the package is built so it can say that out loud.

TypeScript (strict) Vitest tsup Biome Dual ESM/CJS Versioned schemas Zero runtime deps Apache-2.0

One source of truth, versioned in the open.

Extraction made the dimension set cleaner and the weights documented. It also gave the library a calibration story it didn't have before: the score is not a probability of correctness until you calibrate it against your own labeled outcomes — and the README now says exactly that, because a confidence number that overpromises is worse than no number at all.

BoardPath still uses it. So does everything else here that touches retrieval. The copy-paste drift is gone, because there's one source now — v0.3, Apache-2.0, published on npm with its algorithm docs and dimensions open for inspection. It wasn't about scoring. It was about boundaries.

npm install transparent-confidence

Repo: github.com/emtcmca/transparent-confidence  ·  Companion write-up: The Confidence Layer Didn't Belong to BoardPath

← P2P Automation Stack Writing →