The Proxy Holder Problem — Eric Tetzlaff

The stress test had been running for two hours. Forty-nine questions, each more deliberately adversarial than the last — ambiguous language, missing context, governing documents that contradicted each other in ways that create actual legal exposure for actual boards. GPT-4o had answered all forty-nine of them correctly under the BoardPath system prompt. Conflict detection working. Authority hierarchy surfacing. Confidence scores calibrated. I was close enough to done that I was starting to think about what came next.

Then I threw the proxy holder question at it. Not because I thought it would fail. Because it was the hardest version of the thing the whole system was built to solve, and I wanted to see it win cleanly.

It didn't win cleanly. And the way it failed told me something I needed to know before this platform touches a real election.

The question wasn't hard because the answer was ambiguous. It was hard because three documents each gave a different answer, and all three had plausible claims to authority — including one that had been specifically written to override the others.

The Setup

Four documents. Four positions. One correct answer.

The question: Who can serve as a proxy holder at an annual meeting?

In isolation, that's a routine governance question. In this corpus, it wasn't routine at all. The test was specifically constructed around four documents that each addressed proxy holding differently, with misaligned authority ranks — the kind of thing that happens in real associations when bylaws were amended decades after the declaration was drafted, and then a board resolution attempted to "clarify" the result.

The document stack looked like this:

Test Corpus — Proxy Holding Provisions

Declaration §3.03 — authority_rank: 10 (highest) — proxy holders must be Members in good standing.

Bylaws §3.03 — authority_rank: 20 — proxy holders need not be Members.

Rules & Regulations — authority_rank: 25 — silent on proxy holding.

Board Resolution 2019-01 — authority_rank: 30 (lowest) — any person may hold a proxy; purports to "clarify" that the Declaration's Member requirement applies only in specific contexts.

The correct answer is unambiguous if you apply the hierarchy: the Declaration controls, full stop. The Bylaws provision is overridden. The Board Resolution — which attempted to carve out exceptions to a Declaration requirement — has no authority to do that, and its "clarification" is null. Only Members in good standing can hold a proxy.

Getting there requires the system to do three distinct things: detect that a conflict exists, correctly identify which document wins from the authority_rank field, and then definitively state that the lower-authority provisions don't just lose — they're overridden and have no legal effect on this question. The third piece is not optional. A board member acting on a wishy-washy answer about competing provisions at a contested election is a board member about to have a very bad night.

What Actually Happened

The model got most of the way there. Then it didn't.

The first run — call it H3 — produced a response that was technically accurate and practically useless. The model correctly identified the Declaration as the higher-authority document. It correctly surfaced the conflict. Then it described the tension and stopped: "the Bylaws allow non-members to serve as proxy holders, creating a conflict."

No resolution. A board member reading that answer knows there's a disagreement between documents. They do not know what to do. That's the failure mode that causes proxy challenges at annual meetings.

Partial

Conflict detected. Declaration surfaced as higher authority. Conflict described but not resolved — the lower-authority provisions were presented as live alternatives, not as overridden provisions. Actionable answer: no.

H4 · v1

Partial

Added explicit CONFLICT RULE to system prompt: "State which provision controls. Explicitly state that the lower-authority provision IS OVERRIDDEN and has no legal effect." Model now said "The Declaration controls" — then immediately followed with "However, the Bylaws and Board Resolution allow non-members to serve, creating a conflict." Followed the instruction in form. Violated it in practice.

H4 · v2

Regression

Escalated to a mandatory output template: "You MUST use this exact structure: first state the controlling provision, then state '[Lower document] is overridden by the [higher document] and has no legal effect.'" Model followed the template. Filled in the wrong document. The Declaration — authority_rank 10, highest in the corpus — was declared overridden by the Bylaws and Board Resolution. High Confidence. Completely wrong.

Pass

Deterministic hierarchy resolution pre-computed from authority_rank integers and injected into user prompt as established fact. Post-call inversion check in code. Model receives the winner — it doesn't determine it. One clean, definitive answer. No legal counsel recommendation needed. Confidence: High. Correct.

The H4v2 regression is worth sitting with for a moment. The more prescriptive the instruction, the more confidently the model applied the template pattern — and the more completely it got the substance backwards. It saw a template that said "[lower document] is overridden by [higher document]" and dutifully filled in the blank. The model didn't understand what "lower" and "higher" meant in the context of legal authority hierarchy. It pattern-matched on the instruction structure and produced a formally correct, substantively catastrophic result.

High Confidence. Wrong document controlling. In a legal context, that's not a miss. That's a trap.

The Diagnosis

Two hard problems dressed up as one.

Here's what the failed attempts revealed: we were asking the model to do two distinct things in a single inference pass.

Step 1 is hierarchy resolution — determining which document controls from the authority_rank field. Step 2 is language production — framing the answer in a way that explicitly declares the non-controlling provisions overridden.

Step 1 is not a language problem. It's arithmetic. SELECT MIN(authority_rank) tells you which document wins. The answer is already in the database. It does not require inference. It does not require the model to reason about legal document hierarchies. It requires an integer comparison.

Step 2 is a language problem — and the model is excellent at it, provided you give it the answer to Step 1 before it starts. When the model is simultaneously determining the hierarchy winner and producing language that declares non-winners overridden, it's doing two hard things at once and failing at the combination. When you pre-solve Step 1 and inject the answer as a confirmed fact, Step 2 becomes straightforward.

The right division of labor is: code determines what is true, model determines how to say it. The moment you ask the model to do both, you're delegating a deterministic database operation to non-deterministic inference. That's not a prompt engineering problem. That's an architectural mistake.

The Solution

Three layers. One of them is the system prompt — and it's the weakest one.

The full implementation has three components. The system prompt behavioral instruction — which I'd already written before any of this — is the soft layer. The two code guardrails are the hard ones. Here's all three, in the order they execute.

Layer 0 — System prompt: CONFLICT RULE. This was in place before H3. It defines the intended behavior and remains part of the system prompt — not because it's load-bearing, but because it shapes the model's output framing in the uncontested case. It's the first line of defense. It's just not the last, and it was never meant to be.

System prompt — CONFLICT RULE (behavioral instruction, soft layer)CONFLICT RULE:
- If sections from different authority levels address the same question
  and appear to say different things, identify the conflict explicitly.
- State which provision controls, and explicitly state that the
  lower-authority provision IS OVERRIDDEN and has no legal effect on
  this question. Use language such as: "The Declaration controls.
  The [Bylaws / Rules] provision is overridden and has no legal effect."
  Do not write "however, [lower doc] may allow X under certain
  circumstances" — an overridden provision does not apply and must not
  be presented as a live option.
- If a lower-authority document (e.g., a board resolution) attempts to
  "clarify," "limit," or "modify" a higher-authority document (e.g.,
  the Declaration), state explicitly that the lower document has no
  authority to do this and its clarification has no legal effect.
- Do not silently pick one source without acknowledging the other.
- Set has_hierarchy_conflict to true if this occurs.

This instruction failed in H4v2 — not because it's wrong, but because it was doing two jobs simultaneously: telling the model both which document wins and how to say so. When the model pattern-matched on the template structure rather than reasoning about the authority hierarchy, it produced a formally correct sentence with the hierarchy inverted. The instruction doesn't fail at language production. It fails at being the sole arbiter of a fact that's already in the database.

Guardrail 1 — Pre-call authority injection. During the ranking step — after candidates are scored and sorted, before the GPT call fires — the code computes which document controls from authority_rank integers directly. Lowest rank wins. No inference involved. The resolution is then built into a structured block and injected at the top of the user prompt — not the system prompt. This is intentional: the immediate user turn has higher effective compliance than instructions buried 500 tokens back in a system prompt the model is already attending to less carefully.

TypeScript — Guardrail 1: pre-call hierarchy injection// ── Pre-compute hierarchy conflict resolution ─────────────────────
// Compute which document controls from authority_rank (code-level,
// not model inference). Inject as a confirmed fact into the user
// prompt so the model receives the resolution rather than inferring it.

const rankGroups = new Map<number, string[]>();
topCandidates.forEach(c => {
  const existing = rankGroups.get(c.authority_rank) ?? [];
  if (!existing.includes(c.document_title)) existing.push(c.document_title);
  rankGroups.set(c.authority_rank, existing);
});
const uniqueRanks = Array.from(rankGroups.keys()).sort((a, b) => a - b);

let conflictInjection = "";
if (uniqueRanks.length > 1) {
  const controllingRank = uniqueRanks[0];
  const controllingDocs = rankGroups.get(controllingRank)!;
  const lowerTiers = uniqueRanks.slice(1).map(rank => {
    const docs = rankGroups.get(rank)!.join(", ");
    return `${authorityLabel(rank)} (${docs})`;
  });
  conflictInjection = [
    "⚠️  HIERARCHY RESOLUTION (computed from document authority ranks",
    " — treat as established fact):",
    `Controlling authority: ${authorityLabel(controllingRank)}`,
    ` — ${controllingDocs.join(", ")}`,
    `Lower authority (overridden if in conflict): ${lowerTiers.join("; ")}`,
    "If any lower-authority provision addresses the same question",
    "differently than the controlling document, it IS overridden",
    "and has no legal effect. State this explicitly in your answer.",
  ].join("\n");
}

// conflictInjection is prepended to the user prompt, before evidence
const userPrompt = [
  conflictInjection,
  `Question: ${question_text}`,
  `\nDocument sections:\n\n${evidenceBlock}`,
].filter(Boolean).join("\n\n");

The model receives the resolution as an established fact, not an instruction to figure it out. Its job shifts from determine which document wins to write an answer that reflects this already-determined winner. The distinction matters. Language models are reliably good at the second task. They are unreliable when asked to do both simultaneously — especially when the first task has a correct answer that's already in a database column.

Guardrail 2 — Post-parse inversion check. Even with the injection in place, I added a post-parse validation layer. After the JSON response is parsed, if has_hierarchy_conflict: true, the code scans the answer text for a specific failure signature: does the name of the controlling document appear within 80 characters of the words "overridden," "no legal effect," or "superseded"? If it does, the hierarchy is inverted. The H4v2 failure mode has a detectable string fingerprint, and this check catches it before it reaches the client.

On detection, the pipeline fires one retry with an explicit correction block appended — naming what went wrong, naming the correct controlling document, and restating the hierarchy as scaffolding for the second pass. One retry. Not a loop. If the second pass fails the check, the response is flagged for human review rather than surfaced to the user. The behavior at every branch is deterministic.

TypeScript — Guardrail 2: post-parse inversion check// ── Hierarchy inversion check ─────────────────────────────────────
// Verify the model did not declare the controlling document overridden
// by a lower-authority one. If it did, retry once with a correction.

if (conflictInjection && answer.has_hierarchy_conflict) {
  const controllingDocs = rankGroups.get(uniqueRanks[0])!;
  const answerText = [
    answer.direct_answer,
    answer.plain_english_explanation,
    answer.hierarchy_conflict_note ?? "",
  ].join(" ").toLowerCase();

  // Inversion signature: controlling doc title appears within 80 chars
  // of "overridden", "no legal effect", or "superseded"
  const inversionDetected = controllingDocs.some(docTitle => {
    const titleIdx = answerText.indexOf(docTitle.toLowerCase().substring(0, 30));
    if (titleIdx === -1) return false;
    const window = answerText.substring(
      Math.max(0, titleIdx - 80),
      titleIdx + 80
    );
    return window.includes("overridden")
        || window.includes("no legal effect")
        || window.includes("superseded");
  });

  if (inversionDetected) {
    // Retry once with an explicit correction block
    const correctionPrompt = [
      conflictInjection,
      `Question: ${question_text}`,
      `\nDocument sections:\n\n${evidenceBlock}`,
      `\nCORRECTION REQUIRED: Your previous answer incorrectly stated`,
      `that the ${authorityLabel(uniqueRanks[0])} is overridden.`,
      `The ${authorityLabel(uniqueRanks[0])} CONTROLS.`,
      `Lower-authority documents are overridden by it, not the reverse.`,
      `Restate your answer with the correct hierarchy.`,
    ].filter(Boolean).join("\n\n");

    const retry = await openai.chat.completions.create({
      model: "gpt-4o-mini",
      max_tokens: 1500,
      temperature: 0.1,
      messages: [
        { role: "system", content: SYSTEM_PROMPT },
        { role: "user",   content: correctionPrompt },
      ],
    });

    try {
      answer = JSON.parse(
        (retry.choices[0]?.message?.content ?? "")
          .replace(/```json|```/g, "").trim()
      );
    } catch {
      console.error("Hierarchy correction retry parse failed — using original.");
    }
  }
}

Note what the check is operating on: strings and integers. indexOf. Substring bounds. Array comparison. There's no model involved in the validation step — the model is only called again if the check fires, and the retry prompt tells it exactly what it got wrong. This is the right division of labor. The code validates. The model only generates language.

The Result

One proxy holder question. One clean answer. No hedge.

The H5 response to the proxy holder question:

H5 Output — Direct Answer

"Only another Member of the Association in good standing can serve as a proxy holder at an annual meeting."

H5 Output — Authority Conflict Detected Panel

"The Bylaws and Board Resolution suggest that a proxy holder need not be a Member, but the Declaration explicitly states that only Members in good standing can serve as proxy holders. The Declaration controls. The Bylaws provision and the Board Resolution are overridden and have no legal effect on this question."

Confidence level: High — appropriate, because once the hierarchy is correctly applied, the answer is unambiguous. No attorney referral surfaced — also correct, because the conflict is resolved by the documents' own authority structure. There's no legal ambiguity requiring counsel. There's only a hierarchy that, once applied, produces a clear answer.

That's the target output. Two sessions of system prompt iteration replaced by one session of deterministic code.

The Broader Principle

If the source of truth is in your database, it belongs in your code — not your prompt.

System prompts are instructions. Instructions can be misread, misapplied, or followed in form while violated in substance — especially when the model is simultaneously reasoning about complex multi-document relationships and trying to format output to a prescribed template. The more prescriptive you make the template, the more confident the model becomes in applying it incorrectly, because confidence is a function of pattern completion, not of correctness.

Code-enforced gates operate on data that the model never needs to touch. Authority rank is an integer. Hierarchy resolution is MIN(). These operations are deterministic by definition. Delegating them to model inference under a prompt instruction trades guaranteed correctness for probabilistic compliance — and in a legal context, probabilistic compliance is just a polite way of describing a system that will eventually get someone's election invalidated.

This same principle generalizes beyond document hierarchy. Any decision that can be pre-computed from structured data in your database should be pre-computed and injected — not delegated to inference. The model is excellent at reasoning over pre-established facts. It is unreliable as the arbiter of those facts when the source of truth is already in a column you control.

The right architecture: code answers "which document controls." The model answers "what does that mean for this board member's question." Keep those two jobs separate, enforce the boundary in code, and give the model a desk worth sitting at. The proxy holder question got there on the fifth attempt. The next one that looks like this should get there on the first.

BoardPath Guardrail Design Deterministic Architecture Document Intelligence LLM Reliability Prompt Engineering