You're building an agent that manages billing records for a small law firm. The agent reads incoming time entries from attorneys, evaluates them for completeness, and prepares them for invoicing. It's handling real money, real clients, real billing compliance obligations.
You face a design choice: let the agent make judgment calls directly on the data — approve an entry, flag a narrative gap, compute budget utilization — or build a separate system where the agent can only request changes, and deterministic code validates and applies them.
The choice you make determines whether your system becomes reliable or a very smart way to introduce subtle errors into production.
The Problem With Letting Probabilistic Systems Decide
Most agent tutorials assume the agent writes data directly. The agent reasons about a situation, produces a tool call, and writes to the database. Simple. In practice, this creates a specific class of failure that's hard to see coming.
Language models are probabilistic. Given the same input, they don't always produce the same output. They don't always follow instructions. They occasionally hallucinate plausible-sounding information or miss constraints in context.
None of that is a problem if the worst outcome is a slightly awkward email draft that a human reads and edits.
It's a catastrophic problem if the output is a billing record that gets invoiced, a state transition that locks a matter, or a deadline confirmation that prevents escalation.
You end up spending most of your time building state management, validation layers, permissions boundaries, fallback logic, and guardrails just to stop the agent from freelancing itself into failure. Ironically, the more boring and deterministic the surrounding architecture becomes, the more reliable the AI layer gets.
The Architecture That Works: Deterministic/Probabilistic Separation
The principle is simple: any decision that can be pre-computed from structured data should be deterministic code. Any decision that requires human judgment or natural language reasoning should be the agent's job.
Split the work. Keep them separate. Never let one role bleed into the other.
What Goes in the Deterministic Layer
Everything that produces a repeatable answer from structured input:
- State machine transitions — Is
PENDING → APPROVEDa valid transition? The answer is always yes or always no. Code answers it. - Budget math — Is this entry within the client's remaining budget? Math doesn't change based on model temperature.
- Validation rules — Does this entry have a narrative? Is the rate on file for this attorney? Code validates these.
- Routing decisions — Is this a deadline signal or a billing signal? A deterministic classifier routes it.
- Audit logging — Every write needs to be logged. This happens in code, always, no exceptions.
- Idempotency checks — Did we already process this? Code detects it.
The pattern: Input → Code → Deterministic Output. Same input, same output, every time. If the system is down, you know exactly where.
What Goes in the Probabilistic Layer
Everything where the right answer depends on judgment, context, or natural language:
- Brief assembly and framing — "What's the most important thing the attorney needs to know?" The agent synthesizes facts into narrative.
- Draft generation — Writing a client status update or escalation brief requires natural language reasoning.
- Deadline extraction — An email says "opposition due June 4." The agent identifies the date and entity.
- Anomaly detection — "This pattern looks wrong" is a judgment call. The agent detects it and flags it.
- Narrative review — "Does this billing narrative explain what the attorney did?" That's reading comprehension.
The pattern: Agent Reasoning → Structured Output → Code Validates and Writes. The agent produces a request for a state change. Code validates, applies business logic, and either commits it or rejects it with a structured error.
Litt: How It Works in Practice
Litt is an autonomous operations agent for small law firms. It monitors deadlines, reconciles billing, drafts client communications, and escalates anomalies. It handles real legal and financial decisions every day.
Here's how the boundary actually works:
Scenario 1: A Deadline Arrives via Email
Ingestion (deterministic): Gmail API reads an email from opposing counsel containing "opposition due June 4, 2026."
Extraction (probabilistic): The coordinator agent calls Gemini: "Extract the deadline date and description from this email." Gemini returns the structured data.
Validation (deterministic): Python code creates a deadline record with status UNVERIFIED. A state machine enforces that system-detected deadlines cannot enter active monitoring until an attorney confirms them.
Escalation (probabilistic): The agent synthesizes a brief: "Deadline detected: Opposition due June 4. Your response?"
Attorney confirmation (deterministic): Attorney clicks [Confirm]. Python code transitions the deadline to ACTIVE and logs the timestamp. No further agent inference.
Monitoring (deterministic): Code compares dates. If it's 7 days out, code flags it as HARD_LEGAL. No LLM reasoning.
The boundary: The agent cannot activate a deadline. The agent cannot advance a billing entry. The agent cannot generate an invoice. The agent can only propose these actions. Code applies them or rejects them.
Scenario 2: A Time Entry Needs Review
Capture (probabilistic + deterministic): Attorney's Claude Code session ends. A hook writes session duration and file activity log. Status: CAPTURED.
Draft assembly (probabilistic): The coordinator agent reads the activity log and asks Gemini to draft a billing narrative. Gemini suggests: "Reviewed opposing counsel motion and drafted responsive argument."
Attorney review (deterministic): Attorney confirms hours and edits the narrative. She clicks [Log Entry]. Python code validates: permission check, rate validation, rounding rules, duplicate detection. Status advances to PENDING. If any validation fails, a structured error explains why.
Pre-bill review (probabilistic + deterministic): Code detects anomalies. Gemini scores risk. Code surfaces the anomaly with a risk score.
Invoice generation (deterministic only): Attorney clicks [Generate Invoice]. Python computes LEDES fields, formats entries, generates the invoice. Everything has already been validated.
The boundary: The agent drafts narratives and scores anomalies. Code validates every transition and write.
Why This Matters
1. Auditability
When something goes wrong — a deadline missed, a billing entry questioned, a board asking "how did this happen?" — you need to produce a record.
With agents writing directly: "The agent decided to transition the deadline to ACTIVE. We don't know exactly why."
With deterministic boundaries: "The attorney clicked confirm on May 29 at 2:34 PM. The deadline transitioned to ACTIVE because that's the rule. Here's the log entry."
In legal contexts, the second answer is the one that survives scrutiny.
2. Reproducibility
If you run the same billing reconciliation pass twice on the same data, you should get the same output. Guaranteed.
With agents deciding: Gemini might score the same anomaly differently. It might classify an entry differently. You lose deterministic computation.
With boundaries: Python code makes the same call every time. The agent's reasoning might be nuanced. The decision itself is repeatable.
3. Cost and Speed
Every inference call costs money and latency. If you can pre-compute something in code, you do, because code is free and fast compared to token generation.
Litt runs deadline checks hourly. If each check involved asking Gemini "is this deadline 7 days out?", you'd burn through quota and slow the system down. Code does it in microseconds.
The Orchestration Problem
The model is excellent at reasoning over pre-established facts and generating natural language from structured input. It is unreliable as the source of truth for facts that should live in a database.
The right architecture: Code answers "what is true about this data." The model answers "what does that mean in context." Keep those jobs separate. Enforce the boundary in code.
When you do, something interesting happens. The agent becomes more reliable, not less. It's no longer being asked to do things it's bad at. It's being asked to synthesize and reason over facts that are already verified. It can focus on judgment instead of validation.
The boring part — the state machines, the validation layers, the permissions boundaries — that's not a limitation you're adding to constrain the agent. That's the foundation that lets the agent be genuinely useful.
The Takeaway
If you're building an agent that touches financial records, legal decisions, or operational state — anything that needs to survive scrutiny — design the boundary first. Decide what must be deterministic before you write the agent code.
You'll spend more time on boring architecture. Your system will be more reliable, more auditable, and ironically, more capable of using AI effectively.
The irony worth remembering
The more deterministic the context becomes, the more reliably probabilistic the AI layer can be.