Context window economics — Eric Tetzlaff

The benchmark run made it 7% into the corpus before I stopped it.

Seven percent. Roughly 7,700 documents out of 110,000. Token consumption was already at a level that made me actively consider deleting my entire .env file — API keys and all — and walking away from the whole thing. The naive approach had a pipeline that was fully functional, thoroughly tested, and architecturally sound in every way except the one that actually mattered at scale: it was feeding the coordinator everything. Every document. Every extracted page. Every line of every spreadsheet. All of it, upstream, into the context window of an agent whose job was analysis — not ingestion.

I knew before I ran it that the naive approach wasn't going to work at 110,000 documents. I ran it anyway, because I needed the benchmark. I needed to know exactly how bad "naive" was before I could design something better. The answer was: catastrophically bad. Not slightly inefficient. Not manageable with some tuning. Architecturally incompatible with the problem at the scale I was solving it.

The analogy I used with my attorneys

Imagine a named partner at a law firm. His name is on every document that leaves the firm. His reputation — and his liability — rides on the quality of every output. He is the coordinator. He is responsible for final analysis.

Now imagine dropping 110,000 documents on his desk and asking him to read every single one before producing his analysis. No associates. No paralegals. No research assistants. Just him, the documents, and a billing rate of several hundred dollars an hour applied to every minute he spends on administrative work that a $45-an-hour paralegal could have handled.

That's what feeding a coordinator agent an unfiltered 110,000-document corpus looks like. Expensive. Slow. And it produces worse analysis — not better — because the signal is buried so deep in noise that no amount of intelligence at the coordinator level can reliably surface it.

The attorneys understood immediately. The architecture that replaced the naive approach mirrors a well-run law firm so precisely that I've used the analogy ever since.

Named Partner / Coordinator

Receives only distilled signal from department heads. Performs final analysis, synthesis, and output generation. Never touches raw documents.

Highest compute cost — reserved for final analysis only

Managing Partners / Subagents

Spearhead dedicated analytical teams. Limited lateral communication with peer subagents where cross-domain signal concentration is beneficial. Report distilled findings upward.

Mid-tier compute — domain coordination

Partners, Associates, Paralegals / Workers

Specialists responsible for one task and one task only within their domain. OCR. Keyword filtering. Zone hydration. Extraction. Each worker's context window is intentionally disposable.

Lowest compute — bulk of the work

The phrase that unlocked the architecture for me came from a developer I respect: "Subagent context windows are disposable."

Disposable. That word reframed everything. Most people building multi-agent systems think of context as a resource to be conserved — something precious that needs to be managed carefully up and down the stack. This framing leads to architectures where agents try to pass as much context as possible to the next layer, preserving everything in case it turns out to be useful later.

The disposable framing inverts that entirely. Worker context isn't precious. It's ephemeral by design. You extract signal, surface it upward, and let the rest evaporate at no cost. The coordinator never sees the noise because the architecture is designed so the noise never reaches it — not because you filtered it out at the coordinator level, but because you never passed it upward in the first place. That distinction is the entire design.

The retrieval strategy that makes this work came from an unexpected place: Grep and Glob — Unix tools that have existed since the 1970s. I was using both during smoke tests of increasing scope, running Grep to search document content for keywords and Glob to search file names. The compiled candidate lists were fast and cheap to produce. They were also full of false positives — documents that contained the target keywords in contexts completely unrelated to the query at hand.

During edge case identification sessions, I started manually reviewing documents from the Grep/Glob result sets to understand where the false positives were coming from. The pattern was immediate: in almost every case, you could tell within a few lines of surrounding text whether a keyword hit was signal or noise. The keyword itself wasn't enough context. The lines immediately around its first occurrence almost always were.

That observation became the architecture. Three stages. Each one dramatically cheaper than the one before it, applied in sequence, with false positives eliminated at every gate:

Grep / Glob Filtering

Keyword presence in file content and file names. Fast, cheap, broad. Produces initial candidate list. False positive rate: high — but this stage costs almost nothing.

Candidate list

Zone Hydration

Extract only the lines immediately surrounding the first keyword instance. Enough context to evaluate relevance without full document cost. Eliminates 90%+ of false positives from Stage 1.

90% noise removed

Full Document Hydration

Applied only to survivors of Stage 2. Full extraction confirms signal vs. noise with certainty. Clears 98%+ of original false positives. Expensive compute runs only on near-pure signal.

98% noise cleared

90%

False positives cleared by zone hydration alone

98%

Total false positives cleared before full analysis

What I had built — without knowing the research terminology for it at the time — is what retrieval system designers call cascaded retrieval with progressive reranking. Coarse-to-fine. Each stage cheaper than the last, each one operating on a dramatically smaller and cleaner candidate set than the one before it. By the time the coordinator receives anything for analysis, it is receiving signal — distilled, confirmed, ready for the work only a named partner can do.

The output quality improvement over the naive approach isn't marginal. It's categorical. The coordinator isn't wading through noise trying to find signal. It's analyzing signal that worker agents have already confirmed, organized, and handed upward through a hierarchy designed specifically to make that final analysis as good as it can possibly be.

The economic argument for this architecture is as straightforward as the quality argument. You don't pay a named partner $800 an hour to read every document in a 110,000-document corpus. You pay a paralegal to organize it first. Token cost maps onto hourly rate. Compute expenditure maps onto staff seniority. The architecture isn't just elegant — it's financially rational. The most expensive compute in the system runs only on the work that actually requires it. Everything else runs at the cost it deserves.

One final observation worth stating plainly: this architecture emerged from a combination of a benchmark run that failed spectacularly, a developer's throwaway phrase about disposable context windows, and a manual review session studying false positives from Unix grep output. Not from a research paper. Not from a course. From building something real, watching it break, and thinking carefully about why.

That's the only way I know how to design systems. Start with the problem. Run it until it breaks. Understand exactly how it broke. Build the thing that doesn't.

Auris Intelligence Subagent Architecture Context Window Management Progressive Retrieval Multi-Agent Systems Token Efficiency

Context window economics: keeping 110,000 documents out of my coordinator's context