2026-05-07

Context Windows, Explained for Operators

Why a 1M-token context window is not the win you think it is — and what to do instead.

#ai
#agents
#rag

TL;DR

A 1M-token context window doesn't mean "give the model everything and let it figure it out". It means "you now have the option to be sloppy". You probably shouldn't take it.

What changed

Frontier models now ship with context windows in the 200k–1M range. The pricing is real — long contexts charge per token at every step — and so is the latency hit. The thing that didn't change: attention is not free, and recall over a stuffed context degrades long before the limit.

What to do instead

Retrieve, then reason. Build a small, well-scored retrieval step that pulls the 5–20 chunks the model actually needs.
Cache the static prefix. Anthropic's prompt caching cuts repeated-prefix cost by ~10×; OpenAI does the same. Use it.
Measure recall. Add a tiny eval that checks whether the model answered using the right snippet. If recall is < 80%, your retrieval is the bug, not the model.

When long context does earn its keep

Multi-document synthesis where chunking destroys cross-references.
Code review on a single large file you don't want to summarise.
Long-running conversations where the prefix is stable and cached.

For everything else, retrieval beats stuffing on cost, latency, and quality. The 1M number sells decks; recall sells products.