A Friday afternoon. You paste the entire 180-page supplier handbook into the prompt, append the question, and wait. The model answers confidently and cites section 4.2. Section 4.2 doesn't exist. Section 4.1 covers payment terms; you wanted clause 7.3 on liability. The model read everything and remembered nothing in particular.
This happens more than people admit. The fix is rarely a bigger context window. The fix is usually a small retrieval layer in front of the model. RAG (retrieval-augmented generation) keeps getting dismissed as last year's pattern now that million-token windows are shipping. That dismissal is wrong for most jobs your team actually has.
The 200k-token trap
Long context works. We use it. For a single contract, a single research paper, a single onboarding doc, pasting the whole thing is fine. The model can hold a coherent argument across the document because nothing competes for its attention.
The trap is what happens past one document. Three things compound:
- Cost. A 180k-token input prompt at current Sonnet rates is roughly $0.54 per call before output. Run that 4,000 times a month and you have a five-figure invoice for a feature your team currently does in their head.
- Latency. Time-to-first-token climbs with prompt length. A tight retrieval query usually returns under 200ms; a 180k prompt warm-up is measured in seconds.
- Attention dilution. The well-known "lost in the middle" effect. Models recall the start and the end of a long context better than the middle. Paste 80 documents and watch the model fixate on the first three and the last two.
What retrieval actually does
Strip the acronym and RAG is one sentence. You chunk your source material into small pieces, turn each piece into a vector, and at query time you pull the few chunks closest to the question. The model only ever sees those few chunks. It reads less. It answers from cited evidence.
The work is not in the embedding call. The work is in deciding what a "chunk" is, what your corpus actually contains, and how you measure whether the right chunks come back.
When the longer prompt wins
Honestly assess the job before you reach for vectors. Long context is the right tool when:
- The whole corpus is one document and the answer requires cross-references the chunker would shred. A 60-page contract reviewed end-to-end, for example.
- You run the task once or twice. Setting up an index for a one-shot job is a waste of an afternoon.
- The source fits under 100k tokens and is read once per session.
If those describe your case, paste the document and move on. Don't add infrastructure you won't use again.
When retrieval wins by a wide margin
The other side of the line:
- The corpus changes. Product docs, ticket history, internal wiki, shipping policies updated weekly. Re-pasting on every query is unworkable.
- You have hundreds of documents and no idea in advance which ones the question touches.
- Cost per query matters because the feature runs thousands of times a day.
- You need citations the user can click. Long-context answers don't tell you which page they came from. Retrieval does, by construction.
Long context is for one document read once. Retrieval is for a corpus queried often. Picking the wrong one is what makes your AI feature feel slow or vague.
A minimal retrieval setup
You do not need a vector database company. Postgres with the pgvector extension covers the first two years of most projects. Schema first:
create extension if not exists vector;
create table docs (
id bigserial primary key,
source text not null,
chunk text not null,
embedding vector(1536)
);
create index docs_embedding_idx
on docs using ivfflat (embedding vector_cosine_ops)
with (lists = 100);
Query side, in Node:
import { Client } from 'pg';
import OpenAI from 'openai';
const db = new Client({ connectionString: process.env.DATABASE_URL });
const ai = new OpenAI();
await db.connect();
async function retrieve(question, k = 5) {
const { data } = await ai.embeddings.create({
model: 'text-embedding-3-small',
input: question,
});
const vec = `[${data[0].embedding.join(',')}]`;
const { rows } = await db.query(
`select source, chunk
from docs
order by embedding <=> $1
limit $2`,
[vec, k]
);
return rows;
}
const hits = await retrieve(
'What is the liability cap in the EU supplier contract?'
);
console.log(hits.map(r => `[${r.source}] ${r.chunk.slice(0, 120)}`).join('\n\n'));
Five chunks. One database. No new vendor on the bill. The model receives the question plus the five most relevant snippets and is told to answer only from those snippets. That last constraint is what gives you trustworthy citations.
Chunking is where most setups fail
Naive chunking cuts text every N characters. It splits a sentence in half. It splits a table header from its rows. It splits a clause from its definition. The retrieval looks fine in tests, then breaks the moment a real question lands on a boundary.
Two patterns that hold up:
- Recursive splitting on structure. Split on headings first, then paragraphs, then sentences. Keep the path (Section 4 > Clause 3) in the chunk metadata so the model can cite it.
- Overlap. Each chunk shares around 15% of its content with its neighbour. Cheap insurance against bad boundaries.
If your corpus has structure (Confluence, Notion, a docs site with H2s and H3s), use it. Generic 500-token windows are the last resort, not the default.
Beyond the basic pattern
If pure vector search disappoints, the next move is usually hybrid retrieval: combine BM25 (keyword) and vector scores. Some of the more interesting results we have seen come from graph-aware retrieval. The FastGraphRAG project on Hacker News (457 points, November 2024) layers PageRank over the chunk graph to find chunks that matter structurally, not just lexically. Worth a read before you decide your basic setup is "good enough".
Measuring whether it worked
The most common RAG failure is shipping with no evals. You add retrieval, the answers feel better, you call it done. Two months later a colleague says "the AI is hallucinating again" and you have nothing to debug.
The minimum bar: a CSV of 50 real questions with the chunk ID that should be retrieved for each. Run your retrieval over it nightly. Track hit@5 (is the right chunk in the top five) and MRR (mean reciprocal rank). When either drops, you know before the user does. If you cannot say what your hit@5 was last week, you do not have a RAG system. You have a demo.
From theory to a real corpus
When we built a knowledge agent for a Dutch logistics client this spring, the thing we kept tripping on was that their shipping policy doc had three versions in circulation. The model retrieved the wrong one with high confidence. We solved it by adding a supersedes field to every chunk and filtering before similarity, not after. Agent work like that is mostly debugging the corpus, not the model.
Five-minute audit you can run today: open whatever AI feature you shipped last quarter, pull the prompt, and count tokens. If it pastes more than 20k tokens of source material on every request, you have a retrieval problem waiting to be discovered. Write down the top three questions users ask. If you cannot point to which document each answer should come from, your team needs a retrieval layer before it needs a bigger model.




