Skip to content

RAG without the hype: when retrieval beats a longer prompt

A 200k-token prompt feels powerful until the model picks the wrong paragraph from page 47. Retrieval is not yesterday's pattern. It is a way to make the model read less and answer better.

Jacob Molkenboer
Jacob Molkenboer
Founder · A Brand New Company
Published
12 May 2026
Reading time
7 min read
Category
RAG
Open oak index-card drawer with cream cards, brass divider, one green tab card, red ribbon on ivory paper.

A Friday afternoon. You paste the entire 180-page supplier handbook into the prompt, append the question, and wait. The model answers confidently and cites section 4.2. Section 4.2 doesn't exist. Section 4.1 covers payment terms; you wanted clause 7.3 on liability. The model read everything and remembered nothing in particular.

This happens more than people admit. The fix is rarely a bigger context window. The fix is usually a small retrieval layer in front of the model. RAG (retrieval-augmented generation) keeps getting dismissed as last year's pattern now that million-token windows are shipping. That dismissal is wrong for most jobs your team actually has.

The 200k-token trap

Long context works. We use it. For a single contract, a single research paper, a single onboarding doc, pasting the whole thing is fine. The model can hold a coherent argument across the document because nothing competes for its attention.

The trap is what happens past one document. Three things compound:

  • Cost. A 180k-token input prompt at current Sonnet rates is roughly $0.54 per call before output. Run that 4,000 times a month and you have a five-figure invoice for a feature your team currently does in their head.
  • Latency. Time-to-first-token climbs with prompt length. A tight retrieval query usually returns under 200ms; a 180k prompt warm-up is measured in seconds.
  • Attention dilution. The well-known "lost in the middle" effect. Models recall the start and the end of a long context better than the middle. Paste 80 documents and watch the model fixate on the first three and the last two.

What retrieval actually does

Strip the acronym and RAG is one sentence. You chunk your source material into small pieces, turn each piece into a vector, and at query time you pull the few chunks closest to the question. The model only ever sees those few chunks. It reads less. It answers from cited evidence.

The work is not in the embedding call. The work is in deciding what a "chunk" is, what your corpus actually contains, and how you measure whether the right chunks come back.

When the longer prompt wins

Honestly assess the job before you reach for vectors. Long context is the right tool when:

  • The whole corpus is one document and the answer requires cross-references the chunker would shred. A 60-page contract reviewed end-to-end, for example.
  • You run the task once or twice. Setting up an index for a one-shot job is a waste of an afternoon.
  • The source fits under 100k tokens and is read once per session.

If those describe your case, paste the document and move on. Don't add infrastructure you won't use again.

When retrieval wins by a wide margin

The other side of the line:

  • The corpus changes. Product docs, ticket history, internal wiki, shipping policies updated weekly. Re-pasting on every query is unworkable.
  • You have hundreds of documents and no idea in advance which ones the question touches.
  • Cost per query matters because the feature runs thousands of times a day.
  • You need citations the user can click. Long-context answers don't tell you which page they came from. Retrieval does, by construction.
Takeaway

Long context is for one document read once. Retrieval is for a corpus queried often. Picking the wrong one is what makes your AI feature feel slow or vague.

A minimal retrieval setup

You do not need a vector database company. Postgres with the pgvector extension covers the first two years of most projects. Schema first:

create extension if not exists vector;

create table docs (
  id        bigserial primary key,
  source    text     not null,
  chunk     text     not null,
  embedding vector(1536)
);

create index docs_embedding_idx
  on docs using ivfflat (embedding vector_cosine_ops)
  with (lists = 100);

Query side, in Node:

import { Client } from 'pg';
import OpenAI from 'openai';

const db = new Client({ connectionString: process.env.DATABASE_URL });
const ai = new OpenAI();
await db.connect();

async function retrieve(question, k = 5) {
  const { data } = await ai.embeddings.create({
    model: 'text-embedding-3-small',
    input: question,
  });
  const vec = `[${data[0].embedding.join(',')}]`;

  const { rows } = await db.query(
    `select source, chunk
       from docs
       order by embedding <=> $1
       limit $2`,
    [vec, k]
  );
  return rows;
}

const hits = await retrieve(
  'What is the liability cap in the EU supplier contract?'
);
console.log(hits.map(r => `[${r.source}] ${r.chunk.slice(0, 120)}`).join('\n\n'));

Five chunks. One database. No new vendor on the bill. The model receives the question plus the five most relevant snippets and is told to answer only from those snippets. That last constraint is what gives you trustworthy citations.

Chunking is where most setups fail

Naive chunking cuts text every N characters. It splits a sentence in half. It splits a table header from its rows. It splits a clause from its definition. The retrieval looks fine in tests, then breaks the moment a real question lands on a boundary.

Two patterns that hold up:

  • Recursive splitting on structure. Split on headings first, then paragraphs, then sentences. Keep the path (Section 4 > Clause 3) in the chunk metadata so the model can cite it.
  • Overlap. Each chunk shares around 15% of its content with its neighbour. Cheap insurance against bad boundaries.

If your corpus has structure (Confluence, Notion, a docs site with H2s and H3s), use it. Generic 500-token windows are the last resort, not the default.

Beyond the basic pattern

If pure vector search disappoints, the next move is usually hybrid retrieval: combine BM25 (keyword) and vector scores. Some of the more interesting results we have seen come from graph-aware retrieval. The FastGraphRAG project on Hacker News (457 points, November 2024) layers PageRank over the chunk graph to find chunks that matter structurally, not just lexically. Worth a read before you decide your basic setup is "good enough".

Measuring whether it worked

The most common RAG failure is shipping with no evals. You add retrieval, the answers feel better, you call it done. Two months later a colleague says "the AI is hallucinating again" and you have nothing to debug.

The minimum bar: a CSV of 50 real questions with the chunk ID that should be retrieved for each. Run your retrieval over it nightly. Track hit@5 (is the right chunk in the top five) and MRR (mean reciprocal rank). When either drops, you know before the user does. If you cannot say what your hit@5 was last week, you do not have a RAG system. You have a demo.

From theory to a real corpus

When we built a knowledge agent for a Dutch logistics client this spring, the thing we kept tripping on was that their shipping policy doc had three versions in circulation. The model retrieved the wrong one with high confidence. We solved it by adding a supersedes field to every chunk and filtering before similarity, not after. Agent work like that is mostly debugging the corpus, not the model.

Five-minute audit you can run today: open whatever AI feature you shipped last quarter, pull the prompt, and count tokens. If it pastes more than 20k tokens of source material on every request, you have a retrieval problem waiting to be discovered. Write down the top three questions users ask. If you cannot point to which document each answer should come from, your team needs a retrieval layer before it needs a bigger model.

Frequently asked

Isn't a million-token context window enough to skip RAG entirely?+
For one document read once, yes. For a corpus queried many times a day, no. Cost, latency, and attention dilution all favour retrieval as soon as the corpus is bigger than a single file.
Do I need a dedicated vector database to start?+
Not for the first year. Postgres with pgvector handles millions of chunks comfortably. Move to a specialised store only when you hit measured limits, not before.
How do I know my chunking is the problem?+
Run 50 real questions through your retrieval and check whether the right chunk lands in the top five. If it doesn't, chunking is usually the cause before embeddings or the model itself.
What's the smallest useful RAG eval?+
A CSV with 50 questions and the expected chunk ID for each, run nightly. Track hit@5 and MRR. Without that you cannot tell whether changes improve or regress retrieval.

Want to build something similar?

Send us one paragraph about the process that eats the most of your week. We'll reply with an honest plan — within 4h on weekdays.