Databases

pgvector HNSW reindex: an 8-hour Black Friday outage

A Rotterdam logistics SaaS reindexed pgvector mid-Black-Friday and watched their RAG agent go dark for 8 hours. Here is the timeline and the procedure we now ship.

Jacob Molkenboer· Founder · A Brand New Company· 25 Jun 2026· 9 min

Open oak index-card drawer with cream card, green silk ribbon, brass tab, broken red wax seal on folded ledger slip.

Saturday, 28 November 2025. 14:47 CET. A logistics SaaS in Rotterdam — twenty-seven people, mostly couriers and ops, a four-engineer platform team — pushed REINDEX on a pgvector HNSW index that backed their customer-support RAG agent. The agent answered roughly 6,400 tickets per day during Black Friday week. At 14:48, the agent stopped answering anything. It would not answer again until 22:51.

This is the post-mortem of those eight hours, and the procedure we put in place afterwards so the next 4.2M-embedding re-index never touches an ACCESS EXCLUSIVE lock again. They asked us to share the timeline once the bonuses had been paid out. The names of the engineers are removed. The numbers are real.

The cluster on the Friday

The setup was the kind of thing a four-engineer team can keep in their head. One Postgres 16 primary on Hetzner, 64 GB RAM, NVMe, with a streaming replica in Falkenstein. The pgvector extension at 0.7.4, an HNSW index over a 4.2M-row embeddings table, vectors at 1536 dimensions from text-embedding-3-small. The index had been built four months earlier with m=16, ef_construction=64. It was 11 GB on disk.

The agent on top was a fairly standard retrieval pipeline: query, embed, top-k against the HNSW index, re-rank with a smaller cross-encoder, answer with Claude Sonnet. During Black Friday week, traffic ran at roughly 130 queries per minute peak.

On the Friday afternoon, an engineer noticed that recall against a held-out evaluation set had slowly drifted. New embeddings had been streaming in for four months and the HNSW graph had grown messy at the edges. Recall@10 on the eval set was 0.81 against a baseline of 0.93. Latency was fine. They wanted to rebuild the index before the weekend.

The Saturday morning decision

The decision, written in the ticket at 14:32 on Saturday, was a single line: rebuild the HNSW now, traffic is lower (it was not lower — Black Friday rolled into Saturday morning), expected duration forty minutes based on the test cluster, will update at the back end.

REINDEX on a Postgres B-tree index, with the default options, locks the table against writes but allows reads. REINDEX on a pgvector HNSW index also takes a lock, and crucially, queries that would normally use that index now sequentially scan the table. A sequential scan against 4.2M rows of 1536-dimension vectors, with cosine distance, is not a query. It is a forty-second timeout.

The engineer ran:

REINDEX INDEX embeddings_hnsw_idx;

No CONCURRENTLY. No swap. Single statement. At 14:48, the first RAG query timed out. By 14:55, the upstream tickets queue had backed up by 380 items. By 15:30, the support team had switched to manual triage and the COO was on the platform Slack.

The reindex did not take forty minutes. It took five hours and forty-three minutes. The engineer had tested on a 280k-row staging cluster. HNSW build time on pgvector scales worse than linearly with row count — the graph construction does k-NN queries against the partially-built graph for every new row, and the constant grows with the size of the graph itself. The pgvector documentation is clear about this. The engineer had not read it on the Saturday.

Warning

HNSW build time is roughly O(N · log N) but with a constant that grows with the size of the existing graph. A 40-minute build on 280k rows is a 6+ hour build on 4M rows. Test on a copy of production, not a staging slice.

Eight hours of ACCESS EXCLUSIVE

Once the lock was in flight, cancelling it would roll back the partial index and leave the cluster in the same state, only with five hours of build wasted. The team decided to let it finish. While the REINDEX ran:

Customer-support tickets queued in Intercom; the SLA breach counter ticked up.
The RAG agent's health check kept returning 500. The frontend showed a "we're temporarily unavailable" banner.
A second engineer wrote a query-side fallback that fell through to the cross-encoder against a random 50k sample. Recall was bad enough that they killed it within twenty minutes.
The COO drafted, then did not send, an email to the top fifteen customers.

The REINDEX completed at 20:31. At which point the team discovered that the new index had to vacuum and analyze before the planner picked it up. Another two hours and twenty minutes. The agent came back at 22:51.

Total: eight hours and three minutes of dark agent during a Black Friday Saturday.

Recovery and the cost

Counting only the things that have a euro number attached: 412 tickets escalated to human agents who would normally have been off-shift, two senior engineers paid weekend rates, one customer threatened cancellation (they did not cancel). The platform team called us on the Tuesday.

The interesting part of this post is not the eight hours. It is what we put in place afterwards.

Online build, alias swap

The procedure we now ship before any pgvector re-index hits production has three properties. The live HNSW index is never dropped or locked until the new one is fully built and warmed. The application code never knows the index is being swapped, because the index name stays stable from the planner's perspective. And if the new index is worse than the old one, the rollback is one renamed object away.

The trick is that Postgres lets you build an index with a different name, concurrently, and then atomically swap the names inside a single transaction. CREATE INDEX CONCURRENTLY does not take ACCESS EXCLUSIVE; it takes a lower lock that allows reads and writes. The swap itself takes ACCESS EXCLUSIVE for milliseconds.

Here is the shape of it.

-- 1. Build a fresh HNSW index alongside the live one, without locking.
-- Hours on 4.2M rows. The live index keeps serving queries throughout.
CREATE INDEX CONCURRENTLY embeddings_hnsw_idx_new
  ON embeddings
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 96);

-- 2. Warm the new index so the planner has stats and the OS has it in RAM.
ANALYZE embeddings;
SELECT count(*) FROM embeddings
  WHERE embedding <=> '[...]'::vector < 0.5
  LIMIT 100;  -- repeat with representative query vectors

-- 3. Atomic swap. Inside one transaction, rename the old, then the new.
BEGIN;
  ALTER INDEX embeddings_hnsw_idx     RENAME TO embeddings_hnsw_idx_old;
  ALTER INDEX embeddings_hnsw_idx_new RENAME TO embeddings_hnsw_idx;
COMMIT;

-- 4. After 24 hours of healthy operation, drop the rollback copy.
DROP INDEX CONCURRENTLY embeddings_hnsw_idx_old;

The swap in step 3 holds ACCESS EXCLUSIVE, but only on the embeddings table and only for the duration of two DDL statements. In practice, on a healthy cluster, this is under fifty milliseconds. Queries in flight either block briefly or finish on the old name before the rename lands.

There is a footgun. If CREATE INDEX CONCURRENTLY fails partway — disk pressure, OOM, a slow vacuum — it leaves an INVALID index behind. The planner will not use it, but the disk space is gone. Before retry, drop the invalid one:

SELECT indexname FROM pg_indexes
WHERE tablename = 'embeddings' AND indexname LIKE '%_new';

DROP INDEX CONCURRENTLY embeddings_hnsw_idx_new;

The runbook, end to end

The procedure we now ship is six steps. Three are checks before the build starts. Three are the build, swap, and rollback.

Capacity check. Free disk must be at least 2× the size of the current index. Run pg_size_pretty(pg_relation_size('embeddings_hnsw_idx')) on the primary, then df -h on the host.
Recall baseline. Run the held-out eval set against the live index. Record recall@10 and p95 latency. This is the gate the new index has to beat.
Raise maintenance_work_mem. Bump it to ~25% of host RAM for the build session only. HNSW build is memory-hungry; the default 64 MB is the slowest possible setting.
Build concurrently. The CREATE INDEX CONCURRENTLY statement above. Tag the session with application_name so it shows up in pg_stat_activity as something other than "psql".
Warm and verify. ANALYZE. Run the eval set against the new index by hinting it explicitly (EXPLAIN to confirm the new name is picked). If recall@10 is worse than the baseline, abort. If it is equal or better, continue.
Swap. The BEGIN/RENAME/RENAME/COMMIT transaction. Watch the agent's health check for one full cycle (typically 60s). The old index sits on disk for 24 hours as undo.

The rollback path is the same procedure in reverse. If something goes wrong in the first hour, run another rename pair and the planner is back on the old graph before the next query lands.

What this has to do with agentic systems

A recurring thread on the front page this week is about building reliable agentic systems. Most of the discussion focuses on prompt structure, evals, and tool wiring. The boring truth is that an agent's reliability is bounded by the reliability of the slowest, weirdest piece of infrastructure underneath it. For a RAG agent, that is almost always the vector store.

If you are building agents on top of pgvector — or pgvector-on-Supabase, or pgvector-on-RDS — the index re-build procedure is not a database problem. It is part of your agent's runtime contract. Treating it as routine database maintenance is what produces eight-hour outages.

When we built the support agent for the Rotterdam logistics client above, the thing we ran into during recovery was that there was no separation between "the database team's procedures" and "the agent team's procedures." We ended up writing the swap runbook as part of the agent's deployment manifest, in the same file that holds the system prompt and the eval set. It now ships with every AI agent we build, because the failure mode is the same regardless of stack.

The five-minute thing to do today

Open the database that backs your retrieval pipeline. Run \d <embeddings_table> and write down the name of the HNSW (or IVFFlat) index. Then write down, in your runbook, the exact CREATE INDEX CONCURRENTLY statement that would rebuild it, with the correct opclass and parameters. If your team cannot produce that statement in under five minutes during an incident, you do not have a rebuild procedure. You have an outage waiting for a recall drift to make it interesting.

Key takeaway

Build the new HNSW index alongside the old one, warm it, then swap names in a single transaction. The lock window collapses from hours to milliseconds.

FAQ

Can I just use REINDEX CONCURRENTLY instead of the alias swap?

Yes on recent Postgres for HNSW, but you give up the rollback path: the old index is gone the moment it finishes. The alias swap keeps the old index around for 24 hours so you can flip back if recall regresses.

How long does CREATE INDEX CONCURRENTLY take on a few million embeddings?

On 4M+ rows of 1536-dim vectors with maintenance_work_mem raised to ~25% of host RAM, plan for 3 to 6 hours. Test on a copy of production, not a sliced staging set, because build time scales worse than linearly.

Does the same procedure work for IVFFlat?

Same shape. IVFFlat builds faster but recall is more sensitive to centroid drift, so you rebuild it more often. The build-alongside, warm, swap-names pattern is identical.

How do we know when recall has drifted enough to rebuild?

Hold out an eval set of 500 to 2000 queries with known-good answers. Run it weekly and track recall@10. When it slips more than ~10% from the baseline, plan a rebuild in the next quiet window.

ragai agentscase studyarchitectureoperationsknowledge base

Building something?

Start a project