Scaling Postgres with pgvector — what we learned at 2M embeddings
Lessons from running a production HNSW vector index on Postgres for code search at scale. Recall, latency, and the operational tradeoffs nobody warns you about.
When we shipped HELIX.ai's semantic code search, we benched three vector stores: FAISS, Pinecone, and Postgres + pgvector. We expected pgvector to lose. It didn't — but only after we tuned three things almost nobody mentions in introductory blog posts.
Why we picked Postgres
Operational simplicity. We were already running Postgres for everything else — auth, billing, repo metadata. Adding a managed vector service meant another deploy target, another set of secrets, another paging surface. pgvector keeps embeddings inside the same transaction boundary as the rest of the data, which is more useful than it sounds.
The first surprise: HNSW vs IVFFlat
pgvector ships two index types. We started with IVFFlat because every tutorial uses it. At 200k embeddings, recall was 92% at acceptable latency. At 2M, recall fell off a cliff — to ~78%. HNSW solved it, but at a cost: index build times went from minutes to over an hour, and memory usage roughly doubled.
-- IVFFlat: fast to build, weaker at scale
CREATE INDEX ON embeddings USING ivfflat (vector vector_cosine_ops)
WITH (lists = 1000);
-- HNSW: slow to build, far better recall
CREATE INDEX ON embeddings USING hnsw (vector vector_cosine_ops)
WITH (m = 16, ef_construction = 64);The numbers that matter: at 2M rows, HNSW gave us 96% recall@10 with p95 query latency of 38ms. IVFFlat at the same configuration gave 78% recall and 71ms p95.
The second surprise: ef_search at query time
HNSW has a runtime parameter — ef_search — that's invisible to most ORMs. The default of 40 was fine for our top-5 queries but tanked recall on top-50. Setting it per-query was the fix.
# Set ef_search per session, not at index creation
async def vector_search(query: str, k: int):
async with pool.acquire() as conn:
# Tune for the result size we want
ef = max(40, k * 4)
await conn.execute(f"SET LOCAL hnsw.ef_search = {ef}")
return await conn.fetch("""
SELECT chunk_id, content, 1 - (vector <=> $1) AS score
FROM embeddings
ORDER BY vector <=> $1
LIMIT $2
""", query_embedding, k)The third surprise: VACUUM and bloat
We were rebuilding embeddings nightly as the indexed corpus grew. Within a week, query latency degraded by 40%. Postgres' MVCC means deleted tuples stay around until VACUUM cleans them up — and HNSW indexes are sensitive to bloat.
- Run `VACUUM ANALYZE` after any large delete or bulk update
- Schedule REINDEX on the HNSW index weekly during low-traffic windows
- Monitor pg_stat_user_indexes for idx_scan vs n_dead_tup ratio
What I'd do differently
If I started over, I'd commit to HNSW from day one and skip the IVFFlat phase entirely. The index build cost is real but pays for itself within a week of production use. I'd also bake a small reranker on top — semantic search alone gets you 80% of the way; a cross-encoder rerank closes the rest.
"Vector search is mostly about systems engineering. The model picks the candidates; the database determines whether the system can be operated by humans on call."
How I cut a startup's AWS Lambda bill by 67%
→A practical walkthrough of the four biggest cost wins on serverless: right-sizing memory, ARM architecture, provisioned concurrency, and the invocation patterns nobody profiles.