Vector Databases Explained: Indexes, Recall, and Cost

By Samuel Labant Published Jan 9, 2026 Updated Jan 9, 2026

Vector databases are easy to “use” and surprisingly hard to tune. This guide focuses on what changes outcomes in real systems: index choice, recall vs latency, filters, and cost drivers.

Quickstart: pick the right index + settings in 10 minutes

If you only read one section, read this. Most production issues come from two knobs: how you index vectors and how hard you search the index.

Step 1 — Choose your index type

Start with the option that matches your scale and update pattern.

Small/medium (≤ 1–5M vectors): HNSW is often the easiest win
Very large (10M+ vectors): IVF-family indexes can be more memory efficient
Need exact results: Flat (brute force) is correct but expensive
Frequent updates: Prefer indexes that handle inserts well (often HNSW)

Step 2 — Tune recall vs latency (the “search effort”)

Approximate search is always a tradeoff. Decide what you’re optimizing.

Increase effort → higher recall, higher latency, higher cost
Decrease effort → lower latency, but more “missed” best matches
Measure recall using a small exact baseline (Flat) on a sample
Pick a target (e.g. “≥ 0.95 recall @ top-10”)

Step 3 — Reduce cost with metadata and hybrid search

Cost usually spikes when you search too much data. Your best cost lever is limiting the candidate set before vector math.

Use metadata filters

Filter by tenant/user
Filter by language, region, category
Filter by time range (freshness)

Use a hybrid “pre-filter”

Keyword/BM25 narrows to candidates
Vector search reranks for meaning
Often best quality for technical docs and product search

If you’re building RAG

Use smaller chunks than you think (often 200–500 tokens).
Retrieve k=6–12, then rerank or use MMR to reduce redundancy.
Track “answerable?” rate and citations, not only click metrics.

Overview: what a vector database actually does

A vector database stores embeddings (vectors) and lets you search by similarity. Under the hood, “vector search” is usually:

Similarity search = find nearest neighbors

Given a query embedding q, you want the vectors v that maximize similarity (often cosine similarity) or minimize distance.

# Conceptually:
results = arg_top_k(similarity(q, v_i) for each vector v_i)

# In practice:
# You rarely compute similarity against ALL vectors.
# You use an index to search a small candidate set fast.

The three tradeoffs you can’t avoid

Thing you want	What it costs	Typical lever
High recall (don’t miss best matches)	More computation / more candidates	Higher search effort (e.g., efSearch / nprobe)
Low latency (fast queries)	Lower recall unless you add resources	Lower effort, better filters, smaller index
Lower cost (CPU/RAM/storage)	Less brute force capacity	Compression, IVF, smarter chunking, caching

Most teams get stuck because they try to maximize all three at once. The winning approach is to pick a target recall, then engineer latency and cost around it.

A mental model that prevents confusion

Vector DB ≠ magic. It’s an index + storage + filtering + operational features (replication, sharding, backups). If you understand the index and the search knobs, you understand 80% of the real-world performance.

Core concepts: indexes, recall, filters, and cost

1) Embeddings and distance metrics

Embeddings turn text/images/products into numbers. Similar meaning → vectors are close. Common similarity metrics:

Metric	What it means	When it’s used
Cosine similarity	Angle between vectors (direction)	Most text embeddings; scale-invariant
Dot product	Similarity with magnitude	Some models assume this directly
Euclidean (L2)	Geometric distance	Some indexes and vision embeddings

Keep it consistent

Your embedding model and vector DB should agree on normalization and metric. If your model expects cosine similarity, normalize vectors (or use cosine directly) and avoid mixing metrics across pipelines.

2) Recall: the metric that actually matters

Recall answers: “Did the search return the true nearest neighbors?” Approximate methods speed up search by sacrificing some recall.

Practical recall definition

Compute exact neighbors on a small sample using brute force (Flat). Then compare how many of those appear in your approximate results.

# Pseudocode
true = flat_search(query, k=10)
approx = ann_search(query, k=10)

recall@10 = |true ∩ approx| / 10

If a RAG system “feels wrong,” it’s often because recall is low at the retrieval stage. Fix retrieval first before changing prompts or model temperature.

3) Indexes: Flat vs HNSW vs IVF

Vector search indexes reduce the number of comparisons you do per query. The most common families:

Index	Best for	Pros	Tradeoffs
Flat (exact)	Small datasets, gold-standard evaluation	Perfect recall	Expensive at scale (latency/cost)
HNSW (graph)	General-purpose, fast, great quality	High recall at low latency	Memory-heavy; tuning matters
IVF (cluster)	Very large datasets	Memory/compute efficient	More tuning; quality varies with clusters

HNSW: why it’s popular

HNSW builds a graph where vectors connect to neighbors. Search walks the graph to find close points quickly.

M (connections): higher → better recall, more memory
efConstruction: higher → better index quality, slower build
efSearch: higher → better recall, slower query

IVF: why it scales

IVF clusters the space, then searches only the most relevant clusters.

nlist (clusters): more → finer partitioning
nprobe (clusters searched): higher → better recall, slower query
Often combined with compression (PQ) to reduce RAM

4) Metadata filters and why they change everything

Filters are not just “nice to have.” They’re one of the biggest levers for cost and relevance because they shrink the candidate set.

Examples that usually pay off

Multi-tenant RAG: filter by tenant/user first
News/search freshness: filter by date range
Product search: filter by category/brand/price bucket
Docs: filter by version (v1/v2), language, product line

Filtering pitfall

If your DB applies filters after approximate retrieval, you can waste work (and lose recall in filtered subsets). Prefer systems that support efficient filtering or partitioning strategies that keep filters “close” to the search.

5) Cost: what you actually pay for

Vector DB cost is usually a mix of RAM (index + vectors), CPU (search), storage (persistence), and network (remote queries + replication).

The cost drivers checklist

Vector count: more vectors → more RAM/storage
Vector dimension: higher dims → more memory and compute per comparison
Index type + params: HNSW memory grows with connectivity; IVF grows with clusters
Search effort: high efSearch/nprobe increases CPU and latency
Top-k: larger k often increases work and rerank cost
Write rate: frequent updates may force rebuild/maintenance

Step-by-step: how to choose and tune a vector database

This is a production-style workflow you can copy. It keeps you honest: you measure recall, then optimize latency and cost.

Step 1 — Define your workload (it changes the best index)

Write these down before you pick anything:

N: number of vectors now and in 6 months
d: vector dimension (e.g. 384, 768, 1024)
QPS: queries per second (peak, not average)
Write rate: inserts/updates per hour/day
Filters: tenant, language, time, category
Target: recall@k goal + latency SLA

Step 2 — Pick a baseline index (then tune)

If you have…	Start with…	Why
≤ 1M vectors, modest QPS	HNSW	Great quality, easy to tune
10M+ vectors, tight memory	IVF (optionally + compression)	More memory efficient at scale
Need ground-truth evaluation	Flat (on a sample)	Measures true recall and regressions

Step 3 — Measure recall against an exact baseline

Pick a representative sample (e.g. 5k–50k vectors) and run exact search on it. Then compare your approximate index settings.

# Simple recall test plan (no vendor specifics)
1) Sample vectors + queries from production distribution
2) For each query:
   - true_top10 = flat_search(sample, query, k=10)
   - approx_top10 = ann_search(sample, query, k=10)
3) recall@10 = mean(|intersection| / 10)
4) Record latency p50/p95 at your QPS target

Step 4 — Tune the search effort (the main performance knob)

Every index has a “how hard to search” knob. Increase it until recall hits your target, then stop.

HNSW tuning loop

Hold M and efConstruction steady initially
Increase efSearch until recall@k stabilizes
If recall plateaus too low, rebuild with higher M/efConstruction

IVF tuning loop

Pick nlist (clusters) based on N
Increase nprobe until recall@k stabilizes
If recall is unstable, adjust nlist and retrain clusters

A simple rule that prevents over-tuning

Stop tuning when recall gains per latency cost get small. Past a point, it’s better to improve chunking, filtering, reranking, or embedding quality than to brute-force the index.

Step 5 — Improve quality without exploding cost

If recall is fine but results “feel” off, it’s usually not the index. It’s your data and retrieval strategy.

RAG quality upgrades

Chunk by meaning (headings/sections), not fixed length only
Add metadata (doc title, section, product, version)
Use MMR or reranking to reduce duplicates
Store citations (URL, heading) for traceable answers

Search relevance upgrades

Hybrid: keyword prefilter + vector rerank
Use query expansion for short queries
Boost fresh/authoritative content with metadata
Handle synonyms and model drift with eval sets

Step 6 — Operational checklist (what breaks in production)

Backups: you want point-in-time recovery for index + metadata
Versioning: record embedding model version and chunker version
Monitoring: track latency p95, recall proxy metrics, and “no result” rate
Cold start: ensure index rebuild time fits your recovery plan
Multi-tenancy: enforce filters at query-time and schema-level

Once this is in place, you can make changes confidently—without “it felt worse” arguments.

Common mistakes (and the fixes)

These are the traps that waste the most time when building semantic search and RAG systems.

Mistake 1 — “k=5 is enough” (without reranking)

Small k can miss crucial context. But large k can add noise.

Fix: retrieve k=6–12, then rerank or MMR
Fix: measure answer quality vs k on a small eval set

Mistake 2 — Tuning index knobs before chunking

If chunks are too big, embeddings are “averages.” If too small, you lose context.

Fix: chunk by headings/sections + a token budget
Fix: store titles and section names as metadata

Mistake 3 — Ignoring filters (or filtering too late)

Unfiltered search is expensive and often less relevant.

Fix: filter early (tenant/time/category)
Fix: design metadata schema before indexing millions of vectors

Mistake 4 — No recall measurement

Without recall tests, you can’t tell whether quality drops are index-related or data-related.

Fix: keep a small Flat baseline for evaluation
Fix: record recall@k and latency p95 for changes

Mistake 5 — Treating embeddings as “set and forget”

Changing the embedding model can change distance distributions. Your old index knobs may no longer be optimal. Plan for reindexing, versioning, and regression tests.

FAQ: vector databases, recall, and cost

What is a vector database (in one sentence)?

A vector database stores embeddings and retrieves nearest neighbors efficiently using specialized indexes, plus filtering and operational features for production use.

HNSW vs IVF: which one should I choose?

HNSW is often the best default for small-to-medium scale and high-quality retrieval. IVF tends to shine at very large scale and tighter memory budgets—but usually needs more tuning.

What’s the difference between recall and precision in vector search?

Recall here usually means “did we retrieve the true nearest neighbors?” (ANN quality). Precision is “are the retrieved items actually relevant to the user’s intent?” You can have high ANN recall but low relevance if your chunks/metadata/embedding model are wrong.

Why is my vector search slow?

The common causes are: searching too many candidates (high effort), too many vectors, high dimensions, missing filters, high top-k, or running reranking on too many documents. Fix order: add filters → reduce k/rerank set → tune effort → consider different index.

How do I reduce vector database cost?

Reduce the amount of data each query touches (filters, partitions, hybrid prefilter), reduce stored vectors (better chunking), and tune index params to your target recall—not “as high as possible.” Compression can help at high scale.

What are good starting settings for RAG retrieval?

Start with meaningful chunking, retrieve k=6–12, apply metadata filters, and consider reranking or MMR. Then build a small evaluation set (real questions) and tune recall/latency around that.

Cheatsheet: the fast “do this” checklist

Index + tuning (most common knobs)

Flat: exact baseline (for eval)
HNSW: M, efConstruction (build), efSearch (query)
IVF: nlist (clusters), nprobe (searched clusters)
Rule: tune query effort first, rebuild only if needed

Quality + cost (big wins)

Chunking: 200–500 tokens is a strong start
Metadata: tenant, language, category, version, time
Hybrid: keyword/BM25 → candidates → vector rerank
RAG: k=6–12 + MMR/rerank for diversity

If you remember only one thing

Vector DB performance is mostly: candidate set size × search effort × vector dimension. Reduce the candidate set (filters/hybrid) before you brute-force the index.

Wrap-up: how to pick a vector store without guessing

Pick a target (recall@k + latency), measure recall with a Flat baseline on a sample, then tune your index “search effort” until you hit the target. After that, optimize cost by shrinking the candidate set with filters and hybrid retrieval—and improve relevance with chunking, metadata, and reranking.

Vector Databases Explained: Indexes, Recall, and Cost

Quickstart: pick the right index + settings in 10 minutes

Step 1 — Choose your index type

Step 2 — Tune recall vs latency (the “search effort”)

Step 3 — Reduce cost with metadata and hybrid search

Use metadata filters

Use a hybrid “pre-filter”

Overview: what a vector database actually does

Similarity search = find nearest neighbors

The three tradeoffs you can’t avoid

Core concepts: indexes, recall, filters, and cost

1) Embeddings and distance metrics

2) Recall: the metric that actually matters

Practical recall definition

3) Indexes: Flat vs HNSW vs IVF

HNSW: why it’s popular

IVF: why it scales

4) Metadata filters and why they change everything

Examples that usually pay off

5) Cost: what you actually pay for

The cost drivers checklist

Step-by-step: how to choose and tune a vector database

Step 1 — Define your workload (it changes the best index)

Step 2 — Pick a baseline index (then tune)

Step 3 — Measure recall against an exact baseline

Step 4 — Tune the search effort (the main performance knob)

HNSW tuning loop

IVF tuning loop

Step 5 — Improve quality without exploding cost

RAG quality upgrades

Search relevance upgrades

Step 6 — Operational checklist (what breaks in production)

Common mistakes (and the fixes)

Mistake 1 — “k=5 is enough” (without reranking)

Mistake 2 — Tuning index knobs before chunking

Mistake 3 — Ignoring filters (or filtering too late)

Mistake 4 — No recall measurement

FAQ: vector databases, recall, and cost

What is a vector database (in one sentence)?

HNSW vs IVF: which one should I choose?

What’s the difference between recall and precision in vector search?

Why is my vector search slow?

How do I reduce vector database cost?

What are good starting settings for RAG retrieval?

Cheatsheet: the fast “do this” checklist

Index + tuning (most common knobs)

Quality + cost (big wins)

If you remember only one thing

Wrap-up: how to pick a vector store without guessing

Quiz

Related posts