Vector databases are easy to “use” and surprisingly hard to tune. This guide focuses on what changes outcomes in real systems: index choice, recall vs latency, filters, and cost drivers.
Quickstart: pick the right index + settings in 10 minutes
If you only read one section, read this. Most production issues come from two knobs: how you index vectors and how hard you search the index.
Step 1 — Choose your index type
Start with the option that matches your scale and update pattern.
- Small/medium (≤ 1–5M vectors): HNSW is often the easiest win
- Very large (10M+ vectors): IVF-family indexes can be more memory efficient
- Need exact results: Flat (brute force) is correct but expensive
- Frequent updates: Prefer indexes that handle inserts well (often HNSW)
Step 2 — Tune recall vs latency (the “search effort”)
Approximate search is always a tradeoff. Decide what you’re optimizing.
- Increase effort → higher recall, higher latency, higher cost
- Decrease effort → lower latency, but more “missed” best matches
- Measure recall using a small exact baseline (Flat) on a sample
- Pick a target (e.g. “≥ 0.95 recall @ top-10”)
Step 3 — Reduce cost with metadata and hybrid search
Cost usually spikes when you search too much data. Your best cost lever is limiting the candidate set before vector math.
Use metadata filters
- Filter by tenant/user
- Filter by language, region, category
- Filter by time range (freshness)
Use a hybrid “pre-filter”
- Keyword/BM25 narrows to candidates
- Vector search reranks for meaning
- Often best quality for technical docs and product search
- Use smaller chunks than you think (often 200–500 tokens).
- Retrieve k=6–12, then rerank or use MMR to reduce redundancy.
- Track “answerable?” rate and citations, not only click metrics.
Overview: what a vector database actually does
A vector database stores embeddings (vectors) and lets you search by similarity. Under the hood, “vector search” is usually:
Similarity search = find nearest neighbors
Given a query embedding q, you want the vectors v that maximize similarity (often cosine similarity) or minimize distance.
# Conceptually:
results = arg_top_k(similarity(q, v_i) for each vector v_i)
# In practice:
# You rarely compute similarity against ALL vectors.
# You use an index to search a small candidate set fast.
The three tradeoffs you can’t avoid
| Thing you want | What it costs | Typical lever |
|---|---|---|
| High recall (don’t miss best matches) | More computation / more candidates | Higher search effort (e.g., efSearch / nprobe) |
| Low latency (fast queries) | Lower recall unless you add resources | Lower effort, better filters, smaller index |
| Lower cost (CPU/RAM/storage) | Less brute force capacity | Compression, IVF, smarter chunking, caching |
Most teams get stuck because they try to maximize all three at once. The winning approach is to pick a target recall, then engineer latency and cost around it.
Vector DB ≠ magic. It’s an index + storage + filtering + operational features (replication, sharding, backups). If you understand the index and the search knobs, you understand 80% of the real-world performance.
Core concepts: indexes, recall, filters, and cost
1) Embeddings and distance metrics
Embeddings turn text/images/products into numbers. Similar meaning → vectors are close. Common similarity metrics:
| Metric | What it means | When it’s used |
|---|---|---|
| Cosine similarity | Angle between vectors (direction) | Most text embeddings; scale-invariant |
| Dot product | Similarity with magnitude | Some models assume this directly |
| Euclidean (L2) | Geometric distance | Some indexes and vision embeddings |
Your embedding model and vector DB should agree on normalization and metric. If your model expects cosine similarity, normalize vectors (or use cosine directly) and avoid mixing metrics across pipelines.
2) Recall: the metric that actually matters
Recall answers: “Did the search return the true nearest neighbors?” Approximate methods speed up search by sacrificing some recall.
Practical recall definition
Compute exact neighbors on a small sample using brute force (Flat). Then compare how many of those appear in your approximate results.
# Pseudocode
true = flat_search(query, k=10)
approx = ann_search(query, k=10)
recall@10 = |true ∩ approx| / 10
If a RAG system “feels wrong,” it’s often because recall is low at the retrieval stage. Fix retrieval first before changing prompts or model temperature.
3) Indexes: Flat vs HNSW vs IVF
Vector search indexes reduce the number of comparisons you do per query. The most common families:
| Index | Best for | Pros | Tradeoffs |
|---|---|---|---|
| Flat (exact) | Small datasets, gold-standard evaluation | Perfect recall | Expensive at scale (latency/cost) |
| HNSW (graph) | General-purpose, fast, great quality | High recall at low latency | Memory-heavy; tuning matters |
| IVF (cluster) | Very large datasets | Memory/compute efficient | More tuning; quality varies with clusters |
HNSW: why it’s popular
HNSW builds a graph where vectors connect to neighbors. Search walks the graph to find close points quickly.
- M (connections): higher → better recall, more memory
- efConstruction: higher → better index quality, slower build
- efSearch: higher → better recall, slower query
IVF: why it scales
IVF clusters the space, then searches only the most relevant clusters.
- nlist (clusters): more → finer partitioning
- nprobe (clusters searched): higher → better recall, slower query
- Often combined with compression (PQ) to reduce RAM
4) Metadata filters and why they change everything
Filters are not just “nice to have.” They’re one of the biggest levers for cost and relevance because they shrink the candidate set.
Examples that usually pay off
- Multi-tenant RAG: filter by tenant/user first
- News/search freshness: filter by date range
- Product search: filter by category/brand/price bucket
- Docs: filter by version (v1/v2), language, product line
If your DB applies filters after approximate retrieval, you can waste work (and lose recall in filtered subsets). Prefer systems that support efficient filtering or partitioning strategies that keep filters “close” to the search.
5) Cost: what you actually pay for
Vector DB cost is usually a mix of RAM (index + vectors), CPU (search), storage (persistence), and network (remote queries + replication).
The cost drivers checklist
- Vector count: more vectors → more RAM/storage
- Vector dimension: higher dims → more memory and compute per comparison
- Index type + params: HNSW memory grows with connectivity; IVF grows with clusters
- Search effort: high efSearch/nprobe increases CPU and latency
- Top-k: larger k often increases work and rerank cost
- Write rate: frequent updates may force rebuild/maintenance
Step-by-step: how to choose and tune a vector database
This is a production-style workflow you can copy. It keeps you honest: you measure recall, then optimize latency and cost.
Step 1 — Define your workload (it changes the best index)
Write these down before you pick anything:
- N: number of vectors now and in 6 months
- d: vector dimension (e.g. 384, 768, 1024)
- QPS: queries per second (peak, not average)
- Write rate: inserts/updates per hour/day
- Filters: tenant, language, time, category
- Target: recall@k goal + latency SLA
Step 2 — Pick a baseline index (then tune)
| If you have… | Start with… | Why |
|---|---|---|
| ≤ 1M vectors, modest QPS | HNSW | Great quality, easy to tune |
| 10M+ vectors, tight memory | IVF (optionally + compression) | More memory efficient at scale |
| Need ground-truth evaluation | Flat (on a sample) | Measures true recall and regressions |
Step 3 — Measure recall against an exact baseline
Pick a representative sample (e.g. 5k–50k vectors) and run exact search on it. Then compare your approximate index settings.
# Simple recall test plan (no vendor specifics)
1) Sample vectors + queries from production distribution
2) For each query:
- true_top10 = flat_search(sample, query, k=10)
- approx_top10 = ann_search(sample, query, k=10)
3) recall@10 = mean(|intersection| / 10)
4) Record latency p50/p95 at your QPS target
Step 4 — Tune the search effort (the main performance knob)
Every index has a “how hard to search” knob. Increase it until recall hits your target, then stop.
HNSW tuning loop
- Hold M and efConstruction steady initially
- Increase efSearch until recall@k stabilizes
- If recall plateaus too low, rebuild with higher M/efConstruction
IVF tuning loop
- Pick nlist (clusters) based on N
- Increase nprobe until recall@k stabilizes
- If recall is unstable, adjust nlist and retrain clusters
Stop tuning when recall gains per latency cost get small. Past a point, it’s better to improve chunking, filtering, reranking, or embedding quality than to brute-force the index.
Step 5 — Improve quality without exploding cost
If recall is fine but results “feel” off, it’s usually not the index. It’s your data and retrieval strategy.
RAG quality upgrades
- Chunk by meaning (headings/sections), not fixed length only
- Add metadata (doc title, section, product, version)
- Use MMR or reranking to reduce duplicates
- Store citations (URL, heading) for traceable answers
Search relevance upgrades
- Hybrid: keyword prefilter + vector rerank
- Use query expansion for short queries
- Boost fresh/authoritative content with metadata
- Handle synonyms and model drift with eval sets
Step 6 — Operational checklist (what breaks in production)
- Backups: you want point-in-time recovery for index + metadata
- Versioning: record embedding model version and chunker version
- Monitoring: track latency p95, recall proxy metrics, and “no result” rate
- Cold start: ensure index rebuild time fits your recovery plan
- Multi-tenancy: enforce filters at query-time and schema-level
Once this is in place, you can make changes confidently—without “it felt worse” arguments.
Common mistakes (and the fixes)
These are the traps that waste the most time when building semantic search and RAG systems.
Mistake 1 — “k=5 is enough” (without reranking)
Small k can miss crucial context. But large k can add noise.
- Fix: retrieve k=6–12, then rerank or MMR
- Fix: measure answer quality vs k on a small eval set
Mistake 2 — Tuning index knobs before chunking
If chunks are too big, embeddings are “averages.” If too small, you lose context.
- Fix: chunk by headings/sections + a token budget
- Fix: store titles and section names as metadata
Mistake 3 — Ignoring filters (or filtering too late)
Unfiltered search is expensive and often less relevant.
- Fix: filter early (tenant/time/category)
- Fix: design metadata schema before indexing millions of vectors
Mistake 4 — No recall measurement
Without recall tests, you can’t tell whether quality drops are index-related or data-related.
- Fix: keep a small Flat baseline for evaluation
- Fix: record recall@k and latency p95 for changes
Changing the embedding model can change distance distributions. Your old index knobs may no longer be optimal. Plan for reindexing, versioning, and regression tests.
FAQ: vector databases, recall, and cost
What is a vector database (in one sentence)?
A vector database stores embeddings and retrieves nearest neighbors efficiently using specialized indexes, plus filtering and operational features for production use.
HNSW vs IVF: which one should I choose?
HNSW is often the best default for small-to-medium scale and high-quality retrieval. IVF tends to shine at very large scale and tighter memory budgets—but usually needs more tuning.
What’s the difference between recall and precision in vector search?
Recall here usually means “did we retrieve the true nearest neighbors?” (ANN quality). Precision is “are the retrieved items actually relevant to the user’s intent?” You can have high ANN recall but low relevance if your chunks/metadata/embedding model are wrong.
Why is my vector search slow?
The common causes are: searching too many candidates (high effort), too many vectors, high dimensions, missing filters, high top-k, or running reranking on too many documents. Fix order: add filters → reduce k/rerank set → tune effort → consider different index.
How do I reduce vector database cost?
Reduce the amount of data each query touches (filters, partitions, hybrid prefilter), reduce stored vectors (better chunking), and tune index params to your target recall—not “as high as possible.” Compression can help at high scale.
What are good starting settings for RAG retrieval?
Start with meaningful chunking, retrieve k=6–12, apply metadata filters, and consider reranking or MMR. Then build a small evaluation set (real questions) and tune recall/latency around that.
Cheatsheet: the fast “do this” checklist
Index + tuning (most common knobs)
- Flat: exact baseline (for eval)
- HNSW: M, efConstruction (build), efSearch (query)
- IVF: nlist (clusters), nprobe (searched clusters)
- Rule: tune query effort first, rebuild only if needed
Quality + cost (big wins)
- Chunking: 200–500 tokens is a strong start
- Metadata: tenant, language, category, version, time
- Hybrid: keyword/BM25 → candidates → vector rerank
- RAG: k=6–12 + MMR/rerank for diversity
If you remember only one thing
Vector DB performance is mostly: candidate set size × search effort × vector dimension. Reduce the candidate set (filters/hybrid) before you brute-force the index.
Wrap-up: how to pick a vector store without guessing
Pick a target (recall@k + latency), measure recall with a Flat baseline on a sample, then tune your index “search effort” until you hit the target. After that, optimize cost by shrinking the candidate set with filters and hybrid retrieval—and improve relevance with chunking, metadata, and reranking.
Quiz
Quick self-check. This quiz is auto-generated for ai / vector / dbs.