AI · Vector DBs

Vector Databases Explained: Indexes, Recall, and Cost

What actually matters when picking a vector store.

Reading time: ~8–12 min
Level: All levels
Updated:

Vector databases are easy to “use” and surprisingly hard to tune. This guide focuses on what changes outcomes in real systems: index choice, recall vs latency, filters, and cost drivers.


Quickstart: pick the right index + settings in 10 minutes

If you only read one section, read this. Most production issues come from two knobs: how you index vectors and how hard you search the index.

Step 1 — Choose your index type

Start with the option that matches your scale and update pattern.

  • Small/medium (≤ 1–5M vectors): HNSW is often the easiest win
  • Very large (10M+ vectors): IVF-family indexes can be more memory efficient
  • Need exact results: Flat (brute force) is correct but expensive
  • Frequent updates: Prefer indexes that handle inserts well (often HNSW)

Step 2 — Tune recall vs latency (the “search effort”)

Approximate search is always a tradeoff. Decide what you’re optimizing.

  • Increase effort → higher recall, higher latency, higher cost
  • Decrease effort → lower latency, but more “missed” best matches
  • Measure recall using a small exact baseline (Flat) on a sample
  • Pick a target (e.g. “≥ 0.95 recall @ top-10”)

Step 3 — Reduce cost with metadata and hybrid search

Cost usually spikes when you search too much data. Your best cost lever is limiting the candidate set before vector math.

Use metadata filters

  • Filter by tenant/user
  • Filter by language, region, category
  • Filter by time range (freshness)

Use a hybrid “pre-filter”

  • Keyword/BM25 narrows to candidates
  • Vector search reranks for meaning
  • Often best quality for technical docs and product search
If you’re building RAG
  • Use smaller chunks than you think (often 200–500 tokens).
  • Retrieve k=6–12, then rerank or use MMR to reduce redundancy.
  • Track “answerable?” rate and citations, not only click metrics.

Overview: what a vector database actually does

A vector database stores embeddings (vectors) and lets you search by similarity. Under the hood, “vector search” is usually:

Similarity search = find nearest neighbors

Given a query embedding q, you want the vectors v that maximize similarity (often cosine similarity) or minimize distance.

# Conceptually:
results = arg_top_k(similarity(q, v_i) for each vector v_i)

# In practice:
# You rarely compute similarity against ALL vectors.
# You use an index to search a small candidate set fast.

The three tradeoffs you can’t avoid

Thing you want What it costs Typical lever
High recall (don’t miss best matches) More computation / more candidates Higher search effort (e.g., efSearch / nprobe)
Low latency (fast queries) Lower recall unless you add resources Lower effort, better filters, smaller index
Lower cost (CPU/RAM/storage) Less brute force capacity Compression, IVF, smarter chunking, caching

Most teams get stuck because they try to maximize all three at once. The winning approach is to pick a target recall, then engineer latency and cost around it.

A mental model that prevents confusion

Vector DB ≠ magic. It’s an index + storage + filtering + operational features (replication, sharding, backups). If you understand the index and the search knobs, you understand 80% of the real-world performance.

Core concepts: indexes, recall, filters, and cost

1) Embeddings and distance metrics

Embeddings turn text/images/products into numbers. Similar meaning → vectors are close. Common similarity metrics:

Metric What it means When it’s used
Cosine similarity Angle between vectors (direction) Most text embeddings; scale-invariant
Dot product Similarity with magnitude Some models assume this directly
Euclidean (L2) Geometric distance Some indexes and vision embeddings
Keep it consistent

Your embedding model and vector DB should agree on normalization and metric. If your model expects cosine similarity, normalize vectors (or use cosine directly) and avoid mixing metrics across pipelines.

2) Recall: the metric that actually matters

Recall answers: “Did the search return the true nearest neighbors?” Approximate methods speed up search by sacrificing some recall.

Practical recall definition

Compute exact neighbors on a small sample using brute force (Flat). Then compare how many of those appear in your approximate results.

# Pseudocode
true = flat_search(query, k=10)
approx = ann_search(query, k=10)

recall@10 = |true ∩ approx| / 10

If a RAG system “feels wrong,” it’s often because recall is low at the retrieval stage. Fix retrieval first before changing prompts or model temperature.

3) Indexes: Flat vs HNSW vs IVF

Vector search indexes reduce the number of comparisons you do per query. The most common families:

Index Best for Pros Tradeoffs
Flat (exact) Small datasets, gold-standard evaluation Perfect recall Expensive at scale (latency/cost)
HNSW (graph) General-purpose, fast, great quality High recall at low latency Memory-heavy; tuning matters
IVF (cluster) Very large datasets Memory/compute efficient More tuning; quality varies with clusters

HNSW: why it’s popular

HNSW builds a graph where vectors connect to neighbors. Search walks the graph to find close points quickly.

  • M (connections): higher → better recall, more memory
  • efConstruction: higher → better index quality, slower build
  • efSearch: higher → better recall, slower query

IVF: why it scales

IVF clusters the space, then searches only the most relevant clusters.

  • nlist (clusters): more → finer partitioning
  • nprobe (clusters searched): higher → better recall, slower query
  • Often combined with compression (PQ) to reduce RAM

4) Metadata filters and why they change everything

Filters are not just “nice to have.” They’re one of the biggest levers for cost and relevance because they shrink the candidate set.

Examples that usually pay off

  • Multi-tenant RAG: filter by tenant/user first
  • News/search freshness: filter by date range
  • Product search: filter by category/brand/price bucket
  • Docs: filter by version (v1/v2), language, product line
Filtering pitfall

If your DB applies filters after approximate retrieval, you can waste work (and lose recall in filtered subsets). Prefer systems that support efficient filtering or partitioning strategies that keep filters “close” to the search.

5) Cost: what you actually pay for

Vector DB cost is usually a mix of RAM (index + vectors), CPU (search), storage (persistence), and network (remote queries + replication).

The cost drivers checklist

  • Vector count: more vectors → more RAM/storage
  • Vector dimension: higher dims → more memory and compute per comparison
  • Index type + params: HNSW memory grows with connectivity; IVF grows with clusters
  • Search effort: high efSearch/nprobe increases CPU and latency
  • Top-k: larger k often increases work and rerank cost
  • Write rate: frequent updates may force rebuild/maintenance

Step-by-step: how to choose and tune a vector database

This is a production-style workflow you can copy. It keeps you honest: you measure recall, then optimize latency and cost.

Step 1 — Define your workload (it changes the best index)

Write these down before you pick anything:

  • N: number of vectors now and in 6 months
  • d: vector dimension (e.g. 384, 768, 1024)
  • QPS: queries per second (peak, not average)
  • Write rate: inserts/updates per hour/day
  • Filters: tenant, language, time, category
  • Target: recall@k goal + latency SLA

Step 2 — Pick a baseline index (then tune)

If you have… Start with… Why
≤ 1M vectors, modest QPS HNSW Great quality, easy to tune
10M+ vectors, tight memory IVF (optionally + compression) More memory efficient at scale
Need ground-truth evaluation Flat (on a sample) Measures true recall and regressions

Step 3 — Measure recall against an exact baseline

Pick a representative sample (e.g. 5k–50k vectors) and run exact search on it. Then compare your approximate index settings.

# Simple recall test plan (no vendor specifics)
1) Sample vectors + queries from production distribution
2) For each query:
   - true_top10 = flat_search(sample, query, k=10)
   - approx_top10 = ann_search(sample, query, k=10)
3) recall@10 = mean(|intersection| / 10)
4) Record latency p50/p95 at your QPS target

Step 4 — Tune the search effort (the main performance knob)

Every index has a “how hard to search” knob. Increase it until recall hits your target, then stop.

HNSW tuning loop

  • Hold M and efConstruction steady initially
  • Increase efSearch until recall@k stabilizes
  • If recall plateaus too low, rebuild with higher M/efConstruction

IVF tuning loop

  • Pick nlist (clusters) based on N
  • Increase nprobe until recall@k stabilizes
  • If recall is unstable, adjust nlist and retrain clusters
A simple rule that prevents over-tuning

Stop tuning when recall gains per latency cost get small. Past a point, it’s better to improve chunking, filtering, reranking, or embedding quality than to brute-force the index.

Step 5 — Improve quality without exploding cost

If recall is fine but results “feel” off, it’s usually not the index. It’s your data and retrieval strategy.

RAG quality upgrades

  • Chunk by meaning (headings/sections), not fixed length only
  • Add metadata (doc title, section, product, version)
  • Use MMR or reranking to reduce duplicates
  • Store citations (URL, heading) for traceable answers

Search relevance upgrades

  • Hybrid: keyword prefilter + vector rerank
  • Use query expansion for short queries
  • Boost fresh/authoritative content with metadata
  • Handle synonyms and model drift with eval sets

Step 6 — Operational checklist (what breaks in production)

  • Backups: you want point-in-time recovery for index + metadata
  • Versioning: record embedding model version and chunker version
  • Monitoring: track latency p95, recall proxy metrics, and “no result” rate
  • Cold start: ensure index rebuild time fits your recovery plan
  • Multi-tenancy: enforce filters at query-time and schema-level

Once this is in place, you can make changes confidently—without “it felt worse” arguments.

Common mistakes (and the fixes)

These are the traps that waste the most time when building semantic search and RAG systems.

Mistake 1 — “k=5 is enough” (without reranking)

Small k can miss crucial context. But large k can add noise.

  • Fix: retrieve k=6–12, then rerank or MMR
  • Fix: measure answer quality vs k on a small eval set

Mistake 2 — Tuning index knobs before chunking

If chunks are too big, embeddings are “averages.” If too small, you lose context.

  • Fix: chunk by headings/sections + a token budget
  • Fix: store titles and section names as metadata

Mistake 3 — Ignoring filters (or filtering too late)

Unfiltered search is expensive and often less relevant.

  • Fix: filter early (tenant/time/category)
  • Fix: design metadata schema before indexing millions of vectors

Mistake 4 — No recall measurement

Without recall tests, you can’t tell whether quality drops are index-related or data-related.

  • Fix: keep a small Flat baseline for evaluation
  • Fix: record recall@k and latency p95 for changes
Mistake 5 — Treating embeddings as “set and forget”

Changing the embedding model can change distance distributions. Your old index knobs may no longer be optimal. Plan for reindexing, versioning, and regression tests.

FAQ: vector databases, recall, and cost

What is a vector database (in one sentence)?

A vector database stores embeddings and retrieves nearest neighbors efficiently using specialized indexes, plus filtering and operational features for production use.

HNSW vs IVF: which one should I choose?

HNSW is often the best default for small-to-medium scale and high-quality retrieval. IVF tends to shine at very large scale and tighter memory budgets—but usually needs more tuning.

What’s the difference between recall and precision in vector search?

Recall here usually means “did we retrieve the true nearest neighbors?” (ANN quality). Precision is “are the retrieved items actually relevant to the user’s intent?” You can have high ANN recall but low relevance if your chunks/metadata/embedding model are wrong.

Why is my vector search slow?

The common causes are: searching too many candidates (high effort), too many vectors, high dimensions, missing filters, high top-k, or running reranking on too many documents. Fix order: add filters → reduce k/rerank set → tune effort → consider different index.

How do I reduce vector database cost?

Reduce the amount of data each query touches (filters, partitions, hybrid prefilter), reduce stored vectors (better chunking), and tune index params to your target recall—not “as high as possible.” Compression can help at high scale.

What are good starting settings for RAG retrieval?

Start with meaningful chunking, retrieve k=6–12, apply metadata filters, and consider reranking or MMR. Then build a small evaluation set (real questions) and tune recall/latency around that.

Cheatsheet: the fast “do this” checklist

Index + tuning (most common knobs)

  • Flat: exact baseline (for eval)
  • HNSW: M, efConstruction (build), efSearch (query)
  • IVF: nlist (clusters), nprobe (searched clusters)
  • Rule: tune query effort first, rebuild only if needed

Quality + cost (big wins)

  • Chunking: 200–500 tokens is a strong start
  • Metadata: tenant, language, category, version, time
  • Hybrid: keyword/BM25 → candidates → vector rerank
  • RAG: k=6–12 + MMR/rerank for diversity

If you remember only one thing

Vector DB performance is mostly: candidate set size × search effort × vector dimension. Reduce the candidate set (filters/hybrid) before you brute-force the index.

Wrap-up: how to pick a vector store without guessing

Pick a target (recall@k + latency), measure recall with a Flat baseline on a sample, then tune your index “search effort” until you hit the target. After that, optimize cost by shrinking the candidate set with filters and hybrid retrieval—and improve relevance with chunking, metadata, and reranking.

Quiz

Quick self-check. This quiz is auto-generated for ai / vector / dbs.

1) What is the best way to use this post about “Vector Databases Explained: Indexes, Recall, and Cost”?
2) Which section is designed for fast scanning and saving time?
3) If you forget something later, what’s the best “return point”?
4) This post is categorized under “AI”. What does that mainly affect?