AI · LLM Cost

LLM Cost Optimization: Reduce Tokens Without Losing Quality

Prompt compression, caching, and routing strategies.

Reading time: ~10–14 min
Level: Beginner → Advanced
Updated:

LLM features are addictive: add a chatbot, a summarizer, an “AI helper”, and suddenly your monthly bill looks like a surprise rent payment. The good news: most LLM spend is preventable. You can reduce cost dramatically without hurting quality by focusing on three levers: tokens, requests, and routing.


Quickstart: cut LLM costs today (highest-impact steps)

If you want immediate savings, do these in order. Each step is common, low-risk, and often gives double-digit reductions.

Fast wins (usually 30–70% cheaper)

  • Stop over-prompting: move long “rules” into a compact system template
  • Cap output: set max tokens + “be concise” + stop sequences
  • Cache: reuse results for repeated prompts and repeated context
  • Route: cheap model for easy tasks; expensive model only when needed
  • RAG over paste: retrieve only relevant chunks, don’t paste entire docs

A 5-minute baseline you should measure

Before optimizing, measure where tokens go. Most teams discover the same pattern: inputs are huge (docs + history) and outputs are wordy.

Metric What to log Why it matters
Input tokens System + user + retrieved context Usually the biggest cost driver
Output tokens Assistant tokens Controls verbosity + latency
Calls per action # model calls per user event Multipliers hurt fast
Fallback rate % routed to expensive model Shows how well routing works
Rule of thumb

The cheapest token is the one you never send. Optimize context size first, then outputs, then model choice.

Overview: where LLM costs actually come from

LLM pricing varies by provider, but the shape is consistent: you pay for tokens processed and sometimes for extras (tools, embeddings, reranking, images, etc.). For a typical text app, the bill is mostly: input tokens + output tokens multiplied by how often you call the model.

The cost equation (practical)

Total cost ≈ (input_tokens + output_tokens) × calls × price_per_token

That’s why “just add more context” becomes expensive: it increases every call.

The 3 levers that matter most

Lever What you change Typical savings
Tokens Shorter prompts + smaller context + tighter output 20–80%
Requests Fewer calls via caching, batching, and better UX flow 10–60%
Routing Use cheaper models/paths when confidence is high 10–70%

This guide focuses on tactics that work across stacks and providers. You’ll get checklists, patterns, and “when to use what” so you can optimize without turning your product into a brittle prompt museum.

Core concepts: token budget, quality budget, and “wasted context”

1) Token budget: your real “compute budget”

Tokens are the unit of work. If you want to cut cost, you need a token budget per user action (or per workflow). It’s like setting performance budgets in web apps: you can’t optimize what you don’t cap.

A simple budget that works

  • Input: keep under a target (example: 1–3k tokens) for common requests
  • Output: cap per response (example: 150–400 tokens) unless explicitly requested
  • Calls: aim for 1 call per user action; 2 only when necessary

You can still support “deep dives”—just make them an explicit mode so you don’t pay deep-dive costs on every casual query.

2) Quality budget: spend tokens where they matter

Not all tokens are equal. A lot of what you send is low-value repetition: long rules pasted every time, irrelevant chat history, entire documentation pages, verbose outputs, duplicated context. Quality improves when you keep only the parts that change the model’s answer.

High-value tokens

  • User goal + constraints (“what good looks like”)
  • Relevant facts (retrieved chunks, structured data)
  • Examples of desired format (few-shot, 1–3 examples)
  • Acceptance criteria (schema, bullet checklist)

Low-value tokens

  • Re-stating rules the model already follows
  • Entire docs instead of targeted excerpts
  • Old conversation history that’s no longer relevant
  • Overlong outputs (“explain everything”) by default

3) Wasted context: the silent cost killer

Most LLM apps overpay because they include too much context “just in case”. The model then spends tokens processing irrelevant text, which increases cost and often reduces accuracy (noise competes with signal).

Counterintuitive truth

More context can make answers worse. Treat context like a search result page: you want the top few relevant items, not the entire internet pasted into the prompt.

4) Cost and latency usually move together

Fewer tokens and fewer calls typically means lower latency too—so cost optimization often improves UX.

Step-by-step: a practical LLM cost optimization playbook

Here’s a proven sequence: measure → trim context → compress prompts → control outputs → reduce calls → route smartly. Each step includes tactics you can apply immediately.

Step 0 — Instrument the basics (do this first)

Log input tokens, output tokens, model name, latency, cache hit/miss, and route/fallback decisions. If possible, tag by feature (“summarize”, “chat”, “extract”, “classify”) so you can see what’s expensive.

Step 1 — Trim context aggressively (biggest ROI)

Most prompts are bloated. Reduce the context to only what changes the answer.

Context trimming tactics

  • Windowing: include only the last N turns, not the full chat
  • Relevance filter: only include messages referenced by the user’s current ask
  • Summarize history: keep a running short summary + the latest turns
  • Structured state: store facts in JSON fields instead of repeating prose

A clean “memory” pattern

Maintain two things: (1) a short summary of stable info, and (2) the last 3–8 turns. Refresh the summary occasionally with a cheap model.

Step 2 — Use RAG instead of pasting documents

If you paste large docs into prompts, you pay for them every time and you overload the model. Retrieval-Augmented Generation (RAG) fetches only relevant passages.

RAG cost tips that actually matter

  • Chunk size: keep chunks small enough to be targeted (but not fragmentary)
  • Top-k: retrieve fewer chunks (often 3–6 is enough)
  • Rerank: optional, but can reduce k while improving relevance
  • Deduplicate: avoid overlapping chunks
  • Prefer structured data: if you have tables/fields, pass those instead of prose
The “needle” test

Ask: “If I remove this context, does the answer change?” If not, it’s probably wasted tokens.

Step 3 — Prompt compression: shorter instructions, same behavior

Many prompts repeat long policies, style guides, and rules. You can compress them into a compact template and still get the same outputs.

Compression tactics

  • Replace paragraphs with bullets (models follow bullets well)
  • Use “do/don’t” lists instead of essays
  • Make formats explicit (schemas, section headers)
  • Remove synonyms and repeated requirements
  • Keep one example instead of five

A reusable “compact system template”

Role: You are a helpful assistant.
Output: Use the requested format. Be concise by default.
Quality: If unsure, ask a single clarifying question or state assumptions.
Safety: Don’t invent citations or data.
Style: Use bullets and short paragraphs.

Step 4 — Control output tokens (often overlooked)

Output tokens are easy to cap and they directly affect cost and latency. Default to shorter answers unless the user asks for depth.

Output controls that work

  • Set max output tokens based on the feature (e.g., 120 for classification, 300 for summaries)
  • Use stop sequences for structured outputs (JSON end, delimiter end)
  • Ask for “short” by default and provide “expand” UI for more detail
  • Require format: JSON / bullet list / table reduces rambling

Step 5 — Cache everything that repeats

Caching is the most underused cost strategy. Many prompts are identical or nearly identical. Cache results with a sensible key (inputs + relevant settings) and a TTL.

What to cache (highest value)

  • Embeddings for documents and queries
  • RAG retrieval results (query → chunks)
  • Tool outputs (database lookups, metadata)
  • Common completions (templates, boilerplate responses)
  • Conversation summaries (update incrementally)

Cache gotchas

  • Include model + temperature in the cache key for generation
  • Use short TTL for volatile content
  • Don’t cache sensitive user data across users
  • Log cache hit rate (it’s your free money metric)

Step 6 — Route smarter: cheap-first, expensive-only-when-needed

Not every task needs a top model. Use a “router” strategy: start with a cheaper model for easy tasks, then escalate only when confidence is low or the task is complex.

Routing patterns that don’t break quality

Task type Try cheap model first Escalate when…
Classification / tagging Yes Low confidence or unclear input
Summarization Often Requires precise citations or technical accuracy
Extraction to JSON Yes Schema violations, messy input
Complex reasoning Maybe Multi-step constraints, planning, deep code changes
Customer-facing final answers Sometimes Brand tone risk or high-stakes correctness
A simple router heuristic
  • If input is short + structured → cheap model.
  • If the user asks for “exact” / “legal” / “financial” → escalate.
  • If output must be perfect JSON → cheap model first, then validate, then escalate on failure.

Step 7 — Batch and parallelize where possible

If you do multiple small calls per user action, consider batching or using a single call that returns multiple fields. (But don’t cram unrelated tasks into one prompt if it harms accuracy.)

Common mistakes (and fixes) in LLM cost optimization

These are the cost leaks that keep showing up in real products.

Mistake 1 — Pasting the entire knowledge base

This increases cost and often reduces answer quality.

  • Fix: use RAG; retrieve 3–6 relevant chunks
  • Fix: summarize or structure data before sending

Mistake 2 — No output caps

Verbose outputs can quietly become your biggest line item.

  • Fix: set max tokens per feature
  • Fix: use “short by default, expand on demand”

Mistake 3 — Calling the model multiple times for one thing

Each extra call multiplies your token spend.

  • Fix: merge steps when safe (one call with multiple outputs)
  • Fix: cache intermediate results

Mistake 4 — Using the best model for everything

Most tasks don’t need it.

  • Fix: route cheap-first and escalate on failures
  • Fix: validate outputs automatically (schema, constraints)
Don’t optimize blindly

If you cut tokens and your product becomes unreliable, you didn’t “optimize”—you just moved the cost to support tickets. Always measure quality (accuracy, user satisfaction, task success).

FAQ: LLM cost questions people search

Why is my LLM bill so high?

Usually because of one (or more) of these: oversized prompts (too much context), verbose outputs, multiple calls per user action, and no caching. Start by logging input/output tokens and calls per workflow.

What’s the fastest way to reduce tokens?

Trim context first: remove irrelevant chat history, stop pasting docs, and adopt RAG with a small top-k. Then compress prompts into bullet rules and cap output length.

Does caching really help with LLM costs?

Yes—often massively. Many user requests repeat (or are similar enough to normalize). Cache retrieval results, summaries, and common generations. Track cache hit rate to quantify savings.

How do I use cheaper models without losing quality?

Use routing: cheap model for simple tasks, then validate outputs (schema checks, confidence scoring) and escalate only on failures. This preserves quality while reducing average cost.

Does RAG reduce cost or increase it?

Done right, it reduces cost because you send less context (only relevant chunks) instead of large documents. The extra retrieval steps are usually cheaper than processing thousands of extra prompt tokens repeatedly.

How do I stop the model from writing too much?

Combine: explicit brevity instruction, max output token limits, a required format (bullets/JSON), and stop sequences. Also consider UX: offer “Expand” to fetch a longer follow-up only when users want it.

Cheatsheet: the “reduce LLM cost” checklist

Cut input tokens

  • Replace full-doc paste with RAG (top-k 3–6)
  • Summarize long history into a short memory
  • Use bullets instead of long instruction paragraphs
  • Pass structured fields (JSON) instead of prose when possible
  • Remove duplicated rules across prompts

Cut output tokens

  • Set max output tokens per feature
  • Use “short by default” and an “expand” option
  • Require format (JSON/bullets/tables)
  • Use stop sequences for clean endings
  • Lower temperature for fewer rambles

Reduce calls

  • Cache repeated prompts and repeated retrieval results
  • Batch small tasks when safe
  • Don’t call the model on every keystroke / UI change
  • Precompute expensive things (summaries, embeddings)

Route smarter

  • Cheap model first
  • Validate (schema/constraints/confidence)
  • Escalate only on failure or high-stakes tasks
  • Track fallback rate and optimize it

The 80/20 rule

Most savings come from: (1) smaller context, (2) output caps, and (3) caching. Start there before chasing exotic tricks.

Wrap-up: save money without making the product worse

LLM cost optimization is mostly about being intentional: measure tokens and calls, trim context, compress prompts, cap outputs, cache repeats, and route intelligently. Done right, you’ll ship a faster product that costs less and often answers better (because it sees less noise).

Your next step
  • Pick one workflow and log token usage end-to-end.
  • Cut context by 50% (remove paste, add RAG, summarize history).
  • Add output caps + stop sequences.
  • Implement caching for the top 5 repeated requests.

Quiz

Quick self-check (demo). This quiz is auto-generated for ai / llm / cost.

1) Which change usually reduces LLM cost the most?
2) What is a safe way to use cheaper models without losing quality?
3) Why can pasting whole documents be counterproductive?
4) What should you include in the cache key for generated text?