Large Language Models (LLMs) look like magic: you type words, they reply with reasoning, code, or stories. Under the hood, it’s a surprisingly simple loop repeated billions of times: turn text into tokens → mix token information with attention → predict the next token. This post explains that loop in plain engineering terms—no heavy math, but enough detail that you can implement a tiny version.
Quickstart: the whole LLM pipeline in 7 minutes
If you only read one section, read this. Here’s the “mental executable” for how a transformer LLM works at inference time. Keep this in mind and the rest of the article will feel obvious.
What happens when you press “Send”
- Tokenize: text → token IDs (numbers)
- Embed: token IDs → vectors (dense meaning-ish coordinates)
- Add position: so order matters (“dog bites man” ≠ “man bites dog”)
- Repeat N layers: attention + MLP transforms vectors
- Project to vocabulary: vectors → logits (scores per token)
- Sample next token: choose token using temperature/top-p/top-k
- Append and loop: feed the new token back in until done
Two phrases that prevent confusion forever
- LLMs predict the next token. Everything else is an emergent side effect.
- Training changes weights; inference doesn’t. At inference you only compute, cache, and sample.
If you ever feel lost, come back to those two lines.
You don’t need to understand calculus to “get” LLMs. You just need a good picture of representations (vectors), routing (attention), and prediction (next-token sampling).
Overview: the simplest correct model of an LLM
A transformer LLM is a function that maps a sequence of tokens to a probability distribution over the next token. The “large” part is scale: lots of parameters, trained on lots of text, with a big vocabulary.
In one sentence
An LLM compresses patterns of language into weights, then uses attention to combine information from earlier tokens to predict the next token.
A plain-English picture of what’s inside
| Component | What it stores/does | Why it matters |
|---|---|---|
| Tokenizer | Turns text into token IDs | Defines the model’s “alphabet” and cost (tokens = billing/latency) |
| Embeddings | Lookup table: ID → vector | Transforms discrete symbols into something neural nets can process |
| Positional info | Encoding of order | Without it, the model is a bag-of-words blender |
| Attention | Mixes information across tokens | Lets later tokens “look back” at relevant earlier tokens |
| MLP / Feed-forward | Per-token transformation | Adds nonlinearity and capacity (skills live here too) |
| Output head | Vectors → token scores | Creates logits for sampling the next token |
| Sampler | Turns scores into an actual choice | Controls creativity vs reliability |
If you’re an engineer, it’s helpful to think of the model as a very expensive, very smart autocomplete engine. It can appear to “reason” because next-token prediction on huge corpora rewards internal circuits that track facts, logic patterns, and multi-step structure—but it’s still a prediction game.
Core concepts: tokens, attention, training (without pain)
1) Tokens: the model’s basic units
LLMs don’t read characters like humans. They read tokens—chunks of text (often ~3–4 characters on average in English, but it varies by language). Tokenization makes text manageable and lets the model reuse patterns like “ing”, “://”, or “function”.
Why tokenization exists
- Reduces sequence length (faster than character-level)
- Captures common subwords (“un-”, “-tion”, “micro-”)
- Provides a stable vocabulary for training and deployment
Practical impact
- Cost: you pay per token (often input + output)
- Limits: context window measured in tokens
- Prompting: short prompts are not always “short” in tokens
Think of tokens like Lego bricks. The model doesn’t see your sentence—only a sequence of numbered bricks. Everything it “knows” must be expressed through those bricks.
2) Embeddings: turning IDs into meaning-ish vectors
Token IDs are just integers. To do anything useful, the model maps each ID to an embedding vector. Similar tokens end up with vectors that are “near” each other in a high-dimensional space—because that helps prediction.
What an embedding is (no math)
It’s a lookup table: token_id → vector. The vector is a learned fingerprint. During training, vectors move around until they help the model predict what comes next.
3) Attention: the “routing” system
Attention answers a simple question: when predicting the next token, which previous tokens matter most? Instead of compressing the entire past into one fixed summary, attention lets the model create a different summary depending on what it needs right now.
The intuition
If the text says: “Alice gave Bob the book. He thanked her.” then “He” should attend strongly to “Bob” and “her” should attend strongly to “Alice”.
- Attention is how the model resolves references.
- It’s how long-range dependencies become possible.
The engineering version
- Queries: what the current token is looking for
- Keys: what each previous token offers
- Values: the information to actually mix in
- Masking: prevents “seeing the future” during generation
Attention is powerful, but not free. Naively, it scales with sequence length (more tokens → more work). That’s why long-context models rely on optimizations like KV caching and attention variants.
4) Layers: repeated “think blocks”
A transformer stacks many identical-ish layers. Each layer: mixes context via attention and then transforms each token via an MLP. Early layers tend to learn local patterns, later layers capture higher-level structure.
Why so many layers?
- Each layer can refine representations (like repeatedly editing a draft)
- Deeper networks can represent more complex functions
- Stacking creates “circuits” for skills like syntax, facts, and reasoning patterns
5) Training vs inference: what changes, what doesn’t
This is where most misunderstandings come from. During training, the model sees lots of text and updates weights to reduce prediction error. During inference, the weights are frozen. You’re just running the forward pass and sampling.
Training (learning)
- Data: huge text corpus
- Goal: predict next token correctly
- Weights: updated via optimization
- Outcome: patterns stored in parameters
Inference (using)
- Data: your prompt + conversation
- Goal: generate useful continuation
- Weights: fixed
- Outcome: tokens sampled from a distribution
When people say “the model learned from my prompt”, they usually mean in-context learning: the model adapts behavior based on the prompt content, but it does not permanently change weights.
Step-by-step: generate one token like a transformer
Let’s walk the actual generation loop in an implementation mindset. This is the simplest accurate story of what happens.
Step 1 — Tokenize your prompt
Input: "Explain attention simply" →
Output: a list of token IDs like [1012, 8912, 2331, ...].
The exact IDs depend on the tokenizer.
Step 2 — Convert IDs to embeddings + positions
Each token ID becomes a vector. Then positional information is added so the model can distinguish “first token” vs “fifth token”.
Step 3 — Run the transformer layers
What each layer does (conceptually)
- Self-attention: each token mixes information from previous tokens (causal mask)
- MLP: each token is transformed independently (adds capacity)
- Residual + normalization: stabilizes and helps learning (implementation detail, big practical impact)
After many layers, the last token’s vector contains a rich summary of the prompt + how to continue it.
Step 4 — Convert the final vector into token scores (logits)
The model projects the last hidden vector into vocabulary space: one score per possible next token. Higher score means more likely continuation.
A useful mental model
Logits are like “unnormalized preferences” for each token. Turning logits into probabilities is done with softmax, but you don’t need to memorize the formula to use the concept.
Step 5 — Sample the next token (controls creativity)
If you always pick the highest-probability token, output can become repetitive and dull. Sampling methods trade off determinism and variety.
Common sampling knobs (what they actually do)
| Setting | Effect | When to use |
|---|---|---|
| Temperature | Higher = more randomness | Creative writing, brainstorming |
| Top-p (nucleus) | Sample from the smallest set of tokens totaling p probability | Good default for “natural” variation |
| Top-k | Sample only from the k most likely tokens | Helps avoid weird low-probability tokens |
| Stop sequences | Force generation to stop at patterns | Structured outputs, tool calls, safety constraints |
Step 6 — Append the token and repeat
The chosen token is appended to the sequence, then the model predicts the next one, and so on. This loop continues until it hits a stop condition (length limit, stop token, stop sequence).
Generation is sequential: you must generate token 1 before token 2. Also, long prompts increase attention work. Efficient serving relies on batching and KV caching (storing past keys/values so you don’t recompute them).
Bonus: KV cache in 30 seconds
Without caching, each new token would re-run attention over the entire prompt from scratch. With KV caching, the model stores the “keys” and “values” from previous tokens, and each new step only computes the new token’s pieces. Result: much faster token-by-token generation for long contexts.
Common mistakes (and fixes) when learning LLMs
These are the misunderstandings that keep people stuck. Fix them and your intuition improves fast.
Mistake 1 — “The model understands like a human”
LLMs are trained to predict text, not to build grounded world models by default. They can be impressive, but they can also confidently generate wrong answers.
- Fix: treat outputs as probabilistic suggestions
- Fix: verify facts, use citations, add retrieval (RAG) for grounding
Mistake 2 — “Attention is the whole model”
Attention is crucial, but a transformer also relies on embeddings, MLP blocks, normalization, and scale. Many “skills” live across components.
- Fix: think “attention routes info, MLP transforms it”
- Fix: remember there are many layers, not one
Mistake 3 — Confusing training with prompting
Prompts influence behavior temporarily (within the context). Training changes the actual weights.
- Fix: use prompts for formatting, constraints, examples
- Fix: use fine-tuning when you need consistent behavior across many prompts
Mistake 4 — “Longer prompts always help”
Long prompts can dilute focus and raise cost. Better is structured prompts and relevant context.
- Fix: give the model only what it needs for the task
- Fix: move bulky info to retrieval (RAG) or summarize it
Fluency is not accuracy. The model is optimized to produce plausible text, which can include plausible errors. For high-stakes use, add verification, tools, or human review.
FAQ: questions people actually search
What is an LLM in simple terms?
An LLM (Large Language Model) is a neural network trained to predict the next token in text. Because it’s trained on huge datasets, it learns patterns for grammar, style, facts, and common reasoning steps.
Are all LLMs transformers?
Most modern, high-performing LLMs are transformer-based, because attention scales well and models long-range context. Older architectures (like RNNs) exist but are rarely used for state-of-the-art LLMs today.
What does “attention” mean in transformers?
Attention is a mechanism that lets each token decide which earlier tokens to focus on when building its representation. It’s like a dynamic routing system: the model can pull relevant information from different parts of the prompt as needed.
Why do LLMs hallucinate?
Because the model is trained to generate the most likely continuation, not to guarantee truth. If the prompt implies an answer exists, the model may generate a plausible one even if it’s not grounded. Techniques like retrieval (RAG), better prompting, and tool use reduce hallucinations.
What is a context window?
The context window is how many tokens the model can consider at once (prompt + conversation + generated output). If you exceed it, older parts are truncated or summarized depending on the system.
Fine-tuning vs RAG: which should I use?
Use RAG when you need up-to-date or private knowledge and citations. Use fine-tuning when you need the model to consistently follow a style, format, or behavior across many tasks.
Cheatsheet: the fast “remember this” list
One-liners
- Token: a chunk of text the model processes as one unit
- Embedding: a learned vector representation of a token
- Attention: weighted mixing of information across tokens
- Logits: scores for each possible next token
- Sampling: choosing the next token from the distribution
- Training: updating weights on huge corpora
- Inference: generating with fixed weights
If you’re building with LLMs
- Give clear constraints (format, tone, length)
- Provide relevant context (not maximum context)
- Use RAG for factual/enterprise knowledge
- Lower temperature for reliability
- Validate outputs (schemas, tests, citations) for critical tasks
The core loop
If you remember nothing else, remember this:
text → tokens → embeddings → (attention + MLP) × N → logits → sample → repeat
Wrap-up: what you now understand
You now have the “not scary” but accurate model of how LLMs work: they tokenize text, turn tokens into vectors, route context with attention, and repeatedly predict the next token. The magic comes from scale and training—not from a hidden rules engine.
- Read your own prompts as inputs to a next-token predictor: “What continuation would be likely?”
- Try one knob: set temperature lower for more reliable answers.
- If you need facts: use retrieval (RAG) or ask for citations and verify.
Quiz
Quick self-check. This quiz is auto-generated for ai / llms / how.