How LLMs Work (Without the Math Spiral)

By Samuel Labant Published Jan 9, 2026 Updated Jan 9, 2026

Large Language Models (LLMs) look like magic: you type words, they reply with reasoning, code, or stories. Under the hood, it’s a surprisingly simple loop repeated billions of times: turn text into tokens → mix token information with attention → predict the next token. This post explains that loop in plain engineering terms—no heavy math, but enough detail that you can implement a tiny version.

Quickstart: the whole LLM pipeline in 7 minutes

If you only read one section, read this. Here’s the “mental executable” for how a transformer LLM works at inference time. Keep this in mind and the rest of the article will feel obvious.

What happens when you press “Send”

Tokenize: text → token IDs (numbers)
Embed: token IDs → vectors (dense meaning-ish coordinates)
Add position: so order matters (“dog bites man” ≠ “man bites dog”)
Repeat N layers: attention + MLP transforms vectors
Project to vocabulary: vectors → logits (scores per token)
Sample next token: choose token using temperature/top-p/top-k
Append and loop: feed the new token back in until done

Two phrases that prevent confusion forever

LLMs predict the next token. Everything else is an emergent side effect.
Training changes weights; inference doesn’t. At inference you only compute, cache, and sample.

If you ever feel lost, come back to those two lines.

Practical takeaway

You don’t need to understand calculus to “get” LLMs. You just need a good picture of representations (vectors), routing (attention), and prediction (next-token sampling).

Overview: the simplest correct model of an LLM

A transformer LLM is a function that maps a sequence of tokens to a probability distribution over the next token. The “large” part is scale: lots of parameters, trained on lots of text, with a big vocabulary.

In one sentence

An LLM compresses patterns of language into weights, then uses attention to combine information from earlier tokens to predict the next token.

A plain-English picture of what’s inside

Component	What it stores/does	Why it matters
Tokenizer	Turns text into token IDs	Defines the model’s “alphabet” and cost (tokens = billing/latency)
Embeddings	Lookup table: ID → vector	Transforms discrete symbols into something neural nets can process
Positional info	Encoding of order	Without it, the model is a bag-of-words blender
Attention	Mixes information across tokens	Lets later tokens “look back” at relevant earlier tokens
MLP / Feed-forward	Per-token transformation	Adds nonlinearity and capacity (skills live here too)
Output head	Vectors → token scores	Creates logits for sampling the next token
Sampler	Turns scores into an actual choice	Controls creativity vs reliability

If you’re an engineer, it’s helpful to think of the model as a very expensive, very smart autocomplete engine. It can appear to “reason” because next-token prediction on huge corpora rewards internal circuits that track facts, logic patterns, and multi-step structure—but it’s still a prediction game.

Core concepts: tokens, attention, training (without pain)

1) Tokens: the model’s basic units

LLMs don’t read characters like humans. They read tokens—chunks of text (often ~3–4 characters on average in English, but it varies by language). Tokenization makes text manageable and lets the model reuse patterns like “ing”, “://”, or “function”.

Why tokenization exists

Reduces sequence length (faster than character-level)
Captures common subwords (“un-”, “-tion”, “micro-”)
Provides a stable vocabulary for training and deployment

Practical impact

Cost: you pay per token (often input + output)
Limits: context window measured in tokens
Prompting: short prompts are not always “short” in tokens

SEO-friendly explanation readers actually remember

Think of tokens like Lego bricks. The model doesn’t see your sentence—only a sequence of numbered bricks. Everything it “knows” must be expressed through those bricks.

2) Embeddings: turning IDs into meaning-ish vectors

Token IDs are just integers. To do anything useful, the model maps each ID to an embedding vector. Similar tokens end up with vectors that are “near” each other in a high-dimensional space—because that helps prediction.

What an embedding is (no math)

It’s a lookup table: token_id → vector. The vector is a learned fingerprint. During training, vectors move around until they help the model predict what comes next.

3) Attention: the “routing” system

Attention answers a simple question: when predicting the next token, which previous tokens matter most? Instead of compressing the entire past into one fixed summary, attention lets the model create a different summary depending on what it needs right now.

The intuition

If the text says: “Alice gave Bob the book. He thanked her.” then “He” should attend strongly to “Bob” and “her” should attend strongly to “Alice”.

Attention is how the model resolves references.
It’s how long-range dependencies become possible.

The engineering version

Queries: what the current token is looking for
Keys: what each previous token offers
Values: the information to actually mix in
Masking: prevents “seeing the future” during generation

Important limitation

Attention is powerful, but not free. Naively, it scales with sequence length (more tokens → more work). That’s why long-context models rely on optimizations like KV caching and attention variants.

4) Layers: repeated “think blocks”

A transformer stacks many identical-ish layers. Each layer: mixes context via attention and then transforms each token via an MLP. Early layers tend to learn local patterns, later layers capture higher-level structure.

Why so many layers?

Each layer can refine representations (like repeatedly editing a draft)
Deeper networks can represent more complex functions
Stacking creates “circuits” for skills like syntax, facts, and reasoning patterns

5) Training vs inference: what changes, what doesn’t

This is where most misunderstandings come from. During training, the model sees lots of text and updates weights to reduce prediction error. During inference, the weights are frozen. You’re just running the forward pass and sampling.

Training (learning)

Data: huge text corpus
Goal: predict next token correctly
Weights: updated via optimization
Outcome: patterns stored in parameters

Inference (using)

Data: your prompt + conversation
Goal: generate useful continuation
Weights: fixed
Outcome: tokens sampled from a distribution

When people say “the model learned from my prompt”, they usually mean in-context learning: the model adapts behavior based on the prompt content, but it does not permanently change weights.

Step-by-step: generate one token like a transformer

Let’s walk the actual generation loop in an implementation mindset. This is the simplest accurate story of what happens.

Step 1 — Tokenize your prompt

Input: "Explain attention simply" → Output: a list of token IDs like [1012, 8912, 2331, ...]. The exact IDs depend on the tokenizer.

Step 2 — Convert IDs to embeddings + positions

Each token ID becomes a vector. Then positional information is added so the model can distinguish “first token” vs “fifth token”.

Step 3 — Run the transformer layers

What each layer does (conceptually)

Self-attention: each token mixes information from previous tokens (causal mask)
MLP: each token is transformed independently (adds capacity)
Residual + normalization: stabilizes and helps learning (implementation detail, big practical impact)

After many layers, the last token’s vector contains a rich summary of the prompt + how to continue it.

Step 4 — Convert the final vector into token scores (logits)

The model projects the last hidden vector into vocabulary space: one score per possible next token. Higher score means more likely continuation.

A useful mental model

Logits are like “unnormalized preferences” for each token. Turning logits into probabilities is done with softmax, but you don’t need to memorize the formula to use the concept.

Step 5 — Sample the next token (controls creativity)

If you always pick the highest-probability token, output can become repetitive and dull. Sampling methods trade off determinism and variety.

Common sampling knobs (what they actually do)

Setting	Effect	When to use
Temperature	Higher = more randomness	Creative writing, brainstorming
Top-p (nucleus)	Sample from the smallest set of tokens totaling p probability	Good default for “natural” variation
Top-k	Sample only from the k most likely tokens	Helps avoid weird low-probability tokens
Stop sequences	Force generation to stop at patterns	Structured outputs, tool calls, safety constraints

Step 6 — Append the token and repeat

The chosen token is appended to the sequence, then the model predicts the next one, and so on. This loop continues until it hits a stop condition (length limit, stop token, stop sequence).

Why models feel “slow” on long prompts

Generation is sequential: you must generate token 1 before token 2. Also, long prompts increase attention work. Efficient serving relies on batching and KV caching (storing past keys/values so you don’t recompute them).

Bonus: KV cache in 30 seconds

Without caching, each new token would re-run attention over the entire prompt from scratch. With KV caching, the model stores the “keys” and “values” from previous tokens, and each new step only computes the new token’s pieces. Result: much faster token-by-token generation for long contexts.

Common mistakes (and fixes) when learning LLMs

These are the misunderstandings that keep people stuck. Fix them and your intuition improves fast.

Mistake 1 — “The model understands like a human”

LLMs are trained to predict text, not to build grounded world models by default. They can be impressive, but they can also confidently generate wrong answers.

Fix: treat outputs as probabilistic suggestions
Fix: verify facts, use citations, add retrieval (RAG) for grounding

Mistake 2 — “Attention is the whole model”

Attention is crucial, but a transformer also relies on embeddings, MLP blocks, normalization, and scale. Many “skills” live across components.

Fix: think “attention routes info, MLP transforms it”
Fix: remember there are many layers, not one

Mistake 3 — Confusing training with prompting

Prompts influence behavior temporarily (within the context). Training changes the actual weights.

Fix: use prompts for formatting, constraints, examples
Fix: use fine-tuning when you need consistent behavior across many prompts

Mistake 4 — “Longer prompts always help”

Long prompts can dilute focus and raise cost. Better is structured prompts and relevant context.

Fix: give the model only what it needs for the task
Fix: move bulky info to retrieval (RAG) or summarize it

The “looks smart” trap

Fluency is not accuracy. The model is optimized to produce plausible text, which can include plausible errors. For high-stakes use, add verification, tools, or human review.

FAQ: questions people actually search

What is an LLM in simple terms?

An LLM (Large Language Model) is a neural network trained to predict the next token in text. Because it’s trained on huge datasets, it learns patterns for grammar, style, facts, and common reasoning steps.

Are all LLMs transformers?

Most modern, high-performing LLMs are transformer-based, because attention scales well and models long-range context. Older architectures (like RNNs) exist but are rarely used for state-of-the-art LLMs today.

What does “attention” mean in transformers?

Attention is a mechanism that lets each token decide which earlier tokens to focus on when building its representation. It’s like a dynamic routing system: the model can pull relevant information from different parts of the prompt as needed.

Why do LLMs hallucinate?

Because the model is trained to generate the most likely continuation, not to guarantee truth. If the prompt implies an answer exists, the model may generate a plausible one even if it’s not grounded. Techniques like retrieval (RAG), better prompting, and tool use reduce hallucinations.

What is a context window?

The context window is how many tokens the model can consider at once (prompt + conversation + generated output). If you exceed it, older parts are truncated or summarized depending on the system.

Fine-tuning vs RAG: which should I use?

Use RAG when you need up-to-date or private knowledge and citations. Use fine-tuning when you need the model to consistently follow a style, format, or behavior across many tasks.

Cheatsheet: the fast “remember this” list

One-liners

Token: a chunk of text the model processes as one unit
Embedding: a learned vector representation of a token
Attention: weighted mixing of information across tokens
Logits: scores for each possible next token
Sampling: choosing the next token from the distribution
Training: updating weights on huge corpora
Inference: generating with fixed weights

If you’re building with LLMs

Give clear constraints (format, tone, length)
Provide relevant context (not maximum context)
Use RAG for factual/enterprise knowledge
Lower temperature for reliability
Validate outputs (schemas, tests, citations) for critical tasks

The core loop

If you remember nothing else, remember this:

text → tokens → embeddings → (attention + MLP) × N → logits → sample → repeat

Wrap-up: what you now understand

You now have the “not scary” but accurate model of how LLMs work: they tokenize text, turn tokens into vectors, route context with attention, and repeatedly predict the next token. The magic comes from scale and training—not from a hidden rules engine.

Your next step

Read your own prompts as inputs to a next-token predictor: “What continuation would be likely?”
Try one knob: set temperature lower for more reliable answers.
If you need facts: use retrieval (RAG) or ask for citations and verify.

UniLab Editorial

Modern learning notes for practical builders.

How LLMs Work (Without the Math Spiral)

Quickstart: the whole LLM pipeline in 7 minutes

What happens when you press “Send”

Two phrases that prevent confusion forever

Overview: the simplest correct model of an LLM

In one sentence

A plain-English picture of what’s inside

Core concepts: tokens, attention, training (without pain)

1) Tokens: the model’s basic units

Why tokenization exists

Practical impact

2) Embeddings: turning IDs into meaning-ish vectors

What an embedding is (no math)

3) Attention: the “routing” system

The intuition

The engineering version

4) Layers: repeated “think blocks”

Why so many layers?

5) Training vs inference: what changes, what doesn’t

Training (learning)

Inference (using)

Step-by-step: generate one token like a transformer

Step 1 — Tokenize your prompt

Step 2 — Convert IDs to embeddings + positions

Step 3 — Run the transformer layers

What each layer does (conceptually)

Step 4 — Convert the final vector into token scores (logits)

A useful mental model

Step 5 — Sample the next token (controls creativity)

Common sampling knobs (what they actually do)

Step 6 — Append the token and repeat

Bonus: KV cache in 30 seconds

Common mistakes (and fixes) when learning LLMs

Mistake 1 — “The model understands like a human”

Mistake 2 — “Attention is the whole model”

Mistake 3 — Confusing training with prompting

Mistake 4 — “Longer prompts always help”

FAQ: questions people actually search

What is an LLM in simple terms?

Are all LLMs transformers?

What does “attention” mean in transformers?

Why do LLMs hallucinate?

What is a context window?

Fine-tuning vs RAG: which should I use?

Cheatsheet: the fast “remember this” list

One-liners

If you’re building with LLMs

The core loop

Wrap-up: what you now understand

Quiz

Related posts