Home
Blog
Research
About
Portfolio
Monday. June 01, 2026 -
26 mins
This post is a walkthrough of how LLMs work. Modern LLMs are mostly built by stacking transformer blocks over and over, so understanding the transformer machinery gets you most of the way there.
I’ll cover the core mechanisms inside modern transformer-based LLMs, without all that sticky math stuff. Don’t get me wrong, you should learn the math, but this can serve as an introduction.
Most modern LLMs share the same transformer-family skeleton. The differences come from what each one was trained on, the scale and configuration choices, and the post-training done on top. By the end, you should be able to read many modern LLM papers or model cards and know which piece of the architecture each section is talking about.
Here’s the path:
Tokens, how a string of text becomes a sequence of integers
Embeddings, how those integers get meaning
Positional encoding, how the model knows what order the tokens came in
Attention, how tokens share information with each other
Multi-head attention, how the model tracks many kinds of relationships at once
The feed-forward network, where a large share of the model’s stored structure lives
The residual stream and layer normalization, what makes deep stacks trainable
Predicting the next token, what the model actually outputs and how the generation loop works
Architecture vs trained weights, what’s broadly shared across modern LLMs, and what’s different
Tiny explainers appear throughout so anyone can follow along, regardless of background.
Tokenization
Models don’t read text directly. They read integer IDs. The step that converts your prompt into a sequence of those integers.
That conversion step is called tokenization. A tokenizer takes a string and produces a sequence of integers, where each integer points to an entry in a fixed vocabulary. Modern LLM vocabularies usually contain tens of thousands to a few hundred thousand entries.
Tiny explainer: token ID A token ID is the integer the model uses for one vocabulary entry. The model works with the number, not the written word itself.
Tiny explainer: token ID A token ID is the integer the model uses for one vocabulary entry. The model works with the number, not the written word itself.
Tokens aren’t usually whole words. They’re usually subword pieces. The word “tokenization” might split into [“token”, “ization”]. The word “running” might split into [“run”, “ning”]. The reason is efficiency. Whole-word vocabularies are too big and don’t generalize to new words. Character-level vocabularies are too small and force the model to learn even the simplest patterns from scratch. Subword tokenization sits in the middle. The most common pieces become single tokens, and rare or novel words get composed from smaller pieces.
Tiny explainer: vocabulary The vocabulary is the tokenizer’s fixed list of pieces. Each piece has an ID, and the model can only directly receive IDs from that list.
Tiny explainer: vocabulary The vocabulary is the tokenizer’s fixed list of pieces. Each piece has an ID, and the model can only directly receive IDs from that list.
The trade-off shows up in places people don’t expect. The classic example: ask an LLM how many R’s are in “strawberry.” LLMs used to get it wrong. That’s not the model failing at counting. It’s the model not operating on letters directly, only token IDs that happen to spell out a word a human would split letter by letter.
Different model families use different tokenizers. GPT models use Byte Pair Encoding variants. SentencePiece is common in LLaMA-style models. The choice matters for compute (fewer tokens means less work) and for things like multilingual coverage, but the basic shape is the same. Text in, integers out.
Now that the prompt is a sequence of integers, the next step is to give those integers meaning.
Embeddings
A token ID like 1024 is just a row index. It doesn’t mean anything by itself. The thing that gives it meaning is a giant table called the embedding matrix.
Every model has one. It has one row per entry in the vocabulary, and each row is a long vector of numbers. The length of each row is the model’s hidden size. In many 7B-class models, that means 4,096 numbers per token. Larger models usually use wider vectors.
Tiny explainer: vector A vector is a list of numbers. In a transformer, each token becomes a vector so the model can do math with it.
Tiny explainer: vector A vector is a list of numbers. In a transformer, each token becomes a vector so the model can do math with it.
When the tokenizer hands the model an integer, the model looks up that row and uses the vector instead. That vector is the token’s embedding. It’s the model’s representation of what that token “means,” learned during training.
Tiny explainer: embedding matrix The embedding matrix is a lookup table. Token ID in, learned vector out.
Tiny explainer: embedding matrix The embedding matrix is a lookup table. Token ID in, learned vector out.
The interesting property of these embeddings is that semantically similar tokens end up with similar vectors. The vector for “king” is close in space to the vector for “queen,” and the vector for “Paris” is close to “France.” None of this is hard-coded. It emerges from training on enough text, and the model learns these positions because they let it predict text well.
You can do arithmetic on embeddings and it sometimes works. The famous example is king − man + woman ≈ queen. The geometry of embedding space carries real semantic structure, even though nobody told the model to build it that way.
Worth being clear on: at this stage every token has been replaced by its embedding, but the embedding alone says nothing about where the token sits in the sequence. The vector for “dog” is the same vector whether “dog” is the first word in your prompt or the fifth. That’s a problem.
That’s the gap positional encoding fills.
Positional encoding
Plain self-attention doesn’t have a built-in representation of word order. Without some positional signal, it has no direct way to know that “dog” came before “bites” instead of after it.
Word order changes meaning. So the model needs another piece. It needs a way to inject the position of each token into the math.
Tiny explainer: positional encoding Positional encoding is how the model gets order information. It tells the model where each token sits in the sequence.
Tiny explainer: positional encoding Positional encoding is how the model gets order information. It tells the model where each token sits in the sequence.
The original transformer paper (Vaswani et al. 2017) did this by giving each position its own pattern of numbers and adding it directly to each token’s embedding before any other processing. Position 1 had one pattern, position 5 had a different pattern, position 100 had another. The patterns came from sine and cosine waves at different frequencies. Now the embedding for “dog” at position 1 was different from the embedding for “dog” at position 5, just because the position pattern added to it was different.
That worked, and sinusoidal encodings were chosen partly because they can extrapolate beyond the exact sequence lengths seen during training. But additive position schemes still had two problems that became important as models scaled up.
First, the embedding had to carry both meaning and position in the same set of numbers. There’s only so much you can pack in.
Second, learned absolute position embeddings in particular don’t generalize cleanly. If you trained on prompts up to 2,048 tokens long, the model never saw position 5,000 during training, and the embedding for that position was not learned in the same way.
Modern models mostly use a different scheme called Rotary Position Embeddings (RoPE), introduced by Su et al. in 2021 and now used in LLaMA, Mistral, Gemma, Qwen, and most other open-weight families. The intuition: instead of adding position info to each token’s vector, RoPE rotates the vector by an angle that depends on its position. A token at position 1 gets a small turn, a token at position 100 gets a bigger turn. When two tokens are later compared during attention, what matters is the difference between their rotations, which encodes how far apart they are.
Tiny explainer: RoPE RoPE stands for Rotary Position Embeddings. Instead of adding a position vector, it rotates token vectors so relative distance shows up during attention.
Tiny explainer: RoPE RoPE stands for Rotary Position Embeddings. Instead of adding a position vector, it rotates token vectors so relative distance shows up during attention.
The practical advantages are real. RoPE encodes relative position naturally (which is closer to what attention actually wants). It generalizes better to longer contexts. And it doesn’t add new parameters to the model.
Even with good positional encoding, modern LLMs have a documented “lost in the middle” problem (Liu et al. 2023). They use information at the start and end of long prompts more reliably than information buried in the middle. That’s why prompt engineering tips like “put important context first” or “repeat key info at the end” actually help. The model isn’t using every part of your prompt equally well.
With token meaning and position both encoded, the next question is how do tokens actually exchange information?
Attention
This is the mechanism that gave the architecture its name. Attention.
Inside every transformer layer, attention does one thing. It lets each token look at the other tokens it is allowed to see and decide which ones matter for what comes next.
It does this by giving each token three roles at once. Each token gets transformed into three new vectors, called Query, Key, and Value (Q, K, V).
Tiny explainer: Q, K, V Query means “what am I looking for,” Key means “what do I match with,” and Value is the information that gets copied when the match is strong.
Tiny explainer: Q, K, V Query means “what am I looking for,” Key means “what do I match with,” and Value is the information that gets copied when the match is strong.
The Query asks, “what am I looking for from other tokens?”
The Key says, “this is what I offer to tokens looking at me.”
The Value carries, “this is what gets passed along when a match happens.”
The same token plays all three roles at the same time. The Q, K, V transformations are learned matrices, so the model figures out during training what each token should look for and what it should offer.
Matching happens through a similarity score. Each token’s Query is compared against the Key of each token it is allowed to see, using a scaled dot product. Intuitively, this measures how much the two vectors line up. The scaling keeps the numbers stable before softmax.
Tiny explainer: dot product A dot product is a simple way to score how aligned two vectors are. Higher alignment means a stronger match.
Tiny explainer: dot product A dot product is a simple way to score how aligned two vectors are. Higher alignment means a stronger match.
The match scores then get turned into weights using softmax. Softmax takes any set of numbers and turns them into a probability-like distribution that sums to 1. Tokens with higher match scores get higher weights, and the weights are then used to take a weighted average of the value vectors.
Tiny explainer: softmax Softmax turns raw scores into weights that add up to 1. Big scores get big weights, small scores get small weights.
Tiny explainer: softmax Softmax turns raw scores into weights that add up to 1. Big scores get big weights, small scores get small weights.
An example. Consider the sentence “The cat that I saw yesterday was sleeping.” When the model processes “was,” it needs to figure out what’s doing the sleeping. The Query vector for “was” gets compared against the Key vectors of the tokens it is allowed to see. The dot product with “cat” is high, because the model has learned that verbs like “was” need a subject and that subjects like “cat” produce Key vectors that line up well. The dot product with “yesterday” is low. Softmax turns those scores into weights, “cat” gets a high weight, “yesterday” gets a low one. The model then takes a weighted sum of the corresponding value vectors, so the value for “cat” dominates the result. The new representation of “was” is now mostly shaped by the value of “cat.” That’s how a token several positions back becomes the referent.
There’s a constraint specific to GPT-style language models, which is that they generate text left to right. A token at position 5 is only allowed to attend to positions 1 through 5. It cannot attend to tokens at positions 6, 7, 8, because those haven’t been generated yet. This is called causal masking. The implementation is simple: future tokens get match scores so low they end up with effectively zero weight after softmax.
Tiny explainer: causal masking Causal masking hides future tokens. It keeps a decoder-only language model from looking ahead while predicting the next token.
Tiny explainer: causal masking Causal masking hides future tokens. It keeps a decoder-only language model from looking ahead while predicting the next token.
One of the most interesting findings in interpretability research is about specialized attention heads called induction heads, found by Anthropic in 2022. These heads learn to spot patterns of the form “A B … A” in the prompt and predict that B comes next. When the model sees “A” the second time, the induction head looks back to where “A” appeared before, sees what came after, and copies that. They’re one of the clearest known mechanisms behind in-context learning, the ability of an LLM to pick up a pattern from your prompt and continue it.
Tiny explainer: induction head An induction head is an attention head that notices repeated patterns in the prompt and helps continue them.
Tiny explainer: induction head An induction head is an attention head that notices repeated patterns in the prompt and helps continue them.
Attention has one big cost. In full attention, each token compares against all the tokens it is allowed to see, so doubling the prompt length roughly quadruples the work. This is why long prompts are expensive to run, and why a lot of recent research is about making attention more efficient (FlashAttention, sparse attention, linear attention).
But one attention head only gives the model one learned view of those relationships.
Multi-head attention
A single attention pass gives the model one way of deciding which tokens matter to which other tokens. That’s not enough. Language has many relationships happening at the same time. Subject and verb agreement. Pronouns and the names they refer to. Long-range references between sentences. Word order and local phrases.
Multi-head attention solves this by running attention many times in parallel, with each parallel pass operating in its own smaller space. Each parallel pass is called a head.
Tiny explainer: attention head An attention head is one independent attention pass with its own learned projections.
Tiny explainer: attention head An attention head is one independent attention pass with its own learned projections.
The part that’s often described wrong, including in plenty of tutorials. Each head doesn’t get a literal slice of the original token vector. Each head has its own learned projection matrices that map the full token vector down to its own smaller Q, K, and V vectors. So if a model has 4,096 numbers per token and 32 heads, each head usually works in a 128-dimensional space, but those 128 numbers are a learned projection of the full 4,096, not a fixed slice. Different “views” of the same token, not different chunks of it.
Each head runs its attention pass independently. Then the outputs of all the heads get concatenated and passed through a final linear layer that mixes them back into one full-size vector. The model learns that final mixing too.
What makes this interesting is that different heads often end up partially specialized. The model is never told what each head should do. Specialization emerges naturally during training. Researchers have found heads that track grammar (linking verbs to their objects, articles to their nouns), heads that figure out which pronoun refers to which name, heads that track positional patterns, induction heads, and many more. A single transformer layer might have 32 heads. A modern frontier model has dozens of layers. So a typical LLM has thousands of attention heads in total, each adding its own learned view.
There’s a practical cost concern that drove a recent architectural change. Each head needs to keep its Key and Value vectors in memory for all the tokens already generated, so that when a new token gets generated the model doesn’t have to recompute everything from scratch. This is called the KV cache, and it’s the main memory cost of running an LLM at long context lengths.
Tiny explainer: KV cache The KV cache stores old Key and Value vectors during generation. It saves the model from recomputing the whole prompt every time it adds a token.
Tiny explainer: KV cache The KV cache stores old Key and Value vectors during generation. It saves the model from recomputing the whole prompt every time it adds a token.
Modern decoder-only LLMs mostly use a variant called Grouped-Query Attention (GQA). Instead of every head having its own keys and values, groups of heads share the same key and value heads. LLaMA-2 70B has 64 query heads but only 8 key/value heads. Mistral 7B has 32 query heads and 8 key/value heads. The result is nearly the same accuracy as full multi-head attention but with much less memory pressure and inference cost.
Tiny explainer: GQA Grouped-Query Attention lets multiple query heads share fewer key/value heads. That cuts KV-cache memory while keeping many query views.
Tiny explainer: GQA Grouped-Query Attention lets multiple query heads share fewer key/value heads. That cuts KV-cache memory while keeping many query views.
Feed-forward network
After attention finishes mixing information between tokens, every layer has a second step that nobody talks about as much. The feed-forward network.