Transformers and QKV Attention: A Primer

LLMs
Transformers
Attention
Teaching
How a Transformer moves information between words — queries, keys, values, the residual stream, and why attention is the only channel where tokens talk to each other. Part 1 of 3.
Author

Luca Erzegovesi

Published

May 31, 2026

First in the series “Understanding LLMs to use them better in management and finance.” The three notes describe one machine — a Transformer language model — from three complementary angles. This note opens the machine up and lays out the attention machinery: how a Transformer moves information between words. The second note is about the embeddings — the vocabulary of vectors the machine reads from and writes to, and how to look at a space with hundreds of dimensions without fooling yourself. The third note watches the final step — the moment all that work collapses into a single next word, which turns out to be a relentless search. Read in order, the three move from mechanism, to dictionary, to read-out.

A word on how to read this one. Think of it as the technical booklet that ships with an electronic appliance: it aims to give an accessible but complete view of the machinery and its inner workings. The end user normally need not open the booklet; the technician must. But in business applications of AI the boundary between technician and end user is blurring fast — anyone who deploys these models in management or finance increasingly has to look inside the machine to judge what it can and cannot reliably do. This first note is accordingly longer and more abstract than the two that follow, which are focused and example-driven. The architecture it describes is the Transformer, introduced by Vaswani et al. (2017) in the paper whose title became a slogan, “Attention Is All You Need”; the technical terms are collected in Appendix C and the references at the very end.

1. Transformers: an evolutionary step from neural networks

Before opening up the attention machinery, it helps to see where a Transformer comes from. It is not an exotic invention out of nowhere; it is one more step in a long line of neural networks that all do the same basic thing — convert input data, step by step, into a representation whose position in some space encodes the answer. What changes from one architecture to the next is how that conversion is organized. Attention is the organizational idea that made networks good at language.

A network you may already picture: the OCR convolutional net

Think of a classic optical-character-recognition (OCR) network — the kind that reads a handwritten digit or letter from a small image. Its workhorse is the convolution. The input is a grid of pixels. A small filter (say a 3 \times 3 patch of weights) slides across the image; at each location it multiplies the pixels under it by its weights and sums them into a single number. Sweeping the filter over the whole image produces a feature map — a new grid that lights up wherever the filter’s little pattern (an edge, a curve, a stroke) is present. A layer has many such filters, so it produces many feature maps. Stack a few convolution layers (with pooling in between to shrink the grid), and the network builds up from edges → strokes → loops → whole-character shapes.

It is worth stating the underlying operation precisely, because a clean version of it carries over to attention. A convolution is a weighted sum over a small local window of the input: at each position the filter computes a dot product between its fixed weights and the pixels it currently covers. (It is a weighted sum, not strictly an average — the weights need not sum to one and are often negative, which is how a filter can reward ink in one place and penalise it in another.) Two things are then layered on top. First, the same weights are reused at every position (weight sharing), so the operation is really one small pattern-detector swept across the whole image. Second, a layer stacks many such detectors and a later layer takes weighted combinations of their feature maps — and it is those combination weights, learned by training, that select which mixtures of low-level features best predict the correct character. So the intuition behind the original marker is right once it is split in two: each convolution is a local weighted sum, and the cross-feature mixing that “selects the useful combinations” happens when later layers combine feature maps. (Appendix B works a full convolution through by hand.)

At the very end the features are flattened and a softmax classifier turns them into a probability over the possible characters.

flowchart LR
    A["pixels<br/>(grid of ink/blank)"] -->|"filters slide<br/>shared 3×3 weights<br/>local receptive field,<br/>reused at every position"| B["feature maps<br/>one grid per filter:<br/>edges, strokes, ..."]
    B -->|"flatten +<br/>linear layer"| C["softmax over<br/>26 letters"]
    C --> D["a .02<br/>...<br/>e .91 ◄ answer<br/>..."]

Two properties make this work for images. First, each filter has a fixed, local receptive field: it only ever looks at a small neighborhood of pixels. Second, the same filter is reused at every position (weight sharing), so a stroke is recognized the same way wherever it sits on the page. Both properties are exactly right for images, because the clues needed to recognize a stroke are local and their meaning does not depend on where on the page they appear.

Why language breaks this, and why we need attention

Language does not have those two convenient properties. The piece of context a word depends on can be right next to it or hundreds of words back — and which earlier words matter depends on the content, not on a fixed offset. To resolve “it” you must find the noun it refers to, and that noun could be anywhere. A fixed, local filter cannot do this: it always looks in the same small window, regardless of what the sentence is about.

Attention is the fix. Instead of a fixed local window, an attention head computes, on the fly and for each token, how much to pull from every other token — and then pulls. The “receptive field” is no longer fixed by the architecture; it is decided at runtime, from the content, by the query–key matching we will detail in Section 2. In one sentence:

An attention head is the language analogue of a convolutional filter, except its receptive field is learned, dynamic, and content-dependent instead of fixed and local.

That single change — a receptive field the data gets to choose, every time — is what let neural networks finally handle long-range, content-driven dependencies, and it is the heart of the Transformer.

It is important not to over-claim, though: attention is the new idea, not the only idea. A Transformer interleaves two kinds of block. The attention heads are the novelty just described — the content-driven, dynamic receptive field that moves information between positions. Alongside them sit ordinary MLP (multi-layer perceptron) blocks, the same fully-connected, learned weighted-combination machinery that does the cross-feature mixing in a CNN — except here each MLP refines one token’s representation in place, with no reference to its neighbours. The two blocks divide the labour cleanly: attention is the only channel that lets tokens talk to each other, while the MLP is where each token’s representation is reshaped on its own. Sections 5 and 6 build both explicitly, and Appendix A shows what is lost if you switch the attention off and keep only the MLP.

What the output data actually is

It pays to be precise about what these networks produce, because the same description fits both the OCR net and the language model.

In both cases the network turns its input into a representation: a vector whose position in a high-dimensional space encodes the answer. In OCR, the final feature vector’s location says “this image sits in the region of the space that means the letter e.” In a Transformer, the inputs are pieces of text — tokens — and each token is represented by an embedding: a vector of numbers giving its coordinates in a high-dimensional “language space,” where direction and proximity stand in for meaning. The vector the model processes and maintains for each token — the residual stream, formally introduced in Section 4 — is best read as a modified embedding: it starts life as the raw token embedding and each layer nudges it to a new position that encodes everything the model has worked out about that token in its context.

So the output data are representations of entities, describing each entity’s most likely position in a solution space. And from that position the model reads out one of two things:

  • a single crisp solution — the one best answer (take the highest-scoring class, i.e. the \arg\max, equivalently temperature \to 0); or
  • a probability distribution over solutions — a graded set of plausible answers with weights.

The readout is the same machine in both networks: a linear projection followed by a softmax. OCR projects the final feature vector onto the alphabet and gets a distribution over letters; a language model projects the final token representation onto the vocabulary and gets a distribution over next tokens. Same structure, different solution space. (Section 5 shows exactly how the final representation’s direction encodes which tokens are likely and its magnitude encodes how confident the model is — the geometric version of “position in a solution space.”)

The logic of the processing: params convert inputs into entity representations

How does the input get from raw symbols to these context-aware representations? The recipe is the through-line of this whole note:

  1. Look up. Each input token ID is replaced by its embedding — a first, context-free guess at its position in the space (a row of the embedding matrix E).
  2. Read, transform, write — repeatedly. A stack of parameterized blocks then reads the current representations and writes adjustments back. Two kinds of block alternate:
    • Attention blocks move information between tokens (the dynamic receptive field above), letting each token’s representation absorb what it needs from the others.
    • MLP blocks refine each token’s representation in place, one position at a time (detailed in Section 6).
  3. Read out. After L such rounds (model layers) the representation is “finished,” and the unembedding projects it into the solution space to give the crisp answer or the distribution.

The parametersW_Q, W_K, W_V, W_O in attention, W_1, W_2 in the MLP — are frozen after training. They are the learned knowledge: they encode the rule for how to move each representation, step by step, from a bare embedding toward its correct final position. Training is just the search for parameter values that put every entity in the right place in the solution space.

Why this framing matters for what follows. Because all the “thinking” lives in these incrementally-modified representations flowing through the residual stream, we can later ask very sharp questions about a trained model: where is a particular piece of behavior carried, and which block put it there? In small, fully-understood models one can even disable a single block and watch a specific behavior appear or vanish — the basis of the interpretability experiments these notes are meant to accompany (see Appendix A). The rest of this note builds the mechanism precisely enough to make those questions answerable.

2. The QKV trio

Every token, at every attention layer, produces three vectors derived from the token’s current hidden state via three learned weight matrices (W_Q, W_K, W_V):

  • Q (query) — “what am I looking for?”
  • K (key) — “what do I offer to be matched against?”
  • V (value) — “what information do I carry, if matched?”

Attention is the operation that uses these three to mix information across tokens. For a given token’s query Q, the model computes a similarity score against every previous token’s K (a dot product, scaled, then softmaxed into weights). Those weights are then used to take a weighted sum of the corresponding V vectors. The result is the attention output for that token — a blend of values from earlier tokens, weighted by how well their keys matched this token’s query.

In compact form (Q, K, and V are the corresponding W applied to tokens’ embeddings, see Section 4):

\text{attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) \cdot V

This formula is per head: Q, K \in \mathbb{R}^{T \times d_k} and V \in \mathbb{R}^{T \times d_v}, giving an output in \mathbb{R}^{T \times d_v}. In multi-head attention, the same formula is applied h times in parallel on independent Q/K/V slices, and the h resulting T \times d_v blocks are concatenated and then mixed by W_O into the residual.

So Q and K together produce the attention pattern (who attends to whom, and how strongly), and V is what actually flows along those attention edges. Q combines with K to produce attention weights, which then select/blend the V vectors. V is the payload, QK is the routing.

3. Why only K and V are cached, not Q

During generation, when the model is producing token N+1:

  • It needs the Q of the new token only — Q is computed fresh each step and discarded; it has no use beyond this single attention computation.
  • It needs the K and V of every previous token, because the new token’s Q has to attend back to all of them.

That’s why the cache is called the KV cache and not “QKV cache”. K and V are the historical state that accumulates as the conversation grows; Q is ephemeral, recomputed and thrown away every step.

This also explains the asymmetry between prefill (prompt processing) and generation (prompt continuation). During prefill, Q is computed for every input token too — but it’s used immediately to compute attention for that token and then discarded. Only K and V get written into the cache for future reuse. Generation is just prefill-of-one-token, repeated, with the cache growing by one row of K and one row of V at each layer per step.

4. Dimensionality: a complete inventory

Let me define the symbols once and then track every shape through the network.

Symbol Meaning Typical value (GPT-2 small) Typical value (modern 7B)
V vocabulary size 50,257 32,000–150,000
L number of transformer layers 12 32
T sequence length (tokens in the context) up to 1024 up to 128K
d_{\text{model}} residual stream / embedding dimension 768 4096
h number of attention heads 12 32
d_k = d_v per-head Q/K/V dimension, often d_{\text{model}}/h 64 128
d_{\text{ff}} MLP hidden dimension, usually \sim 4 \cdot d_{\text{model}} 3072 11008

Model parameters (frozen weights)

Weight Shape What it does
Token embedding E V \times d_{\text{model}} maps token IDs to vectors
Positional embedding (if any) T_{\max} \times d_{\text{model}} adds positional information
Per layer \ell: W_Q^{(\ell)} d_{\text{model}} \times (h \cdot d_k) projects hidden → all heads’ Q
Per layer: W_K^{(\ell)} d_{\text{model}} \times (h \cdot d_k) projects hidden → all heads’ K
Per layer: W_V^{(\ell)} d_{\text{model}} \times (h \cdot d_v) projects hidden → all heads’ V
Per layer: W_O^{(\ell)} (h \cdot d_v) \times d_{\text{model}} mixes head outputs back to residual
Per layer: MLP W_1, W_2 d_{\text{model}} \times d_{\text{ff}}, d_{\text{ff}} \times d_{\text{model}} non-linear transformation
Final LayerNorm d_{\text{model}} normalization scale/shift
Unembedding U (often E^\top) d_{\text{model}} \times V maps final hidden → logits

Total parameters scale roughly as 12 \cdot L \cdot d_{\text{model}}^2 (the famous result of Kaplan et al.).

Activations (recomputed every forward pass)

Activation Shape Notes
Input token IDs T integers in [0, V)
Hidden state / residual stream h^{(\ell)} T \times d_{\text{model}} the running “meaning” per token, threaded through layers
Q^{(\ell)}, K^{(\ell)}, V^{(\ell)} per head T \times d_k each, per head computed from h^{(\ell)} via W_Q, W_K, W_V
Attention scores QK^\top T \times T per head the routing pattern
Attention weights (post-softmax) T \times T per head row-stochastic, lower-triangular (causal mask)
Attention output (single head) T \times d_v weighted sum of V rows
Attention output (all heads concat) T \times (h \cdot d_v) usually h \cdot d_v = d_{\text{model}}
Output of W_O added to residual T \times d_{\text{model}} written back into the residual stream
MLP output T \times d_{\text{model}} also written back into the residual stream
Final hidden state (after layer L, LayerNorm) T \times d_{\text{model}} input to the unembedding
Logits T \times V one row per token position
Logits of the last token V the one that matters for next-token prediction
Probability distribution over vocabulary V, sums to 1 \text{softmax} of last-token logits

From last-token logits to the next-token distribution

After the final layer, the hidden state of the last position h^{(L)}_T \in \mathbb{R}^{d_{\text{model}}} is multiplied by the unembedding matrix:

\ell = h^{(L)}_T \cdot U \in \mathbb{R}^{V}

These are the logits: one real number per vocabulary token, unnormalized. Softmax converts them into probabilities:

p(\text{next token} = i \mid \text{context}) = \frac{\exp(\ell_i / \tau)}{\sum_{j=1}^{V} \exp(\ell_j / \tau)}

where \tau is the sampling temperature. At \tau \to 0 this becomes greedy (\arg\max). At \tau = 1 you get the raw model distribution. The sampler then draws the next token from this distribution, appends it to the sequence, and the loop continues.

KV cache size

The KV cache has total size:

\text{KV cache size} = 2 \cdot L \cdot T \cdot h \cdot d_k \cdot (\text{bytes per element})

The factor 2 is for K and V together. For a 7B model with L=32, h=32, d_k=128, at FP16 (2 bytes), 32K context:

2 \cdot 32 \cdot 32768 \cdot 32 \cdot 128 \cdot 2 \approx 17 \text{ GB}

This is why long-context inference is memory-hungry, and why DeepSeek’s Multi-head Latent Attention (which compresses K and V into a low-rank latent space) is such a big deal — it can cut this by an order of magnitude.

5. A worked example: tiny model, tiny vocabulary

Let me build a deliberately small model so every matrix fits on a page. We’ll watch one attention step end-to-end.

Setup

  • Vocabulary size V = 20. Say the vocabulary is {the, cat, dog, sat, ran, on, mat, floor, big, small, red, blue, fast, slow, a, and, ., is, was, .EOS} — 20 tokens indexed 0–19.
  • Embedding dimension d_{\text{model}} = 5.
  • One attention head, so d_k = d_v = 5.
  • One layer (we’ll ignore MLPs for clarity).
  • Context: 3 tokens. We’re going to compute attention for the sequence the cat sat, token IDs [0, 1, 3].

Step 1: token embedding

The embedding matrix E is 20 \times 5. After training it might look like (showing only the three rows we need):

E[0]  = [ 0.10, -0.20,  0.05,  0.40,  0.15]   "the"
E[1]  = [ 0.30,  0.50, -0.10,  0.20,  0.00]   "cat"
E[3]  = [-0.40,  0.10,  0.60, -0.30,  0.20]   "sat"

After looking up these three rows, the input to layer 1 is a 3 \times 5 matrix — three tokens, each as a 5-dimensional embedding:

X = [[ 0.10, -0.20,  0.05,  0.40,  0.15],
     [ 0.30,  0.50, -0.10,  0.20,  0.00],
     [-0.40,  0.10,  0.60, -0.30,  0.20]]

This is the initial residual stream h^{(0)}, shape T \times d_{\text{model}} = 3 \times 5.

Step 2: compute Q, K, V

The weight matrices W_Q, W_K, W_V are each 5 \times 5 here.

General shape: d_{\text{model}} \times (h \cdot d_k) for W_Q, W_K and d_{\text{model}} \times (h \cdot d_v) for W_V. In this toy example we have one head with d_k = d_v = d_{\text{model}} = 5, so the shape collapses to 5 \times 5. In a real multi-head model the per-head V slice has shape T \times d_v, not T \times d_{\text{model}} — the equality holds only after concatenating all h heads.

W_V is also called the IN matrix.

Let’s say:

W_Q = [[ 1.0,  0.0,  0.0,  0.5,  0.0],
       [ 0.0,  1.0,  0.0,  0.0,  0.5],
       [ 0.0,  0.0,  1.0,  0.0,  0.0],
       [ 0.5,  0.0,  0.0,  1.0,  0.0],
       [ 0.0,  0.5,  0.0,  0.0,  1.0]]

W_K = [[ 0.8,  0.2,  0.0,  0.0,  0.0],
       [ 0.2,  0.8,  0.0,  0.0,  0.0],
       [ 0.0,  0.0,  0.9,  0.1,  0.0],
       [ 0.0,  0.0,  0.1,  0.9,  0.0],
       [ 0.0,  0.0,  0.0,  0.0,  1.0]]

W_V = [[ 0.5,  0.5,  0.0,  0.0,  0.0],
       [ 0.5, -0.5,  0.0,  0.0,  0.0],
       [ 0.0,  0.0,  1.0,  0.0,  0.0],
       [ 0.0,  0.0,  0.0,  1.0,  0.0],
       [ 0.0,  0.0,  0.0,  0.0,  1.0]]

Compute Q = X \cdot W_Q, K = X \cdot W_K, V = X \cdot W_V. Each is 3 \times 5 = T \times d_k in this single-head example. In general, per head, the shapes are T \times d_k for Q and K, and T \times d_v for V. In the three matrices below, each row is a token and each column indexes one coordinate of the per-head Q/K/V vector.

Q = [[ 0.30, -0.10,  0.05,  0.45,  0.05],   # for "the"
     [ 0.40,  0.50, -0.10,  0.35,  0.25],   # for "cat"
     [-0.25,  0.20,  0.60, -0.50,  0.25]]   # for "sat"

K = [[ 0.04, -0.14,  0.04,  0.41,  0.15],   # for "the"
     [ 0.34,  0.46, -0.10,  0.17,  0.00],   # for "cat"
     [-0.30, -0.00,  0.51, -0.21,  0.20]]   # for "sat"

V = [[-0.05,  0.15,  0.05,  0.40,  0.15],   # for "the"
     [ 0.40, -0.10, -0.10,  0.20,  0.00],   # for "cat"
     [-0.15, -0.25,  0.60, -0.30,  0.20]]   # for "sat"

(I’ve rounded these for readability; the principle is what matters.)

A fourth weight matrix, W_O, (OUT matrix) is also a learned parameter of the attention block. Its job is to project the concatenated per-head attention outputs back into the residual stream’s dimension and (in multi-head attention) to mix information across heads.

General shape: (h \cdot d_v) \times d_{\text{model}}. Here, with one head and d_v = d_{\text{model}} = 5, it collapses to 5 \times 5. Note: h \cdot d_v is the concatenated head-output dimension, which in most architectures equals d_{\text{model}} by design — that’s why W_O usually looks like a d_{\text{model}} \times d_{\text{model}} square in implementations.

W_O = [[ 1.0,  0.0,  0.0,  0.0,  0.0],
       [ 0.0,  1.0,  0.0,  0.0,  0.0],
       [ 0.0,  0.0,  1.0,  0.0,  0.0],
       [ 0.0,  0.0,  0.0,  1.0,  0.0],
       [ 0.0,  0.0,  0.0,  0.0,  1.0]]

(For simplicity we’ve set W_O​ to the identity matrix here, meaning the attention output passes through unchanged. In a trained model W_O​​ would be a learned dense matrix that performs the head-mixing and projection described above.)

Step 3: attention scores QK^\top

This is a 3 \times 3 matrix where entry (i, j) is the dot product of token i’s query with token j’s key — a similarity score between tokens (not between heads), measuring how strongly token i wants to attend to token j.

QK^T = [[ 0.20,  0.07, -0.21],
        [-0.01,  0.40, -0.23],
        [-0.16,  0.06,  0.59]]

Now divide by \sqrt{d_k} = \sqrt{5} \approx 2.24:

QK^T / sqrt(5) = [[ 0.089,  0.031, -0.094],
                  [-0.004,  0.179, -0.103],
                  [-0.071,  0.027,  0.264]]

Step 4: causal mask + softmax

Since we’re doing autoregressive language modeling, each token can only attend to itself and earlier tokens. We mask the upper triangle to -\infty (so softmax gives them weight 0):

masked = [[ 0.089,  -inf,  -inf],
          [-0.004,  0.179,  -inf],
          [-0.071,  0.027,  0.264]]

Apply softmax row by row:

attention_weights = [[1.000, 0.000, 0.000],
                     [0.454, 0.546, 0.000],
                     [0.250, 0.276, 0.474]]

Read this carefully: the third row says that when computing the output for “sat”, the model attends 25% to “the”, 28% to “cat”, and 47% to itself. This is the attention pattern, the “who attends to whom” that QK produced.

Step 5: weighted sum of V

Multiply \text{attention\_weights} \cdot V. Per head this is (T \times T) \cdot (T \times d_v) = T \times d_v. With one head and d_v = 5, the result is 3 \times 5:

attention_output = [[-0.050,  0.150,  0.050,  0.400,  0.150],   # "the" attends only to itself
                    [ 0.196,  0.014, -0.027,  0.291,  0.068],   # "cat" blends "the" + "cat"
                    [-0.022, -0.094,  0.262, -0.061,  0.132]]   # "sat" blends all three

The third row is the interesting one: it’s a weighted blend (25%/28%/47%) of the three V vectors. This is the V “payload” being routed along the attention edges that QK set up. The output for “sat” now incorporates information drawn from “the” and “cat” — that’s how the model learns long-range dependencies.

(With multiple heads, you’d have h such T \times d_v blocks, one per head, computed in parallel from independent Q^{(j)}, K^{(j)}, V^{(j)} slices. They get concatenated along the last axis into a single T \times (h \cdot d_v) tensor before the next step. Section 6 carries out exactly this multi-head case numerically.)

Step 6: through W_O, back into the residual stream

The concatenated attention output (here just one head, so “concatenation” is a no-op) is projected by W_O:

h^{(1)} = h^{(0)} + \text{attention\_output} \cdot W_O

Shape arithmetic: (T \times (h \cdot d_v)) \cdot ((h \cdot d_v) \times d_{\text{model}}) = T \times d_{\text{model}}. In our single-head toy, that’s (3 \times 5) \cdot (5 \times 5) = 3 \times 5.

This is the residual addition — the attention output is added to the previous residual stream, not replacing it. The hidden state h^{(1)} is still shape T \times d_{\text{model}} = 3 \times 5.

In a real model, MLP layers would now run on top, also reading from and writing to the residual stream. We’d repeat the whole thing L times. Here we have one layer, so we go straight to the output. (Section 6 adds the MLP and a second layer explicitly.)

Step 7: final hidden state → logits → next-token distribution

We only care about the next token, so we take the last row of h^{(1)}: a single 5-vector, the final hidden state for “sat”. Suppose after W_O and residual it’s:

h_last = [-0.351,  0.030,  0.802, -0.398,  0.247]

Multiply by the unembedding matrix U, shape 5 \times 20 (one column per vocabulary token):

\ell = h_{\text{last}} \cdot U \in \mathbb{R}^{20}

Result might look like:

logits = [-0.8, -0.2,  0.1,  0.5,  1.2,  2.4,  3.1,  2.8,  0.4, -0.1,
          -0.3,  0.0, -0.5, -0.4,  0.6,  0.2,  1.0, -0.2,  0.3, -1.5]
            the  cat  dog  sat  ran   on   mat floor  big small  red ...

The model thinks “on” (logit 2.4), “mat” (3.1), “floor” (2.8) are the most likely continuations of “the cat sat”. Applying softmax at temperature 1:

P("mat"   | "the cat sat") ≈ 0.32
P("floor" | "the cat sat") ≈ 0.24
P("on"    | "the cat sat") ≈ 0.16
P(others)                  ≈ remainder

This is the conditional distribution p(\text{next} \mid \text{context}) that the language model has been trained to approximate. The sampler picks a token from this — greedy would pick “mat” — appends it to the sequence, and the next forward pass begins.

Final step calculations in detail.

The setup

After the last transformer layer (L), the residual stream is a T \times d_{\text{model}} tensor — one d_{\text{model}}-dimensional vector per token position. Call it h^{(L)}.

At inference time, when you want to predict the next token, you only care about the last row: h^{(L)}_T \in \mathbb{R}^{d_{\text{model}}}. This vector is the model’s final, fully-processed representation of “what comes next” given the entire context so far.

The two operations

1. Final LayerNorm (or RMSNorm). Before unembedding, virtually all modern transformers apply one last normalization to the residual stream. It rescales the vector so its components have controlled magnitude, then applies a learned per-dimension scale (and sometimes shift):

\tilde{h} = \text{LayerNorm}(h^{(L)}_T)

The shape stays d_{\text{model}}. This step is easy to forget but it matters — without it, the magnitudes coming out of the residual stream would be wild, since every layer has been adding to it.

2. Unembedding (the linear projection to vocabulary). The normalized vector is multiplied by the unembedding matrix U \in \mathbb{R}^{d_{\text{model}} \times V}:

\ell = \tilde{h} \cdot U \in \mathbb{R}^{V}

Each column of U is a d_{\text{model}}-dimensional vector — one per vocabulary token. The matrix multiplication computes, for every vocabulary token i, the dot product between the final hidden state and that token’s column:

\ell_i = \tilde{h} \cdot U_{:,i}

So each logit \ell_i is literally a similarity score between the final hidden state and the i-th vocabulary token’s representation in U. Tokens whose column points in the same direction as \tilde{h} get high logits; tokens whose column points elsewhere get low logits.

Interpreting the similarity score

It is tempting to read this one step further and say: the model is trained to produce a last-token vector that points in the same direction as the embedding vectors of the likely next tokens. That intuition is sound, and it is exactly the geometry the third note develops in full — so here we only state the result and move on.

Training (gradient descent on the cross-entropy loss) shapes \tilde{h} to have high dot product with the columns of likely next tokens and low dot product with the rest. Under weight tying (U = E^\top, below), those columns are the input embedding vectors, so the picture is almost literal. Two refinements keep it honest, both elaborated in Note 3: the vector does not point at one token but positions itself among many plausible continuations at once (which is how the model expresses uncertainty), and its magnitude — not just its direction — matters, because a longer vector sharpens the softmax (more confident) and a shorter one flattens it (less sure). The compact statement:

The model learns to produce a final residual-stream vector whose direction encodes which next tokens are likely and whose magnitude encodes how confident the prediction is.

This is the foundation of the logit lens, which applies the unembedding U to intermediate residual streams to ask what the model would predict if forced to commit early; it works because the residual stream lives, throughout the network, in the same space the final read-out uses. Note 3 turns this whole picture — the last step as a similarity search over the vocabulary — into its central theme.

Weight tying

A detail worth knowing: in many models (GPT-2, Llama, and others), U is the same matrix as the token embedding E — specifically, U = E^\top. This is called weight tying. The intuition is elegant: the embedding matrix maps token ID → vector (each row is a token’s representation); the unembedding maps vector → token logits (each column is a token’s representation). It’s the same dictionary, used in two directions.

Weight tying cuts parameter count noticeably — that matrix is V \times d_{\text{model}}, which can be hundreds of millions of parameters for large vocabularies — and empirically it often improves quality.

Not all models tie weights (some recent ones keep them separate to give the unembedding more flexibility), but it’s a very common default.

From logits to a distribution

Logits are just real numbers, unnormalized. They can be negative, can be huge — they don’t sum to anything meaningful on their own. To turn them into a probability distribution over the vocabulary, apply softmax:

p_i = \frac{\exp(\ell_i / \tau)}{\sum_{j=1}^{V} \exp(\ell_j / \tau)}

where \tau is the sampling temperature. The sampler then draws the next token ID from this distribution (or picks \arg\max for greedy decoding), and the next forward pass begins.

The compact picture

\underbrace{h^{(L)}_T}_{d_{\text{model}}} \xrightarrow{\text{LayerNorm}} \underbrace{\tilde{h}}_{d_{\text{model}}} \xrightarrow{\times U} \underbrace{\ell}_{V} \xrightarrow{\text{softmax}} \underbrace{p}_{V}

Two matrix multiplications away from a probability distribution over the entire vocabulary. The residual stream did the heavy lifting; the unembedding just reads out the answer.

Two things to keep in mind

The residual stream “decides” everything before the unembedding. The unembedding is a fixed linear readout — it has no capacity to think, only to project. By the time you reach U, all the work of conditioning on the context has already been done by the L transformer layers writing into h^{(L)}_T. The unembedding is a translator from “internal representation space” to “vocabulary space.”

During training, you compute logits for every position. At inference you only need the last row, but training computes \ell for all T positions in parallel — each row predicts its successor — so the loss can be evaluated everywhere at once (teacher forcing). That’s why the full logits tensor in training has shape T \times V, while at inference you typically only materialize the last row.

6. A second worked example: two heads, two layers, with an MLP

The first example deliberately stripped the model down to a single head and a single layer, and skipped the MLP entirely. That was the right move for seeing attention clearly, but it hides three things a real Transformer does on every forward pass: it splits the work across several heads, it refines each token with an MLP, and it stacks layers so the residual stream is processed again and again. This example puts all three back, kept just small enough to do by hand.

The point of this section is dimensionality: watching the shape of the data at every step as it flows through two complete layers. The numbers below are real (computed and checked), but don’t memorize them — the weights are random, not trained, so the specific final token is meaningless. Follow the shapes. (Every matrix is shown rounded to two decimals, so re-adding the displayed intermediates by hand may differ from a shown result by \pm 0.01; the full-precision computation is consistent.)

Setup

  • d_{\text{model}} = 6.
  • h = 2 heads, so the per-head dimension is d_k = d_v = d_{\text{model}} / h = 6 / 2 = 3.
  • L = 2 layers, each one a full attention block + MLP block.
  • MLP hidden size d_{\text{ff}} = 8. (Real models use \sim 4 \cdot d_{\text{model}} = 24; we shrink it so the matrices stay on the page.)
  • ReLU nonlinearity in the MLP: \text{ReLU}(z) = \max(0, z), applied element-wise — it simply zeros out negatives.
  • Context: the same 3 tokens, the cat sat, so T = 3.

Here is the whole journey as a shape table. Everything below just fills in the numbers for these rows.

Step Operation Output shape
token IDs lookup [T] = [3]
h^{(0)} embedding + positional T \times d_{\text{model}} = 3 \times 6
Q, K, V (all heads) h^{(0)} W_Q, etc. 3 \times 6 each
per-head Q, K, V slice into h=2 blocks 3 \times 3 each, ×2
scores QK^\top / \sqrt{d_k} per head 3 \times 3 per head
weights mask + softmax 3 \times 3 per head
head output weights \cdot V 3 \times 3 per head
concat join 2 heads 3 \times (h \cdot d_v) = 3 \times 6
attn write-back concat \cdot W_O 3 \times 6
h_{\text{mid}} h^{(0)} + \text{attn} 3 \times 6
MLP pre-activation h_{\text{mid}} W_1 3 \times d_{\text{ff}} = 3 \times 8
MLP activation \text{ReLU} 3 \times 8
MLP output \cdot W_2 3 \times 6
h^{(1)} h_{\text{mid}} + \text{MLP} 3 \times 6
… repeat for layer 2 …
h^{(2)} final hidden state 3 \times 6
last row \cdot U unembedding [V]
softmax distribution [V]

Step 1: embedding + positional → h^{(0)}

Each token’s 6-dimensional embedding, plus a positional embedding for its slot, gives the initial residual stream h^{(0)}, shape 3 \times 6:

h⁰ = [[ 0.10, -0.10,  0.05,  0.30,  0.15, -0.05],   # the  (pos 1)
      [ 0.40,  0.50, -0.20,  0.20,  0.10,  0.25],   # cat  (pos 2)
      [-0.50,  0.15,  0.65, -0.20,  0.15,  0.05]]   # sat  (pos 3)

Step 2: project to Q, K, V, then split into heads

W_Q is now 6 \times 6 (general shape d_{\text{model}} \times (h \cdot d_k) = 6 \times 6), and likewise W_K, W_V. Multiplying Q = h^{(0)} \cdot W_Q gives a 3 \times 6 matrix. The crucial new idea: those 6 columns are two heads’ worth of Q stacked side by side — columns 1–3 are head 1’s query, columns 4–6 are head 2’s. The vertical bar marks the split:

Q = [[-0.24, -0.27, -0.43 | -0.05, -0.30, -0.01],   # the
     [-0.37,  0.29, -0.61 | -0.56,  0.17, -0.13],   # cat
     [ 0.20, -0.18, -0.00 |  0.37, -0.14,  0.09]]   # sat
       └──── head 1 ────┘   └──── head 2 ────┘

K and V are computed and split the same way. Each head now has its own 3 \times 3 query, key, and value. This is what “multi-head” means: one matrix multiply, then carve the result into h independent lanes, each running the attention formula on its own slice.

Step 3: attention inside each head

Run Steps 3–5 of the previous example separately in each lane.

Head 1. Scaled scores Q_1 K_1^\top / \sqrt{3}, then causal mask and softmax:

scores₁ = [[-0.01,  -inf,  -inf],          weights₁ = [[1.00, 0.00, 0.00],
           [ 0.04,  0.07,  -inf],     →               [0.49, 0.51, 0.00],
           [-0.01, -0.04, -0.07]]                     [0.34, 0.33, 0.32]]

Weighted sum of head 1’s V gives head 1’s output, shape 3 \times 3:

head₁_out = [[-0.12, -0.04, -0.07],
             [-0.15,  0.40, -0.07],
             [-0.12,  0.20, -0.20]]

Head 2. Same procedure on head 2’s slice, giving its own pattern and output:

weights₂ = [[1.00, 0.00, 0.00],         head₂_out = [[ 0.06, -0.03,  0.01],
            [0.45, 0.55, 0.00],     →                [ 0.04, -0.06, -0.05],
            [0.36, 0.32, 0.32]]                      [-0.16,  0.11,  0.03]]

Notice the two heads produce different attention patterns from the same input — head 1 weights “cat” slightly more on row 2 (0.51), head 2 weights it more strongly (0.55). Each head is free to specialize. This is the structural fact the interpretability experiments hinge on: distinct heads can do distinct jobs, and you can study them one at a time.

Step 4: concatenate the heads

Glue the two 3 \times 3 head outputs back together, side by side, into one 3 \times (h \cdot d_v) = 3 \times 6 matrix:

concat = [[-0.12, -0.04, -0.07 |  0.06, -0.03,  0.01],
          [-0.15,  0.40, -0.07 |  0.04, -0.06, -0.05],
          [-0.12,  0.20, -0.20 | -0.16,  0.11,  0.03]]
            └─ head 1 out ──┘    └─ head 2 out ──┘

Step 5: W_O and the residual addition

W_O has shape (h \cdot d_v) \times d_{\text{model}} = 6 \times 6. It mixes the two heads’ information together and maps it back to the residual width. The product \text{concat} \cdot W_O is the attention block’s write-back, shape 3 \times 6:

attn = [[-0.06,  0.10, -0.11,  0.05,  0.06,  0.01],
        [-0.03, -0.09, -0.10, -0.15,  0.30,  0.08],
        [ 0.19, -0.20,  0.23,  0.09,  0.34, -0.10]]

Add it to the residual stream — h_{\text{mid}} = h^{(0)} + \text{attn}, still 3 \times 6:

h_mid = [[ 0.04, -0.00, -0.06,  0.35,  0.21, -0.04],
         [ 0.37,  0.41, -0.30,  0.05,  0.40,  0.33],
         [-0.31, -0.05,  0.88, -0.11,  0.49, -0.05]]

Step 6: the MLP block

Now the piece the first example skipped. The MLP acts on the residual stream one token at a time — the same W_1, W_2 applied to every row independently, with no interaction between positions. (Hold onto that fact; the Appendix turns on it.)

First, expand from width 6 to width d_{\text{ff}} = 8 via W_1 (shape 6 \times 8): \text{pre} = h_{\text{mid}} \cdot W_1, shape 3 \times 8:

pre = [[ 0.16, -0.07, -0.14,  0.10, -0.12, -0.05, -0.06, -0.29],
       [ 0.17, -0.38,  0.18,  0.24, -0.11, -0.26,  0.35, -0.24],
       [-0.02,  0.22,  0.51, -0.66, -0.29, -0.36, -0.45,  0.36]]

Apply ReLU — every negative entry becomes 0. This is the model’s only nonlinearity, and it’s what lets the MLP do more than a plain matrix multiply:

act = [[0.16, 0.00, 0.00, 0.10, 0.00, 0.00, 0.00, 0.00],
       [0.17, 0.00, 0.18, 0.24, 0.00, 0.00, 0.35, 0.00],
       [0.00, 0.22, 0.51, 0.00, 0.00, 0.00, 0.00, 0.36]]

Then contract back from width 8 to width 6 via W_2 (shape 8 \times 6): \text{mlp} = \text{act} \cdot W_2, shape 3 \times 6:

mlp = [[ 0.07,  0.03, -0.04, -0.05,  0.08,  0.07],
       [ 0.08,  0.07, -0.22, -0.48,  0.24,  0.05],
       [ 0.79,  0.29, -0.27, -0.34,  0.26, -0.42]]

Add it back to the residual stream — h^{(1)} = h_{\text{mid}} + \text{mlp}, shape 3 \times 6. Layer 1 is now complete:

h¹ = [[ 0.12,  0.03, -0.10,  0.31,  0.29,  0.04],
      [ 0.45,  0.48, -0.52, -0.43,  0.64,  0.38],
      [ 0.47,  0.24,  0.61, -0.45,  0.75, -0.47]]

Note the shape went 6 \to 8 \to 6 inside the MLP: the residual stream stays width 6 everywhere; the width-8 expansion happens only inside the block and is contracted away before the write-back. The residual stream’s width d_{\text{model}} is invariant — that constancy is what lets every block read and write the same T \times d_{\text{model}} tape.

Step 7: layer 2 (same shapes, new weights)

Layer 2 has its own W_Q, W_K, W_V, W_O, W_1, W_2, but the shapes and the procedure are identical to layer 1. It reads h^{(1)}, runs two-head attention, writes back, runs its MLP, writes back, and produces h^{(2)}. We show only the write-backs and the result:

attn² (write-back)   h_mid² = h¹ + attn²        mlp²                 h² = h_mid² + mlp²
[[-0.09,-0.27, 0.01,  [[ 0.02,-0.24,-0.09,       [[-0.04,-0.27,-0.37,  [[-0.02,-0.51,-0.46,
  -0.10,-0.02,-0.08],    0.21, 0.27,-0.05],         -0.66, 0.06, 0.18],   -0.46, 0.33, 0.14],
 [-0.20,-0.48, 0.08,   [ 0.25, 0.01,-0.44,        [-0.17,-0.40,-0.58,   [ 0.08,-0.39,-1.01,
   0.12,-0.06,-0.39],    -0.31, 0.58,-0.02],        -0.93, 0.17, 0.33],   -1.24, 0.75, 0.32],
 [-0.46,-0.16, 0.31,   [ 0.02, 0.08, 0.92,        [-0.22,-0.53,-1.20,   [-0.21,-0.45,-0.28,
   0.21,-0.28,-0.47]]    -0.24, 0.47,-0.94]]        -2.21, 0.90, 0.26]]   -2.45, 1.37,-0.67]]

Each of these is 3 \times 6. After L = 2 layers the residual stream is still 3 \times 6 — the same shape it started as. Every layer reshaped the contents, never the shape.

Step 8: read out the last token

Exactly as in Section 5. Take the last row of h^{(2)} (the representation for “sat”, now informed by two full layers of attention + MLP):

h²_last = [-0.21, -0.45, -0.28, -2.45,  1.37, -0.67]

Multiply by the unembedding U (shape d_{\text{model}} \times V = 6 \times V) to get logits, then softmax. Showing the first 10 vocabulary columns:

logits = [-0.48,  0.27,  2.01, -0.91,  0.35, -1.04, -1.80,  1.03,  0.60,  0.36]
probs  = [ 0.04,  0.07,  0.42,  0.02,  0.08,  0.02,  0.01,  0.16,  0.10,  0.08]

(With random, untrained weights the specific winner carries no meaning — what matters is that the pipeline produced a clean [V]-shaped distribution that sums to 1.)

What this example added

Reading the shape table top to bottom, three things are now visible that the single-head example could not show:

  1. Heads are lanes. One W_Q multiply produces all heads at once; the result is sliced into h independent T \times d_k blocks, each running attention alone, then concatenated and remixed by W_O. The width bookkeeping is d_{\text{model}} \to (h \cdot d_k) \to d_{\text{model}}, and here h \cdot d_k = 2 \cdot 3 = 6 = d_{\text{model}} exactly.
  2. The MLP is a per-token refinery. It expands each token’s vector to d_{\text{ff}}, applies ReLU, contracts back, and adds the result to the residual — touching each position in isolation. The width bookkeeping is d_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}.
  3. Layers stack without changing shape. The residual stream is a T \times d_{\text{model}} tape that every block reads from and writes to additively. Stacking L layers just repeats the read–transform–write cycle; the shape is invariant from h^{(0)} to h^{(L)}.

7. Recap of the data flow

Shapes shown per head where the per-head structure matters; h \cdot d_k = h \cdot d_v = d_{\text{model}} in most architectures, but they are conceptually distinct.

flowchart TB
    A["token IDs &nbsp; [T]"] -->|"embedding lookup<br/>E: V × d_model"| B["hidden state h⁰ &nbsp; [T × d_model]<br/>initial residual stream"]
    B -->|"project via W_Q, W_K: d_model × (h·d_k),<br/>W_V: d_model × (h·d_v)"| C["Q, K [T × d_k]; V [T × d_v]<br/>per head"]
    C -->|"per-head QKᵀ / √d_k"| D["attention scores &nbsp; [T × T] per head"]
    D -->|"causal mask + softmax"| E["attention weights &nbsp; [T × T] per head<br/>rows sum to 1"]
    E -->|"multiply by V, per head"| F["per-head output &nbsp; [T × d_v]"]
    F -->|"concatenate the h heads"| G["concat output &nbsp; [T × (h·d_v)]"]
    G -->|"W_O: (h·d_v) × d_model,<br/>add to residual"| H["hidden state after attention &nbsp; [T × d_model]"]
    H -->|"MLP: W₁, ReLU, W₂,<br/>add to residual"| I["hidden state h¹ &nbsp; [T × d_model]<br/>after MLP block"]
    I -->|"more layers ... eventually layer L"| J["final hidden state &nbsp; [T × d_model]"]
    J -->|"take last row × U: d_model × V"| K["logits (last token) &nbsp; [V]"]
    K -->|"softmax"| L["next-token distribution &nbsp; [V]<br/>sums to 1"]
    L -->|"sample"| M["next token ID &nbsp; scalar in [0, V)"]

Things to keep clear in your head:

  1. Q is one-shot, K and V persist. That’s why the cache is “KV” — Q for past tokens is never needed again.
  2. The residual stream is the spine. Every attention and MLP block reads from it and writes back to it (additively). All the “thinking” passes through this T \times d_{\text{model}} tensor.
  3. Per-head V has shape T \times d_v, not T \times d_{\text{model}}. The full-width T \times d_{\text{model}} shape only appears after concatenating the h heads (and only if h \cdot d_v = d_{\text{model}}, which is the usual but not mandatory choice). W_O is the operator that maps that concatenated block back into the residual.
  4. Only the last token’s logits matter for next-token prediction at inference time. During training you compute logits for every position (to predict each next token in parallel), but at generation time you only need the last.
  5. Logits are unnormalized; softmax produces the actual distribution. Temperature, top-k, top-p sampling all operate on logits or the resulting distribution to control output diversity.
  6. Attention is the only cross-token channel; the MLP is per-position. Attention blocks are the only place where one token’s representation can be influenced by another’s. MLP blocks refine each token in isolation. So every relational thing a model does — agreement, coreference, copying, “don’t repeat the previous speaker” — must be carried by attention. Appendix A makes this precise by switching attention off; Appendix B works a small CNN by hand to sharpen the filter-vs-head contrast from Section 1.

Appendix A: What if attention heads were disabled?

A Transformer, stripped to its skeleton, is a stack of MLP blocks with attention blocks inserted between them, all communicating through one residual stream. A classic feed-forward neural network is essentially just the MLP part. So a natural question — and a useful one for understanding what attention buys you — is: what happens if we switch the attention off? There are two clean ways to do it, and they arrive at the same destination.

The baseline: a network with only MLP blocks

Recall from Section 6 that an MLP block processes the residual stream one token at a time: the same W_1, ReLU, W_2 applied to each row independently, with no reference to any other position. So a network built only from MLP blocks processes every token in complete isolation. It can learn a fixed mapping “this token (at this position) → that output,” i.e. per-token and position-conditioned statistics — but it has no mechanism for one token to influence another’s representation. Whatever “sat” becomes, it becomes without ever consulting “the” or “cat.”

Method 1: remove the attention layers entirely

Delete the attention sub-blocks and wire the embeddings straight into the MLP stack. The update rule becomes simply

h^{(\ell+1)} = h^{(\ell)} + \text{MLP}(h^{(\ell)}), \quad \text{per token, no mixing.}

This is the pure-MLP baseline. In the running example, the representation of “sat” can now never absorb anything from “the” or “cat” — each column of the residual stream is processed down its own private pipe. A relational rule is therefore impossible: anything that requires comparing two positions — for instance “the next item must differ from the current one” — cannot be expressed, because the two positions never meet.

Method 2: keep the architecture, make attention transparent

The second way disables attention without deleting anything. Keep the full Transformer architecture — all the W_Q, W_K, W_V, W_O machinery — but fabricate the weights so that the attention block writes nothing into the residual stream.

Recall the write-back from Section 6: the attention block’s contribution is \text{attn} = \text{concat}(\text{head outputs}) \cdot W_O, and it is added to the residual. So set

W_O = 0 \quad (\text{equivalently } W_V = 0).

Now, no matter what attention pattern QK^\top computes, the value added to the residual is the zero vector:

h_{\text{mid}} = h^{(\ell)} + 0 = h^{(\ell)}.

The residual stream passes through the attention block untouched. The Q/K/V machinery still runs, still computes attention patterns — but those patterns are transparent: they have no effect on anything downstream. Functionally, the model is once again the pure-MLP network of Method 1.

(You might ask whether there’s a subtler “identity” fabrication — make each token attend only to itself, so attention copies each value through. That doesn’t give transparency: copying each token’s value and adding it back doubles the contribution rather than leaving the stream unchanged. True transparency means the block adds zero, which is what zeroing the write-back achieves.)

Both roads lead to the same place

Whether you disable attention by deletion (Method 1) or by making it transparent (Method 2), the Transformer collapses to a per-position MLP network — a stack that refines each token in isolation and can never move information between positions. This is the precise sense in which:

Attention is the only channel through which tokens communicate. Everything relational a language model does must be carried by attention, because it is the only operation that moves information across positions.

That is also why the tiny-model story these notes accompany is about attention heads specifically: if a rule relates one token to another, the circuit that enforces it has to live in the attention machinery — there is nowhere else for it to be.

Why Method 2 is the important one: ablation

Method 2 has a feature deletion lacks: it is selective. The write-back matrix W_O is organized in blocks — one slice of rows per head (Section 6, Step 5). Zero out just one head’s slice, and you make exactly that head transparent while leaving every other head working. Re-run the model, measure what changes, and you have a causal probe: what does this one head actually do?

This per-head version of Method 2 is called ablation, and it is the workhorse of mechanistic interpretability. Two findings from tiny, fully-understood models show why it matters:

  • A rule can rest on a single head. Ablate that one head and a behavior the model performed perfectly collapses; ablate any other head and nothing changes. The behavior was carried, causally, by one specific lane in one specific layer.
  • Attention patterns can mislead. A head whose attention looks like it implements a rule — say it stares almost entirely at the relevant earlier token — may, when ablated, turn out to change nothing: it was not load-bearing. Conversely a head with a messy, unremarkable-looking pattern may be the one holding the rule. The attention pattern tells you what a head looks at; only ablation tells you what it does.

That second lesson is worth underlining, because it is exactly where intuition goes wrong: you cannot read a head’s function off its attention picture. You have to switch the head off — Method 2, one head at a time — and watch what breaks. Disabling attention, far from being a destructive curiosity, is therefore the single most useful tool for finding out where in a network a behavior lives — which is the question the accompanying tiny-language-model experiments are built to answer.

Appendix B: A convolutional OCR pass, by hand

Section 1 sketched the OCR convolutional network in words. Here we run one through, with numbers small enough to check by hand, so the filter is as concrete as the head. Then we lay the two side by side.

The image

Take a tiny 5 \times 5 grayscale image. Each pixel is 0 (blank) or 1 (ink). This one shows a vertical stroke down the middle column — the kind of mark that distinguishes, say, a 1 or the spine of a T:

       col: 0 1 2 3 4
row 0:      0 0 1 0 0
row 1:      0 0 1 0 0
row 2:      0 0 1 0 0
row 3:      0 0 1 0 0
row 4:      0 0 1 0 0

One filter, sliding

A filter is a small fixed grid of weights. Here is a 3 \times 3 vertical-stroke detector: it rewards ink in its center column and punishes ink on either side, so it responds most strongly to a vertical line.

F_vert = [[-1,  2, -1],
          [-1,  2, -1],
          [-1,  2, -1]]

To convolve, we slide this filter over every 3 \times 3 window of the image; at each stop we multiply overlapping cells and sum to a single number. A 5 \times 5 image with a 3 \times 3 filter has 3 \times 3 = 9 valid stops, so the output (the feature map) is 3 \times 3.

Look at two stops to see the mechanism:

Top-left window (rows 0–2, cols 0–2) — the stroke is off to the right, so the filter’s center column sits on blanks:

window      = [[0,0,1],     elementwise·F_vert, summed:
               [0,0,1],     each row: 0·(-1) + 0·(2) + 1·(-1) = -1
               [0,0,1]]     three rows → -3

Top-center window (rows 0–2, cols 1–3) — now the stroke lines up under the filter’s center column:

window      = [[0,1,0],     each row: 0·(-1) + 1·(2) + 0·(-1) = +2
               [0,1,0],     three rows → +6
               [0,1,0]]

Do this at all nine stops and you get the feature map. The center column lights up (+6); the flanks are suppressed (−3):

conv = [[-3,  6, -3],
        [-3,  6, -3],
        [-3,  6, -3]]

ReLU, then pooling

Apply ReLU (zero out negatives) — the map keeps only the positive evidence “a vertical stroke is here”:

relu(conv) = [[0, 6, 0],
              [0, 6, 0],
              [0, 6, 0]]

Then max-pool with a 2 \times 2 window to shrink the grid and add a little position tolerance (each output cell is the max of a 2 \times 2 patch). The 3 \times 3 map becomes 2 \times 2:

pool = [[6, 6],
        [6, 6]]

The pooled map is uniformly high: this filter is shouting “vertical stroke, present.” Pooling means it would keep shouting even if the stroke shifted a pixel — the detector is now slightly position-invariant, exactly the property Section 1 said convolution buys you.

A layer has many filters

A real layer applies many filters in parallel, each producing its own feature map. Add a second filter, a horizontal-stroke detector (ink rewarded in the center row):

F_horiz = [[-1, -1, -1],
           [ 2,  2,  2],
           [-1, -1, -1]]

Run it on the same vertical-stroke image and every window cancels to zero — there is no horizontal ink to reward:

conv = relu = pool = all zeros

So the two filters disagree, informatively: the vertical detector fires, the horizontal detector is silent. That contrast is the feature the classifier wants.

Flatten and classify

Flatten the two pooled maps into one feature vector (four numbers per filter, eight in all):

features = [6, 6, 6, 6,   0, 0, 0, 0]
            └ vertical ┘   └ horizontal ┘

A final linear layer reads these features into one score per class — let the “vertical” class sum the vertical-filter features and the “horizontal” class sum the horizontal ones — and softmax turns the scores into probabilities:

logits = [24,  0]            # vertical, horizontal
softmax ≈ [1.00, 0.00]       # "this is a vertical stroke"

The network has converted a grid of pixels, step by step, into a point in a tiny two-class solution space, and read out a crisp answer — the same arc as the language model, just over characters instead of next tokens. (The probability is emphatic because this is a noise-free toy; on real handwriting the distribution would be softer.)

Filter vs head, side by side

Now the comparison that motivated this appendix. A filter and a head are both feature detectors that get reused across positions, but they differ in the one respect that matters for language:

Convolutional filter (CNN) Attention head (Transformer)
What it stores a fixed 3 \times 3 pattern of weights three projections W_Q, W_K, W_V (and shares W_O)
Receptive field fixed and local — always the same small window dynamic and global — chosen at runtime, can reach any earlier token
How it matches slides the same pattern over every position; fires where the pixels match it computes, from the content, a query–key similarity to decide which positions to read
What “reuse across positions” means weight sharing: identical weights at every location the same W_Q, W_K, W_V at every position, but the attention pattern is recomputed per input
Output at a position one activation = how well the local patch matches the pattern a weighted blend of other positions’ V payloads
Set by the data… at training time (the filter weights are learned, then frozen) at training time and at run time (the pattern depends on the actual tokens)

The crucial row is receptive field. The vertical filter above can only ever see a 3 \times 3 patch; to relate ink in the top-left corner to ink in the bottom-right, a CNN must stack many layers until their windows overlap. That is fine for images, where the clues are local. A head pays no such toll: the query for one token can match the key of a token hundreds of positions away in a single step, and which token it matches is decided by the content, not fixed by the architecture. That is the whole reason attention displaced convolution for language — and, looping back to Appendix A, it is also why a relational rule in a language model lives in a head: the head is the only component whose reach is wide enough, and content-driven enough, to relate one token to another.

Appendix C: Glossary

For readers who would like the basics or a refresher. Terms are grouped roughly by where they first appear; Notes 2 and 3 carry their own glossaries for the terms specific to them.

Token, vocabulary

A token is the unit of text the model reads — a whole word in our toy examples, usually a sub-word piece in production models. The vocabulary (V) is the fixed set of all possible tokens (20 in Section 5’s toy, 50,257 in GPT-2).

Embedding, embedding matrix E

The vector of real numbers that represents a token — its coordinates in the model’s high-dimensional “language space.” The embedding matrix E has one row per vocabulary token; “looking up” a token means reading its row.

Residual stream

The running vector the Transformer maintains for each position and updates layer by layer; every attention and MLP block adds its output to it. It is the only place the model’s “thinking” lives, and the prediction is read from its final state.

Attention head, Q / K / V

A sub-mechanism that, for each token, decides how much to read from every earlier token. It does so with three learned projections of the residual: the query (“what am I looking for?”), the key (“what do I offer to match against?”), and the value (“what payload do I carry if matched?”). Query–key similarity sets the attention pattern; the values are what flow along it.

Multi-head attention

Running h attention heads in parallel on independent slices of Q/K/V, then concatenating their outputs and mixing them with W_O. Each head can specialise in a different relation.

W_Q, W_K, W_V, W_O

The four learned weight matrices of an attention block: three that project the residual into queries, keys and values, and W_O that maps the concatenated head outputs back into the residual stream.

MLP block (W_1, W_2, ReLU)

The fully-connected “feed-forward” sub-layer that refines each token’s vector in place. It expands the vector to a wider hidden size via W_1, applies a non-linearity (here ReLU, which zeroes negatives), and contracts back via W_2. It never mixes information across positions.

Logits, softmax, temperature

The model’s raw, unnormalised scores over the vocabulary are logits. Softmax turns them into a probability distribution (exponentiate, then normalise to sum to one). Temperature \tau rescales the logits before softmax: low \tau sharpens the distribution (greedy at \tau\to 0), high \tau flattens it.

Unembedding U, weight tying

The matrix U that maps the final residual vector to one logit per vocabulary token; each column is a token’s representation. Weight tying sets U = E^\top — the same dictionary used for input lookup and output scoring (standard in GPT-2, Llama, and our toys).

LayerNorm / RMSNorm

A normalisation applied to the residual vector (notably just before the unembedding) that rescales its components to a controlled magnitude and applies a learned per-dimension scale.

Causal mask

The rule that each token may attend only to itself and earlier tokens. Implemented by setting the upper triangle of the attention scores to -\infty before softmax, so future positions get weight zero.

KV cache

The stored keys and values of all past tokens, reused at each generation step so the new token’s query can attend back to the whole history. Queries are not cached — hence “KV”, not “QKV”.

Prefill vs. generation

Prefill processes the whole prompt at once, writing every token’s K and V into the cache. Generation then adds one token at a time, growing the cache by one row of K and V per step.

Ablation

Switching off one component — e.g. zeroing one head’s slice of W_O — and re-measuring behaviour, to test what that component causally does (Appendix A).

Convolution, filter, feature map (CNN terms)

A convolution slides a small fixed filter of weights over an image, computing a local weighted sum at each position; the resulting grid is a feature map. The contrast with an attention head — fixed local window vs. dynamic content-driven reach — motivates Section 1 and Appendix B.

References

Cunningham, H., Ewart, A., Riggs, L., Huben, R., & Sharkey, L. (2023). Sparse autoencoders find highly interpretable features in language models. arXiv. https://arxiv.org/abs/2309.08600

DeepSeek-AI. (2024). DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model. arXiv. https://arxiv.org/abs/2405.04434

Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., … Olah, C. (2022). Toy models of superposition. Transformer Circuits Thread. https://transformer-circuits.pub/2022/toy_model/index.html

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., … Amodei, D. (2020). Scaling laws for neural language models. arXiv. https://arxiv.org/abs/2001.08361

nostalgebraist. (2020, August 31). Interpreting GPT: The logit lens. LessWrong. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03762


Written with substantial help from Claude (Anthropic); directed, reviewed, and verified by me.