The Workbench

The Final Step: The Language Model as a Relentless Seeker of the Best Next Word

Luca Erzegovesi — Mon, 01 Jun 2026 22:00:00 GMT

Third and last in the series “Understanding LLMs to use them better in management and finance.” It closes the loop opened by QKV Attention (how a Transformer moves information between tokens) and Embeddings and the Maps We Draw of Them (where the vectors come from and how to read them). Those two notes built the machinery and the dictionary; this one watches the last step — the moment the model turns all that work into a single next word — and argues that the step is, at heart, a search. It is written for a reader who is not a computer scientist; the technical terms (logit, softmax, unembedding, …) are collected in the Glossary at the end and explained where first used.

1. The intuition: a relentless search for the next word

Strip a language model down to what it actually does, moment to moment, and it is almost embarrassingly simple. It has a fixed list of possible words — its vocabulary — and at every step it gives each word in that list a score, sorts them, and picks one. Then it appends the chosen word to the text and does the whole thing again. And again. Thousands of times, to write you a paragraph.

The useful image is a search engine. Generating text is like running a Google search at every step — but a peculiar one. You are not searching the web; you are searching the model’s own vocabulary. And you are not typing the query; the query is a vector the model has computed from everything said so far. The “search results” are the whole vocabulary, ranked by how well each word fits as the continuation, and the model reads off the top of the list.

The first two notes were really about the two halves of that sentence. The QKV note showed how the model builds the query: each token is carried forward as a vector (the residual stream), and attention heads and MLP blocks keep modifying it until the last token’s vector encodes everything the model has worked out about what should come next. The embeddings note showed what is being searched: the vocabulary as a cloud of vectors — the dictionary — and how its geometry encodes meaning. This note connects them: the query meets the dictionary, and a word comes out.

2. The mechanism, precisely

After the last layer, the model holds one vector for the final position — call it , of length . It is the fully-processed representation of “what comes next.” To turn it into word scores, the model multiplies it by the unembedding matrix , whose columns are one vector per vocabulary word:

(The tilde is a final LayerNorm that tames the vector’s magnitude first; the glossary has the detail.) Read the second equation slowly, because it is the whole note: each score — each logit — is the dot product of the model’s query vector with word ’s vector in the dictionary. A dot product is the most basic similarity measure there is: it is large when two vectors point the same way. So the score of a word is how aligned the model’s query is with that word’s entry in the dictionary. Ranking words by their logit is therefore a similarity search over the vocabulary — what the literature calls a maximum-inner-product search.

And here is where the two notes fuse. In our models (and in GPT-2, Llama, and many others) the embedding and unembedding matrices are tied: . The columns the query is scored against are the very embedding rows the previous note mapped. The dictionary you search at the end is literally the dictionary you looked up at the start. The maps in note 2 are maps of the index this search runs over.

The raw logits are then passed through softmax, which exponentiates and normalises them into a probability distribution over the vocabulary — the ranked results page, now with percentages. A tiny worked version (three words, by hand) is in Appendix A; the shape of it is all that matters here:

flowchart LR
    A["query vector
(the final residual)"] -->|"dot product"| B["one score per word
(logits)"]
    B -->|"softmax"| C["probabilities
(the ranking)"]

The query did all the hard thinking; the search itself is a single matrix multiply followed by a normalisation. The unembedding cannot reason — it can only compare.

3. The search-result page

Let us actually run it. Our worked example throughout is the small word-level Transformer from note 2 — two layers, four heads, , a 28-word vocabulary — trained on the turn-taking “calling game” (Pietro chiama Paolo, with epithets that depend on who calls whom). Give it the prompt Pietro chiama Paolo and ask for the next word’s ranking:

The search-result page: confident vs uncertain

The left panel is the results page for that prompt. One result dominates utterly: Tarso, the epithet the game’s rule assigns to Paolo when Pietro calls him, with probability 0.9998. Every other word in the vocabulary is down in the one-in-ten-thousand range. This is what a confident search looks like — the query points almost exactly at one entry in the dictionary, and the dot product with that entry towers over all the others.

Three things are worth pinning down about that distribution, because they correct the naïve “find the single nearest word” picture:

It is a ranking over the whole vocabulary, not one hit. The model does not retrieve a word; it scores all of them and reports a graded list. Usually the list is informative well below the top.
Direction says which words, magnitude says how sure. The direction of the query picks out which entries it aligns with; its length controls how peaked the softmax is. A long query makes one word dominate (high confidence); a short one leaves the results flat (high uncertainty). The model’s confidence is encoded in the geometry, not bolted on afterwards.
“Similarity” here means “trained to fit,” not human synonymy. Two words sit close in the dictionary because the model learned they play similar roles in predicting text — which usually looks like meaning, but is defined by the training task, exactly as note 2 argued.

4. Confident searches and uncertain ones

Now shorten the prompt to Pietro chiama — caller named, but no callee yet. The right panel above is the result. There is no dominant hit; instead eight near-tied results (the player tokens 2, 6, 3, 4, 7, 1, 5, 8, each around 0.10–0.12), because at this point any valid next player is an equally good continuation. The flatness is not a bug; it is the model correctly reporting that it does not know which player comes next, only that it must be a player.

That single contrast — one towering bar versus eight stubby equal ones — is the most operationally useful thing in this note, because three everyday LLM behaviours fall straight out of it:

Sampling and “temperature.” When the results are flat, something still has to be chosen. Greedy decoding takes the top bar; sampling rolls a weighted die over the ranking. Temperature reshapes the distribution before the roll — high temperature flattens it (more adventurous), low temperature sharpens it (more predictable). Generation is a chain of sampled searches, not deterministic lookups.
Hallucination, demystified. The search always returns a ranking — even when nothing in the vocabulary genuinely fits. Ask a model for a fact it never learned and the query points nowhere in particular, but the nearest-by-accident words still get scored, softmax still sums to one, and the model still emits its confident-looking top result. A hallucination is a low-quality search the machinery has no choice but to complete.
Why prompts and context matter so much. Everything the model knows about this step is compressed into the query vector, and the query is built from the prompt. A better prompt is, quite literally, a better search query.

5. The fulcrum: how one vector comes to hold a whole world

Step back to ask what is really remarkable here, because it is easy to lose it in the arithmetic. The query vector is produced by nothing but multiplications and additions — matrix products, dot products, a normalisation. And yet, by the time it reaches the final step, that one short list of numbers has absorbed the meaning of the entire prompt: who is speaking, who was called, what the game’s rule demands, which word is therefore due. Pure numerical manipulation has condensed a whole context into a single point in space.

There is an Archimedean quality to this. Give me a place to stand, and a lever long enough, and I will move the world — and the place the model stands is exactly this fulcrum, the last token’s vector. The whole world of the prompt, with all its accumulated context, bears down on that one point, and what gets lifted into existence is the next word. The lever is built by the parts the earlier notes described: attention reaches back across the sentence and binds the relevant tokens into the residual (note 1), and the MLP / feed-forward blocks act, position by position, as a kind of learned lookup table — given “Pietro called Paolo,” they fetch the direction that means “Tarso.” Layer by layer the residual is loaded until it is the long arm of the lever, and the tiny final search is the short arm that the loaded context throws upward.

We can watch the lever load. Project the last token’s residual, at each stage of the forward pass, onto a plane built (by the same Gram-Schmidt method as note 2) from two dictionary directions — the epithet Tarso and the callee Paolo:

The residual stream migrating toward the answer

The vector starts near the called name and travels across the plane toward Tarso as the layers run — its coordinate along the Tarso axis climbs from −3.1 to +10.6 — and decomposing the path shows that block-0 attention alone supplies +7.62 of that push. Moving the residual toward a word’s direction is raising that word’s logit (tied weights again), so this picture is the search ranking being rewritten in real time. The same thing read as rankings rather than geometry is the logit lens — applying the unembedding at each intermediate stage to ask “what would the model predict if it had to commit now?”:

Logit lens: the ranking locks on early, then sharpens

By the end of the very first block the search already ranks Tarso top (0.92); the second block only sharpens it to near-certainty. The decisive work — the lever’s heave — happens early, exactly where the trajectory said it did. None of this is mysticism: it is binding, then lookup, then a dot product. But it is worth pausing on the fact that that is enough to make a vector carry a world.

6. From a toy to a thousand-page contract

It would be fair to object that lifting the word “Tarso” out of a 28-word vocabulary is a parlour trick. So here is the part that should genuinely give pause: the machinery is identical at the top of the field. GPT-class models do exactly what our toy does — score every word in the vocabulary by dot product with a computed query, softmax, sample, append, repeat. Nothing more exotic is bolted on. What differs is only scale: a query vector of 12,288 numbers instead of 64 (note 2), a dictionary of 50,000-plus word-pieces instead of 28, and a lever built from dozens of layers trained on a sizeable fraction of the written internet.

And from that — from a relentless loop of “rank the vocabulary, pick a word” — comes the entire observed richness: a comedy sketch with a setup and a punchline, a sonnet that scans, a scientific paper with a coherent argument, an investor report that ties its narrative to its numbers, a full contract with cross-referencing clauses. Each of those is produced one next-word search at a time, each search standing on the fulcrum the previous words built. The wonder of large language models is not that they do something other than this. It is that this, at scale, is enough.

7. The conveyor belt grows: tools, search, and code

Modern systems add one more move, and it is the move that turns a text generator into an assistant. The find-next-word loop is orchestrated with real outside power: a web search, a calculator, a code interpreter, or other applications reached through MCP servers. It is tempting to imagine the model “using” these tools the way a person does. It does something stranger and simpler.

Picture the prompt as a conveyor belt of text that the model endlessly reads and extends. When a tool is involved, the model does not leave the belt to operate machinery. It simply writes onto the belt a request — a few tokens that mean “search the web for X” or “run this code.” An external harness, sitting outside the model, notices that text, performs the actual action, and lays the result back onto the belt as more text: the search snippets, the computed number, the program’s output. Then the same next-word loop resumes, now reading a context enriched with fresh, grounded material. The tool’s answer is not stored in some special memory; it becomes ordinary prompt text, indistinguishable in kind from what the user typed.

This is why the picture matters for the rest of the note. The lever’s “world” — the context pressing on the query — is no longer limited to what the user wrote and what the model already knew. It can now include this morning’s web page, an exact calculation, the output of a freshly compiled program. But the mechanism underneath never changes: every one of those additions is just more text on the belt, and every word the system produces is still one dot-product search over the vocabulary. The intelligence of an agent is, in the end, the intelligence of what gets written onto the tape — and a search that keeps reading it.

8. Where this sits — and where the analogy breaks

Honesty first, since this series tries to locate its ideas rather than oversell them. Nothing in the mechanism here is a new finding; a specialist would recognise every piece. That the unembedding is a dot-product readout you can even apply to intermediate layers is the logit lens (nostalgebraist; refined as the tuned lens, Belrose et al. 2023). That the feed-forward blocks behave like a lookup table is Transformer Feed-Forward Layers Are Key-Value Memories (Geva et al. 2021). The output-layer-as-similarity-search idea goes back to word2vec (Mikolov et al. 2013). And the pedagogy of doing the whole thing transparently has been done before — Ishan Anand’s Spreadsheets Are All You Need implements GPT-2 in a spreadsheet for exactly this reason. The contribution here is only the exposition for a non-technical audience and the live, perturbable toy behind the figures; the toy game itself is reproduced from the public ToyDialogueGames exercise.

With that said, the “search engine” image earns its keep but must not be pushed too far. Four places it breaks, each worth keeping in mind:

There is no external corpus. A web search ranks billions of documents; this search ranks only the model’s own fixed vocabulary (tens of thousands of word pieces). It cannot return anything that is not already a token.
You don’t type the query — the model computes it. All the work, and all the cleverness, is in constructing the query vector. The search step itself is a trivial linear operation. “It’s just doing search” is true and deeply misleading at once: the search is dumb; the query is the model.
It returns a distribution and then gambles. Unlike a search box that shows you a fixed list, generation samples from the ranking, so the same prompt can yield different continuations. Determinism is a special case (temperature zero), not the rule.
“Relevance” is the training objective, not human judgement. A word scores high because the model was trained to make the right next token score high — which approximates meaning but is not the same thing, and is exactly why the results can be fluent and wrong together.

This last point connects the whole series back to its second note. A vector database (the engine of retrieval-augmented generation) also performs a similarity search — but over an external corpus, using a query produced by a separately trained encoder. A language model performs a similarity search over its own vocabulary, using a query it computes internally. Same operation, different index and different source of query. Seen this way, RAG is just a way to put better text on the conveyor belt: it drops genuinely relevant documents into the context so that the model’s computed query — and therefore its next-word search — points somewhere grounded.

9. What this means for using LLMs in management and finance

If you carry one mental model away from these three notes, let it be this: a language model is not a database it queries for facts, and not a mind that “knows” things. It is a machine that, at every step, draws on the knowledge compressed into its weights to build a vector summarising the context, and then runs a similarity search over its own vocabulary for the best next word. Strike that key again and again and an articulated response takes shape, one word at a time.

And what comes out is unlike the result of a database query. It is not a selection of pre-baked text or data retrieved from a store; it is closer to a kind of mechanical expert judgement — a chain of small inferences over knowledge the model has metabolised from training on many similar cases. That distinction is the practical heart of what follows.

That single picture pays off in practice:

Hallucinations are not malfunctions; they are the search completing when it shouldn’t. The cure is not to scold the model but to improve the query — by putting the right facts on the conveyor belt (RAG, tools, better context).
Prompting is query construction. Time spent shaping the context is time spent aiming the search. It is the highest-leverage thing a non-technical user controls.
Confidence is readable, and worth reading. A model that is “sure” has a peaked distribution; a flat one is a warning. Where a system exposes token probabilities or lets you vary temperature, those are direct windows onto how strong the search hit actually was.
Tools extend the world the model can lift, not the mechanism. Web search, a calculator, a code runner, an MCP-connected application — each one just enriches the text the next-word search reads. Understanding that boundary is what lets you reason about what these systems can and cannot reliably do.

The model is a relentless seeker for the best next word. Everything else — the poems, the reports, the contracts, the agentic tool use — is what that one tireless search becomes when it stands on a rich enough world and is run, patiently, again and again.

Appendix A: the search, by hand

Take a deliberately tiny model: a query vector of length 3 and a three-word vocabulary, each word a row of the (tied) dictionary.

A caveat on what is being skipped, so the example is not mistaken for the whole machine. In a real run there is a prompt — text such as Pietro chiama Paolo — and a query-building mechanism: the embedding lookup, the attention blocks, and the MLP blocks of Notes 1 and 2, which between them read the prompt and grind it down into the single final-residual vector . All of that is the hard, interesting part. Here we simply posit the finished and a three-word dictionary, so that the search step itself — dot product, then softmax — stands alone and can be checked by hand. Read the numbers below as “suppose the upstream layers handed us this query against this dictionary.”

query  h     = ( 1.0,  0.5, -0.2)        the final residual

dictionary:
   Tarso      = ( 1.2,  0.4, -0.1)
   Cefa       = ( 0.3,  1.0,  0.2)
   Paolo      = (-0.5,  0.2,  0.9)

Score each word — a dot product with the query (this is the “search”):

ℓ(Tarso) = 1.0·1.2 + 0.5·0.4 + (-0.2)(-0.1) =  1.20 + 0.20 + 0.02 =  1.42
ℓ(Cefa)  = 1.0·0.3 + 0.5·1.0 + (-0.2)( 0.2) =  0.30 + 0.50 − 0.04 =  0.76
ℓ(Paolo) = 1.0(-0.5)+ 0.5·0.2 + (-0.2)( 0.9) = −0.50 + 0.10 − 0.18 = −0.58

Softmax — turn scores into a ranked probability list:

exp:   e^1.42 = 4.14,   e^0.76 = 2.14,   e^-0.58 = 0.56     sum = 6.84
prob:  Tarso 4.14/6.84 = 0.61    Cefa 0.31    Paolo 0.08

Tarso wins, because its dictionary entry points most nearly the same way as the query. Note what each stage did: the dot products are the search; softmax only turns the scores into a tidy ranking. Make the query longer (multiply by 3, keeping its direction) and the logits become with probabilities — same winner, sharper confidence. Direction chose the word; magnitude set the certainty.

Appendix B: Glossary

Continues the glossary of note 2; here are the terms specific to the final step.

Logit

A word’s raw, unnormalised score — the dot product of the model’s query vector with that word’s dictionary entry. One logit per vocabulary word.

Softmax

The function that turns a vector of logits into a probability distribution: exponentiate each, divide by the total. Bigger logits become bigger probabilities; the result sums to 1.

Unembedding

The matrix that maps the final vector to one logit per word. Its columns are word vectors. With tied weights it is the transpose of the embedding matrix — the same dictionary used for input lookup and output scoring.

Maximum-inner-product search (MIPS)

“Find the items whose vectors have the largest dot product with a query.” Ranking the vocabulary by logit is exactly this — a similarity search, with the dot product as the similarity.

Logit lens

A diagnostic: apply the unembedding to the residual at an intermediate layer to see what the model would predict if forced to commit there. It works because the residual lives, throughout the network, in the space the final readout uses.

Key-value memory (feed-forward block)

A way of reading the MLP/feed-forward sub-layers: each behaves like a stored key–value pair — it recognises a pattern in the residual (“Pietro called Paolo”) and writes an associated direction back (“toward Tarso”). The model’s per-token “lookups.”

Greedy decoding, sampling, temperature, top-k / top-p

Ways to pick a word from the ranked distribution. Greedy takes the top one. Sampling draws at random in proportion to probability. Temperature rescales the distribution before drawing — high = flatter/more varied, low = sharper/more predictable. Top-k / top-p restrict the draw to the most probable words.

RAG (retrieval-augmented generation)

Fetching relevant documents from an external store (via a vector-database similarity search) and inserting them into the prompt, so the model’s next-word search runs over a context grounded in real text.

MCP (Model Context Protocol)

A standard by which a model’s host application connects to external tools and data sources. In the picture of this note: a way for results computed outside the model to be written back onto the prompt “conveyor belt” as text the next-word loop then reads.

References

Anand, I. (2024). Spreadsheets are all you need: A spreadsheet implementation of GPT-2. https://spreadsheets-are-all-you-need.ai/

Belrose, N., Furman, Z., Smith, L., Halawi, D., Ostrovsky, I., McKinney, L., Biderman, S., & Steinhardt, J. (2023). Eliciting latent predictions from transformers with the tuned lens. arXiv. https://arxiv.org/abs/2303.08112

Geva, M., Schuster, R., Berant, J., & Levy, O. (2021). Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 5484–5495). Association for Computational Linguistics. https://arxiv.org/abs/2012.14913

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv. https://arxiv.org/abs/1301.3781

nostalgebraist. (2020, August 31). Interpreting GPT: The logit lens. LessWrong. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03762

Models and tooling behind the figures (in the julia-impromptu project): the trained dialogue-game model DialogueGame-Tiny-Epithet-Trained.json, whose Forward_TopK ranking and logit-lens recompute live as you edit the prompt; the series’ live companion EmbeddingAtlas.json, whose final_step_search report shows the confident/uncertain search pages and the logit-lens locking on; the figure data computed bundle-pure in tools/compute_note3.jl and rendered by tools/render_note3.py. The residual-trajectory figure is reused from note 2.

Embeddings and the Maps We Draw of Them

Luca Erzegovesi — Sun, 31 May 2026 22:00:00 GMT

Second in the series “Understanding LLMs to use them better in management and finance.” It follows QKV Attention (how a Transformer moves information between tokens) and sets up The Final Step (how the final layer reads an answer out of the vocabulary — a relentless search for the best next word). This note is about the thing both the attention machinery and that final read-out quietly depend on: the vocabulary embeddings — what they are, where they come from, and how to look at a space with hundreds of dimensions without fooling yourself. It is written for a reader who is not a computer scientist; the technical terms (token, residual stream, PCA, softmax, …) are collected in the Glossary at the end, and each is also explained where it is first used.

1. What an embedding is — and the chicken-and-egg of finding it

Everyone who has met a language model has met the idea of an embedding: each word (more precisely, each token) is a point in a high-dimensional space, and “nearby” points are supposed to mean “similar” things. The picture is so common that it is easy to skip the question that actually matters for using these models well: where do those vectors come from?

There is a tempting wrong answer, and it is worth naming because it is the mental model most people import from vector databases. In a vector database — the engine behind semantic search and retrieval-augmented generation — you take an already-trained encoder, push each document through it, and store the vector it produces. The encoder is fixed; the vectors are read off it; “similarity” is a property of a model someone else trained earlier. The embedding is an input to your system.

Inside a language model, embeddings work the other way around. They are not downloaded from a standard generator and they are not fixed in advance. They are parameters — the rows of the embedding matrix — and they are learned jointly with the rest of the model, by the same gradient descent that trains the attention and MLP weights. There is a chicken-and-egg quality to this that is the whole point:

The embedding of a word is whatever vector makes this particular model predict well. The model shapes the embeddings; the embeddings shape the model; they are solved for together.

So when GPT-2 places “king” at some location in its 768-dimensional space, that location is not a statement about the timeless meaning of king. It is a statement about what direction is useful for GPT-2’s next-token predictions, given everything else GPT-2’s weights are simultaneously doing. Train a different model on different data and you get different embeddings, even for the same word.

A fair objection: aren’t the vectors in a vector database also learned? They are — and being precise about this softens the contrast in the right way, because the real difference is not “learned versus not learned.” The encoder behind a vector database is itself a trained neural network — OpenAI’s text-embedding-3, the open sentence-transformers family, or token-level late-interaction retrievers like ColBERT are all optimised models. What differs is the objective they were optimised for. A retrieval encoder is trained so that the similarity between two pieces of text tracks their relevance — exactly the quantity a search index needs. A language model’s embeddings are trained so that the model predicts the next token. Same mechanism (vectors solved for by gradient descent), different goal — and a different geometry results. So the lesson of this section is not “LM embeddings are special because they are learned,” but the sharper: every embedding is shaped by the task it was trained for; there is no neutral, universal embedding you simply look up. The vector you get for king depends on whether you asked “what retrieves documents about kings?” or “what predicts the word after king?”

This is the first thing to internalise, because it explains everything that follows — including why the most natural-seeming way to build embeddings by hand, dictating the axes yourself, does not work.

	Vector-database embedding	Language-model embedding
Where it comes from	a pre-trained encoder you call	a parameter learned during training
Fixed or learned?	learned by the encoder, then frozen at lookup	learned jointly with all other weights
What “similar” means	a property of the encoder’s training	whatever helps this model predict next tokens
Role in the system	an input (you store it)	an output of training (the model owns it)
Can you choose the axes?	no (encoder decides)	no — and §2 shows that trying breaks the model

2. The pivot: a cautionary tale about wiring the meaning by hand

Because embeddings are “just” vectors with interpretable neighbourhoods, a very reasonable engineer’s instinct is: why not build them myself? If I know that words differ along a few common-sense axes — gender, age, power, size, whether they are animate, to name a few — why not assign each word a short vector on those axes, feed a small model some sentences, and let it learn the rest? The axes would be human-readable, the space would be tidy, and interpretability would come for free.

This note’s running example was built precisely to test that instinct, and the result is the most useful thing in it. I designed a tiny vocabulary (~40 content words: king, queen, lion, lioness, stag, mountain, pebble, ship…) and a six-dimensional hand-wired latent space — gender, age, power, size, animacy, mobility — with each word assigned coordinates in (king = male/adult/strong/—/animate/mobile; pebble = —/—/—/small/inanimate/still; and so on). From that latent table I generated a corpus of simple definition and comparison sentences (“the king is older than the …”, “a lion is bigger than a …”), and trained a small Transformer (2 layers, 2 heads, ) on it. The model was never shown the latent table — the experiment was whether, trained on data whose every regularity is a function of those six axes, the model would recover them in its own learned embedding.

It did not. Here is the picture that tells the story.

Hand-wired vs learned embedding space

On the left is the hand-wired latent space (the six designed dimensions, projected to 2-D by PCA). It is exactly what you would hope for: clean, structured, the three classes (human / animal / thing) separated, the discrete lattice of designed coordinates visible — its top two principal components capture 70% of the variance, because we put the structure there. On the right is the model’s learned 32-dimensional embedding for the same words, projected the same way. The tidy structure is gone: the classes interleave, the top two PCs capture only 20% of the variance, and — the giveaway — the famous word2vec analogy fails. Computing king − man + woman and asking for the nearest word returns lioness (cosine 0.42), with child and cub next; queen is not in the top five. The model has not learned a clean “gender axis”; it has learned a diffuse blob of feminine-animate mass and a separate male-animate blob, and the arithmetic lands wherever those blobs happen to sit.

By every acceptance test we set in advance, the experiment failed: principal components flat (50% cumulative over six PCs, against a 75% target), zero of six linear probes able to read a latent axis back out at , and only two of five designed analogies recovered. So why does hand-wiring the meaning fail so thoroughly?

2a. The child and the draughtsman

A child asked to draw a person assembles a kit of symbols: a circle for the head, two dots for eyes, a triangle for the nose, a line for the mouth, sticks for the limbs. Each part is a discrete token, drawn on its own, sitting in its own patch of paper, meaning one thing. The picture is a sum of labelled pieces.

A trained draughtsman does the opposite. A single line traces the underside of the eye and, without lifting, becomes the bridge of the nose; one shadow gives the cheekbone and the eye socket at once; whole regions are set not by the outline of a “part” but by tone. No stroke belongs to one feature, and no feature lives in one stroke. The face emerges from the configuration of marks as a whole — which is the central lesson of Betty Edwards’ Drawing on the Right Side of the Brain: the beginner’s real obstacle is cognitive, not manual. The mind insists on what it already knows — “an eye is an almond with a circle in it” — and stamps that symbol onto the page, drowning out what the eye reports. To draw, you switch off the symbol-maker and let the subject come out of the whole.

This is exactly the difference between the two ways of holding meaning in a vector. When I fabricated embeddings by hand — one axis for gender, one for age, one for power, one for size — I was drawing like the child: one feature per symbol, each concept walled off on its own dedicated dimension, the vocabulary a kit of labelled parts. It trained poorly, and the failure was not a bug to fix; it was the model refusing to draw like a child. A set of embeddings that works is drawn like the artist: each direction in the space serves several meanings at once, and each meaning is spread across many directions. Concepts are not parked on private axes — they are shared out over the available strokes, and the word comes out of the whole.

Hand-wired embeddings draw a face the way a child does: a symbol per part. Learned embeddings draw it the way an artist does: every line doing several jobs at once, no part living in a single stroke.

2b. From symbols to distribution and superposition

The analogy is worth pinning exactly where it holds. It explains why a working code is distributed rather than symbolic — meaning smeared across the strokes instead of filed under labels. It does not, on its own, explain the stronger and more specific fact that a model stores more features than it has dimensions, tolerating a little interference to do so. That step is not about artistry but about geometry: it is possible only because a high-dimensional space has room for very many almost-non-overlapping directions. The drawing is the intuition; superposition is the mechanism — which is where we turn next.

Back to our question: why does hand-wiring the meaning fail? There are four linked reasons, and together they are the core lesson of this whole note.

Hand-wiring forces axis-aligned concepts — one meaning per dimension. My design said “dimension 1 is gender, dimension 3 is power.” But a learned embedding has no reason to put one concept on one coordinate axis. Which brings us to:
Real features are distributed, and packed in superposition. Gradient descent does not store “gender” on axis 1. It stores many features as directions — often many more directions than there are dimensions — squeezed in at angles chosen so they interfere with each other as little as possible. A direction that means “feminine” can be a diagonal combination of dozens of coordinates, sharing space with hundreds of other such diagonals. This packing of more features than dimensions is called superposition, and it is the normal state of affairs in a self-organising model trained to minimise empirical error, not a pathology.
Descent wants freedom, not your axes. The reason it packs things this way is that it is optimising for prediction, and the geometry that predicts best is the one that places each useful direction where interference is lowest — not the one a human finds legible. Dictating the axes removes exactly the freedom the optimiser needs. A handful of clean symbolic dimensions simply cannot supply the statistical degrees of freedom a model uses to fit language.
It is the same phenomenon as distributed coding. Neuroscientists have long argued that a concept in the brain is carried by a population of neurons rather than a single “grandmother cell”; interpretability researchers find the same in networks — individual units are usually not cleanly interpretable. The reason you cannot read meaning off a single dimension of a real embedding is the same reason hand-wiring one meaning per dimension fails: meaning does not live on the coordinate axes. It lives in directions, and the axes are an arbitrary basis.

The lesson, stated plainly: if you want embeddings that work, you have to give up wiring the meaning by hand. The price of a model that predicts well is a space whose coordinates are not individually meaningful. (This particular toy also illustrates a narrower trap — a corpus mechanically generated from a feature spec gives the model nothing to do but memorise the templates, so it never has to infer the latent axes at all. But the deeper point stands for real models trained on real text: their useful structure is distributed, not axis-aligned.)

Is this too strong? It is worth checking against the literature before generalising, because the assertion is forceful — and three well-known results sharpen it rather than overturn it. First, sparse autoencoders trained on a model’s activations (Anthropic’s Towards Monosemanticity, 2023; Cunningham et al., 2023) do succeed in recovering clean, human-readable features — but they recover them as directions teased out of the superposition, which is precisely the claim that the meaning is present yet not aligned with the raw axes. Second, injecting hand-built structure is not worthless: retrofitting word vectors toward a lexicon or ontology (Faruqui et al., 2015) measurably improves them — but it adjusts already-learned vectors, it does not replace learning with a dictated table, which is the thing that fails. Third, the difficulty is principled: recovering clean, axis-aligned factors without strong inductive biases is provably impossible in the unsupervised case (Locatello et al., 2019). So the careful statement is not “structure never helps” — it sometimes does — but: you cannot hand-author the whole embedding and freeze out the model’s freedom to place features where prediction wants them, and what a model learns will be distributed rather than one-concept-per-axis. That is the robust core, and it is what the toy demonstrates.

3. What real embeddings are actually like

So the dimensions are not individually meaningful, and there are a lot of them: their number is 768 in GPT-2 small, 4096 in a typical 7-billion-parameter model, 12,288 in GPT-3 (175 billion parameters, one of the most-studied LLMs), and larger still in frontier models. Two consequences follow immediately, and they set up the rest of the note.

Reading single dimensions is hopeless — partly because there are hundreds or thousands of them (trans-human to eyeball), and more importantly because, per §2, an individual coordinate usually means nothing. The interesting structure is in combinations of dimensions — in directions.
Therefore we need compressed views. To see a high-dimensional space we must squash it down to two or three dimensions we can plot. And here is the part that is easy to forget: every way of squashing is a choice of what to preserve, and therefore a choice of which question you are asking. A 2-D map is never “the” embedding space; it is an answer to one question about it. Pick the wrong method for your question and the map will mislead you with total confidence.

The next section lays out the three families of method you will meet, what question each one answers, and — just as important — what each one quietly destroys.

4. Three ways to look at an embedding space

The three workhorses are concept directions (you bring the meaning), PCA (the data brings the directions of greatest spread), and t-SNE/UMAP (the data brings local neighbourhoods, nonlinearly). They differ on every axis that matters:

	Concept directions / Gram-Schmidt	PCA	t-SNE / UMAP
Supervised?	Yes — you choose the axes	No	No
Linear?	Yes	Yes	No
A true projection?	Yes (onto chosen axes)	Yes (onto top eigenvectors)	No — a learned 2-D embedding
Readable directions / arithmetic?	Yes — that is the point	Sort of (PC axes have a sign)	No — axes mean nothing
What it preserves	your chosen contrasts	global variance	local neighbourhoods only
What it destroys	everything off your plane	small-variance structure	global geometry, distances, density
Best question	“where do words fall on this meaning?”	“what are the biggest directions of spread?”	“what clusters together locally?”

To make the comparison concrete, here is one embedding space — the 28-token vocabulary of our trained dialogue-game model (a tiny word-level Transformer, 2 layers × 4 heads, , that plays a turn-taking “calling game”: Pietro chiama Paolo, with epithets that depend on who calls whom) — shown all three ways at once. The vocabulary splits into roles: players (Pietro, Paolo, the numbers 1–8), epithets (Tarso, Cefa, capo, vice), verbs (chiama, perde), absurd distractor words, and special tokens.

The same 28-token vocabulary under all three methods, left to right: panel (a) concept directions / Gram–Schmidt (axes Paolo − Pietro, peer − subordinate epithets), panel (b) PCA, panel (c) t-SNE. Click the figure to enlarge — the panel titles and axis labels are only legible at full size.

Same 28 points, three pictures, three different stories. Now each method in turn.

4a. Concept directions — you bring the meaning

The most honest method is also the most opinionated: you decide what the axes mean. You pick a direction in embedding space that stands for a concept — either by naming a token (“the direction of Tarso”) or, more usefully, by taking a difference of tokens (“Paolo minus Pietro” = the which-leader direction), then orthogonalise a second concept against the first (Gram-Schmidt) so the two axes are independent, and read off where every word lands. The leftmost panel (a) of the figure above uses $e_1 = $ (Paolo − Pietro) and $e_2 = $ (peer epithets − subordinate epithets): the epithets fly to the corners exactly as their roles predict.

This is the king − man + woman ≈ queen world, and it is the method this project already uses elsewhere — applied not to the vocabulary but to the residual stream, the running vector the model updates token by token (see the QKV note). Because our model ties its embedding and unembedding matrices (The Final Step develops why this makes the dictionary the model searches literally the embedding matrix), moving the residual toward a token’s embedding direction is raising that token’s probability. So we can build a Gram-Schmidt plane from two embedding rows and watch the prediction travel across it. The next figure is a separate, single-panel plot — not one of the three panels above, and built on a different pair of axes:

Residual-stream trajectory in a concept plane (a single-panel figure, distinct from the three-panel figure above). Here is the Tarso direction and is the callee Paolo orthogonalised against it — chosen to track the prediction of Tarso.

For the prompt Pietro chiama Paolo, the model must emit Paolo’s epithet, Tarso. Note this is a different concept plane from panel (a): here is the Tarso direction itself and is the callee Paolo with its Tarso-component removed (Gram-Schmidt again, on a different pair of rows) — chosen because we are now tracking the prediction of Tarso, not laying out the whole vocabulary by leader and register. The residual starts near the called name and migrates across the plane toward the Tarso direction as the layers run (the coordinate climbs from −3.1 to +10.6), and decomposing the path by which write moved it shows that the attention block in layer 0 does the +7.62 push toward the epithet — pinning the behaviour to a specific component, the same kind of causal claim the QKV note’s ablation appendix makes. The point for this note: concept directions are supervised. They show you exactly what you ask about and nothing else.

That last clause is the catch, and the numbers make it honest. The concept plane in panel (a) captures only 15% of the vocabulary’s spread, and the residual plane captures 28% of the trajectory’s spread (against a best-possible 72% — see §4b). A supervised plane is chosen for meaning, not for variance, so it generally is not where the data spreads most. That is a feature, not a bug — but it means you are seeing your hypothesis, not the data’s own structure. Caveat: the famous analogies are partly fragile and cherry-picked (the literature has known this for a decade — see §7); they are cleanest on classic static word embeddings and patchier inside trained language models, as our own king − man + woman → lioness already warned.

4b. PCA — the data’s own biggest directions

PCA asks a different, unsupervised question: along which directions does the data spread most? It finds them by an eigen-decomposition of the centred data (a single svd call), and projects onto the top two. Unlike concept directions, you bring no hypothesis; unlike t-SNE, it is a genuine linear projection — the axes have a fixed meaning, you can read coordinates off them, and (with care) do arithmetic.

Panel (b) above is the PCA of our 28 embeddings; its top two components hold 34.7% of the variance, top four 55.3%. That single number — “only a third of the structure fits in the best possible 2-D linear view” — is itself the most useful thing PCA tells you, and it is honest in a way a t-SNE plot never is: it quantifies how much you are not seeing.

But PCA has a caveat that bites constantly and is widely under-appreciated: variance is not meaning. The directions of greatest spread are very often boring — token frequency, vector norm, or positional artefacts — rather than clean semantics. In word embeddings this is so reliable that a standard preprocessing trick (“All-but-the-Top”, §7) is to delete the top few PCs because they encode frequency. The practical defence, which our tooling also offers, is to project the rows onto the unit sphere first (a “cosine” view): that removes the magnitude/frequency effect and lifts our 2-D fraction from 34.7% to 43.5%, placing rare tokens by their direction instead of letting their small norm collapse them to the centre.

4c. t-SNE (and UMAP) — local neighbourhoods, with loud caveats

t-SNE answers a third question: which points are each other’s nearest neighbours? It is nonlinear and local — it tries to keep neighbours together while caring nothing about anything else — and it usually produces the prettiest, most cluster-y pictures, which is exactly why it is the most dangerous. Four caveats deserve to be stated loudly, because every one of them is routinely violated in practice:

It is not a projection. There are no axes, so a t-SNE plot has no readable directions and supports no arithmetic. “Right” and “up” mean nothing. (UMAP, the popular faster cousin, shares this — and additionally is not designed to preserve global structure either, despite a common belief otherwise.)
Inter-cluster distances are meaningless. Two clusters drawn far apart are not “more different” than two drawn close. The gaps between blobs carry no information.
It is stochastic. A different random seed gives a different layout. Here is the same 28-token data, same perplexity, two seeds:

t-SNE seed sensitivity

Paolo lands top-left in one run and bottom-centre in the other; the global arrangement reshuffles entirely. If your conclusion would change with the seed, it was never a conclusion about the data.
It is perplexity-sensitive. The one knob (roughly, “how many neighbours count as local”) changes the picture qualitatively; there is no single right value, and small datasets are especially unstable. The classic, still-essential reference is Wattenberg, Viégas & Johnson, “How to Use t-SNE Effectively” (Distill, 2016) — read it before you trust any t-SNE plot, your own included.

Used within its remit — “what clusters with what, locally” — t-SNE is genuinely useful. Read as a map with meaningful axes and distances, it is a confident liar.

5. Capstone — does the intuition survive contact with real language?

Our dialogue game has 28 words and a hand-built rule. Do the same intuitions hold when we scale up to a model trained on something language-like? To check, I took a real TinyStories-trained GPT-2 (segestic/Tinystories-gpt-0.1-3m — a small model trained on the TinyStories corpus of simple children’s stories, with a 50,257-token vocabulary and, conveniently, the same 64-dimensional embeddings as our toy), curated 84 readable whole-word tokens across clear semantic groups (animals, people, nature, colours, verbs, adjectives, function words), and ran the same two unsupervised methods.

TinyStories capstone — PCA and t-SNE

Two things survive the jump to a real corpus. First, the warnings hold: PCA captures only 33% of the variance in 2-D — the same “most of the structure is off-plane” story as the toy — and the function words (grey) peel off along the first PC, a textbook case of frequency dominating the top component (§4b). Second, structure does appear, but softly: in both views you can see animals loosely grouping, function words separating, colours and adjectives drifting together — real, but smeared, not the clean clusters a t-SNE plot’s prettiness might tempt you to over-read. The lesson scales exactly: the maps are useful for orientation, never for measurement, and they get harder to read, not easier, as the model becomes more operationally relevant.

6. How this connects to the rest of the series

This note and its two companions describe one machine from three angles, and they share one matrix:

The vectors visualised here are the columns of the unembedding matrix that the model scores the final residual against to pick a next token. With tied weights — as in both our toy models — the embedding matrix and the unembedding are the same numbers. So this note is “what the searched dictionary looks like,” and The Final Step is “how the search works.” The maps in §4 are maps of the very dictionary the model’s last step looks words up in.
The QKV note explains how attention assembles the residual vector that then gets scored. Fig 5 here is the bridge: it watches that residual move through a concept-direction plane built from the embedding rows, and attributes the decisive move to one attention block — the visualization technique of §4a applied to the mechanism of the QKV note.

The through-line: an embedding is a direction in a space the model owns; attention moves the residual through that space; the unembedding searches the space for the nearest word. Visualisation is how we — who cannot see in 768 dimensions — get a partial, question-shaped glimpse of where everything sits.

7. Where this sits in the literature (an honesty box)

None of the findings here are new, and a specialist would recognise every piece; the contribution is expository and the live, perturbable spreadsheet models behind the figures. To locate the material honestly:

Concept directions and analogies go back to word2vec (Mikolov et al., 2013), whose king − man + woman ≈ queen is the origin of the whole “linear semantics” picture. That picture is real but partly fragile: later work (Levy & Goldberg; Linzen, 2016) showed the analogies are sensitive to normalisation and to excluding the input words, and are easy to cherry-pick — which is why our in-model analogy landed on lioness, not queen.
The idea that concepts are linear directions in representation space is the linear representation hypothesis (e.g. Park, Choe & Veitch, 2023, and a long interpretability lineage). That distributed features are packed more than one per dimension is superposition, made precise in Anthropic’s Toy Models of Superposition (Elhage et al., 2022) — the formal version of §2’s lesson and of “single dimensions aren’t interpretable.”
Does hand-built structure ever help? (the §2 counter-example check.) With care, yes — and none of it rescues hand-wiring. Sparse autoencoders recover monosemantic feature directions from superposition (Bricken et al., Towards Monosemanticity, 2023; Cunningham et al., 2023) — confirming meaning lives in directions, not axes; retrofitting nudges already-learned vectors toward a lexicon to improve them (Faruqui et al., 2015); and the impossibility of unsupervised axis-aligned disentanglement without inductive bias is a theorem (Locatello et al., 2019). Together these say the §2 failure is about dictating and freezing the embedding, not about structure being useless.
The PCA caveat that top components track frequency/norm rather than clean semantics is well documented; the “delete the top components” fix is All-but-the- Top (Mu & Viswanath, 2018).
The t-SNE caveats are from van der Maaten & Hinton (2008) and, for practice, Wattenberg, Viégas & Johnson’s How to Use t-SNE Effectively (Distill, 2016); UMAP is McInnes, Healy & Melville (2018), with Coenen & Pearce’s Understanding UMAP as the matching cautionary companion.
Honesty caveat on the toy: the dialogue-game model is reproduced from the public ToyDialogueGames exercise; the novelty is the transparent, cell-by-cell spreadsheet realisation, the epithet/binding extension, and this exposition for a non-CS audience — not the toy or the methods themselves. The TinyStories capstone uses a community checkpoint (segestic/Tinystories-gpt-0.1-3m); my own from-scratch TinyStories replication is awaiting hardware.

Appendix A: a Gram-Schmidt projection you can check by hand

Concept-direction maps look like magic but are three dot products. Take a toy 3-dimensional embedding with four words:

king   = ( 2,  1,  0)
queen  = ( 2, -1,  0)
man    = ( 1,  1,  1)
woman  = ( 1, -1,  1)

Suppose we want a “gender” axis and a “royalty” axis. Define them as token differences and build an orthonormal plane.

Axis 1 — gender, as (king − queen):

v1 = king − queen = (0, 2, 0)        e1 = v1/‖v1‖ = (0, 1, 0)

So is just the second coordinate. Good: in this toy, dimension 2 happens to carry gender (the male words have +1, the female words −1).

Axis 2 — royalty, as (king − man), then Gram-Schmidt against :

v2  = king − man = (1, 0, -1)
v2·e1 = (1,0,-1)·(0,1,0) = 0          (already orthogonal to e1)
e2  = v2/‖v2‖ = (1, 0, -1)/√2 ≈ (0.71, 0, -0.71)

Project each word onto — two dot products per word:

        e1 (gender)          e2 (royalty)
king    (2,1,0)·e1 =  1      (2,1,0)·e2 = (2−0)/√2 ≈  1.41
queen   (2,-1,0)·e1 = -1     (2,-1,0)·e2 ≈  1.41
man     (1,1,1)·e1 =  1      (1,1,1)·e2 = (1−1)/√2 =  0.00
woman   (1,-1,1)·e1 = -1     (1,-1,1)·e2 =  0.00

Plotted, that is exactly the parallelogram the analogy promises:

   royalty (e2)
   1.41 |  queen ●         ● king
        |
   0.00 |  woman ●         ● man
        +------------------------- gender (e1)
          -1                +1

king − man + woman $= (2,1,0) − (1,1,1) + (1,-1,1) = (2,-1,0) = $ queen, exactly — because we built a space where it works. The sobering content of §2 is that a model trained to predict text does not build such a space for you; it builds whatever predicts best, and the clean parallelogram is the exception, not the rule. Visualisation lets you look for the parallelograms — and, just as importantly, lets you measure how often they are not there.

Appendix B: Glossary

For readers who would like the basics or a refresher. Terms are grouped roughly by where they appear.

Token, vocabulary

A token is the unit of text the model reads — here a whole word; in production models usually a sub-word piece. The vocabulary is the fixed set of all possible tokens (28 in our dialogue game, 50,257 in GPT-2/GPT-3).

Embedding

The vector of real numbers that represents a token. The embedding matrix has one row per vocabulary token; “looking up” a token means reading its row. The rows are parameters — numbers learned during training — not values fetched from elsewhere (§1).

Encoder (retrieval / vector database)

A separately-trained model that turns a piece of text into one vector so that similar texts get nearby vectors. Used to fill a vector database for semantic search. Examples: OpenAI text-embedding-3, sentence-transformers, ColBERT. It is trained for relevance similarity, a different objective from next-token prediction (§1).

Gradient descent

The training procedure: nudge every parameter a little in the direction that reduces the model’s error, repeat millions of times. It is what “learns” the embeddings.

Residual stream

The running vector the Transformer keeps for each position and updates layer by layer; each attention/MLP block adds its output to it. The model’s prediction is read from the final residual. (See the QKV note.)

Attention head

A sub-mechanism inside a layer that, for each position, decides how much to read from every earlier position. Our toy has 2 layers × 4 heads. “Which head does what” is the central question of mechanistic interpretability.

Logits, softmax

The model’s raw output scores over the vocabulary are logits. Softmax turns them into a probability distribution (exponentiate, then normalise to sum 1).

Tied weights (embedding / unembedding)

Using the same matrix to map tokens→vectors (input) and vectors→token-scores (output, the unembedding ). Standard for small models; it means the “dictionary” the model searches at the end is the embedding matrix (§6).

Cosine similarity

A measure of how aligned two vectors are: the cosine of the angle between them (+1 = same direction, 0 = orthogonal, −1 = opposite). Ignores length, so it compares direction only.

PCA (Principal Component Analysis)

An unsupervised method that finds the directions along which the data spreads most (the principal components), computed from an eigen-decomposition / SVD of the centred data. Projecting onto the top two gives a 2-D map; the variance explained (e.g. “34.7%”) is the fraction of the data’s total spread those two directions capture — a built-in honesty meter (§4b).

Eigenvector / SVD

The linear-algebra machinery PCA runs on: the singular value decomposition (SVD) factorises the data matrix and hands back the principal directions and how much variance each carries. You do not need the details — just that one svd call yields the PCA axes.

Projection

Mapping high-dimensional points onto a lower-dimensional plane by taking dot products with chosen axes. PCA and concept-direction maps are true projections (the axes keep a fixed meaning); t-SNE is not (§4).

Gram-Schmidt / orthonormal

A recipe for turning two chosen direction vectors into a clean perpendicular (orthonormal) pair of axes, so the two coordinates you read off are independent. Used to build concept-direction planes (§4a, Appendix A).

Concept direction

An axis you choose to stand for a meaning — either a token’s own direction or a difference of tokens (e.g. Paolo − Pietro = “which leader”). Supervised: you bring the meaning (§4a).

t-SNE, UMAP, perplexity

t-SNE and UMAP are nonlinear methods that place points so that local neighbourhoods are preserved, producing cluster-y 2-D pictures. They are not projections: the axes, the distances between clusters, and the global layout carry no meaning, and the result changes with the random seed. Perplexity is t-SNE’s main knob, controlling roughly how many neighbours count as “local” (§4c).

Distributed representation / superposition

Distributed: a concept is carried by a pattern across many dimensions, not one. Superposition: a model packs more feature-directions into a space than it has dimensions, at angles chosen to minimise interference — which is why the axes are not individually meaningful (§2).

Linear representation hypothesis

The empirical idea that many high-level concepts correspond to straight-line directions in representation space — the reason concept-direction maps and analogies (king − man + woman) work at all, when they do (§4a, §7).

Sparse autoencoder (SAE)

A tool that decomposes a model’s activations into many sparse, often human-interpretable feature directions — used to read meaning out of superposition without dictating it in advance (§2, §7).

Ablation

Switching off one component (e.g. one attention head) and re-measuring behaviour, to test what it causally does — as opposed to what its attention pattern looks like. Fig 5’s “which write moved the residual” is the same spirit (§4a; the QKV note’s appendix).

References

Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., … Olah, C. (2023). Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread. https://transformer-circuits.pub/2023/monosemantic-features/index.html

Coenen, A., & Pearce, A. (n.d.). Understanding UMAP. Google PAIR. https://pair-code.github.io/understanding-umap/

Cunningham, H., Ewart, A., Riggs, L., Huben, R., & Sharkey, L. (2023). Sparse autoencoders find highly interpretable features in language models. arXiv. https://arxiv.org/abs/2309.08600

Edwards, B. (2012), Drawing on the Right Side of the Brain: The Definitive,, 4th ed, TarcherPerigee.

Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., … Olah, C. (2022). Toy models of superposition. Transformer Circuits Thread. https://transformer-circuits.pub/2022/toy_model/index.html

Faruqui, M., Dodge, J., Jauhar, S. K., Dyer, C., Hovy, E., & Smith, N. A. (2015). Retrofitting word vectors to semantic lexicons. In Proceedings of NAACL-HLT 2015 (pp. 1606–1615). Association for Computational Linguistics. https://arxiv.org/abs/1411.4166

Levy, O., & Goldberg, Y. (2014). Linguistic regularities in sparse and explicit word representations. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning (CoNLL) (pp. 171–180). Association for Computational Linguistics. https://aclanthology.org/W14-1618/

Linzen, T. (2016). Issues in evaluating semantic spaces using word analogies. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP (pp. 13–18). Association for Computational Linguistics. https://arxiv.org/abs/1606.07736

Locatello, F., Bauer, S., Lucic, M., Rätsch, G., Gelly, S., Schölkopf, B., & Bachem, O. (2019). Challenging common assumptions in the unsupervised learning of disentangled representations. In Proceedings of the 36th International Conference on Machine Learning (ICML) (pp. 4114–4124). https://arxiv.org/abs/1811.12359

McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv. https://arxiv.org/abs/1802.03426

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv. https://arxiv.org/abs/1301.3781

Mu, J., & Viswanath, P. (2018). All-but-the-top: Simple and effective postprocessing for word representations. In International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1702.01417

Park, K., Choe, Y. J., & Veitch, V. (2023). The linear representation hypothesis and the geometry of large language models. arXiv. https://arxiv.org/abs/2311.03658

van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(86), 2579–2605. https://jmlr.org/papers/v9/vandermaaten08a.html

Wattenberg, M., Viégas, F., & Johnson, I. (2016). How to use t-SNE effectively. Distill. https://distill.pub/2016/misread-tsne/

Models and tooling behind the figures (all in the julia-impromptu project): the trained dialogue-game model DialogueGame-Tiny-Epithet-Trained.json with its live vocab_map and residual_trajectory reports; the failed hand-wired SemanticTiny.json; the projection pipeline tools/project_embeddings_note.jl (PCA via svd, concept axes via Gram-Schmidt, t-SNE via TSne.jl — one source of truth for every coordinate) and the renderer tools/render_embeddings_note.py.

Transformers and QKV Attention: A Primer

Luca Erzegovesi — Sat, 30 May 2026 22:00:00 GMT

First in the series “Understanding LLMs to use them better in management and finance.” The three notes describe one machine — a Transformer language model — from three complementary angles. This note opens the machine up and lays out the attention machinery: how a Transformer moves information between words. The second note is about the embeddings — the vocabulary of vectors the machine reads from and writes to, and how to look at a space with hundreds of dimensions without fooling yourself. The third note watches the final step — the moment all that work collapses into a single next word, which turns out to be a relentless search. Read in order, the three move from mechanism, to dictionary, to read-out.

A word on how to read this one. Think of it as the technical booklet that ships with an electronic appliance: it aims to give an accessible but complete view of the machinery and its inner workings. The end user normally need not open the booklet; the technician must. But in business applications of AI the boundary between technician and end user is blurring fast — anyone who deploys these models in management or finance increasingly has to look inside the machine to judge what it can and cannot reliably do. This first note is accordingly longer and more abstract than the two that follow, which are focused and example-driven. The architecture it describes is the Transformer, introduced by Vaswani et al. (2017) in the paper whose title became a slogan, “Attention Is All You Need”; the technical terms are collected in Appendix C and the references at the very end.

1. Transformers: an evolutionary step from neural networks

Before opening up the attention machinery, it helps to see where a Transformer comes from. It is not an exotic invention out of nowhere; it is one more step in a long line of neural networks that all do the same basic thing — convert input data, step by step, into a representation whose position in some space encodes the answer. What changes from one architecture to the next is how that conversion is organized. Attention is the organizational idea that made networks good at language.

A network you may already picture: the OCR convolutional net

Think of a classic optical-character-recognition (OCR) network — the kind that reads a handwritten digit or letter from a small image. Its workhorse is the convolution. The input is a grid of pixels. A small filter (say a patch of weights) slides across the image; at each location it multiplies the pixels under it by its weights and sums them into a single number. Sweeping the filter over the whole image produces a feature map — a new grid that lights up wherever the filter’s little pattern (an edge, a curve, a stroke) is present. A layer has many such filters, so it produces many feature maps. Stack a few convolution layers (with pooling in between to shrink the grid), and the network builds up from edges → strokes → loops → whole-character shapes.

It is worth stating the underlying operation precisely, because a clean version of it carries over to attention. A convolution is a weighted sum over a small local window of the input: at each position the filter computes a dot product between its fixed weights and the pixels it currently covers. (It is a weighted sum, not strictly an average — the weights need not sum to one and are often negative, which is how a filter can reward ink in one place and penalise it in another.) Two things are then layered on top. First, the same weights are reused at every position (weight sharing), so the operation is really one small pattern-detector swept across the whole image. Second, a layer stacks many such detectors and a later layer takes weighted combinations of their feature maps — and it is those combination weights, learned by training, that select which mixtures of low-level features best predict the correct character. So the intuition behind the original marker is right once it is split in two: each convolution is a local weighted sum, and the cross-feature mixing that “selects the useful combinations” happens when later layers combine feature maps. (Appendix B works a full convolution through by hand.)

At the very end the features are flattened and a softmax classifier turns them into a probability over the possible characters.

flowchart LR
    A["pixels
(grid of ink/blank)"] -->|"filters slide
shared 3×3 weights
local receptive field,
reused at every position"| B["feature maps
one grid per filter:
edges, strokes, ..."]
    B -->|"flatten +
linear layer"| C["softmax over
26 letters"]
    C --> D["a .02
...
e .91 ◄ answer
..."]

Two properties make this work for images. First, each filter has a fixed, local receptive field: it only ever looks at a small neighborhood of pixels. Second, the same filter is reused at every position (weight sharing), so a stroke is recognized the same way wherever it sits on the page. Both properties are exactly right for images, because the clues needed to recognize a stroke are local and their meaning does not depend on where on the page they appear.

Why language breaks this, and why we need attention

Language does not have those two convenient properties. The piece of context a word depends on can be right next to it or hundreds of words back — and which earlier words matter depends on the content, not on a fixed offset. To resolve “it” you must find the noun it refers to, and that noun could be anywhere. A fixed, local filter cannot do this: it always looks in the same small window, regardless of what the sentence is about.

Attention is the fix. Instead of a fixed local window, an attention head computes, on the fly and for each token, how much to pull from every other token — and then pulls. The “receptive field” is no longer fixed by the architecture; it is decided at runtime, from the content, by the query–key matching we will detail in Section 2. In one sentence:

An attention head is the language analogue of a convolutional filter, except its receptive field is learned, dynamic, and content-dependent instead of fixed and local.

That single change — a receptive field the data gets to choose, every time — is what let neural networks finally handle long-range, content-driven dependencies, and it is the heart of the Transformer.

It is important not to over-claim, though: attention is the new idea, not the only idea. A Transformer interleaves two kinds of block. The attention heads are the novelty just described — the content-driven, dynamic receptive field that moves information between positions. Alongside them sit ordinary MLP (multi-layer perceptron) blocks, the same fully-connected, learned weighted-combination machinery that does the cross-feature mixing in a CNN — except here each MLP refines one token’s representation in place, with no reference to its neighbours. The two blocks divide the labour cleanly: attention is the only channel that lets tokens talk to each other, while the MLP is where each token’s representation is reshaped on its own. Sections 5 and 6 build both explicitly, and Appendix A shows what is lost if you switch the attention off and keep only the MLP.

What the output data actually is

It pays to be precise about what these networks produce, because the same description fits both the OCR net and the language model.

In both cases the network turns its input into a representation: a vector whose position in a high-dimensional space encodes the answer. In OCR, the final feature vector’s location says “this image sits in the region of the space that means the letter e.” In a Transformer, the inputs are pieces of text — tokens — and each token is represented by an embedding: a vector of numbers giving its coordinates in a high-dimensional “language space,” where direction and proximity stand in for meaning. The vector the model processes and maintains for each token — the residual stream, formally introduced in Section 4 — is best read as a modified embedding: it starts life as the raw token embedding and each layer nudges it to a new position that encodes everything the model has worked out about that token in its context.

So the output data are representations of entities, describing each entity’s most likely position in a solution space. And from that position the model reads out one of two things:

a single crisp solution — the one best answer (take the highest-scoring class, i.e. the , equivalently temperature ); or
a probability distribution over solutions — a graded set of plausible answers with weights.

The readout is the same machine in both networks: a linear projection followed by a softmax. OCR projects the final feature vector onto the alphabet and gets a distribution over letters; a language model projects the final token representation onto the vocabulary and gets a distribution over next tokens. Same structure, different solution space. (Section 5 shows exactly how the final representation’s direction encodes which tokens are likely and its magnitude encodes how confident the model is — the geometric version of “position in a solution space.”)

The logic of the processing: params convert inputs into entity representations

How does the input get from raw symbols to these context-aware representations? The recipe is the through-line of this whole note:

Look up. Each input token ID is replaced by its embedding — a first, context-free guess at its position in the space (a row of the embedding matrix ).
Read, transform, write — repeatedly. A stack of parameterized blocks then reads the current representations and writes adjustments back. Two kinds of block alternate:
- Attention blocks move information between tokens (the dynamic receptive field above), letting each token’s representation absorb what it needs from the others.
- MLP blocks refine each token’s representation in place, one position at a time (detailed in Section 6).
Read out. After such rounds (model layers) the representation is “finished,” and the unembedding projects it into the solution space to give the crisp answer or the distribution.

The parameters — in attention, in the MLP — are frozen after training. They are the learned knowledge: they encode the rule for how to move each representation, step by step, from a bare embedding toward its correct final position. Training is just the search for parameter values that put every entity in the right place in the solution space.

Why this framing matters for what follows. Because all the “thinking” lives in these incrementally-modified representations flowing through the residual stream, we can later ask very sharp questions about a trained model: where is a particular piece of behavior carried, and which block put it there? In small, fully-understood models one can even disable a single block and watch a specific behavior appear or vanish — the basis of the interpretability experiments these notes are meant to accompany (see Appendix A). The rest of this note builds the mechanism precisely enough to make those questions answerable.

2. The QKV trio

Every token, at every attention layer, produces three vectors derived from the token’s current hidden state via three learned weight matrices (, , ):

Q (query) — “what am I looking for?”
K (key) — “what do I offer to be matched against?”
V (value) — “what information do I carry, if matched?”

Attention is the operation that uses these three to mix information across tokens. For a given token’s query Q, the model computes a similarity score against every previous token’s K (a dot product, scaled, then softmaxed into weights). Those weights are then used to take a weighted sum of the corresponding V vectors. The result is the attention output for that token — a blend of values from earlier tokens, weighted by how well their keys matched this token’s query.

In compact form (Q, K, and V are the corresponding W applied to tokens’ embeddings, see Section 4):

This formula is per head: and , giving an output in . In multi-head attention, the same formula is applied times in parallel on independent Q/K/V slices, and the resulting blocks are concatenated and then mixed by into the residual.

So Q and K together produce the attention pattern (who attends to whom, and how strongly), and V is what actually flows along those attention edges. Q combines with K to produce attention weights, which then select/blend the V vectors. V is the payload, QK is the routing.

3. Why only K and V are cached, not Q

During generation, when the model is producing token :

It needs the Q of the new token only — Q is computed fresh each step and discarded; it has no use beyond this single attention computation.
It needs the K and V of every previous token, because the new token’s Q has to attend back to all of them.

That’s why the cache is called the KV cache and not “QKV cache”. K and V are the historical state that accumulates as the conversation grows; Q is ephemeral, recomputed and thrown away every step.

This also explains the asymmetry between prefill (prompt processing) and generation (prompt continuation). During prefill, Q is computed for every input token too — but it’s used immediately to compute attention for that token and then discarded. Only K and V get written into the cache for future reuse. Generation is just prefill-of-one-token, repeated, with the cache growing by one row of K and one row of V at each layer per step.

4. Dimensionality: a complete inventory

Let me define the symbols once and then track every shape through the network.

Meaning	Typical value (GPT-2 small)	Typical value (modern 7B)
vocabulary size	50,257	32,000–150,000
number of transformer layers	12	32
sequence length (tokens in the context)	up to 1024	up to 128K
residual stream / embedding dimension	768	4096
number of attention heads	12	32
per-head Q/K/V dimension, often	64	128
MLP hidden dimension, usually	3072	11008

Model parameters (frozen weights)

Weight	Shape	What it does
Token embedding		maps token IDs to vectors
Positional embedding (if any)		adds positional information
Per layer :		projects hidden → all heads’ Q
Per layer:		projects hidden → all heads’ K
Per layer:		projects hidden → all heads’ V
Per layer:		mixes head outputs back to residual
Per layer: MLP	,	non-linear transformation
Final LayerNorm		normalization scale/shift
Unembedding (often )		maps final hidden → logits

Total parameters scale roughly as (the famous result of Kaplan et al.).

Activations (recomputed every forward pass)

Activation	Shape	Notes
Input token IDs		integers in
Hidden state / residual stream		the running “meaning” per token, threaded through layers
, , per head	each, per head	computed from via , ,
Attention scores	per head	the routing pattern
Attention weights (post-softmax)	per head	row-stochastic, lower-triangular (causal mask)
Attention output (single head)		weighted sum of V rows
Attention output (all heads concat)		usually
Output of added to residual		written back into the residual stream
MLP output		also written back into the residual stream
Final hidden state (after layer , LayerNorm)		input to the unembedding
Logits		one row per token position
Logits of the last token		the one that matters for next-token prediction
Probability distribution over vocabulary	, sums to 1	of last-token logits

From last-token logits to the next-token distribution

After the final layer, the hidden state of the last position is multiplied by the unembedding matrix:

These are the logits: one real number per vocabulary token, unnormalized. Softmax converts them into probabilities:

where is the sampling temperature. At this becomes greedy (). At you get the raw model distribution. The sampler then draws the next token from this distribution, appends it to the sequence, and the loop continues.

KV cache size

The KV cache has total size:

The factor 2 is for K and V together. For a 7B model with , , , at FP16 (2 bytes), 32K context:

This is why long-context inference is memory-hungry, and why DeepSeek’s Multi-head Latent Attention (which compresses K and V into a low-rank latent space) is such a big deal — it can cut this by an order of magnitude.

5. A worked example: tiny model, tiny vocabulary

Let me build a deliberately small model so every matrix fits on a page. We’ll watch one attention step end-to-end.

Setup

Vocabulary size . Say the vocabulary is {the, cat, dog, sat, ran, on, mat, floor, big, small, red, blue, fast, slow, a, and, ., is, was, .EOS} — 20 tokens indexed 0–19.
Embedding dimension .
One attention head, so .
One layer (we’ll ignore MLPs for clarity).
Context: 3 tokens. We’re going to compute attention for the sequence the cat sat, token IDs .

Step 1: token embedding

The embedding matrix is . After training it might look like (showing only the three rows we need):

E[0]  = [ 0.10, -0.20,  0.05,  0.40,  0.15]   "the"
E[1]  = [ 0.30,  0.50, -0.10,  0.20,  0.00]   "cat"
E[3]  = [-0.40,  0.10,  0.60, -0.30,  0.20]   "sat"

After looking up these three rows, the input to layer 1 is a matrix — three tokens, each as a 5-dimensional embedding:

X = [[ 0.10, -0.20,  0.05,  0.40,  0.15],
     [ 0.30,  0.50, -0.10,  0.20,  0.00],
     [-0.40,  0.10,  0.60, -0.30,  0.20]]

This is the initial residual stream , shape .

Step 2: compute Q, K, V

The weight matrices , , are each here.

General shape: for and for . In this toy example we have one head with , so the shape collapses to . In a real multi-head model the per-head slice has shape , not — the equality holds only after concatenating all heads.

is also called the IN matrix.

Let’s say:

W_Q = [[ 1.0,  0.0,  0.0,  0.5,  0.0],
       [ 0.0,  1.0,  0.0,  0.0,  0.5],
       [ 0.0,  0.0,  1.0,  0.0,  0.0],
       [ 0.5,  0.0,  0.0,  1.0,  0.0],
       [ 0.0,  0.5,  0.0,  0.0,  1.0]]

W_K = [[ 0.8,  0.2,  0.0,  0.0,  0.0],
       [ 0.2,  0.8,  0.0,  0.0,  0.0],
       [ 0.0,  0.0,  0.9,  0.1,  0.0],
       [ 0.0,  0.0,  0.1,  0.9,  0.0],
       [ 0.0,  0.0,  0.0,  0.0,  1.0]]

W_V = [[ 0.5,  0.5,  0.0,  0.0,  0.0],
       [ 0.5, -0.5,  0.0,  0.0,  0.0],
       [ 0.0,  0.0,  1.0,  0.0,  0.0],
       [ 0.0,  0.0,  0.0,  1.0,  0.0],
       [ 0.0,  0.0,  0.0,  0.0,  1.0]]

Compute , , . Each is in this single-head example. In general, per head, the shapes are for and , and for . In the three matrices below, each row is a token and each column indexes one coordinate of the per-head Q/K/V vector.

Q = [[ 0.30, -0.10,  0.05,  0.45,  0.05],   # for "the"
     [ 0.40,  0.50, -0.10,  0.35,  0.25],   # for "cat"
     [-0.25,  0.20,  0.60, -0.50,  0.25]]   # for "sat"

K = [[ 0.04, -0.14,  0.04,  0.41,  0.15],   # for "the"
     [ 0.34,  0.46, -0.10,  0.17,  0.00],   # for "cat"
     [-0.30, -0.00,  0.51, -0.21,  0.20]]   # for "sat"

V = [[-0.05,  0.15,  0.05,  0.40,  0.15],   # for "the"
     [ 0.40, -0.10, -0.10,  0.20,  0.00],   # for "cat"
     [-0.15, -0.25,  0.60, -0.30,  0.20]]   # for "sat"

(I’ve rounded these for readability; the principle is what matters.)

A fourth weight matrix, , (OUT matrix) is also a learned parameter of the attention block. Its job is to project the concatenated per-head attention outputs back into the residual stream’s dimension and (in multi-head attention) to mix information across heads.

General shape: . Here, with one head and , it collapses to . Note: is the concatenated head-output dimension, which in most architectures equals by design — that’s why usually looks like a square in implementations.

W_O = [[ 1.0,  0.0,  0.0,  0.0,  0.0],
       [ 0.0,  1.0,  0.0,  0.0,  0.0],
       [ 0.0,  0.0,  1.0,  0.0,  0.0],
       [ 0.0,  0.0,  0.0,  1.0,  0.0],
       [ 0.0,  0.0,  0.0,  0.0,  1.0]]

(For simplicity we’ve set to the identity matrix here, meaning the attention output passes through unchanged. In a trained model would be a learned dense matrix that performs the head-mixing and projection described above.)

Step 3: attention scores

This is a matrix where entry is the dot product of token ’s query with token ’s key — a similarity score between tokens (not between heads), measuring how strongly token wants to attend to token .

QK^T = [[ 0.20,  0.07, -0.21],
        [-0.01,  0.40, -0.23],
        [-0.16,  0.06,  0.59]]

Now divide by :

QK^T / sqrt(5) = [[ 0.089,  0.031, -0.094],
                  [-0.004,  0.179, -0.103],
                  [-0.071,  0.027,  0.264]]

Step 4: causal mask + softmax

Since we’re doing autoregressive language modeling, each token can only attend to itself and earlier tokens. We mask the upper triangle to (so softmax gives them weight 0):

masked = [[ 0.089,  -inf,  -inf],
          [-0.004,  0.179,  -inf],
          [-0.071,  0.027,  0.264]]

Apply softmax row by row:

attention_weights = [[1.000, 0.000, 0.000],
                     [0.454, 0.546, 0.000],
                     [0.250, 0.276, 0.474]]

Read this carefully: the third row says that when computing the output for “sat”, the model attends 25% to “the”, 28% to “cat”, and 47% to itself. This is the attention pattern, the “who attends to whom” that QK produced.

Step 5: weighted sum of V

Multiply . Per head this is . With one head and , the result is :

attention_output = [[-0.050,  0.150,  0.050,  0.400,  0.150],   # "the" attends only to itself
                    [ 0.196,  0.014, -0.027,  0.291,  0.068],   # "cat" blends "the" + "cat"
                    [-0.022, -0.094,  0.262, -0.061,  0.132]]   # "sat" blends all three

The third row is the interesting one: it’s a weighted blend (25%/28%/47%) of the three V vectors. This is the V “payload” being routed along the attention edges that QK set up. The output for “sat” now incorporates information drawn from “the” and “cat” — that’s how the model learns long-range dependencies.

(With multiple heads, you’d have such blocks, one per head, computed in parallel from independent slices. They get concatenated along the last axis into a single tensor before the next step. Section 6 carries out exactly this multi-head case numerically.)

Step 6: through , back into the residual stream

The concatenated attention output (here just one head, so “concatenation” is a no-op) is projected by :

Shape arithmetic: . In our single-head toy, that’s .

This is the residual addition — the attention output is added to the previous residual stream, not replacing it. The hidden state is still shape .

In a real model, MLP layers would now run on top, also reading from and writing to the residual stream. We’d repeat the whole thing times. Here we have one layer, so we go straight to the output. (Section 6 adds the MLP and a second layer explicitly.)

Step 7: final hidden state → logits → next-token distribution

We only care about the next token, so we take the last row of : a single 5-vector, the final hidden state for “sat”. Suppose after and residual it’s:

h_last = [-0.351,  0.030,  0.802, -0.398,  0.247]

Multiply by the unembedding matrix , shape (one column per vocabulary token):

Result might look like:

logits = [-0.8, -0.2,  0.1,  0.5,  1.2,  2.4,  3.1,  2.8,  0.4, -0.1,
          -0.3,  0.0, -0.5, -0.4,  0.6,  0.2,  1.0, -0.2,  0.3, -1.5]
            the  cat  dog  sat  ran   on   mat floor  big small  red ...

The model thinks “on” (logit 2.4), “mat” (3.1), “floor” (2.8) are the most likely continuations of “the cat sat”. Applying softmax at temperature 1:

P("mat"   | "the cat sat") ≈ 0.32
P("floor" | "the cat sat") ≈ 0.24
P("on"    | "the cat sat") ≈ 0.16
P(others)                  ≈ remainder

This is the conditional distribution that the language model has been trained to approximate. The sampler picks a token from this — greedy would pick “mat” — appends it to the sequence, and the next forward pass begins.

Final step calculations in detail.

The setup

After the last transformer layer (), the residual stream is a tensor — one -dimensional vector per token position. Call it .

At inference time, when you want to predict the next token, you only care about the last row: . This vector is the model’s final, fully-processed representation of “what comes next” given the entire context so far.

The two operations

1. Final LayerNorm (or RMSNorm). Before unembedding, virtually all modern transformers apply one last normalization to the residual stream. It rescales the vector so its components have controlled magnitude, then applies a learned per-dimension scale (and sometimes shift):

The shape stays . This step is easy to forget but it matters — without it, the magnitudes coming out of the residual stream would be wild, since every layer has been adding to it.

2. Unembedding (the linear projection to vocabulary). The normalized vector is multiplied by the unembedding matrix :

Each column of is a -dimensional vector — one per vocabulary token. The matrix multiplication computes, for every vocabulary token , the dot product between the final hidden state and that token’s column:

So each logit is literally a similarity score between the final hidden state and the -th vocabulary token’s representation in . Tokens whose column points in the same direction as get high logits; tokens whose column points elsewhere get low logits.

Interpreting the similarity score

It is tempting to read this one step further and say: the model is trained to produce a last-token vector that points in the same direction as the embedding vectors of the likely next tokens. That intuition is sound, and it is exactly the geometry the third note develops in full — so here we only state the result and move on.

Training (gradient descent on the cross-entropy loss) shapes to have high dot product with the columns of likely next tokens and low dot product with the rest. Under weight tying (, below), those columns are the input embedding vectors, so the picture is almost literal. Two refinements keep it honest, both elaborated in Note 3: the vector does not point at one token but positions itself among many plausible continuations at once (which is how the model expresses uncertainty), and its magnitude — not just its direction — matters, because a longer vector sharpens the softmax (more confident) and a shorter one flattens it (less sure). The compact statement:

The model learns to produce a final residual-stream vector whose direction encodes which next tokens are likely and whose magnitude encodes how confident the prediction is.

This is the foundation of the logit lens, which applies the unembedding to intermediate residual streams to ask what the model would predict if forced to commit early; it works because the residual stream lives, throughout the network, in the same space the final read-out uses. Note 3 turns this whole picture — the last step as a similarity search over the vocabulary — into its central theme.

Weight tying

A detail worth knowing: in many models (GPT-2, Llama, and others), is the same matrix as the token embedding — specifically, . This is called weight tying. The intuition is elegant: the embedding matrix maps token ID → vector (each row is a token’s representation); the unembedding maps vector → token logits (each column is a token’s representation). It’s the same dictionary, used in two directions.

Weight tying cuts parameter count noticeably — that matrix is , which can be hundreds of millions of parameters for large vocabularies — and empirically it often improves quality.

Not all models tie weights (some recent ones keep them separate to give the unembedding more flexibility), but it’s a very common default.

From logits to a distribution

Logits are just real numbers, unnormalized. They can be negative, can be huge — they don’t sum to anything meaningful on their own. To turn them into a probability distribution over the vocabulary, apply softmax:

where is the sampling temperature. The sampler then draws the next token ID from this distribution (or picks for greedy decoding), and the next forward pass begins.

The compact picture

Two matrix multiplications away from a probability distribution over the entire vocabulary. The residual stream did the heavy lifting; the unembedding just reads out the answer.

Two things to keep in mind

The residual stream “decides” everything before the unembedding. The unembedding is a fixed linear readout — it has no capacity to think, only to project. By the time you reach , all the work of conditioning on the context has already been done by the transformer layers writing into . The unembedding is a translator from “internal representation space” to “vocabulary space.”

During training, you compute logits for every position. At inference you only need the last row, but training computes for all positions in parallel — each row predicts its successor — so the loss can be evaluated everywhere at once (teacher forcing). That’s why the full logits tensor in training has shape , while at inference you typically only materialize the last row.

6. A second worked example: two heads, two layers, with an MLP

The first example deliberately stripped the model down to a single head and a single layer, and skipped the MLP entirely. That was the right move for seeing attention clearly, but it hides three things a real Transformer does on every forward pass: it splits the work across several heads, it refines each token with an MLP, and it stacks layers so the residual stream is processed again and again. This example puts all three back, kept just small enough to do by hand.

The point of this section is dimensionality: watching the shape of the data at every step as it flows through two complete layers. The numbers below are real (computed and checked), but don’t memorize them — the weights are random, not trained, so the specific final token is meaningless. Follow the shapes. (Every matrix is shown rounded to two decimals, so re-adding the displayed intermediates by hand may differ from a shown result by ; the full-precision computation is consistent.)

Setup

.
heads, so the per-head dimension is .
layers, each one a full attention block + MLP block.
MLP hidden size . (Real models use ; we shrink it so the matrices stay on the page.)
ReLU nonlinearity in the MLP: , applied element-wise — it simply zeros out negatives.
Context: the same 3 tokens, the cat sat, so .

Here is the whole journey as a shape table. Everything below just fills in the numbers for these rows.

Step	Operation	Output shape
token IDs	lookup
	embedding + positional
(all heads)	, etc.	each
per-head	slice into blocks	each, ×2
scores	per head	per head
weights	mask + softmax	per head
head output	weights	per head
concat	join 2 heads
attn write-back	concat

MLP pre-activation
MLP activation
MLP output

… repeat for layer 2 …
	final hidden state
last row	unembedding
softmax	distribution

Step 1: embedding + positional →

Each token’s 6-dimensional embedding, plus a positional embedding for its slot, gives the initial residual stream , shape :

h⁰ = [[ 0.10, -0.10,  0.05,  0.30,  0.15, -0.05],   # the  (pos 1)
      [ 0.40,  0.50, -0.20,  0.20,  0.10,  0.25],   # cat  (pos 2)
      [-0.50,  0.15,  0.65, -0.20,  0.15,  0.05]]   # sat  (pos 3)

Step 2: project to Q, K, V, then split into heads

is now (general shape ), and likewise . Multiplying gives a matrix. The crucial new idea: those 6 columns are two heads’ worth of Q stacked side by side — columns 1–3 are head 1’s query, columns 4–6 are head 2’s. The vertical bar marks the split:

Q = [[-0.24, -0.27, -0.43 | -0.05, -0.30, -0.01],   # the
     [-0.37,  0.29, -0.61 | -0.56,  0.17, -0.13],   # cat
     [ 0.20, -0.18, -0.00 |  0.37, -0.14,  0.09]]   # sat
       └──── head 1 ────┘   └──── head 2 ────┘

and are computed and split the same way. Each head now has its own query, key, and value. This is what “multi-head” means: one matrix multiply, then carve the result into independent lanes, each running the attention formula on its own slice.

Step 3: attention inside each head

Run Steps 3–5 of the previous example separately in each lane.

Head 1. Scaled scores , then causal mask and softmax:

scores₁ = [[-0.01,  -inf,  -inf],          weights₁ = [[1.00, 0.00, 0.00],
           [ 0.04,  0.07,  -inf],     →               [0.49, 0.51, 0.00],
           [-0.01, -0.04, -0.07]]                     [0.34, 0.33, 0.32]]

Weighted sum of head 1’s gives head 1’s output, shape :

head₁_out = [[-0.12, -0.04, -0.07],
             [-0.15,  0.40, -0.07],
             [-0.12,  0.20, -0.20]]

Head 2. Same procedure on head 2’s slice, giving its own pattern and output:

weights₂ = [[1.00, 0.00, 0.00],         head₂_out = [[ 0.06, -0.03,  0.01],
            [0.45, 0.55, 0.00],     →                [ 0.04, -0.06, -0.05],
            [0.36, 0.32, 0.32]]                      [-0.16,  0.11,  0.03]]

Notice the two heads produce different attention patterns from the same input — head 1 weights “cat” slightly more on row 2 (0.51), head 2 weights it more strongly (0.55). Each head is free to specialize. This is the structural fact the interpretability experiments hinge on: distinct heads can do distinct jobs, and you can study them one at a time.

Step 4: concatenate the heads

Glue the two head outputs back together, side by side, into one matrix:

concat = [[-0.12, -0.04, -0.07 |  0.06, -0.03,  0.01],
          [-0.15,  0.40, -0.07 |  0.04, -0.06, -0.05],
          [-0.12,  0.20, -0.20 | -0.16,  0.11,  0.03]]
            └─ head 1 out ──┘    └─ head 2 out ──┘

Step 5: and the residual addition

has shape . It mixes the two heads’ information together and maps it back to the residual width. The product is the attention block’s write-back, shape :

attn = [[-0.06,  0.10, -0.11,  0.05,  0.06,  0.01],
        [-0.03, -0.09, -0.10, -0.15,  0.30,  0.08],
        [ 0.19, -0.20,  0.23,  0.09,  0.34, -0.10]]

Add it to the residual stream — , still :

h_mid = [[ 0.04, -0.00, -0.06,  0.35,  0.21, -0.04],
         [ 0.37,  0.41, -0.30,  0.05,  0.40,  0.33],
         [-0.31, -0.05,  0.88, -0.11,  0.49, -0.05]]

Step 6: the MLP block

Now the piece the first example skipped. The MLP acts on the residual stream one token at a time — the same applied to every row independently, with no interaction between positions. (Hold onto that fact; the Appendix turns on it.)

First, expand from width 6 to width via (shape ): , shape :

pre = [[ 0.16, -0.07, -0.14,  0.10, -0.12, -0.05, -0.06, -0.29],
       [ 0.17, -0.38,  0.18,  0.24, -0.11, -0.26,  0.35, -0.24],
       [-0.02,  0.22,  0.51, -0.66, -0.29, -0.36, -0.45,  0.36]]

Apply ReLU — every negative entry becomes 0. This is the model’s only nonlinearity, and it’s what lets the MLP do more than a plain matrix multiply:

act = [[0.16, 0.00, 0.00, 0.10, 0.00, 0.00, 0.00, 0.00],
       [0.17, 0.00, 0.18, 0.24, 0.00, 0.00, 0.35, 0.00],
       [0.00, 0.22, 0.51, 0.00, 0.00, 0.00, 0.00, 0.36]]

Then contract back from width 8 to width 6 via (shape ): , shape :

mlp = [[ 0.07,  0.03, -0.04, -0.05,  0.08,  0.07],
       [ 0.08,  0.07, -0.22, -0.48,  0.24,  0.05],
       [ 0.79,  0.29, -0.27, -0.34,  0.26, -0.42]]

Add it back to the residual stream — , shape . Layer 1 is now complete:

h¹ = [[ 0.12,  0.03, -0.10,  0.31,  0.29,  0.04],
      [ 0.45,  0.48, -0.52, -0.43,  0.64,  0.38],
      [ 0.47,  0.24,  0.61, -0.45,  0.75, -0.47]]

Note the shape went inside the MLP: the residual stream stays width 6 everywhere; the width-8 expansion happens only inside the block and is contracted away before the write-back. The residual stream’s width is invariant — that constancy is what lets every block read and write the same tape.

Step 7: layer 2 (same shapes, new weights)

Layer 2 has its own , but the shapes and the procedure are identical to layer 1. It reads , runs two-head attention, writes back, runs its MLP, writes back, and produces . We show only the write-backs and the result:

attn² (write-back)   h_mid² = h¹ + attn²        mlp²                 h² = h_mid² + mlp²
[[-0.09,-0.27, 0.01,  [[ 0.02,-0.24,-0.09,       [[-0.04,-0.27,-0.37,  [[-0.02,-0.51,-0.46,
  -0.10,-0.02,-0.08],    0.21, 0.27,-0.05],         -0.66, 0.06, 0.18],   -0.46, 0.33, 0.14],
 [-0.20,-0.48, 0.08,   [ 0.25, 0.01,-0.44,        [-0.17,-0.40,-0.58,   [ 0.08,-0.39,-1.01,
   0.12,-0.06,-0.39],    -0.31, 0.58,-0.02],        -0.93, 0.17, 0.33],   -1.24, 0.75, 0.32],
 [-0.46,-0.16, 0.31,   [ 0.02, 0.08, 0.92,        [-0.22,-0.53,-1.20,   [-0.21,-0.45,-0.28,
   0.21,-0.28,-0.47]]    -0.24, 0.47,-0.94]]        -2.21, 0.90, 0.26]]   -2.45, 1.37,-0.67]]

Each of these is . After layers the residual stream is still — the same shape it started as. Every layer reshaped the contents, never the shape.

Step 8: read out the last token

Exactly as in Section 5. Take the last row of (the representation for “sat”, now informed by two full layers of attention + MLP):

h²_last = [-0.21, -0.45, -0.28, -2.45,  1.37, -0.67]

Multiply by the unembedding (shape ) to get logits, then softmax. Showing the first 10 vocabulary columns:

logits = [-0.48,  0.27,  2.01, -0.91,  0.35, -1.04, -1.80,  1.03,  0.60,  0.36]
probs  = [ 0.04,  0.07,  0.42,  0.02,  0.08,  0.02,  0.01,  0.16,  0.10,  0.08]

(With random, untrained weights the specific winner carries no meaning — what matters is that the pipeline produced a clean -shaped distribution that sums to 1.)

What this example added

Reading the shape table top to bottom, three things are now visible that the single-head example could not show:

Heads are lanes. One multiply produces all heads at once; the result is sliced into independent blocks, each running attention alone, then concatenated and remixed by . The width bookkeeping is , and here exactly.
The MLP is a per-token refinery. It expands each token’s vector to , applies ReLU, contracts back, and adds the result to the residual — touching each position in isolation. The width bookkeeping is .
Layers stack without changing shape. The residual stream is a tape that every block reads from and writes to additively. Stacking layers just repeats the read–transform–write cycle; the shape is invariant from to .

7. Recap of the data flow

Shapes shown per head where the per-head structure matters; in most architectures, but they are conceptually distinct.

flowchart TB
    A["token IDs   [T]"] -->|"embedding lookup
E: V × d_model"| B["hidden state h⁰   [T × d_model]
initial residual stream"]
    B -->|"project via W_Q, W_K: d_model × (h·d_k),
W_V: d_model × (h·d_v)"| C["Q, K [T × d_k]; V [T × d_v]
per head"]
    C -->|"per-head QKᵀ / √d_k"| D["attention scores   [T × T] per head"]
    D -->|"causal mask + softmax"| E["attention weights   [T × T] per head
rows sum to 1"]
    E -->|"multiply by V, per head"| F["per-head output   [T × d_v]"]
    F -->|"concatenate the h heads"| G["concat output   [T × (h·d_v)]"]
    G -->|"W_O: (h·d_v) × d_model,
add to residual"| H["hidden state after attention   [T × d_model]"]
    H -->|"MLP: W₁, ReLU, W₂,
add to residual"| I["hidden state h¹   [T × d_model]
after MLP block"]
    I -->|"more layers ... eventually layer L"| J["final hidden state   [T × d_model]"]
    J -->|"take last row × U: d_model × V"| K["logits (last token)   [V]"]
    K -->|"softmax"| L["next-token distribution   [V]
sums to 1"]
    L -->|"sample"| M["next token ID   scalar in [0, V)"]

Things to keep clear in your head:

Q is one-shot, K and V persist. That’s why the cache is “KV” — Q for past tokens is never needed again.
The residual stream is the spine. Every attention and MLP block reads from it and writes back to it (additively). All the “thinking” passes through this tensor.
Per-head V has shape , not . The full-width shape only appears after concatenating the heads (and only if , which is the usual but not mandatory choice). is the operator that maps that concatenated block back into the residual.
Only the last token’s logits matter for next-token prediction at inference time. During training you compute logits for every position (to predict each next token in parallel), but at generation time you only need the last.
Logits are unnormalized; softmax produces the actual distribution. Temperature, top-k, top-p sampling all operate on logits or the resulting distribution to control output diversity.
Attention is the only cross-token channel; the MLP is per-position. Attention blocks are the only place where one token’s representation can be influenced by another’s. MLP blocks refine each token in isolation. So every relational thing a model does — agreement, coreference, copying, “don’t repeat the previous speaker” — must be carried by attention. Appendix A makes this precise by switching attention off; Appendix B works a small CNN by hand to sharpen the filter-vs-head contrast from Section 1.

Appendix A: What if attention heads were disabled?

A Transformer, stripped to its skeleton, is a stack of MLP blocks with attention blocks inserted between them, all communicating through one residual stream. A classic feed-forward neural network is essentially just the MLP part. So a natural question — and a useful one for understanding what attention buys you — is: what happens if we switch the attention off? There are two clean ways to do it, and they arrive at the same destination.

The baseline: a network with only MLP blocks

Recall from Section 6 that an MLP block processes the residual stream one token at a time: the same , ReLU, applied to each row independently, with no reference to any other position. So a network built only from MLP blocks processes every token in complete isolation. It can learn a fixed mapping “this token (at this position) → that output,” i.e. per-token and position-conditioned statistics — but it has no mechanism for one token to influence another’s representation. Whatever “sat” becomes, it becomes without ever consulting “the” or “cat.”

Method 1: remove the attention layers entirely

Delete the attention sub-blocks and wire the embeddings straight into the MLP stack. The update rule becomes simply

This is the pure-MLP baseline. In the running example, the representation of “sat” can now never absorb anything from “the” or “cat” — each column of the residual stream is processed down its own private pipe. A relational rule is therefore impossible: anything that requires comparing two positions — for instance “the next item must differ from the current one” — cannot be expressed, because the two positions never meet.

Method 2: keep the architecture, make attention transparent

The second way disables attention without deleting anything. Keep the full Transformer architecture — all the machinery — but fabricate the weights so that the attention block writes nothing into the residual stream.

Recall the write-back from Section 6: the attention block’s contribution is , and it is added to the residual. So set

Now, no matter what attention pattern computes, the value added to the residual is the zero vector:

The residual stream passes through the attention block untouched. The Q/K/V machinery still runs, still computes attention patterns — but those patterns are transparent: they have no effect on anything downstream. Functionally, the model is once again the pure-MLP network of Method 1.

(You might ask whether there’s a subtler “identity” fabrication — make each token attend only to itself, so attention copies each value through. That doesn’t give transparency: copying each token’s value and adding it back doubles the contribution rather than leaving the stream unchanged. True transparency means the block adds zero, which is what zeroing the write-back achieves.)

Both roads lead to the same place

Whether you disable attention by deletion (Method 1) or by making it transparent (Method 2), the Transformer collapses to a per-position MLP network — a stack that refines each token in isolation and can never move information between positions. This is the precise sense in which:

Attention is the only channel through which tokens communicate. Everything relational a language model does must be carried by attention, because it is the only operation that moves information across positions.

That is also why the tiny-model story these notes accompany is about attention heads specifically: if a rule relates one token to another, the circuit that enforces it has to live in the attention machinery — there is nowhere else for it to be.

Why Method 2 is the important one: ablation

Method 2 has a feature deletion lacks: it is selective. The write-back matrix is organized in blocks — one slice of rows per head (Section 6, Step 5). Zero out just one head’s slice, and you make exactly that head transparent while leaving every other head working. Re-run the model, measure what changes, and you have a causal probe: what does this one head actually do?

This per-head version of Method 2 is called ablation, and it is the workhorse of mechanistic interpretability. Two findings from tiny, fully-understood models show why it matters:

A rule can rest on a single head. Ablate that one head and a behavior the model performed perfectly collapses; ablate any other head and nothing changes. The behavior was carried, causally, by one specific lane in one specific layer.
Attention patterns can mislead. A head whose attention looks like it implements a rule — say it stares almost entirely at the relevant earlier token — may, when ablated, turn out to change nothing: it was not load-bearing. Conversely a head with a messy, unremarkable-looking pattern may be the one holding the rule. The attention pattern tells you what a head looks at; only ablation tells you what it does.

That second lesson is worth underlining, because it is exactly where intuition goes wrong: you cannot read a head’s function off its attention picture. You have to switch the head off — Method 2, one head at a time — and watch what breaks. Disabling attention, far from being a destructive curiosity, is therefore the single most useful tool for finding out where in a network a behavior lives — which is the question the accompanying tiny-language-model experiments are built to answer.

Appendix B: A convolutional OCR pass, by hand

Section 1 sketched the OCR convolutional network in words. Here we run one through, with numbers small enough to check by hand, so the filter is as concrete as the head. Then we lay the two side by side.

The image

Take a tiny grayscale image. Each pixel is (blank) or (ink). This one shows a vertical stroke down the middle column — the kind of mark that distinguishes, say, a 1 or the spine of a T:

       col: 0 1 2 3 4
row 0:      0 0 1 0 0
row 1:      0 0 1 0 0
row 2:      0 0 1 0 0
row 3:      0 0 1 0 0
row 4:      0 0 1 0 0

One filter, sliding

A filter is a small fixed grid of weights. Here is a vertical-stroke detector: it rewards ink in its center column and punishes ink on either side, so it responds most strongly to a vertical line.

F_vert = [[-1,  2, -1],
          [-1,  2, -1],
          [-1,  2, -1]]

To convolve, we slide this filter over every window of the image; at each stop we multiply overlapping cells and sum to a single number. A image with a filter has valid stops, so the output (the feature map) is .

Look at two stops to see the mechanism:

Top-left window (rows 0–2, cols 0–2) — the stroke is off to the right, so the filter’s center column sits on blanks:

window      = [[0,0,1],     elementwise·F_vert, summed:
               [0,0,1],     each row: 0·(-1) + 0·(2) + 1·(-1) = -1
               [0,0,1]]     three rows → -3

Top-center window (rows 0–2, cols 1–3) — now the stroke lines up under the filter’s center column:

window      = [[0,1,0],     each row: 0·(-1) + 1·(2) + 0·(-1) = +2
               [0,1,0],     three rows → +6
               [0,1,0]]

Do this at all nine stops and you get the feature map. The center column lights up (+6); the flanks are suppressed (−3):

conv = [[-3,  6, -3],
        [-3,  6, -3],
        [-3,  6, -3]]

ReLU, then pooling

Apply ReLU (zero out negatives) — the map keeps only the positive evidence “a vertical stroke is here”:

relu(conv) = [[0, 6, 0],
              [0, 6, 0],
              [0, 6, 0]]

Then max-pool with a window to shrink the grid and add a little position tolerance (each output cell is the max of a patch). The map becomes :

pool = [[6, 6],
        [6, 6]]

The pooled map is uniformly high: this filter is shouting “vertical stroke, present.” Pooling means it would keep shouting even if the stroke shifted a pixel — the detector is now slightly position-invariant, exactly the property Section 1 said convolution buys you.

A layer has many filters

A real layer applies many filters in parallel, each producing its own feature map. Add a second filter, a horizontal-stroke detector (ink rewarded in the center row):

F_horiz = [[-1, -1, -1],
           [ 2,  2,  2],
           [-1, -1, -1]]

Run it on the same vertical-stroke image and every window cancels to zero — there is no horizontal ink to reward:

conv = relu = pool = all zeros

So the two filters disagree, informatively: the vertical detector fires, the horizontal detector is silent. That contrast is the feature the classifier wants.

Flatten and classify

Flatten the two pooled maps into one feature vector (four numbers per filter, eight in all):

features = [6, 6, 6, 6,   0, 0, 0, 0]
            └ vertical ┘   └ horizontal ┘

A final linear layer reads these features into one score per class — let the “vertical” class sum the vertical-filter features and the “horizontal” class sum the horizontal ones — and softmax turns the scores into probabilities:

logits = [24,  0]            # vertical, horizontal
softmax ≈ [1.00, 0.00]       # "this is a vertical stroke"

The network has converted a grid of pixels, step by step, into a point in a tiny two-class solution space, and read out a crisp answer — the same arc as the language model, just over characters instead of next tokens. (The probability is emphatic because this is a noise-free toy; on real handwriting the distribution would be softer.)

Filter vs head, side by side

Now the comparison that motivated this appendix. A filter and a head are both feature detectors that get reused across positions, but they differ in the one respect that matters for language:

	Convolutional filter (CNN)	Attention head (Transformer)
What it stores	a fixed pattern of weights	three projections (and shares )
Receptive field	fixed and local — always the same small window	dynamic and global — chosen at runtime, can reach any earlier token
How it matches	slides the same pattern over every position; fires where the pixels match it	computes, from the content, a query–key similarity to decide which positions to read
What “reuse across positions” means	weight sharing: identical weights at every location	the same at every position, but the attention pattern is recomputed per input
Output at a position	one activation = how well the local patch matches the pattern	a weighted blend of other positions’ V payloads
Set by the data…	at training time (the filter weights are learned, then frozen)	at training time and at run time (the pattern depends on the actual tokens)

The crucial row is receptive field. The vertical filter above can only ever see a patch; to relate ink in the top-left corner to ink in the bottom-right, a CNN must stack many layers until their windows overlap. That is fine for images, where the clues are local. A head pays no such toll: the query for one token can match the key of a token hundreds of positions away in a single step, and which token it matches is decided by the content, not fixed by the architecture. That is the whole reason attention displaced convolution for language — and, looping back to Appendix A, it is also why a relational rule in a language model lives in a head: the head is the only component whose reach is wide enough, and content-driven enough, to relate one token to another.

Appendix C: Glossary

For readers who would like the basics or a refresher. Terms are grouped roughly by where they first appear; Notes 2 and 3 carry their own glossaries for the terms specific to them.

Token, vocabulary

A token is the unit of text the model reads — a whole word in our toy examples, usually a sub-word piece in production models. The vocabulary () is the fixed set of all possible tokens (20 in Section 5’s toy, 50,257 in GPT-2).

Embedding, embedding matrix

The vector of real numbers that represents a token — its coordinates in the model’s high-dimensional “language space.” The embedding matrix has one row per vocabulary token; “looking up” a token means reading its row.

Residual stream

The running vector the Transformer maintains for each position and updates layer by layer; every attention and MLP block adds its output to it. It is the only place the model’s “thinking” lives, and the prediction is read from its final state.

Attention head, Q / K / V

A sub-mechanism that, for each token, decides how much to read from every earlier token. It does so with three learned projections of the residual: the query (“what am I looking for?”), the key (“what do I offer to match against?”), and the value (“what payload do I carry if matched?”). Query–key similarity sets the attention pattern; the values are what flow along it.

Multi-head attention

Running attention heads in parallel on independent slices of Q/K/V, then concatenating their outputs and mixing them with . Each head can specialise in a different relation.

The four learned weight matrices of an attention block: three that project the residual into queries, keys and values, and that maps the concatenated head outputs back into the residual stream.

MLP block (, ReLU)

The fully-connected “feed-forward” sub-layer that refines each token’s vector in place. It expands the vector to a wider hidden size via , applies a non-linearity (here ReLU, which zeroes negatives), and contracts back via . It never mixes information across positions.

Logits, softmax, temperature

The model’s raw, unnormalised scores over the vocabulary are logits. Softmax turns them into a probability distribution (exponentiate, then normalise to sum to one). Temperature rescales the logits before softmax: low sharpens the distribution (greedy at ), high flattens it.

Unembedding , weight tying

The matrix that maps the final residual vector to one logit per vocabulary token; each column is a token’s representation. Weight tying sets — the same dictionary used for input lookup and output scoring (standard in GPT-2, Llama, and our toys).

LayerNorm / RMSNorm

A normalisation applied to the residual vector (notably just before the unembedding) that rescales its components to a controlled magnitude and applies a learned per-dimension scale.

Causal mask

The rule that each token may attend only to itself and earlier tokens. Implemented by setting the upper triangle of the attention scores to before softmax, so future positions get weight zero.

KV cache

The stored keys and values of all past tokens, reused at each generation step so the new token’s query can attend back to the whole history. Queries are not cached — hence “KV”, not “QKV”.

Prefill vs. generation

Prefill processes the whole prompt at once, writing every token’s K and V into the cache. Generation then adds one token at a time, growing the cache by one row of K and V per step.

Ablation

Switching off one component — e.g. zeroing one head’s slice of — and re-measuring behaviour, to test what that component causally does (Appendix A).

Convolution, filter, feature map (CNN terms)

A convolution slides a small fixed filter of weights over an image, computing a local weighted sum at each position; the resulting grid is a feature map. The contrast with an attention head — fixed local window vs. dynamic content-driven reach — motivates Section 1 and Appendix B.

References

Cunningham, H., Ewart, A., Riggs, L., Huben, R., & Sharkey, L. (2023). Sparse autoencoders find highly interpretable features in language models. arXiv. https://arxiv.org/abs/2309.08600

DeepSeek-AI. (2024). DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model. arXiv. https://arxiv.org/abs/2405.04434

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., … Amodei, D. (2020). Scaling laws for neural language models. arXiv. https://arxiv.org/abs/2001.08361

nostalgebraist. (2020, August 31). Interpreting GPT: The logit lens. LessWrong. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens