<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>The Workbench</title>
<link>https://theworkbench.lerzegov.org/</link>
<atom:link href="https://theworkbench.lerzegov.org/index.xml" rel="self" type="application/rss+xml"/>
<description>An academic&#39;s workshop notes on building small tools for finance teaching and AI experimentation — long essays and short field notes from work in progress.</description>
<generator>quarto-1.9.38</generator>
<lastBuildDate>Mon, 01 Jun 2026 22:00:00 GMT</lastBuildDate>
<item>
  <title>The Final Step: The Language Model as a Relentless Seeker of the Best Next Word</title>
  <dc:creator>Luca Erzegovesi</dc:creator>
  <link>https://theworkbench.lerzegov.org/posts/the-final-step/</link>
  <description><![CDATA[ 




<p><em>Third and last in the series “Understanding LLMs to use them better in management and finance.” It closes the loop opened by <a href="../../posts/transformers-qkv-attention/index.html">QKV Attention</a> (how a Transformer moves information between tokens) and <a href="../../posts/embeddings-and-visualization/index.html">Embeddings and the Maps We Draw of Them</a> (where the vectors come from and how to read them). Those two notes built the machinery and the dictionary; this one watches the <strong>last step</strong> — the moment the model turns all that work into a single next word — and argues that the step is, at heart, a search. It is written for a reader who is not a computer scientist; the technical terms (logit, softmax, unembedding, …) are collected in the Glossary at the end and explained where first used.</em></p>
<hr>
<section id="the-intuition-a-relentless-search-for-the-next-word" class="level2">
<h2 class="anchored" data-anchor-id="the-intuition-a-relentless-search-for-the-next-word">1. The intuition: a relentless search for the next word</h2>
<p>Strip a language model down to what it actually <em>does</em>, moment to moment, and it is almost embarrassingly simple. It has a fixed list of possible words — its <strong>vocabulary</strong> — and at every step it gives each word in that list a score, sorts them, and picks one. Then it appends the chosen word to the text and does the whole thing again. And again. Thousands of times, to write you a paragraph.</p>
<p>The useful image is a <strong>search engine</strong>. Generating text is like running a Google search at every step — but a peculiar one. You are not searching the web; you are searching the model’s own vocabulary. And you are not typing the query; the query is a <em>vector the model has computed</em> from everything said so far. The “search results” are the whole vocabulary, ranked by how well each word fits as the continuation, and the model reads off the top of the list.</p>
<p>The first two notes were really about the two halves of that sentence. The <a href="../../posts/transformers-qkv-attention/index.html">QKV note</a> showed how the model builds the query: each token is carried forward as a vector (the <strong>residual stream</strong>), and attention heads and MLP blocks keep modifying it until the last token’s vector encodes everything the model has worked out about what should come next. The <a href="../../posts/embeddings-and-visualization/index.html">embeddings note</a> showed what is being searched: the vocabulary as a cloud of vectors — the dictionary — and how its geometry encodes meaning. This note connects them: the <strong>query meets the dictionary</strong>, and a word comes out.</p>
<hr>
</section>
<section id="the-mechanism-precisely" class="level2">
<h2 class="anchored" data-anchor-id="the-mechanism-precisely">2. The mechanism, precisely</h2>
<p>After the last layer, the model holds one vector for the final position — call it <img src="https://latex.codecogs.com/png.latex?h">, of length <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D">. It is the fully-processed representation of “what comes next.” To turn it into word scores, the model multiplies it by the <strong>unembedding matrix</strong> <img src="https://latex.codecogs.com/png.latex?W_U">, whose columns are one vector per vocabulary word:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cell%20=%20%5Ctilde%20h%20%5C,%20W_U,%20%5Cqquad%20%5Cell_i%20=%20%5Ctilde%20h%20%5Ccdot%20W_%7BU%7D%5B:,i%5D."></p>
<p>(The tilde is a final <strong>LayerNorm</strong> that tames the vector’s magnitude first; the glossary has the detail.) Read the second equation slowly, because it is the whole note: <strong>each score <img src="https://latex.codecogs.com/png.latex?%5Cell_i"> — each logit — is the dot product of the model’s query vector with word <img src="https://latex.codecogs.com/png.latex?i">’s vector in the dictionary.</strong> A dot product is the most basic similarity measure there is: it is large when two vectors point the same way. So the score of a word is <em>how aligned the model’s query is with that word’s entry in the dictionary</em>. Ranking words by their logit is therefore a <strong>similarity search over the vocabulary</strong> — what the literature calls a maximum-inner-product search.</p>
<p>And here is where the two notes fuse. In our models (and in GPT-2, Llama, and many others) the embedding and unembedding matrices are <strong>tied</strong>: <img src="https://latex.codecogs.com/png.latex?W_U%20=%20E%5E%5Ctop">. The columns the query is scored against <em>are the very embedding rows the previous note mapped</em>. The dictionary you search at the end is literally the dictionary you looked up at the start. The maps in note 2 are maps of the index this search runs over.</p>
<p>The raw logits are then passed through <strong>softmax</strong>, which exponentiates and normalises them into a probability distribution over the vocabulary — the ranked results page, now with percentages. A tiny worked version (three words, by hand) is in Appendix A; the shape of it is all that matters here:</p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">flowchart LR
    A["query vector&lt;br/&gt;(the final residual)"] --&gt;|"dot product"| B["one score per word&lt;br/&gt;(logits)"]
    B --&gt;|"softmax"| C["probabilities&lt;br/&gt;(the ranking)"]
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<p>The query did all the hard thinking; the search itself is a single matrix multiply followed by a normalisation. The unembedding cannot reason — it can only compare.</p>
<hr>
</section>
<section id="the-search-result-page" class="level2 page-columns page-full">
<h2 class="anchored" data-anchor-id="the-search-result-page">3. The search-result page</h2>
<p>Let us actually run it. Our worked example throughout is the small word-level Transformer from note 2 — two layers, four heads, <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D=64">, a 28-word vocabulary — trained on the turn-taking “calling game” (<em>Pietro chiama Paolo</em>, with <strong>epithets</strong> that depend on who calls whom). Give it the prompt <code>&lt;BOS&gt; Pietro chiama Paolo</code> and ask for the next word’s ranking:</p>
<div class="quarto-figure quarto-figure-center page-columns page-full">
<figure class="figure page-columns page-full">
<p class="page-columns page-full"><a href="figs/n3_fig1_search_results.png" class="lightbox page-columns page-full" data-gallery="quarto-lightbox-gallery-1" title="The search-result page: confident vs uncertain"><img src="https://theworkbench.lerzegov.org/posts/the-final-step/figs/n3_fig1_search_results.png" class="column-page img-fluid figure-img" alt="The search-result page: confident vs uncertain"></a></p>
<figcaption>The search-result page: confident vs uncertain</figcaption>
</figure>
</div>
<p>The left panel is the results page for that prompt. One result dominates utterly: <strong>Tarso</strong>, the epithet the game’s rule assigns to Paolo when Pietro calls him, with probability <strong>0.9998</strong>. Every other word in the vocabulary is down in the one-in-ten-thousand range. This is what a <em>confident</em> search looks like — the query points almost exactly at one entry in the dictionary, and the dot product with that entry towers over all the others.</p>
<p>Three things are worth pinning down about that distribution, because they correct the naïve “find the single nearest word” picture:</p>
<ul>
<li><strong>It is a ranking over the whole vocabulary, not one hit.</strong> The model does not retrieve a word; it scores <em>all</em> of them and reports a graded list. Usually the list is informative well below the top.</li>
<li><strong>Direction says which words, magnitude says how sure.</strong> The <em>direction</em> of the query picks out which entries it aligns with; its <em>length</em> controls how peaked the softmax is. A long query makes one word dominate (high confidence); a short one leaves the results flat (high uncertainty). The model’s confidence is encoded in the geometry, not bolted on afterwards.</li>
<li><strong>“Similarity” here means “trained to fit,” not human synonymy.</strong> Two words sit close in the dictionary because the model learned they play similar roles in predicting text — which usually <em>looks</em> like meaning, but is defined by the training task, exactly as note 2 argued.</li>
</ul>
<hr>
</section>
<section id="confident-searches-and-uncertain-ones" class="level2">
<h2 class="anchored" data-anchor-id="confident-searches-and-uncertain-ones">4. Confident searches and uncertain ones</h2>
<p>Now shorten the prompt to <code>&lt;BOS&gt; Pietro chiama</code> — caller named, but no callee yet. The right panel above is the result. There is no dominant hit; instead <strong>eight near-tied results</strong> (the player tokens <code>2, 6, 3, 4, 7, 1, 5, 8</code>, each around 0.10–0.12), because at this point <em>any</em> valid next player is an equally good continuation. The flatness is not a bug; it is the model <strong>correctly reporting that it does not know</strong> which player comes next, only that it must be a player.</p>
<p>That single contrast — one towering bar versus eight stubby equal ones — is the most operationally useful thing in this note, because three everyday LLM behaviours fall straight out of it:</p>
<ul>
<li><strong>Sampling and “temperature.”</strong> When the results are flat, <em>something</em> still has to be chosen. Greedy decoding takes the top bar; sampling rolls a weighted die over the ranking. Temperature reshapes the distribution before the roll — high temperature flattens it (more adventurous), low temperature sharpens it (more predictable). Generation is a chain of <em>sampled</em> searches, not deterministic lookups.</li>
<li><strong>Hallucination, demystified.</strong> The search <strong>always returns a ranking</strong> — even when nothing in the vocabulary genuinely fits. Ask a model for a fact it never learned and the query points nowhere in particular, but the nearest-by-accident words still get scored, softmax still sums to one, and the model still emits its confident-looking top result. A hallucination is a low-quality search the machinery has no choice but to complete.</li>
<li><strong>Why prompts and context matter so much.</strong> Everything the model knows about <em>this</em> step is compressed into the query vector, and the query is built from the prompt. A better prompt is, quite literally, a better search query.</li>
</ul>
<hr>
</section>
<section id="the-fulcrum-how-one-vector-comes-to-hold-a-whole-world" class="level2 page-columns page-full">
<h2 class="anchored" data-anchor-id="the-fulcrum-how-one-vector-comes-to-hold-a-whole-world">5. The fulcrum: how one vector comes to hold a whole world</h2>
<p>Step back to ask what is really remarkable here, because it is easy to lose it in the arithmetic. The query vector <img src="https://latex.codecogs.com/png.latex?h"> is produced by nothing but multiplications and additions — matrix products, dot products, a normalisation. And yet, by the time it reaches the final step, that one short list of numbers has absorbed the <em>meaning of the entire prompt</em>: who is speaking, who was called, what the game’s rule demands, which word is therefore due. Pure numerical manipulation has condensed a whole context into a single point in space.</p>
<p>There is an Archimedean quality to this. <em>Give me a place to stand, and a lever long enough, and I will move the world</em> — and the place the model stands is exactly this fulcrum, the last token’s vector. The whole world of the prompt, with all its accumulated context, bears down on that one point, and what gets lifted into existence is the next word. The lever is built by the parts the earlier notes described: <strong>attention</strong> reaches back across the sentence and binds the relevant tokens into the residual (note 1), and the <strong>MLP / feed-forward</strong> blocks act, position by position, as a kind of learned lookup table — given “Pietro called Paolo,” they fetch the direction that means “Tarso.” Layer by layer the residual is loaded until it is the long arm of the lever, and the tiny final search is the short arm that the loaded context throws upward.</p>
<p>We can watch the lever load. Project the last token’s residual, at each stage of the forward pass, onto a plane built (by the same Gram-Schmidt method as note 2) from two dictionary directions — the epithet <em>Tarso</em> and the callee <em>Paolo</em>:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="figs/fig5_residual_trajectory.png" class="lightbox" data-gallery="quarto-lightbox-gallery-2" title="The residual stream migrating toward the answer"><img src="https://theworkbench.lerzegov.org/posts/the-final-step/figs/fig5_residual_trajectory.png" class="img-fluid figure-img" alt="The residual stream migrating toward the answer"></a></p>
<figcaption>The residual stream migrating toward the answer</figcaption>
</figure>
</div>
<p>The vector starts near the called name and <strong>travels across the plane toward Tarso</strong> as the layers run — its coordinate along the Tarso axis climbs from −3.1 to +10.6 — and decomposing the path shows that <strong>block-0 attention alone supplies +7.62 of that push</strong>. Moving the residual toward a word’s direction <em>is</em> raising that word’s logit (tied weights again), so this picture is the search ranking being rewritten in real time. The same thing read as rankings rather than geometry is the <strong>logit lens</strong> — applying the unembedding at each intermediate stage to ask “what would the model predict if it had to commit <em>now</em>?”:</p>
<div class="quarto-figure quarto-figure-center page-columns page-full">
<figure class="figure page-columns page-full">
<p class="page-columns page-full"><a href="figs/n3_fig2_logit_lens.png" class="lightbox page-columns page-full" data-gallery="quarto-lightbox-gallery-3" title="Logit lens: the ranking locks on early, then sharpens"><img src="https://theworkbench.lerzegov.org/posts/the-final-step/figs/n3_fig2_logit_lens.png" class="column-page img-fluid figure-img" alt="Logit lens: the ranking locks on early, then sharpens"></a></p>
<figcaption>Logit lens: the ranking locks on early, then sharpens</figcaption>
</figure>
</div>
<p>By the end of the very first block the search already ranks Tarso top (0.92); the second block only sharpens it to near-certainty. The decisive work — the lever’s heave — happens early, exactly where the trajectory said it did. None of this is mysticism: it is binding, then lookup, then a dot product. But it is worth pausing on the fact that <em>that</em> is enough to make a vector carry a world.</p>
<hr>
</section>
<section id="from-a-toy-to-a-thousand-page-contract" class="level2">
<h2 class="anchored" data-anchor-id="from-a-toy-to-a-thousand-page-contract">6. From a toy to a thousand-page contract</h2>
<p>It would be fair to object that lifting the word “Tarso” out of a 28-word vocabulary is a parlour trick. So here is the part that should genuinely give pause: <strong>the machinery is identical at the top of the field.</strong> GPT-class models do exactly what our toy does — score every word in the vocabulary by dot product with a computed query, softmax, sample, append, repeat. Nothing more exotic is bolted on. What differs is only <em>scale</em>: a query vector of 12,288 numbers instead of 64 (note 2), a dictionary of 50,000-plus word-pieces instead of 28, and a lever built from dozens of layers trained on a sizeable fraction of the written internet.</p>
<p>And from that — from a relentless loop of “rank the vocabulary, pick a word” — comes the entire observed richness: a comedy sketch with a setup and a punchline, a sonnet that scans, a scientific paper with a coherent argument, an investor report that ties its narrative to its numbers, a full contract with cross-referencing clauses. Each of those is produced one next-word search at a time, each search standing on the fulcrum the previous words built. The wonder of large language models is not that they do something other than this. It is that <em>this</em>, at scale, is enough.</p>
<hr>
</section>
<section id="the-conveyor-belt-grows-tools-search-and-code" class="level2">
<h2 class="anchored" data-anchor-id="the-conveyor-belt-grows-tools-search-and-code">7. The conveyor belt grows: tools, search, and code</h2>
<p>Modern systems add one more move, and it is the move that turns a text generator into an assistant. The find-next-word loop is orchestrated with real outside power: a web search, a calculator, a code interpreter, or other applications reached through <strong>MCP</strong> servers. It is tempting to imagine the model “using” these tools the way a person does. It does something stranger and simpler.</p>
<p>Picture the prompt as a conveyor belt of text that the model endlessly reads and extends. When a tool is involved, the model does not leave the belt to operate machinery. It simply <em>writes onto the belt</em> a request — a few tokens that mean “search the web for X” or “run this code.” An external harness, sitting outside the model, notices that text, performs the actual action, and lays the <strong>result back onto the belt as more text</strong>: the search snippets, the computed number, the program’s output. Then the same next-word loop resumes, now reading a context enriched with fresh, grounded material. The tool’s answer is not stored in some special memory; it becomes ordinary prompt text, indistinguishable in kind from what the user typed.</p>
<p>This is why the picture matters for the rest of the note. The lever’s “world” — the context pressing on the query — is no longer limited to what the user wrote and what the model already knew. It can now include this morning’s web page, an exact calculation, the output of a freshly compiled program. But the mechanism underneath never changes: every one of those additions is just more text on the belt, and every word the system produces is still one dot-product search over the vocabulary. The intelligence of an agent is, in the end, the intelligence of <em>what gets written onto the tape</em> — and a search that keeps reading it.</p>
<hr>
</section>
<section id="where-this-sits-and-where-the-analogy-breaks" class="level2">
<h2 class="anchored" data-anchor-id="where-this-sits-and-where-the-analogy-breaks">8. Where this sits — and where the analogy breaks</h2>
<p>Honesty first, since this series tries to locate its ideas rather than oversell them. Nothing in the mechanism here is a new finding; a specialist would recognise every piece. That the unembedding is a dot-product readout you can even apply to intermediate layers is the <strong>logit lens</strong> (nostalgebraist; refined as the <em>tuned lens</em>, Belrose et al.&nbsp;2023). That the feed-forward blocks behave like a lookup table is <em>Transformer Feed-Forward Layers Are Key-Value Memories</em> (Geva et al.&nbsp;2021). The output-layer-as-similarity-search idea goes back to <strong>word2vec</strong> (Mikolov et al.&nbsp;2013). And the pedagogy of doing the whole thing transparently has been done before — Ishan Anand’s <em>Spreadsheets Are All You Need</em> implements GPT-2 in a spreadsheet for exactly this reason. The contribution here is only the exposition for a non-technical audience and the live, perturbable toy behind the figures; the toy game itself is reproduced from the public <em>ToyDialogueGames</em> exercise.</p>
<p>With that said, the “search engine” image earns its keep but must not be pushed too far. Four places it breaks, each worth keeping in mind:</p>
<ul>
<li><strong>There is no external corpus.</strong> A web search ranks billions of documents; this search ranks only the model’s own fixed vocabulary (tens of thousands of word pieces). It cannot return anything that is not already a token.</li>
<li><strong>You don’t type the query — the model computes it.</strong> All the work, and all the cleverness, is in <em>constructing</em> the query vector. The search step itself is a trivial linear operation. “It’s just doing search” is true and deeply misleading at once: the search is dumb; the query is the model.</li>
<li><strong>It returns a distribution and then gambles.</strong> Unlike a search box that shows you a fixed list, generation <em>samples</em> from the ranking, so the same prompt can yield different continuations. Determinism is a special case (temperature zero), not the rule.</li>
<li><strong>“Relevance” is the training objective, not human judgement.</strong> A word scores high because the model was trained to make the right next token score high — which approximates meaning but is not the same thing, and is exactly why the results can be fluent and wrong together.</li>
</ul>
<p>This last point connects the whole series back to its second note. A <strong>vector database</strong> (the engine of retrieval-augmented generation) also performs a similarity search — but over an <em>external</em> corpus, using a query produced by a <em>separately trained encoder</em>. A language model performs a similarity search over its <em>own vocabulary</em>, using a query it <em>computes internally</em>. Same operation, different index and different source of query. Seen this way, <strong>RAG is just a way to put better text on the conveyor belt</strong>: it drops genuinely relevant documents into the context so that the model’s computed query — and therefore its next-word search — points somewhere grounded.</p>
<hr>
</section>
<section id="what-this-means-for-using-llms-in-management-and-finance" class="level2">
<h2 class="anchored" data-anchor-id="what-this-means-for-using-llms-in-management-and-finance">9. What this means for using LLMs in management and finance</h2>
<p>If you carry one mental model away from these three notes, let it be this: a language model is not a database it queries for facts, and not a mind that “knows” things. It is a machine that, at every step, draws on the knowledge compressed into its weights to build a vector summarising the context, and then runs a similarity search over its own vocabulary for the best next word. Strike that key again and again and an articulated response takes shape, one word at a time.</p>
<p>And what comes out is unlike the result of a database query. It is not a selection of pre-baked text or data retrieved from a store; it is closer to a kind of mechanical expert judgement — a chain of small inferences over knowledge the model has metabolised from training on many similar cases. That distinction is the practical heart of what follows.</p>
<p>That single picture pays off in practice:</p>
<ul>
<li><strong>Hallucinations are not malfunctions; they are the search completing when it shouldn’t.</strong> The cure is not to scold the model but to improve the query — by putting the right facts on the conveyor belt (RAG, tools, better context).</li>
<li><strong>Prompting is query construction.</strong> Time spent shaping the context is time spent aiming the search. It is the highest-leverage thing a non-technical user controls.</li>
<li><strong>Confidence is readable, and worth reading.</strong> A model that is “sure” has a peaked distribution; a flat one is a warning. Where a system exposes token probabilities or lets you vary temperature, those are direct windows onto how strong the search hit actually was.</li>
<li><strong>Tools extend the world the model can lift, not the mechanism.</strong> Web search, a calculator, a code runner, an MCP-connected application — each one just enriches the text the next-word search reads. Understanding that boundary is what lets you reason about what these systems can and cannot reliably do.</li>
</ul>
<p>The model is a relentless seeker for the best next word. Everything else — the poems, the reports, the contracts, the agentic tool use — is what that one tireless search becomes when it stands on a rich enough world and is run, patiently, again and again.</p>
<hr>
</section>
<section id="appendix-a-the-search-by-hand" class="level2">
<h2 class="anchored" data-anchor-id="appendix-a-the-search-by-hand">Appendix A: the search, by hand</h2>
<p>Take a deliberately tiny model: a query vector of length 3 and a three-word vocabulary, each word a row of the (tied) dictionary.</p>
<p>A caveat on what is being skipped, so the example is not mistaken for the whole machine. In a real run there is a <strong>prompt</strong> — text such as <code>&lt;BOS&gt; Pietro chiama Paolo</code> — and a <strong>query-building mechanism</strong>: the embedding lookup, the attention blocks, and the MLP blocks of Notes 1 and 2, which between them read the prompt and grind it down into the single final-residual vector <img src="https://latex.codecogs.com/png.latex?h">. All of that is the hard, interesting part. Here we simply <em>posit</em> the finished <img src="https://latex.codecogs.com/png.latex?h"> and a three-word dictionary, so that the <strong>search step itself</strong> — dot product, then softmax — stands alone and can be checked by hand. Read the numbers below as “suppose the upstream layers handed us this query against this dictionary.”</p>
<pre><code>query  h     = ( 1.0,  0.5, -0.2)        the final residual

dictionary:
   Tarso      = ( 1.2,  0.4, -0.1)
   Cefa       = ( 0.3,  1.0,  0.2)
   Paolo      = (-0.5,  0.2,  0.9)</code></pre>
<p><strong>Score each word — a dot product with the query (this is the “search”):</strong></p>
<pre><code>ℓ(Tarso) = 1.0·1.2 + 0.5·0.4 + (-0.2)(-0.1) =  1.20 + 0.20 + 0.02 =  1.42
ℓ(Cefa)  = 1.0·0.3 + 0.5·1.0 + (-0.2)( 0.2) =  0.30 + 0.50 − 0.04 =  0.76
ℓ(Paolo) = 1.0(-0.5)+ 0.5·0.2 + (-0.2)( 0.9) = −0.50 + 0.10 − 0.18 = −0.58</code></pre>
<p><strong>Softmax — turn scores into a ranked probability list:</strong></p>
<pre><code>exp:   e^1.42 = 4.14,   e^0.76 = 2.14,   e^-0.58 = 0.56     sum = 6.84
prob:  Tarso 4.14/6.84 = 0.61    Cefa 0.31    Paolo 0.08</code></pre>
<p>Tarso wins, because its dictionary entry points most nearly the same way as the query. Note what each stage did: the dot products <em>are</em> the search; softmax only turns the scores into a tidy ranking. Make the query longer (multiply <img src="https://latex.codecogs.com/png.latex?h"> by 3, keeping its direction) and the logits become <img src="https://latex.codecogs.com/png.latex?(4.26,%202.28,%20-1.74)"> with probabilities <img src="https://latex.codecogs.com/png.latex?(0.87,%200.12,%200.01)"> — same winner, sharper confidence. Direction chose the word; magnitude set the certainty.</p>
<hr>
</section>
<section id="appendix-b-glossary" class="level2">
<h2 class="anchored" data-anchor-id="appendix-b-glossary">Appendix B: Glossary</h2>
<p>Continues the glossary of note 2; here are the terms specific to the final step.</p>
<section id="logit" class="level3">
<h3 class="anchored" data-anchor-id="logit">Logit</h3>
<p>A word’s raw, unnormalised score — the dot product of the model’s query vector with that word’s dictionary entry. One logit per vocabulary word.</p>
</section>
<section id="softmax" class="level3">
<h3 class="anchored" data-anchor-id="softmax">Softmax</h3>
<p>The function that turns a vector of logits into a probability distribution: exponentiate each, divide by the total. Bigger logits become bigger probabilities; the result sums to 1.</p>
</section>
<section id="unembedding" class="level3">
<h3 class="anchored" data-anchor-id="unembedding">Unembedding</h3>
<p>The matrix <img src="https://latex.codecogs.com/png.latex?W_U"> that maps the final vector to one logit per word. Its columns are word vectors. With <strong>tied weights</strong> it is the transpose of the embedding matrix — the same dictionary used for input lookup and output scoring.</p>
</section>
<section id="maximum-inner-product-search-mips" class="level3">
<h3 class="anchored" data-anchor-id="maximum-inner-product-search-mips">Maximum-inner-product search (MIPS)</h3>
<p>“Find the items whose vectors have the largest dot product with a query.” Ranking the vocabulary by logit is exactly this — a similarity search, with the dot product as the similarity.</p>
</section>
<section id="logit-lens" class="level3">
<h3 class="anchored" data-anchor-id="logit-lens">Logit lens</h3>
<p>A diagnostic: apply the unembedding to the residual at an <em>intermediate</em> layer to see what the model would predict if forced to commit there. It works because the residual lives, throughout the network, in the space the final readout uses.</p>
</section>
<section id="key-value-memory-feed-forward-block" class="level3">
<h3 class="anchored" data-anchor-id="key-value-memory-feed-forward-block">Key-value memory (feed-forward block)</h3>
<p>A way of reading the MLP/feed-forward sub-layers: each behaves like a stored key–value pair — it recognises a pattern in the residual (“Pietro called Paolo”) and writes an associated direction back (“toward Tarso”). The model’s per-token “lookups.”</p>
</section>
<section id="greedy-decoding-sampling-temperature-top-k-top-p" class="level3">
<h3 class="anchored" data-anchor-id="greedy-decoding-sampling-temperature-top-k-top-p">Greedy decoding, sampling, temperature, top-k / top-p</h3>
<p>Ways to pick a word from the ranked distribution. <strong>Greedy</strong> takes the top one. <strong>Sampling</strong> draws at random in proportion to probability. <strong>Temperature</strong> rescales the distribution before drawing — high = flatter/more varied, low = sharper/more predictable. <strong>Top-k / top-p</strong> restrict the draw to the most probable words.</p>
</section>
<section id="rag-retrieval-augmented-generation" class="level3">
<h3 class="anchored" data-anchor-id="rag-retrieval-augmented-generation">RAG (retrieval-augmented generation)</h3>
<p>Fetching relevant documents from an external store (via a vector-database similarity search) and inserting them into the prompt, so the model’s next-word search runs over a context grounded in real text.</p>
</section>
<section id="mcp-model-context-protocol" class="level3">
<h3 class="anchored" data-anchor-id="mcp-model-context-protocol">MCP (Model Context Protocol)</h3>
<p>A standard by which a model’s host application connects to external tools and data sources. In the picture of this note: a way for results computed outside the model to be written back onto the prompt “conveyor belt” as text the next-word loop then reads.</p>
</section>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<p>Anand, I. (2024). <em>Spreadsheets are all you need: A spreadsheet implementation of GPT-2.</em> https://spreadsheets-are-all-you-need.ai/</p>
<p>Belrose, N., Furman, Z., Smith, L., Halawi, D., Ostrovsky, I., McKinney, L., Biderman, S., &amp; Steinhardt, J. (2023). <em>Eliciting latent predictions from transformers with the tuned lens.</em> arXiv. https://arxiv.org/abs/2303.08112</p>
<p>Geva, M., Schuster, R., Berant, J., &amp; Levy, O. (2021). Transformer feed-forward layers are key-value memories. In <em>Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)</em> (pp.&nbsp;5484–5495). Association for Computational Linguistics. https://arxiv.org/abs/2012.14913</p>
<p>Mikolov, T., Chen, K., Corrado, G., &amp; Dean, J. (2013). <em>Efficient estimation of word representations in vector space.</em> arXiv. https://arxiv.org/abs/1301.3781</p>
<p>nostalgebraist. (2020, August 31). <em>Interpreting GPT: The logit lens.</em> LessWrong. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens</p>
<p>Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. <em>Advances in Neural Information Processing Systems, 30.</em> https://arxiv.org/abs/1706.03762</p>
<hr>
<p><em>Models and tooling behind the figures (in the <code>julia-impromptu</code> project): the trained dialogue-game model <code>DialogueGame-Tiny-Epithet-Trained.json</code>, whose <code>Forward_TopK</code> ranking and logit-lens recompute live as you edit the prompt; the series’ live companion <code>EmbeddingAtlas.json</code>, whose <code>final_step_search</code> report shows the confident/uncertain search pages and the logit-lens locking on; the figure data computed bundle-pure in <code>tools/compute_note3.jl</code> and rendered by <code>tools/render_note3.py</code>. The residual-trajectory figure is reused from note 2.</em></p>


</section>

 ]]></description>
  <category>LLMs</category>
  <category>Transformers</category>
  <category>Search</category>
  <category>Teaching</category>
  <guid>https://theworkbench.lerzegov.org/posts/the-final-step/</guid>
  <pubDate>Mon, 01 Jun 2026 22:00:00 GMT</pubDate>
  <media:content url="https://theworkbench.lerzegov.org/posts/the-final-step/figs/n3_fig1_search_results.png" medium="image" type="image/png" height="56" width="144"/>
</item>
<item>
  <title>Embeddings and the Maps We Draw of Them</title>
  <dc:creator>Luca Erzegovesi</dc:creator>
  <link>https://theworkbench.lerzegov.org/posts/embeddings-and-visualization/</link>
  <description><![CDATA[ 




<p><em>Second in the series “Understanding LLMs to use them better in management and finance.” It follows <a href="../../posts/transformers-qkv-attention/index.html">QKV Attention</a> (how a Transformer moves information between tokens) and sets up <a href="../../posts/the-final-step/index.html">The Final Step</a> (how the final layer reads an answer out of the vocabulary — a relentless search for the best next word). This note is about the thing both the attention machinery and that final read-out quietly depend on: the <strong>vocabulary embeddings</strong> — what they are, where they come from, and how to look at a space with hundreds of dimensions without fooling yourself. It is written for a reader who is not a computer scientist; the technical terms (token, residual stream, PCA, softmax, …) are collected in the Glossary at the end, and each is also explained where it is first used.</em></p>
<hr>
<section id="what-an-embedding-is-and-the-chicken-and-egg-of-finding-it" class="level2">
<h2 class="anchored" data-anchor-id="what-an-embedding-is-and-the-chicken-and-egg-of-finding-it">1. What an embedding is — and the chicken-and-egg of finding it</h2>
<p>Everyone who has met a language model has met the idea of an <strong>embedding</strong>: each word (more precisely, each <em>token</em>) is a point in a high-dimensional space, and “nearby” points are supposed to mean “similar” things. The picture is so common that it is easy to skip the question that actually matters for using these models well: <em>where do those vectors come from?</em></p>
<p>There is a tempting wrong answer, and it is worth naming because it is the mental model most people import from <strong>vector databases</strong>. In a vector database — the engine behind semantic search and retrieval-augmented generation — you take an <em>already-trained</em> encoder, push each document through it, and store the vector it produces. The encoder is fixed; the vectors are read off it; “similarity” is a property of a model someone else trained earlier. The embedding is an <strong>input</strong> to your system.</p>
<p>Inside a language model, embeddings work the other way around. They are <strong>not downloaded</strong> from a standard generator and they are <strong>not fixed in advance</strong>. They are <strong>parameters</strong> — the rows of the embedding matrix <img src="https://latex.codecogs.com/png.latex?E"> — and they are learned <strong>jointly with the rest of the model</strong>, by the same gradient descent that trains the attention and MLP weights. There is a chicken-and-egg quality to this that is the whole point:</p>
<blockquote class="blockquote">
<p>The embedding of a word is whatever vector makes <em>this particular model</em> predict well. The model shapes the embeddings; the embeddings shape the model; they are solved for together.</p>
</blockquote>
<p>So when GPT-2 places “king” at some location in its 768-dimensional space, that location is not a statement about the timeless meaning of <em>king</em>. It is a statement about <strong>what direction is useful for GPT-2’s next-token predictions</strong>, given everything else GPT-2’s weights are simultaneously doing. Train a different model on different data and you get different embeddings, even for the same word.</p>
<p>A fair objection: <em>aren’t the vectors in a vector database also learned?</em> They are — and being precise about this softens the contrast in the right way, because the real difference is <strong>not</strong> “learned versus not learned.” The encoder behind a vector database is itself a trained neural network — OpenAI’s <code>text-embedding-3</code>, the open <code>sentence-transformers</code> family, or token-level late-interaction retrievers like <strong>ColBERT</strong> are all <em>optimised</em> models. What differs is the <strong>objective</strong> they were optimised for. A retrieval encoder is trained so that the <strong>similarity between two pieces of text tracks their relevance</strong> — exactly the quantity a search index needs. A language model’s embeddings are trained so that <strong>the model predicts the next token</strong>. Same mechanism (vectors solved for by gradient descent), different goal — and a different geometry results. So the lesson of this section is not “LM embeddings are special because they are learned,” but the sharper: <strong>every embedding is shaped by the task it was trained for; there is no neutral, universal embedding you simply look up.</strong> The vector you get for <em>king</em> depends on whether you asked “what retrieves documents about kings?” or “what predicts the word after <em>king</em>?”</p>
<p>This is the first thing to internalise, because it explains everything that follows — including why the most natural-seeming way to <em>build</em> embeddings by hand, dictating the axes yourself, does not work.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 20%">
<col style="width: 38%">
<col style="width: 40%">
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>Vector-database embedding</strong></th>
<th><strong>Language-model embedding</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Where it comes from</td>
<td>a pre-trained encoder you call</td>
<td>a parameter learned during training</td>
</tr>
<tr class="even">
<td>Fixed or learned?</td>
<td>learned by the encoder, then frozen at lookup</td>
<td>learned jointly with all other weights</td>
</tr>
<tr class="odd">
<td>What “similar” means</td>
<td>a property of the encoder’s training</td>
<td>whatever helps <em>this</em> model predict next tokens</td>
</tr>
<tr class="even">
<td>Role in the system</td>
<td>an <strong>input</strong> (you store it)</td>
<td>an <strong>output</strong> of training (the model owns it)</td>
</tr>
<tr class="odd">
<td>Can you choose the axes?</td>
<td>no (encoder decides)</td>
<td>no — and §2 shows that trying breaks the model</td>
</tr>
</tbody>
</table>
<hr>
</section>
<section id="the-pivot-a-cautionary-tale-about-wiring-the-meaning-by-hand" class="level2 page-columns page-full">
<h2 class="anchored" data-anchor-id="the-pivot-a-cautionary-tale-about-wiring-the-meaning-by-hand">2. The pivot: a cautionary tale about wiring the meaning by hand</h2>
<p>Because embeddings are “just” vectors with interpretable neighbourhoods, a very reasonable engineer’s instinct is: <em>why not build them myself?</em> If I know that words differ along a few common-sense axes — gender, age, power, size, whether they are animate, to name a few — why not assign each word a short vector on those axes, feed a small model some sentences, and let it learn the rest? The axes would be human-readable, the space would be tidy, and interpretability would come for free.</p>
<p>This note’s running example was built precisely to test that instinct, and the result is the most useful thing in it. I designed a tiny vocabulary (~40 content words: <em>king, queen, lion, lioness, stag, mountain, pebble, ship…</em>) and a <strong>six-dimensional hand-wired latent space</strong> — <code>gender</code>, <code>age</code>, <code>power</code>, <code>size</code>, <code>animacy</code>, <code>mobility</code> — with each word assigned coordinates in <img src="https://latex.codecogs.com/png.latex?%5C%7B-1,%200,%20+1%5C%7D"> (king = male/adult/strong/—/animate/mobile; pebble = —/—/—/small/inanimate/still; and so on). From that latent table I generated a corpus of simple definition and comparison sentences (“<em>the king is older than the …</em>”, “<em>a lion is bigger than a …</em>”), and trained a small Transformer (2 layers, 2 heads, <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D=32">) on it. The model was <strong>never shown the latent table</strong> — the experiment was whether, trained on data whose every regularity is a function of those six axes, the model would <strong>recover</strong> them in its own learned embedding.</p>
<p>It did not. Here is the picture that tells the story.</p>
<div class="quarto-figure quarto-figure-center page-columns page-full">
<figure class="figure page-columns page-full">
<p class="page-columns page-full"><a href="figs/fig1_handwired_vs_learned.png" class="lightbox page-columns page-full" data-gallery="quarto-lightbox-gallery-1" title="Hand-wired vs learned embedding space"><img src="https://theworkbench.lerzegov.org/posts/embeddings-and-visualization/figs/fig1_handwired_vs_learned.png" class="column-page img-fluid figure-img" alt="Hand-wired vs learned embedding space"></a></p>
<figcaption>Hand-wired vs learned embedding space</figcaption>
</figure>
</div>
<p>On the <strong>left</strong> is the hand-wired latent space (the six designed dimensions, projected to 2-D by PCA). It is exactly what you would hope for: clean, structured, the three classes (human / animal / thing) separated, the discrete lattice of designed coordinates visible — its top two principal components capture <strong>70%</strong> of the variance, because <em>we</em> put the structure there. On the <strong>right</strong> is the model’s <strong>learned</strong> 32-dimensional embedding for the <em>same words</em>, projected the same way. The tidy structure is gone: the classes interleave, the top two PCs capture only <strong>20%</strong> of the variance, and — the giveaway — the famous word2vec analogy fails. Computing <code>king − man + woman</code> and asking for the nearest word returns <strong><code>lioness</code></strong> (cosine 0.42), with <code>child</code> and <code>cub</code> next; <code>queen</code> is not in the top five. The model has not learned a clean “gender axis”; it has learned a diffuse <em>blob</em> of feminine-animate mass and a separate male-animate blob, and the arithmetic lands wherever those blobs happen to sit.</p>
<p>By every acceptance test we set in advance, the experiment failed: principal components flat (50% cumulative over six PCs, against a 75% target), <strong>zero of six</strong> linear probes able to read a latent axis back out at <img src="https://latex.codecogs.com/png.latex?R%5E2%20%3E%200.8">, and only two of five designed analogies recovered. So why does hand-wiring the meaning fail so thoroughly?</p>
<section id="a.-the-child-and-the-draughtsman" class="level3">
<h3 class="anchored" data-anchor-id="a.-the-child-and-the-draughtsman">2a. The child and the draughtsman</h3>
<p>A child asked to draw a person assembles a kit of symbols: a circle for the head, two dots for eyes, a triangle for the nose, a line for the mouth, sticks for the limbs. Each part is a discrete token, drawn on its own, sitting in its own patch of paper, meaning one thing. The picture is a sum of labelled pieces.</p>
<p>A trained draughtsman does the opposite. A single line traces the underside of the eye and, without lifting, becomes the bridge of the nose; one shadow gives the cheekbone and the eye socket at once; whole regions are set not by the outline of a “part” but by tone. No stroke belongs to one feature, and no feature lives in one stroke. The face emerges from the configuration of marks as a whole — which is the central lesson of Betty Edwards’ <em>Drawing on the Right Side of the Brain</em>: the beginner’s real obstacle is cognitive, not manual. The mind insists on what it already knows — “an eye is an almond with a circle in it” — and stamps that symbol onto the page, drowning out what the eye reports. To draw, you switch off the symbol-maker and let the subject come out of the whole.</p>
<p>This is exactly the difference between the two ways of holding meaning in a vector. When I fabricated embeddings by hand — one axis for gender, one for age, one for power, one for size — I was drawing like the child: <strong>one feature per symbol</strong>, each concept walled off on its own dedicated dimension, the vocabulary a kit of labelled parts. It trained poorly, and the failure was not a bug to fix; it was the model refusing to draw like a child. A set of embeddings that works is drawn like the artist: each direction in the space serves several meanings at once, and each meaning is spread across many directions. Concepts are not parked on private axes — they are <strong>shared out over the available strokes</strong>, and the word comes out of the whole.</p>
<blockquote class="blockquote">
<p>Hand-wired embeddings draw a face the way a child does: a symbol per part. Learned embeddings draw it the way an artist does: every line doing several jobs at once, no part living in a single stroke.</p>
</blockquote>
</section>
<section id="b.-from-symbols-to-distribution-and-superposition" class="level3">
<h3 class="anchored" data-anchor-id="b.-from-symbols-to-distribution-and-superposition">2b. From symbols to distribution and superposition</h3>
<p>The analogy is worth pinning exactly where it holds. It explains why a working code is <strong>distributed</strong> rather than <strong>symbolic</strong> — meaning smeared across the strokes instead of filed under labels. It does not, on its own, explain the stronger and more specific fact that a model stores <em>more features than it has dimensions</em>, tolerating a little interference to do so. That step is not about artistry but about geometry: it is possible only because a high-dimensional space has room for very many almost-non-overlapping directions. The drawing is the intuition; superposition is the mechanism — which is where we turn next.</p>
<p>Back to our question: why does hand-wiring the meaning fail? There are four linked reasons, and together they are the core lesson of this whole note.</p>
<ol type="1">
<li><p><strong>Hand-wiring forces axis-aligned concepts — one meaning per dimension.</strong> My design said “dimension 1 <em>is</em> gender, dimension 3 <em>is</em> power.” But a learned embedding has no reason to put one concept on one coordinate axis. Which brings us to:</p></li>
<li><p><strong>Real features are distributed, and packed in superposition.</strong> Gradient descent does not store “gender” on axis 1. It stores many features as <strong>directions</strong> — often many <em>more</em> directions than there are dimensions — squeezed in at angles chosen so they interfere with each other as little as possible. A direction that means “feminine” can be a diagonal combination of dozens of coordinates, sharing space with hundreds of other such diagonals. This packing of more features than dimensions is called <strong>superposition</strong>, and it is the normal state of affairs in a self-organising model trained to minimise empirical error, not a pathology.</p></li>
<li><p><strong>Descent wants freedom, not your axes.</strong> The reason it packs things this way is that it is optimising for prediction, and the geometry that predicts best is the one that places each useful direction where interference is lowest — <em>not</em> the one a human finds legible. Dictating the axes removes exactly the freedom the optimiser needs. A handful of clean symbolic dimensions simply cannot supply the statistical degrees of freedom a model uses to fit language.</p></li>
<li><p><strong>It is the same phenomenon as distributed coding.</strong> Neuroscientists have long argued that a concept in the brain is carried by a <em>population</em> of neurons rather than a single “grandmother cell”; interpretability researchers find the same in networks — individual units are <em>usually</em> not cleanly interpretable. The reason you cannot read meaning off a single dimension of a real embedding is <em>the same reason</em> hand-wiring one meaning per dimension fails: meaning does not live on the coordinate axes. It lives in directions, and the axes are an arbitrary basis.</p></li>
</ol>
<blockquote class="blockquote">
<p><strong>The lesson, stated plainly:</strong> if you want embeddings that <em>work</em>, you have to give up wiring the meaning by hand. The price of a model that predicts well is a space whose coordinates are not individually meaningful. (This particular toy also illustrates a narrower trap — a corpus mechanically generated from a feature spec gives the model nothing to do but memorise the templates, so it never has to <em>infer</em> the latent axes at all. But the deeper point stands for real models trained on real text: their useful structure is distributed, not axis-aligned.)</p>
</blockquote>
<p><strong>Is this too strong?</strong> It is worth checking against the literature before generalising, because the assertion is forceful — and three well-known results <em>sharpen</em> it rather than overturn it. First, <strong>sparse autoencoders</strong> trained on a model’s activations (Anthropic’s <em>Towards Monosemanticity</em>, 2023; Cunningham et al., 2023) do succeed in recovering clean, human-readable features — but they recover them as <strong>directions</strong> teased out of the superposition, which is precisely the claim that the meaning is present yet <em>not</em> aligned with the raw axes. Second, injecting hand-built structure is not worthless: <strong>retrofitting</strong> word vectors toward a lexicon or ontology (Faruqui et al., 2015) measurably improves them — but it <em>adjusts already-learned vectors</em>, it does not <em>replace</em> learning with a dictated table, which is the thing that fails. Third, the difficulty is principled: recovering clean, axis-aligned factors without strong inductive biases is provably impossible in the unsupervised case (Locatello et al., 2019). So the careful statement is not “structure never helps” — it sometimes does — but: <strong>you cannot hand-author the whole embedding and freeze out the model’s freedom to place features where prediction wants them, and what a model learns will be distributed rather than one-concept-per-axis.</strong> That is the robust core, and it is what the toy demonstrates.</p>
<hr>
</section>
</section>
<section id="what-real-embeddings-are-actually-like" class="level2">
<h2 class="anchored" data-anchor-id="what-real-embeddings-are-actually-like">3. What real embeddings are actually like</h2>
<p>So the dimensions are not individually meaningful, and there are a lot of them: their number <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D"> is 768 in GPT-2 small, 4096 in a typical 7-billion-parameter model, <strong>12,288 in GPT-3</strong> (175 billion parameters, one of the most-studied LLMs), and larger still in frontier models. Two consequences follow immediately, and they set up the rest of the note.</p>
<ul>
<li><p><strong>Reading single dimensions is hopeless</strong> — partly because there are hundreds or thousands of them (trans-human to eyeball), and <em>more importantly</em> because, per §2, an individual coordinate usually means nothing. The interesting structure is in <em>combinations</em> of dimensions — in directions.</p></li>
<li><p><strong>Therefore we need compressed views.</strong> To see a high-dimensional space we must squash it down to two or three dimensions we can plot. And here is the part that is easy to forget: <strong>every way of squashing is a choice of what to preserve, and therefore a choice of which question you are asking.</strong> A 2-D map is never “the” embedding space; it is <em>an answer to one question</em> about it. Pick the wrong method for your question and the map will mislead you with total confidence.</p></li>
</ul>
<p>The next section lays out the three families of method you will meet, what question each one answers, and — just as important — what each one quietly destroys.</p>
<hr>
</section>
<section id="three-ways-to-look-at-an-embedding-space" class="level2 page-columns page-full">
<h2 class="anchored" data-anchor-id="three-ways-to-look-at-an-embedding-space">4. Three ways to look at an embedding space</h2>
<p>The three workhorses are <strong>concept directions</strong> (you bring the meaning), <strong>PCA</strong> (the data brings the directions of greatest spread), and <strong>t-SNE/UMAP</strong> (the data brings local neighbourhoods, nonlinearly). They differ on every axis that matters:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 21%">
<col style="width: 26%">
<col style="width: 28%">
<col style="width: 23%">
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>Concept directions / Gram-Schmidt</strong></th>
<th><strong>PCA</strong></th>
<th><strong>t-SNE / UMAP</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Supervised?</td>
<td><strong>Yes</strong> — you choose the axes</td>
<td>No</td>
<td>No</td>
</tr>
<tr class="even">
<td>Linear?</td>
<td>Yes</td>
<td>Yes</td>
<td><strong>No</strong></td>
</tr>
<tr class="odd">
<td>A true projection?</td>
<td>Yes (onto chosen axes)</td>
<td>Yes (onto top eigenvectors)</td>
<td><strong>No</strong> — a learned 2-D <em>embedding</em></td>
</tr>
<tr class="even">
<td>Readable directions / arithmetic?</td>
<td><strong>Yes</strong> — that <em>is</em> the point</td>
<td>Sort of (PC axes have a sign)</td>
<td><strong>No</strong> — axes mean nothing</td>
</tr>
<tr class="odd">
<td>What it preserves</td>
<td>your chosen contrasts</td>
<td>global variance</td>
<td><strong>local</strong> neighbourhoods only</td>
</tr>
<tr class="even">
<td>What it destroys</td>
<td>everything off your plane</td>
<td>small-variance structure</td>
<td>global geometry, distances, density</td>
</tr>
<tr class="odd">
<td>Best question</td>
<td>“where do words fall on <em>this</em> meaning?”</td>
<td>“what are the biggest directions of spread?”</td>
<td>“what clusters together locally?”</td>
</tr>
</tbody>
</table>
<p>To make the comparison concrete, here is one embedding space — the 28-token vocabulary of our trained <em>dialogue-game</em> model (a tiny word-level Transformer, 2 layers × 4 heads, <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D=64">, that plays a turn-taking “calling game”: <em>Pietro chiama Paolo</em>, with <strong>epithets</strong> that depend on who calls whom) — shown all three ways at once. The vocabulary splits into roles: <strong>players</strong> (Pietro, Paolo, the numbers 1–8), <strong>epithets</strong> (Tarso, Cefa, capo, vice), <strong>verbs</strong> (chiama, perde), <strong>absurd</strong> distractor words, and <strong>special</strong> tokens.</p>
<div class="quarto-figure quarto-figure-center page-columns page-full">
<figure class="figure page-columns page-full">
<p class="page-columns page-full"><a href="figs/fig2_three_families.png" class="lightbox page-columns page-full" data-gallery="quarto-lightbox-gallery-2" title="The same 28-token vocabulary under all three methods, left to right: panel (a) concept directions / Gram–Schmidt (axes e_1= Paolo − Pietro, e_2= peer − subordinate epithets), panel (b) PCA, panel (c) t-SNE. Click the figure to enlarge — the panel titles and axis labels are only legible at full size."><img src="https://theworkbench.lerzegov.org/posts/embeddings-and-visualization/figs/fig2_three_families.png" class="column-page img-fluid figure-img" alt="The same 28-token vocabulary under all three methods, left to right: panel (a) concept directions / Gram–Schmidt (axes e_1= Paolo − Pietro, e_2= peer − subordinate epithets), panel (b) PCA, panel (c) t-SNE. Click the figure to enlarge — the panel titles and axis labels are only legible at full size."></a></p>
<figcaption>The same 28-token vocabulary under all three methods, left to right: <strong>panel (a)</strong> concept directions / Gram–Schmidt (axes <img src="https://latex.codecogs.com/png.latex?e_1="> Paolo − Pietro, <img src="https://latex.codecogs.com/png.latex?e_2="> peer − subordinate epithets), <strong>panel (b)</strong> PCA, <strong>panel (c)</strong> t-SNE. <em>Click the figure to enlarge — the panel titles and axis labels are only legible at full size.</em></figcaption>
</figure>
</div>
<p>Same 28 points, three pictures, three different stories. Now each method in turn.</p>
<section id="a.-concept-directions-you-bring-the-meaning" class="level3">
<h3 class="anchored" data-anchor-id="a.-concept-directions-you-bring-the-meaning">4a. Concept directions — you bring the meaning</h3>
<p>The most honest method is also the most opinionated: you <strong>decide</strong> what the axes mean. You pick a direction in embedding space that stands for a concept — either by naming a token (“the direction of <em>Tarso</em>”) or, more usefully, by taking a <strong>difference of tokens</strong> (“<em>Paolo</em> minus <em>Pietro</em>” = the <em>which-leader</em> direction), then orthogonalise a second concept against the first (Gram-Schmidt) so the two axes are independent, and read off where every word lands. The <strong>leftmost panel (a)</strong> of the figure above uses $e_1 = $ (Paolo − Pietro) and $e_2 = $ (peer epithets − subordinate epithets): the epithets fly to the corners exactly as their roles predict.</p>
<p>This is the <strong>king − man + woman ≈ queen</strong> world, and it is the method this project already uses elsewhere — applied not to the vocabulary but to the <strong>residual stream</strong>, the running vector the model updates token by token (see the <a href="../../posts/transformers-qkv-attention/index.html">QKV note</a>). Because our model <strong>ties</strong> its embedding and unembedding matrices (<a href="../../posts/the-final-step/index.html">The Final Step</a> develops why this makes the dictionary the model <em>searches</em> literally the embedding matrix), moving the residual toward a token’s embedding direction <em>is</em> raising that token’s probability. So we can build a Gram-Schmidt plane from two embedding rows and watch the prediction travel across it. The next figure is a <strong>separate, single-panel plot</strong> — <em>not</em> one of the three panels above, and built on a different pair of axes:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="figs/fig5_residual_trajectory.png" class="lightbox" data-gallery="quarto-lightbox-gallery-3" title="Residual-stream trajectory in a concept plane (a single-panel figure, distinct from the three-panel figure above). Here e_1 is the Tarso direction and e_2 is the callee Paolo orthogonalised against it — chosen to track the prediction of Tarso."><img src="https://theworkbench.lerzegov.org/posts/embeddings-and-visualization/figs/fig5_residual_trajectory.png" class="img-fluid figure-img" alt="Residual-stream trajectory in a concept plane (a single-panel figure, distinct from the three-panel figure above). Here e_1 is the Tarso direction and e_2 is the callee Paolo orthogonalised against it — chosen to track the prediction of Tarso."></a></p>
<figcaption>Residual-stream trajectory in a concept plane (a single-panel figure, distinct from the three-panel figure above). Here <img src="https://latex.codecogs.com/png.latex?e_1"> is the <strong>Tarso</strong> direction and <img src="https://latex.codecogs.com/png.latex?e_2"> is the callee <strong>Paolo</strong> orthogonalised against it — chosen to track the prediction <em>of Tarso</em>.</figcaption>
</figure>
</div>
<p>For the prompt <code>&lt;BOS&gt; Pietro chiama Paolo</code>, the model must emit Paolo’s epithet, <strong>Tarso</strong>. Note this is a <em>different</em> concept plane from panel (a): here <img src="https://latex.codecogs.com/png.latex?e_1"> is the <strong>Tarso</strong> direction itself and <img src="https://latex.codecogs.com/png.latex?e_2"> is the callee <strong>Paolo</strong> with its Tarso-component removed (Gram-Schmidt again, on a different pair of rows) — chosen because we are now tracking the prediction <em>of Tarso</em>, not laying out the whole vocabulary by leader and register. The residual starts near the called name and <strong>migrates across the plane toward the Tarso direction</strong> as the layers run (the <img src="https://latex.codecogs.com/png.latex?e_1"> coordinate climbs from −3.1 to +10.6), and decomposing the path by <em>which write moved it</em> shows that <strong>the attention block in layer 0 does the +7.62 push</strong> toward the epithet — pinning the behaviour to a specific component, the same kind of causal claim the QKV note’s ablation appendix makes. The point for <em>this</em> note: <strong>concept directions are supervised</strong>. They show you exactly what you ask about and nothing else.</p>
<p>That last clause is the catch, and the numbers make it honest. The concept plane in panel (a) captures only <strong>15%</strong> of the vocabulary’s spread, and the residual plane captures <strong>28%</strong> of the trajectory’s spread (against a best-possible 72% — see §4b). A supervised plane is chosen for <em>meaning</em>, not for <em>variance</em>, so it generally is <strong>not</strong> where the data spreads most. That is a feature, not a bug — but it means you are seeing your hypothesis, not the data’s own structure. <strong>Caveat:</strong> the famous analogies are partly fragile and cherry-picked (the literature has known this for a decade — see §7); they are cleanest on classic <em>static</em> word embeddings and patchier inside trained language models, as our own <code>king − man + woman → lioness</code> already warned.</p>
</section>
<section id="b.-pca-the-datas-own-biggest-directions" class="level3">
<h3 class="anchored" data-anchor-id="b.-pca-the-datas-own-biggest-directions">4b. PCA — the data’s own biggest directions</h3>
<p>PCA asks a different, unsupervised question: <em>along which directions does the data spread most?</em> It finds them by an eigen-decomposition of the centred data (a single <code>svd</code> call), and projects onto the top two. Unlike concept directions, you bring no hypothesis; unlike t-SNE, it is a genuine <strong>linear projection</strong> — the axes have a fixed meaning, you can read coordinates off them, and (with care) do arithmetic.</p>
<p>Panel (b) above is the PCA of our 28 embeddings; its top two components hold <strong>34.7%</strong> of the variance, top four <strong>55.3%</strong>. That single number — “only a third of the structure fits in the best possible 2-D linear view” — is itself the most useful thing PCA tells you, and it is <em>honest in a way a t-SNE plot never is</em>: it quantifies how much you are <strong>not</strong> seeing.</p>
<p>But PCA has a caveat that bites constantly and is widely under-appreciated: <strong>variance is not meaning.</strong> The directions of greatest spread are very often boring — token <strong>frequency</strong>, vector <strong>norm</strong>, or positional artefacts — rather than clean semantics. In word embeddings this is so reliable that a standard preprocessing trick (“All-but-the-Top”, §7) is to <em>delete</em> the top few PCs because they encode frequency. The practical defence, which our tooling also offers, is to project the rows onto the <strong>unit sphere</strong> first (a “cosine” view): that removes the magnitude/frequency effect and lifts our 2-D fraction from 34.7% to 43.5%, placing rare tokens by their <em>direction</em> instead of letting their small norm collapse them to the centre.</p>
</section>
<section id="c.-t-sne-and-umap-local-neighbourhoods-with-loud-caveats" class="level3 page-columns page-full">
<h3 class="anchored" data-anchor-id="c.-t-sne-and-umap-local-neighbourhoods-with-loud-caveats">4c. t-SNE (and UMAP) — local neighbourhoods, with loud caveats</h3>
<p>t-SNE answers a third question: <em>which points are each other’s nearest neighbours?</em> It is <strong>nonlinear</strong> and <strong>local</strong> — it tries to keep neighbours together while caring nothing about anything else — and it usually produces the prettiest, most cluster-y pictures, which is exactly why it is the most dangerous. Four caveats deserve to be stated loudly, because every one of them is routinely violated in practice:</p>
<ul class="page-columns page-full">
<li><p><strong>It is not a projection.</strong> There are no axes, so a t-SNE plot has <strong>no readable directions and supports no arithmetic.</strong> “Right” and “up” mean nothing. (UMAP, the popular faster cousin, shares this — and additionally is <em>not</em> designed to preserve global structure either, despite a common belief otherwise.)</p></li>
<li><p><strong>Inter-cluster distances are meaningless.</strong> Two clusters drawn far apart are not “more different” than two drawn close. The gaps between blobs carry no information.</p></li>
<li class="page-columns page-full"><p><strong>It is stochastic.</strong> A different random seed gives a different layout. Here is the <em>same</em> 28-token data, same perplexity, two seeds:</p>
<div class="quarto-figure quarto-figure-center page-columns page-full">
<figure class="figure page-columns page-full">
<p class="page-columns page-full"><a href="figs/fig3_tsne_seeds.png" class="lightbox page-columns page-full" data-gallery="quarto-lightbox-gallery-4" title="t-SNE seed sensitivity"><img src="https://theworkbench.lerzegov.org/posts/embeddings-and-visualization/figs/fig3_tsne_seeds.png" class="column-page img-fluid figure-img" alt="t-SNE seed sensitivity"></a></p>
<figcaption>t-SNE seed sensitivity</figcaption>
</figure>
</div>
<p>Paolo lands top-left in one run and bottom-centre in the other; the global arrangement reshuffles entirely. If your conclusion would change with the seed, it was never a conclusion about the data.</p></li>
<li><p><strong>It is perplexity-sensitive.</strong> The one knob (roughly, “how many neighbours count as local”) changes the picture qualitatively; there is no single right value, and small datasets are especially unstable. The classic, still-essential reference is Wattenberg, Viégas &amp; Johnson, <em>“How to Use t-SNE Effectively”</em> (Distill, 2016) — read it before you trust any t-SNE plot, your own included.</p></li>
</ul>
<p>Used within its remit — “what clusters with what, locally” — t-SNE is genuinely useful. Read as a map with meaningful axes and distances, it is a confident liar.</p>
<hr>
</section>
</section>
<section id="capstone-does-the-intuition-survive-contact-with-real-language" class="level2 page-columns page-full">
<h2 class="anchored" data-anchor-id="capstone-does-the-intuition-survive-contact-with-real-language">5. Capstone — does the intuition survive contact with real language?</h2>
<p>Our dialogue game has 28 words and a hand-built rule. Do the same intuitions hold when we scale up to a model trained on something language-like? To check, I took a real <strong>TinyStories</strong>-trained GPT-2 (<code>segestic/Tinystories-gpt-0.1-3m</code> — a small model trained on the TinyStories corpus of simple children’s stories, with a 50,257-token vocabulary and, conveniently, the <em>same</em> 64-dimensional embeddings as our toy), curated 84 readable whole-word tokens across clear semantic groups (animals, people, nature, colours, verbs, adjectives, function words), and ran the same two unsupervised methods.</p>
<div class="quarto-figure quarto-figure-center page-columns page-full">
<figure class="figure page-columns page-full">
<p class="page-columns page-full"><a href="figs/fig4_tinystories.png" class="lightbox page-columns page-full" data-gallery="quarto-lightbox-gallery-5" title="TinyStories capstone — PCA and t-SNE"><img src="https://theworkbench.lerzegov.org/posts/embeddings-and-visualization/figs/fig4_tinystories.png" class="column-page img-fluid figure-img" alt="TinyStories capstone — PCA and t-SNE"></a></p>
<figcaption>TinyStories capstone — PCA and t-SNE</figcaption>
</figure>
</div>
<p>Two things survive the jump to a real corpus. First, <strong>the warnings hold</strong>: PCA captures only <strong>33%</strong> of the variance in 2-D — the same “most of the structure is off-plane” story as the toy — and the function words (grey) peel off along the first PC, a textbook case of <em>frequency</em> dominating the top component (§4b). Second, <strong>structure does appear, but softly</strong>: in both views you can see animals loosely grouping, function words separating, colours and adjectives drifting together — real, but smeared, not the clean clusters a t-SNE plot’s prettiness might tempt you to over-read. The lesson scales exactly: the maps are useful for <em>orientation</em>, never for <em>measurement</em>, and they get harder to read, not easier, as the model becomes more operationally relevant.</p>
<hr>
</section>
<section id="how-this-connects-to-the-rest-of-the-series" class="level2">
<h2 class="anchored" data-anchor-id="how-this-connects-to-the-rest-of-the-series">6. How this connects to the rest of the series</h2>
<p>This note and its two companions describe one machine from three angles, and they share one matrix:</p>
<ul>
<li>The vectors visualised here <strong>are</strong> the columns of the unembedding matrix <img src="https://latex.codecogs.com/png.latex?W_U"> that the model scores the final residual against to pick a next token. With <strong>tied weights</strong> — as in both our toy models — the embedding matrix <img src="https://latex.codecogs.com/png.latex?E"> and the unembedding <img src="https://latex.codecogs.com/png.latex?W_U%20=%20E%5E%5Ctop"> are <em>the same numbers</em>. So this note is <strong>“what the searched dictionary looks like,”</strong> and <a href="../../posts/the-final-step/index.html">The Final Step</a> is <strong>“how the search works.”</strong> The maps in §4 are maps of the very dictionary the model’s last step looks words up in.</li>
<li>The <a href="../../posts/transformers-qkv-attention/index.html">QKV note</a> explains how attention assembles the residual vector that then gets scored. Fig 5 here is the bridge: it watches that residual move through a <em>concept-direction</em> plane built from the embedding rows, and attributes the decisive move to one attention block — the visualization technique of §4a applied to the mechanism of the QKV note.</li>
</ul>
<p>The through-line: an embedding is a <strong>direction in a space the model owns</strong>; attention <strong>moves</strong> the residual through that space; the unembedding <strong>searches</strong> the space for the nearest word. Visualisation is how we — who cannot see in 768 dimensions — get a partial, question-shaped glimpse of where everything sits.</p>
<hr>
</section>
<section id="where-this-sits-in-the-literature-an-honesty-box" class="level2">
<h2 class="anchored" data-anchor-id="where-this-sits-in-the-literature-an-honesty-box">7. Where this sits in the literature (an honesty box)</h2>
<p>None of the <em>findings</em> here are new, and a specialist would recognise every piece; the contribution is expository and the live, perturbable spreadsheet models behind the figures. To locate the material honestly:</p>
<ul>
<li><strong>Concept directions and analogies</strong> go back to <strong>word2vec</strong> (Mikolov et al., 2013), whose <code>king − man + woman ≈ queen</code> is the origin of the whole “linear semantics” picture. That picture is real but <strong>partly fragile</strong>: later work (Levy &amp; Goldberg; Linzen, 2016) showed the analogies are sensitive to normalisation and to <em>excluding the input words</em>, and are easy to cherry-pick — which is why our in-model analogy landed on <code>lioness</code>, not <code>queen</code>.</li>
<li>The idea that concepts are <strong>linear directions</strong> in representation space is the <strong>linear representation hypothesis</strong> (e.g.&nbsp;Park, Choe &amp; Veitch, 2023, and a long interpretability lineage). That distributed features are packed <strong>more than one per dimension</strong> is <strong>superposition</strong>, made precise in Anthropic’s <em>Toy Models of Superposition</em> (Elhage et al., 2022) — the formal version of §2’s lesson and of “single dimensions aren’t interpretable.”</li>
<li><strong>Does hand-built structure ever help? (the §2 counter-example check.)</strong> With care, yes — and none of it rescues hand-wiring. <strong>Sparse autoencoders</strong> recover monosemantic feature <em>directions</em> from superposition (Bricken et al., <em>Towards Monosemanticity</em>, 2023; Cunningham et al., 2023) — confirming meaning lives in directions, not axes; <strong>retrofitting</strong> nudges already-learned vectors toward a lexicon to improve them (Faruqui et al., 2015); and the impossibility of unsupervised axis-aligned disentanglement without inductive bias is a theorem (Locatello et al., 2019). Together these say the §2 failure is about <em>dictating and freezing</em> the embedding, not about structure being useless.</li>
<li>The <strong>PCA caveat</strong> that top components track <strong>frequency/norm</strong> rather than clean semantics is well documented; the “delete the top components” fix is <em>All-but-the- Top</em> (Mu &amp; Viswanath, 2018).</li>
<li>The <strong>t-SNE caveats</strong> are from van der Maaten &amp; Hinton (2008) and, for practice, Wattenberg, Viégas &amp; Johnson’s <em>How to Use t-SNE Effectively</em> (Distill, 2016); <strong>UMAP</strong> is McInnes, Healy &amp; Melville (2018), with Coenen &amp; Pearce’s <em>Understanding UMAP</em> as the matching cautionary companion.</li>
<li><strong>Honesty caveat on the toy:</strong> the dialogue-game model is reproduced from the public <em>ToyDialogueGames</em> exercise; the novelty is the transparent, cell-by-cell spreadsheet realisation, the epithet/binding extension, and this exposition for a non-CS audience — not the toy or the methods themselves. The TinyStories capstone uses a community checkpoint (<code>segestic/Tinystories-gpt-0.1-3m</code>); my own from-scratch TinyStories replication is awaiting hardware.</li>
</ul>
<hr>
</section>
<section id="appendix-a-a-gram-schmidt-projection-you-can-check-by-hand" class="level2">
<h2 class="anchored" data-anchor-id="appendix-a-a-gram-schmidt-projection-you-can-check-by-hand">Appendix A: a Gram-Schmidt projection you can check by hand</h2>
<p>Concept-direction maps look like magic but are three dot products. Take a toy 3-dimensional embedding with four words:</p>
<pre><code>king   = ( 2,  1,  0)
queen  = ( 2, -1,  0)
man    = ( 1,  1,  1)
woman  = ( 1, -1,  1)</code></pre>
<p>Suppose we want a “gender” axis and a “royalty” axis. Define them as token differences and build an orthonormal plane.</p>
<p><strong>Axis 1 — gender</strong>, as (king − queen):</p>
<pre><code>v1 = king − queen = (0, 2, 0)        e1 = v1/‖v1‖ = (0, 1, 0)</code></pre>
<p>So <img src="https://latex.codecogs.com/png.latex?e_1"> is just the second coordinate. Good: in this toy, dimension 2 happens to carry gender (the male words have +1, the female words −1).</p>
<p><strong>Axis 2 — royalty</strong>, as (king − man), then Gram-Schmidt against <img src="https://latex.codecogs.com/png.latex?e_1">:</p>
<pre><code>v2  = king − man = (1, 0, -1)
v2·e1 = (1,0,-1)·(0,1,0) = 0          (already orthogonal to e1)
e2  = v2/‖v2‖ = (1, 0, -1)/√2 ≈ (0.71, 0, -0.71)</code></pre>
<p><strong>Project each word</strong> onto <img src="https://latex.codecogs.com/png.latex?(e_1,%20e_2)"> — two dot products per word:</p>
<pre><code>        e1 (gender)          e2 (royalty)
king    (2,1,0)·e1 =  1      (2,1,0)·e2 = (2−0)/√2 ≈  1.41
queen   (2,-1,0)·e1 = -1     (2,-1,0)·e2 ≈  1.41
man     (1,1,1)·e1 =  1      (1,1,1)·e2 = (1−1)/√2 =  0.00
woman   (1,-1,1)·e1 = -1     (1,-1,1)·e2 =  0.00</code></pre>
<p>Plotted, that is exactly the parallelogram the analogy promises:</p>
<pre><code>   royalty (e2)
   1.41 |  queen ●         ● king
        |
   0.00 |  woman ●         ● man
        +------------------------- gender (e1)
          -1                +1</code></pre>
<p><code>king − man + woman</code> $= (2,1,0) − (1,1,1) + (1,-1,1) = (2,-1,0) = $ <strong>queen</strong>, exactly — <em>because we built a space where it works.</em> The sobering content of §2 is that a model trained to predict text does <strong>not</strong> build such a space for you; it builds whatever predicts best, and the clean parallelogram is the exception, not the rule. Visualisation lets you look for the parallelograms — and, just as importantly, lets you measure how often they are not there.</p>
<hr>
</section>
<section id="appendix-b-glossary" class="level2">
<h2 class="anchored" data-anchor-id="appendix-b-glossary">Appendix B: Glossary</h2>
<p>For readers who would like the basics or a refresher. Terms are grouped roughly by where they appear.</p>
<section id="token-vocabulary" class="level3">
<h3 class="anchored" data-anchor-id="token-vocabulary">Token, vocabulary</h3>
<p>A <strong>token</strong> is the unit of text the model reads — here a whole word; in production models usually a sub-word piece. The <strong>vocabulary</strong> is the fixed set of all possible tokens (28 in our dialogue game, 50,257 in GPT-2/GPT-3).</p>
</section>
<section id="embedding" class="level3">
<h3 class="anchored" data-anchor-id="embedding">Embedding</h3>
<p>The vector of real numbers that represents a token. The <strong>embedding matrix</strong> <img src="https://latex.codecogs.com/png.latex?E"> has one row per vocabulary token; “looking up” a token means reading its row. The rows are <strong>parameters</strong> — numbers learned during training — not values fetched from elsewhere (§1).</p>
</section>
<section id="encoder-retrieval-vector-database" class="level3">
<h3 class="anchored" data-anchor-id="encoder-retrieval-vector-database">Encoder (retrieval / vector database)</h3>
<p>A separately-trained model that turns a piece of text into one vector so that similar texts get nearby vectors. Used to fill a <strong>vector database</strong> for semantic search. Examples: OpenAI <code>text-embedding-3</code>, <code>sentence-transformers</code>, ColBERT. It is trained for <em>relevance similarity</em>, a different objective from next-token prediction (§1).</p>
</section>
<section id="gradient-descent" class="level3">
<h3 class="anchored" data-anchor-id="gradient-descent">Gradient descent</h3>
<p>The training procedure: nudge every parameter a little in the direction that reduces the model’s error, repeat millions of times. It is what “learns” the embeddings.</p>
</section>
<section id="residual-stream" class="level3">
<h3 class="anchored" data-anchor-id="residual-stream">Residual stream</h3>
<p>The running vector the Transformer keeps for each position and updates layer by layer; each attention/MLP block <em>adds</em> its output to it. The model’s prediction is read from the final residual. (See the <a href="../../posts/transformers-qkv-attention/index.html">QKV note</a>.)</p>
</section>
<section id="attention-head" class="level3">
<h3 class="anchored" data-anchor-id="attention-head">Attention head</h3>
<p>A sub-mechanism inside a layer that, for each position, decides how much to read from every earlier position. Our toy has 2 layers × 4 heads. “Which head does what” is the central question of mechanistic interpretability.</p>
</section>
<section id="logits-softmax" class="level3">
<h3 class="anchored" data-anchor-id="logits-softmax">Logits, softmax</h3>
<p>The model’s raw output scores over the vocabulary are <strong>logits</strong>. <strong>Softmax</strong> turns them into a probability distribution (exponentiate, then normalise to sum 1).</p>
</section>
<section id="tied-weights-embedding-unembedding" class="level3">
<h3 class="anchored" data-anchor-id="tied-weights-embedding-unembedding">Tied weights (embedding / unembedding)</h3>
<p>Using the <em>same</em> matrix to map tokens→vectors (input) and vectors→token-scores (output, the <strong>unembedding</strong> <img src="https://latex.codecogs.com/png.latex?W_U%20=%20E%5E%5Ctop">). Standard for small models; it means the “dictionary” the model searches at the end <em>is</em> the embedding matrix (§6).</p>
</section>
<section id="cosine-similarity" class="level3">
<h3 class="anchored" data-anchor-id="cosine-similarity">Cosine similarity</h3>
<p>A measure of how aligned two vectors are: the cosine of the angle between them (+1 = same direction, 0 = orthogonal, −1 = opposite). Ignores length, so it compares <em>direction</em> only.</p>
</section>
<section id="pca-principal-component-analysis" class="level3">
<h3 class="anchored" data-anchor-id="pca-principal-component-analysis">PCA (Principal Component Analysis)</h3>
<p>An unsupervised method that finds the directions along which the data spreads most (the <strong>principal components</strong>), computed from an eigen-decomposition / <strong>SVD</strong> of the centred data. Projecting onto the top two gives a 2-D map; the <strong>variance explained</strong> (e.g.&nbsp;“34.7%”) is the fraction of the data’s total spread those two directions capture — a built-in honesty meter (§4b).</p>
</section>
<section id="eigenvector-svd" class="level3">
<h3 class="anchored" data-anchor-id="eigenvector-svd">Eigenvector / SVD</h3>
<p>The linear-algebra machinery PCA runs on: the <strong>singular value decomposition</strong> (SVD) factorises the data matrix and hands back the principal directions and how much variance each carries. You do not need the details — just that one <code>svd</code> call yields the PCA axes.</p>
</section>
<section id="projection" class="level3">
<h3 class="anchored" data-anchor-id="projection">Projection</h3>
<p>Mapping high-dimensional points onto a lower-dimensional plane by taking dot products with chosen axes. PCA and concept-direction maps are <strong>true projections</strong> (the axes keep a fixed meaning); t-SNE is <strong>not</strong> (§4).</p>
</section>
<section id="gram-schmidt-orthonormal" class="level3">
<h3 class="anchored" data-anchor-id="gram-schmidt-orthonormal">Gram-Schmidt / orthonormal</h3>
<p>A recipe for turning two chosen direction vectors into a clean perpendicular (<strong>orthonormal</strong>) pair of axes, so the two coordinates you read off are independent. Used to build concept-direction planes (§4a, Appendix A).</p>
</section>
<section id="concept-direction" class="level3">
<h3 class="anchored" data-anchor-id="concept-direction">Concept direction</h3>
<p>An axis you <em>choose</em> to stand for a meaning — either a token’s own direction or a <strong>difference of tokens</strong> (e.g.&nbsp;<code>Paolo − Pietro</code> = “which leader”). Supervised: you bring the meaning (§4a).</p>
</section>
<section id="t-sne-umap-perplexity" class="level3">
<h3 class="anchored" data-anchor-id="t-sne-umap-perplexity">t-SNE, UMAP, perplexity</h3>
<p><strong>t-SNE</strong> and <strong>UMAP</strong> are nonlinear methods that place points so that <em>local</em> neighbourhoods are preserved, producing cluster-y 2-D pictures. They are <strong>not</strong> projections: the axes, the distances between clusters, and the global layout carry no meaning, and the result changes with the random seed. <strong>Perplexity</strong> is t-SNE’s main knob, controlling roughly how many neighbours count as “local” (§4c).</p>
</section>
<section id="distributed-representation-superposition" class="level3">
<h3 class="anchored" data-anchor-id="distributed-representation-superposition">Distributed representation / superposition</h3>
<p><strong>Distributed:</strong> a concept is carried by a pattern across many dimensions, not one. <strong>Superposition:</strong> a model packs <em>more</em> feature-directions into a space than it has dimensions, at angles chosen to minimise interference — which is why the axes are not individually meaningful (§2).</p>
</section>
<section id="linear-representation-hypothesis" class="level3">
<h3 class="anchored" data-anchor-id="linear-representation-hypothesis">Linear representation hypothesis</h3>
<p>The empirical idea that many high-level concepts correspond to <em>straight-line directions</em> in representation space — the reason concept-direction maps and analogies (<code>king − man + woman</code>) work at all, when they do (§4a, §7).</p>
</section>
<section id="sparse-autoencoder-sae" class="level3">
<h3 class="anchored" data-anchor-id="sparse-autoencoder-sae">Sparse autoencoder (SAE)</h3>
<p>A tool that decomposes a model’s activations into many sparse, often human-interpretable feature <em>directions</em> — used to read meaning out of superposition without dictating it in advance (§2, §7).</p>
</section>
<section id="ablation" class="level3">
<h3 class="anchored" data-anchor-id="ablation">Ablation</h3>
<p>Switching off one component (e.g.&nbsp;one attention head) and re-measuring behaviour, to test what it <em>causally</em> does — as opposed to what its attention pattern <em>looks</em> like. Fig 5’s “which write moved the residual” is the same spirit (§4a; the QKV note’s appendix).</p>
</section>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<p>Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., … Olah, C. (2023). <em>Towards monosemanticity: Decomposing language models with dictionary learning.</em> Transformer Circuits Thread. https://transformer-circuits.pub/2023/monosemantic-features/index.html</p>
<p>Coenen, A., &amp; Pearce, A. (n.d.). <em>Understanding UMAP.</em> Google PAIR. https://pair-code.github.io/understanding-umap/</p>
<p>Cunningham, H., Ewart, A., Riggs, L., Huben, R., &amp; Sharkey, L. (2023). <em>Sparse autoencoders find highly interpretable features in language models.</em> arXiv. https://arxiv.org/abs/2309.08600</p>
<p>Edwards, B. (2012), <em>Drawing on the Right Side of the Brain: The Definitive,</em>, 4th ed, TarcherPerigee.</p>
<p>Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., … Olah, C. (2022). <em>Toy models of superposition.</em> Transformer Circuits Thread. https://transformer-circuits.pub/2022/toy_model/index.html</p>
<p>Faruqui, M., Dodge, J., Jauhar, S. K., Dyer, C., Hovy, E., &amp; Smith, N. A. (2015). Retrofitting word vectors to semantic lexicons. In <em>Proceedings of NAACL-HLT 2015</em> (pp.&nbsp;1606–1615). Association for Computational Linguistics. https://arxiv.org/abs/1411.4166</p>
<p>Levy, O., &amp; Goldberg, Y. (2014). Linguistic regularities in sparse and explicit word representations. In <em>Proceedings of the Eighteenth Conference on Computational Natural Language Learning (CoNLL)</em> (pp.&nbsp;171–180). Association for Computational Linguistics. https://aclanthology.org/W14-1618/</p>
<p>Linzen, T. (2016). Issues in evaluating semantic spaces using word analogies. In <em>Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP</em> (pp.&nbsp;13–18). Association for Computational Linguistics. https://arxiv.org/abs/1606.07736</p>
<p>Locatello, F., Bauer, S., Lucic, M., Rätsch, G., Gelly, S., Schölkopf, B., &amp; Bachem, O. (2019). Challenging common assumptions in the unsupervised learning of disentangled representations. In <em>Proceedings of the 36th International Conference on Machine Learning (ICML)</em> (pp.&nbsp;4114–4124). https://arxiv.org/abs/1811.12359</p>
<p>McInnes, L., Healy, J., &amp; Melville, J. (2018). <em>UMAP: Uniform manifold approximation and projection for dimension reduction.</em> arXiv. https://arxiv.org/abs/1802.03426</p>
<p>Mikolov, T., Chen, K., Corrado, G., &amp; Dean, J. (2013). <em>Efficient estimation of word representations in vector space.</em> arXiv. https://arxiv.org/abs/1301.3781</p>
<p>Mu, J., &amp; Viswanath, P. (2018). All-but-the-top: Simple and effective postprocessing for word representations. In <em>International Conference on Learning Representations (ICLR)</em>. https://arxiv.org/abs/1702.01417</p>
<p>Park, K., Choe, Y. J., &amp; Veitch, V. (2023). <em>The linear representation hypothesis and the geometry of large language models.</em> arXiv. https://arxiv.org/abs/2311.03658</p>
<p>van der Maaten, L., &amp; Hinton, G. (2008). Visualizing data using t-SNE. <em>Journal of Machine Learning Research, 9</em>(86), 2579–2605. https://jmlr.org/papers/v9/vandermaaten08a.html</p>
<p>Wattenberg, M., Viégas, F., &amp; Johnson, I. (2016). How to use t-SNE effectively. <em>Distill.</em> https://distill.pub/2016/misread-tsne/</p>
<hr>
<p><em>Models and tooling behind the figures (all in the <code>julia-impromptu</code> project): the trained dialogue-game model <code>DialogueGame-Tiny-Epithet-Trained.json</code> with its live <code>vocab_map</code> and <code>residual_trajectory</code> reports; the failed hand-wired <code>SemanticTiny.json</code>; the projection pipeline <code>tools/project_embeddings_note.jl</code> (PCA via <code>svd</code>, concept axes via Gram-Schmidt, t-SNE via <code>TSne.jl</code> — one source of truth for every coordinate) and the renderer <code>tools/render_embeddings_note.py</code>.</em></p>


</section>

 ]]></description>
  <category>LLMs</category>
  <category>Embeddings</category>
  <category>Visualization</category>
  <category>Interpretability</category>
  <category>Teaching</category>
  <guid>https://theworkbench.lerzegov.org/posts/embeddings-and-visualization/</guid>
  <pubDate>Sun, 31 May 2026 22:00:00 GMT</pubDate>
  <media:content url="https://theworkbench.lerzegov.org/posts/embeddings-and-visualization/figs/fig1_handwired_vs_learned.png" medium="image" type="image/png" height="62" width="144"/>
</item>
<item>
  <title>Transformers and QKV Attention: A Primer</title>
  <dc:creator>Luca Erzegovesi</dc:creator>
  <link>https://theworkbench.lerzegov.org/posts/transformers-qkv-attention/</link>
  <description><![CDATA[ 




<p><em>First in the series “Understanding LLMs to use them better in management and finance.” The three notes describe one machine — a Transformer language model — from three complementary angles. This note opens the machine up and lays out the <strong>attention machinery</strong>: how a Transformer moves information between words. The <a href="../../posts/embeddings-and-visualization/index.html">second note</a> is about the <strong>embeddings</strong> — the vocabulary of vectors the machine reads from and writes to, and how to look at a space with hundreds of dimensions without fooling yourself. The <a href="../../posts/the-final-step/index.html">third note</a> watches the <strong>final step</strong> — the moment all that work collapses into a single next word, which turns out to be a relentless search. Read in order, the three move from mechanism, to dictionary, to read-out.</em></p>
<p><em>A word on how to read this one. Think of it as the technical booklet that ships with an electronic appliance: it aims to give an accessible but complete view of the machinery and its inner workings. The end user normally need not open the booklet; the technician must. But in business applications of AI the boundary between technician and end user is blurring fast — anyone who deploys these models in management or finance increasingly has to look inside the machine to judge what it can and cannot reliably do. This first note is accordingly longer and more abstract than the two that follow, which are focused and example-driven. The architecture it describes is the <strong>Transformer</strong>, introduced by Vaswani et al. (2017) in the paper whose title became a slogan, “Attention Is All You Need”; the technical terms are collected in Appendix C and the references at the very end.</em></p>
<section id="transformers-an-evolutionary-step-from-neural-networks" class="level2">
<h2 class="anchored" data-anchor-id="transformers-an-evolutionary-step-from-neural-networks">1. Transformers: an evolutionary step from neural networks</h2>
<p>Before opening up the attention machinery, it helps to see where a Transformer <em>comes from</em>. It is not an exotic invention out of nowhere; it is one more step in a long line of neural networks that all do the same basic thing — <strong>convert input data, step by step, into a representation whose position in some space encodes the answer.</strong> What changes from one architecture to the next is <em>how</em> that conversion is organized. Attention is the organizational idea that made networks good at language.</p>
<section id="a-network-you-may-already-picture-the-ocr-convolutional-net" class="level3">
<h3 class="anchored" data-anchor-id="a-network-you-may-already-picture-the-ocr-convolutional-net">A network you may already picture: the OCR convolutional net</h3>
<p>Think of a classic optical-character-recognition (OCR) network — the kind that reads a handwritten digit or letter from a small image. Its workhorse is the <strong>convolution</strong>. The input is a grid of pixels. A small <strong>filter</strong> (say a <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%203"> patch of weights) slides across the image; at each location it multiplies the pixels under it by its weights and sums them into a single number. Sweeping the filter over the whole image produces a <strong>feature map</strong> — a new grid that lights up wherever the filter’s little pattern (an edge, a curve, a stroke) is present. A layer has many such filters, so it produces many feature maps. Stack a few convolution layers (with pooling in between to shrink the grid), and the network builds up from edges → strokes → loops → whole-character shapes.</p>
<p>It is worth stating the underlying operation precisely, because a clean version of it carries over to attention. A convolution is a <strong>weighted sum over a small local window</strong> of the input: at each position the filter computes a dot product between its fixed weights and the pixels it currently covers. (It is a weighted <em>sum</em>, not strictly an average — the weights need not sum to one and are often negative, which is how a filter can <em>reward</em> ink in one place and <em>penalise</em> it in another.) Two things are then layered on top. First, the same weights are reused at every position (<strong>weight sharing</strong>), so the operation is really one small pattern-detector swept across the whole image. Second, a layer stacks many such detectors and a later layer takes <strong>weighted combinations of their feature maps</strong> — and it is <em>those</em> combination weights, learned by training, that select which mixtures of low-level features best predict the correct character. So the intuition behind the original marker is right once it is split in two: each convolution is a <em>local</em> weighted sum, and the <em>cross-feature</em> mixing that “selects the useful combinations” happens when later layers combine feature maps. (Appendix B works a full convolution through by hand.)</p>
<p>At the very end the features are flattened and a <strong>softmax classifier</strong> turns them into a probability over the possible characters.</p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">flowchart LR
    A["pixels&lt;br/&gt;(grid of ink/blank)"] --&gt;|"filters slide&lt;br/&gt;shared 3×3 weights&lt;br/&gt;local receptive field,&lt;br/&gt;reused at every position"| B["feature maps&lt;br/&gt;one grid per filter:&lt;br/&gt;edges, strokes, ..."]
    B --&gt;|"flatten +&lt;br/&gt;linear layer"| C["softmax over&lt;br/&gt;26 letters"]
    C --&gt; D["a .02&lt;br/&gt;...&lt;br/&gt;e .91 ◄ answer&lt;br/&gt;..."]
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<p>Two properties make this work for images. First, each filter has a <strong>fixed, local receptive field</strong>: it only ever looks at a small neighborhood of pixels. Second, the <em>same</em> filter is reused at every position (<strong>weight sharing</strong>), so a stroke is recognized the same way wherever it sits on the page. Both properties are exactly right for images, because the clues needed to recognize a stroke are <em>local</em> and their meaning does not depend on <em>where</em> on the page they appear.</p>
</section>
<section id="why-language-breaks-this-and-why-we-need-attention" class="level3">
<h3 class="anchored" data-anchor-id="why-language-breaks-this-and-why-we-need-attention">Why language breaks this, and why we need attention</h3>
<p>Language does not have those two convenient properties. The piece of context a word depends on can be right next to it or hundreds of words back — and <em>which</em> earlier words matter depends on the <strong>content</strong>, not on a fixed offset. To resolve “it” you must find the noun it refers to, and that noun could be anywhere. A fixed, local filter cannot do this: it always looks in the same small window, regardless of what the sentence is about.</p>
<p><strong>Attention is the fix.</strong> Instead of a fixed local window, an attention <em>head</em> computes, on the fly and for each token, <em>how much to pull from every other token</em> — and then pulls. The “receptive field” is no longer fixed by the architecture; it is <strong>decided at runtime, from the content</strong>, by the query–key matching we will detail in Section 2. In one sentence:</p>
<blockquote class="blockquote">
<p>An attention head is the language analogue of a convolutional filter, except its receptive field is <strong>learned, dynamic, and content-dependent</strong> instead of fixed and local.</p>
</blockquote>
<p>That single change — a receptive field the data gets to choose, every time — is what let neural networks finally handle long-range, content-driven dependencies, and it is the heart of the Transformer.</p>
<p>It is important not to over-claim, though: attention is the <em>new</em> idea, not the <em>only</em> idea. A Transformer interleaves two kinds of block. The <strong>attention heads</strong> are the novelty just described — the content-driven, dynamic receptive field that moves information <em>between</em> positions. Alongside them sit ordinary <strong>MLP</strong> (multi-layer perceptron) blocks, the same fully-connected, learned weighted-combination machinery that does the cross-feature mixing in a CNN — except here each MLP refines one token’s representation <em>in place</em>, with no reference to its neighbours. The two blocks divide the labour cleanly: attention is the <em>only</em> channel that lets tokens talk to each other, while the MLP is where each token’s representation is reshaped on its own. Sections 5 and 6 build both explicitly, and Appendix A shows what is lost if you switch the attention off and keep only the MLP.</p>
</section>
<section id="what-the-output-data-actually-is" class="level3">
<h3 class="anchored" data-anchor-id="what-the-output-data-actually-is">What the output data actually <em>is</em></h3>
<p>It pays to be precise about what these networks produce, because the same description fits both the OCR net and the language model.</p>
<p>In both cases the network turns its input into a <strong>representation</strong>: a vector whose <em>position in a high-dimensional space</em> encodes the answer. In OCR, the final feature vector’s location says “this image sits in the region of the space that means the letter <em>e</em>.” In a Transformer, the inputs are pieces of text — <strong>tokens</strong> — and each token is represented by an <strong>embedding</strong>: a vector of numbers giving its coordinates in a high-dimensional “language space,” where direction and proximity stand in for meaning. The vector the model processes and maintains for each token — the <strong>residual stream</strong>, formally introduced in Section 4 — is best read as a <strong>modified embedding</strong>: it starts life as the raw token embedding and each layer nudges it to a new position that encodes everything the model has worked out about that token <em>in its context</em>.</p>
<p>So the output data are <strong>representations of entities, describing each entity’s most likely position in a solution space.</strong> And from that position the model reads out one of two things:</p>
<ul>
<li>a <strong>single crisp solution</strong> — the one best answer (take the highest-scoring class, i.e.&nbsp;the <img src="https://latex.codecogs.com/png.latex?%5Carg%5Cmax">, equivalently temperature <img src="https://latex.codecogs.com/png.latex?%5Cto%200">); or</li>
<li>a <strong>probability distribution over solutions</strong> — a graded set of plausible answers with weights.</li>
</ul>
<p>The readout is the same machine in both networks: a linear projection followed by a softmax. OCR projects the final feature vector onto the alphabet and gets a distribution over letters; a language model projects the final token representation onto the vocabulary and gets a distribution over <em>next tokens</em>. Same structure, different solution space. (Section 5 shows exactly how the final representation’s <strong>direction</strong> encodes <em>which</em> tokens are likely and its <strong>magnitude</strong> encodes <em>how confident</em> the model is — the geometric version of “position in a solution space.”)</p>
</section>
<section id="the-logic-of-the-processing-params-convert-inputs-into-entity-representations" class="level3">
<h3 class="anchored" data-anchor-id="the-logic-of-the-processing-params-convert-inputs-into-entity-representations">The logic of the processing: params convert inputs into entity representations</h3>
<p>How does the input get from raw symbols to these context-aware representations? The recipe is the through-line of this whole note:</p>
<ol type="1">
<li><strong>Look up.</strong> Each input token ID is replaced by its embedding — a first, context-free guess at its position in the space (a row of the embedding matrix <img src="https://latex.codecogs.com/png.latex?E">).</li>
<li><strong>Read, transform, write — repeatedly.</strong> A stack of parameterized blocks then reads the current representations and writes adjustments back. Two kinds of block alternate:
<ul>
<li><strong>Attention blocks</strong> move information <em>between</em> tokens (the dynamic receptive field above), letting each token’s representation absorb what it needs from the others.</li>
<li><strong>MLP blocks</strong> refine each token’s representation <em>in place</em>, one position at a time (detailed in Section 6).</li>
</ul></li>
<li><strong>Read out.</strong> After <img src="https://latex.codecogs.com/png.latex?L"> such rounds (model <strong>layers</strong>) the representation is “finished,” and the unembedding projects it into the solution space to give the crisp answer or the distribution.</li>
</ol>
<p>The <strong>parameters</strong> — <img src="https://latex.codecogs.com/png.latex?W_Q,%20W_K,%20W_V,%20W_O"> in attention, <img src="https://latex.codecogs.com/png.latex?W_1,%20W_2"> in the MLP — are frozen after training. They <em>are</em> the learned knowledge: they encode the rule for <em>how to move</em> each representation, step by step, from a bare embedding toward its correct final position. Training is just the search for parameter values that put every entity in the right place in the solution space.</p>
<blockquote class="blockquote">
<p><strong>Why this framing matters for what follows.</strong> Because all the “thinking” lives in these incrementally-modified representations flowing through the residual stream, we can later ask very sharp questions about a trained model: <em>where</em> is a particular piece of behavior carried, and <em>which</em> block put it there? In small, fully-understood models one can even disable a single block and watch a specific behavior appear or vanish — the basis of the interpretability experiments these notes are meant to accompany (see Appendix A). The rest of this note builds the mechanism precisely enough to make those questions answerable.</p>
</blockquote>
</section>
</section>
<section id="the-qkv-trio" class="level2">
<h2 class="anchored" data-anchor-id="the-qkv-trio">2. The QKV trio</h2>
<p>Every token, at every attention layer, produces three vectors derived from the token’s current hidden state via three learned weight matrices (<img src="https://latex.codecogs.com/png.latex?W_Q">, <img src="https://latex.codecogs.com/png.latex?W_K">, <img src="https://latex.codecogs.com/png.latex?W_V">):</p>
<ul>
<li><strong>Q (query)</strong> — “what am I looking for?”</li>
<li><strong>K (key)</strong> — “what do I offer to be matched against?”</li>
<li><strong>V (value)</strong> — “what information do I carry, if matched?”</li>
</ul>
<p>Attention is the operation that uses these three to mix information across tokens. For a given token’s query Q, the model computes a similarity score against every previous token’s K (a dot product, scaled, then softmaxed into weights). Those weights are then used to take a weighted sum of the corresponding V vectors. The result is the attention output for that token — a blend of <em>values</em> from earlier tokens, weighted by how well their keys matched this token’s query.</p>
<p>In compact form (Q, K, and V are the corresponding W applied to tokens’ embeddings, see Section 4):</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7Battention%7D(Q,%20K,%20V)%20=%20%5Ctext%7Bsoftmax%7D%5C!%5Cleft(%5Cfrac%7BQ%20K%5E%5Ctop%7D%7B%5Csqrt%7Bd_k%7D%7D%5Cright)%20%5Ccdot%20V"></p>
<p>This formula is <strong>per head</strong>: <img src="https://latex.codecogs.com/png.latex?Q,%20K%20%5Cin%20%5Cmathbb%7BR%7D%5E%7BT%20%5Ctimes%20d_k%7D"> and <img src="https://latex.codecogs.com/png.latex?V%20%5Cin%20%5Cmathbb%7BR%7D%5E%7BT%20%5Ctimes%20d_v%7D">, giving an output in <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BR%7D%5E%7BT%20%5Ctimes%20d_v%7D">. In multi-head attention, the same formula is applied <img src="https://latex.codecogs.com/png.latex?h"> times in parallel on independent Q/K/V slices, and the <img src="https://latex.codecogs.com/png.latex?h"> resulting <img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_v"> blocks are concatenated and then mixed by <img src="https://latex.codecogs.com/png.latex?W_O"> into the residual.</p>
<p>So Q and K together produce the <em>attention pattern</em> (who attends to whom, and how strongly), and V is what actually flows along those attention edges. <strong>Q combines with K to produce attention weights, which then select/blend the V vectors.</strong> V is the payload, QK is the routing.</p>
</section>
<section id="why-only-k-and-v-are-cached-not-q" class="level2">
<h2 class="anchored" data-anchor-id="why-only-k-and-v-are-cached-not-q">3. Why only K and V are cached, not Q</h2>
<p>During generation, when the model is producing token <img src="https://latex.codecogs.com/png.latex?N+1">:</p>
<ul>
<li>It needs the <strong>Q of the new token only</strong> — Q is computed fresh each step and discarded; it has no use beyond this single attention computation.</li>
<li>It needs the <strong>K and V of every previous token</strong>, because the new token’s Q has to attend back to all of them.</li>
</ul>
<p>That’s why the cache is called the <strong>KV cache</strong> and not “QKV cache”. K and V are the <em>historical</em> state that accumulates as the conversation grows; Q is ephemeral, recomputed and thrown away every step.</p>
<p>This also explains the asymmetry between <strong>prefill</strong> (prompt processing) and <strong>generation</strong> (prompt continuation). During prefill, Q is computed for every input token too — but it’s used immediately to compute attention for that token and then discarded. Only K and V get written into the cache for future reuse. Generation is just prefill-of-one-token, repeated, with the cache growing by one row of K and one row of V at each layer per step.</p>
</section>
<section id="dimensionality-a-complete-inventory" class="level2">
<h2 class="anchored" data-anchor-id="dimensionality-a-complete-inventory">4. Dimensionality: a complete inventory</h2>
<p>Let me define the symbols once and then track every shape through the network.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 13%">
<col style="width: 46%">
<col style="width: 20%">
<col style="width: 19%">
</colgroup>
<thead>
<tr class="header">
<th>Symbol</th>
<th>Meaning</th>
<th>Typical value (GPT-2 small)</th>
<th>Typical value (modern 7B)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><img src="https://latex.codecogs.com/png.latex?V"></td>
<td>vocabulary size</td>
<td>50,257</td>
<td>32,000–150,000</td>
</tr>
<tr class="even">
<td><img src="https://latex.codecogs.com/png.latex?L"></td>
<td>number of transformer layers</td>
<td>12</td>
<td>32</td>
</tr>
<tr class="odd">
<td><img src="https://latex.codecogs.com/png.latex?T"></td>
<td>sequence length (tokens in the context)</td>
<td>up to 1024</td>
<td>up to 128K</td>
</tr>
<tr class="even">
<td><img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D"></td>
<td>residual stream / embedding dimension</td>
<td>768</td>
<td>4096</td>
</tr>
<tr class="odd">
<td><img src="https://latex.codecogs.com/png.latex?h"></td>
<td>number of attention heads</td>
<td>12</td>
<td>32</td>
</tr>
<tr class="even">
<td><img src="https://latex.codecogs.com/png.latex?d_k%20=%20d_v"></td>
<td>per-head Q/K/V dimension, often <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D/h"></td>
<td>64</td>
<td>128</td>
</tr>
<tr class="odd">
<td><img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bff%7D%7D"></td>
<td>MLP hidden dimension, usually <img src="https://latex.codecogs.com/png.latex?%5Csim%204%20%5Ccdot%20d_%7B%5Ctext%7Bmodel%7D%7D"></td>
<td>3072</td>
<td>11008</td>
</tr>
</tbody>
</table>
<section id="model-parameters-frozen-weights" class="level3">
<h3 class="anchored" data-anchor-id="model-parameters-frozen-weights">Model parameters (frozen weights)</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 21%">
<col style="width: 54%">
<col style="width: 23%">
</colgroup>
<thead>
<tr class="header">
<th>Weight</th>
<th>Shape</th>
<th>What it does</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Token embedding <img src="https://latex.codecogs.com/png.latex?E"></td>
<td><img src="https://latex.codecogs.com/png.latex?V%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D"></td>
<td>maps token IDs to vectors</td>
</tr>
<tr class="even">
<td>Positional embedding (if any)</td>
<td><img src="https://latex.codecogs.com/png.latex?T_%7B%5Cmax%7D%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D"></td>
<td>adds positional information</td>
</tr>
<tr class="odd">
<td>Per layer <img src="https://latex.codecogs.com/png.latex?%5Cell">: <img src="https://latex.codecogs.com/png.latex?W_Q%5E%7B(%5Cell)%7D"></td>
<td><img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D%20%5Ctimes%20(h%20%5Ccdot%20d_k)"></td>
<td>projects hidden → all heads’ Q</td>
</tr>
<tr class="even">
<td>Per layer: <img src="https://latex.codecogs.com/png.latex?W_K%5E%7B(%5Cell)%7D"></td>
<td><img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D%20%5Ctimes%20(h%20%5Ccdot%20d_k)"></td>
<td>projects hidden → all heads’ K</td>
</tr>
<tr class="odd">
<td>Per layer: <img src="https://latex.codecogs.com/png.latex?W_V%5E%7B(%5Cell)%7D"></td>
<td><img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D%20%5Ctimes%20(h%20%5Ccdot%20d_v)"></td>
<td>projects hidden → all heads’ V</td>
</tr>
<tr class="even">
<td>Per layer: <img src="https://latex.codecogs.com/png.latex?W_O%5E%7B(%5Cell)%7D"></td>
<td><img src="https://latex.codecogs.com/png.latex?(h%20%5Ccdot%20d_v)%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D"></td>
<td>mixes head outputs back to residual</td>
</tr>
<tr class="odd">
<td>Per layer: MLP <img src="https://latex.codecogs.com/png.latex?W_1,%20W_2"></td>
<td><img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D%20%5Ctimes%20d_%7B%5Ctext%7Bff%7D%7D">, <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bff%7D%7D%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D"></td>
<td>non-linear transformation</td>
</tr>
<tr class="even">
<td>Final LayerNorm</td>
<td><img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D"></td>
<td>normalization scale/shift</td>
</tr>
<tr class="odd">
<td>Unembedding <img src="https://latex.codecogs.com/png.latex?U"> (often <img src="https://latex.codecogs.com/png.latex?E%5E%5Ctop">)</td>
<td><img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D%20%5Ctimes%20V"></td>
<td>maps final hidden → logits</td>
</tr>
</tbody>
</table>
<p>Total parameters scale roughly as <img src="https://latex.codecogs.com/png.latex?12%20%5Ccdot%20L%20%5Ccdot%20d_%7B%5Ctext%7Bmodel%7D%7D%5E2"> (the famous result of Kaplan et al.).</p>
</section>
<section id="activations-recomputed-every-forward-pass" class="level3">
<h3 class="anchored" data-anchor-id="activations-recomputed-every-forward-pass">Activations (recomputed every forward pass)</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 36%">
<col style="width: 21%">
<col style="width: 41%">
</colgroup>
<thead>
<tr class="header">
<th>Activation</th>
<th>Shape</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Input token IDs</td>
<td><img src="https://latex.codecogs.com/png.latex?T"></td>
<td>integers in <img src="https://latex.codecogs.com/png.latex?%5B0,%20V)"></td>
</tr>
<tr class="even">
<td>Hidden state / residual stream <img src="https://latex.codecogs.com/png.latex?h%5E%7B(%5Cell)%7D"></td>
<td><img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D"></td>
<td>the running “meaning” per token, threaded through layers</td>
</tr>
<tr class="odd">
<td><img src="https://latex.codecogs.com/png.latex?Q%5E%7B(%5Cell)%7D">, <img src="https://latex.codecogs.com/png.latex?K%5E%7B(%5Cell)%7D">, <img src="https://latex.codecogs.com/png.latex?V%5E%7B(%5Cell)%7D"> per head</td>
<td><img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_k"> each, per head</td>
<td>computed from <img src="https://latex.codecogs.com/png.latex?h%5E%7B(%5Cell)%7D"> via <img src="https://latex.codecogs.com/png.latex?W_Q">, <img src="https://latex.codecogs.com/png.latex?W_K">, <img src="https://latex.codecogs.com/png.latex?W_V"></td>
</tr>
<tr class="even">
<td>Attention scores <img src="https://latex.codecogs.com/png.latex?QK%5E%5Ctop"></td>
<td><img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20T"> per head</td>
<td>the routing pattern</td>
</tr>
<tr class="odd">
<td>Attention weights (post-softmax)</td>
<td><img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20T"> per head</td>
<td>row-stochastic, lower-triangular (causal mask)</td>
</tr>
<tr class="even">
<td>Attention output (single head)</td>
<td><img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_v"></td>
<td>weighted sum of V rows</td>
</tr>
<tr class="odd">
<td>Attention output (all heads concat)</td>
<td><img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20(h%20%5Ccdot%20d_v)"></td>
<td>usually <img src="https://latex.codecogs.com/png.latex?h%20%5Ccdot%20d_v%20=%20d_%7B%5Ctext%7Bmodel%7D%7D"></td>
</tr>
<tr class="even">
<td>Output of <img src="https://latex.codecogs.com/png.latex?W_O"> added to residual</td>
<td><img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D"></td>
<td>written back into the residual stream</td>
</tr>
<tr class="odd">
<td>MLP output</td>
<td><img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D"></td>
<td>also written back into the residual stream</td>
</tr>
<tr class="even">
<td>Final hidden state (after layer <img src="https://latex.codecogs.com/png.latex?L">, LayerNorm)</td>
<td><img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D"></td>
<td>input to the unembedding</td>
</tr>
<tr class="odd">
<td>Logits</td>
<td><img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20V"></td>
<td>one row per token position</td>
</tr>
<tr class="even">
<td>Logits of the <em>last</em> token</td>
<td><img src="https://latex.codecogs.com/png.latex?V"></td>
<td>the one that matters for next-token prediction</td>
</tr>
<tr class="odd">
<td>Probability distribution over vocabulary</td>
<td><img src="https://latex.codecogs.com/png.latex?V">, sums to 1</td>
<td><img src="https://latex.codecogs.com/png.latex?%5Ctext%7Bsoftmax%7D"> of last-token logits</td>
</tr>
</tbody>
</table>
</section>
<section id="from-last-token-logits-to-the-next-token-distribution" class="level3">
<h3 class="anchored" data-anchor-id="from-last-token-logits-to-the-next-token-distribution">From last-token logits to the next-token distribution</h3>
<p>After the final layer, the hidden state of the last position <img src="https://latex.codecogs.com/png.latex?h%5E%7B(L)%7D_T%20%5Cin%20%5Cmathbb%7BR%7D%5E%7Bd_%7B%5Ctext%7Bmodel%7D%7D%7D"> is multiplied by the unembedding matrix:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cell%20=%20h%5E%7B(L)%7D_T%20%5Ccdot%20U%20%5Cin%20%5Cmathbb%7BR%7D%5E%7BV%7D"></p>
<p>These are the <strong>logits</strong>: one real number per vocabulary token, unnormalized. Softmax converts them into probabilities:</p>
<p><img src="https://latex.codecogs.com/png.latex?p(%5Ctext%7Bnext%20token%7D%20=%20i%20%5Cmid%20%5Ctext%7Bcontext%7D)%20=%20%5Cfrac%7B%5Cexp(%5Cell_i%20/%20%5Ctau)%7D%7B%5Csum_%7Bj=1%7D%5E%7BV%7D%20%5Cexp(%5Cell_j%20/%20%5Ctau)%7D"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Ctau"> is the sampling temperature. At <img src="https://latex.codecogs.com/png.latex?%5Ctau%20%5Cto%200"> this becomes greedy (<img src="https://latex.codecogs.com/png.latex?%5Carg%5Cmax">). At <img src="https://latex.codecogs.com/png.latex?%5Ctau%20=%201"> you get the raw model distribution. The sampler then draws the next token from this distribution, appends it to the sequence, and the loop continues.</p>
</section>
<section id="kv-cache-size" class="level3">
<h3 class="anchored" data-anchor-id="kv-cache-size">KV cache size</h3>
<p>The KV cache has total size:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7BKV%20cache%20size%7D%20=%202%20%5Ccdot%20L%20%5Ccdot%20T%20%5Ccdot%20h%20%5Ccdot%20d_k%20%5Ccdot%20(%5Ctext%7Bbytes%20per%20element%7D)"></p>
<p>The factor 2 is for K and V together. For a 7B model with <img src="https://latex.codecogs.com/png.latex?L=32">, <img src="https://latex.codecogs.com/png.latex?h=32">, <img src="https://latex.codecogs.com/png.latex?d_k=128">, at FP16 (2 bytes), 32K context:</p>
<p><img src="https://latex.codecogs.com/png.latex?2%20%5Ccdot%2032%20%5Ccdot%2032768%20%5Ccdot%2032%20%5Ccdot%20128%20%5Ccdot%202%20%5Capprox%2017%20%5Ctext%7B%20GB%7D"></p>
<p>This is why long-context inference is memory-hungry, and why DeepSeek’s Multi-head Latent Attention (which compresses K and V into a low-rank latent space) is such a big deal — it can cut this by an order of magnitude.</p>
</section>
</section>
<section id="a-worked-example-tiny-model-tiny-vocabulary" class="level2">
<h2 class="anchored" data-anchor-id="a-worked-example-tiny-model-tiny-vocabulary">5. A worked example: tiny model, tiny vocabulary</h2>
<p>Let me build a deliberately small model so every matrix fits on a page. We’ll watch one attention step end-to-end.</p>
<section id="setup" class="level3">
<h3 class="anchored" data-anchor-id="setup">Setup</h3>
<ul>
<li>Vocabulary size <img src="https://latex.codecogs.com/png.latex?V%20=%2020">. Say the vocabulary is <code>{the, cat, dog, sat, ran, on, mat, floor, big, small, red, blue, fast, slow, a, and, ., is, was, .EOS}</code> — 20 tokens indexed 0–19.</li>
<li>Embedding dimension <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D%20=%205">.</li>
<li>One attention head, so <img src="https://latex.codecogs.com/png.latex?d_k%20=%20d_v%20=%205">.</li>
<li>One layer (we’ll ignore MLPs for clarity).</li>
<li>Context: 3 tokens. We’re going to compute attention for the sequence <code>the cat sat</code>, token IDs <img src="https://latex.codecogs.com/png.latex?%5B0,%201,%203%5D">.</li>
</ul>
</section>
<section id="step-1-token-embedding" class="level3">
<h3 class="anchored" data-anchor-id="step-1-token-embedding">Step 1: token embedding</h3>
<p>The embedding matrix <img src="https://latex.codecogs.com/png.latex?E"> is <img src="https://latex.codecogs.com/png.latex?20%20%5Ctimes%205">. After training it might look like (showing only the three rows we need):</p>
<pre><code>E[0]  = [ 0.10, -0.20,  0.05,  0.40,  0.15]   "the"
E[1]  = [ 0.30,  0.50, -0.10,  0.20,  0.00]   "cat"
E[3]  = [-0.40,  0.10,  0.60, -0.30,  0.20]   "sat"</code></pre>
<p>After looking up these three rows, the input to layer 1 is a <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%205"> matrix — three tokens, each as a 5-dimensional embedding:</p>
<pre><code>X = [[ 0.10, -0.20,  0.05,  0.40,  0.15],
     [ 0.30,  0.50, -0.10,  0.20,  0.00],
     [-0.40,  0.10,  0.60, -0.30,  0.20]]</code></pre>
<p>This is the initial residual stream <img src="https://latex.codecogs.com/png.latex?h%5E%7B(0)%7D">, shape <img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D%20=%203%20%5Ctimes%205">.</p>
</section>
<section id="step-2-compute-q-k-v" class="level3">
<h3 class="anchored" data-anchor-id="step-2-compute-q-k-v">Step 2: compute Q, K, V</h3>
<p>The weight matrices <img src="https://latex.codecogs.com/png.latex?W_Q">, <img src="https://latex.codecogs.com/png.latex?W_K">, <img src="https://latex.codecogs.com/png.latex?W_V"> are each <img src="https://latex.codecogs.com/png.latex?5%20%5Ctimes%205"> here.</p>
<p>General shape: <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D%20%5Ctimes%20(h%20%5Ccdot%20d_k)"> for <img src="https://latex.codecogs.com/png.latex?W_Q,%20W_K"> and <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D%20%5Ctimes%20(h%20%5Ccdot%20d_v)"> for <img src="https://latex.codecogs.com/png.latex?W_V">. In this toy example we have one head with <img src="https://latex.codecogs.com/png.latex?d_k%20=%20d_v%20=%20d_%7B%5Ctext%7Bmodel%7D%7D%20=%205">, so the shape collapses to <img src="https://latex.codecogs.com/png.latex?5%20%5Ctimes%205">. <strong>In a real multi-head model the per-head <img src="https://latex.codecogs.com/png.latex?V"> slice has shape <img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_v">, not <img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D"> — the equality holds only after concatenating all <img src="https://latex.codecogs.com/png.latex?h"> heads.</strong></p>
<p><img src="https://latex.codecogs.com/png.latex?W_V"> is also called the IN matrix.</p>
<p>Let’s say:</p>
<pre><code>W_Q = [[ 1.0,  0.0,  0.0,  0.5,  0.0],
       [ 0.0,  1.0,  0.0,  0.0,  0.5],
       [ 0.0,  0.0,  1.0,  0.0,  0.0],
       [ 0.5,  0.0,  0.0,  1.0,  0.0],
       [ 0.0,  0.5,  0.0,  0.0,  1.0]]

W_K = [[ 0.8,  0.2,  0.0,  0.0,  0.0],
       [ 0.2,  0.8,  0.0,  0.0,  0.0],
       [ 0.0,  0.0,  0.9,  0.1,  0.0],
       [ 0.0,  0.0,  0.1,  0.9,  0.0],
       [ 0.0,  0.0,  0.0,  0.0,  1.0]]

W_V = [[ 0.5,  0.5,  0.0,  0.0,  0.0],
       [ 0.5, -0.5,  0.0,  0.0,  0.0],
       [ 0.0,  0.0,  1.0,  0.0,  0.0],
       [ 0.0,  0.0,  0.0,  1.0,  0.0],
       [ 0.0,  0.0,  0.0,  0.0,  1.0]]</code></pre>
<p>Compute <img src="https://latex.codecogs.com/png.latex?Q%20=%20X%20%5Ccdot%20W_Q">, <img src="https://latex.codecogs.com/png.latex?K%20=%20X%20%5Ccdot%20W_K">, <img src="https://latex.codecogs.com/png.latex?V%20=%20X%20%5Ccdot%20W_V">. Each is <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%205%20=%20T%20%5Ctimes%20d_k"> in this single-head example. In general, per head, the shapes are <img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_k"> for <img src="https://latex.codecogs.com/png.latex?Q"> and <img src="https://latex.codecogs.com/png.latex?K">, and <img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_v"> for <img src="https://latex.codecogs.com/png.latex?V">. In the three matrices below, each row is a token and each column indexes one coordinate of the per-head Q/K/V vector.</p>
<pre><code>Q = [[ 0.30, -0.10,  0.05,  0.45,  0.05],   # for "the"
     [ 0.40,  0.50, -0.10,  0.35,  0.25],   # for "cat"
     [-0.25,  0.20,  0.60, -0.50,  0.25]]   # for "sat"

K = [[ 0.04, -0.14,  0.04,  0.41,  0.15],   # for "the"
     [ 0.34,  0.46, -0.10,  0.17,  0.00],   # for "cat"
     [-0.30, -0.00,  0.51, -0.21,  0.20]]   # for "sat"

V = [[-0.05,  0.15,  0.05,  0.40,  0.15],   # for "the"
     [ 0.40, -0.10, -0.10,  0.20,  0.00],   # for "cat"
     [-0.15, -0.25,  0.60, -0.30,  0.20]]   # for "sat"</code></pre>
<p>(I’ve rounded these for readability; the principle is what matters.)</p>
<p>A <strong>fourth weight matrix</strong>, <img src="https://latex.codecogs.com/png.latex?W_O">, (OUT matrix) is also a learned parameter of the attention block. Its job is to project the concatenated per-head attention outputs back into the residual stream’s dimension and (in multi-head attention) to mix information across heads.</p>
<p>General shape: <img src="https://latex.codecogs.com/png.latex?(h%20%5Ccdot%20d_v)%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D">. Here, with one head and <img src="https://latex.codecogs.com/png.latex?d_v%20=%20d_%7B%5Ctext%7Bmodel%7D%7D%20=%205">, it collapses to <img src="https://latex.codecogs.com/png.latex?5%20%5Ctimes%205">. Note: <img src="https://latex.codecogs.com/png.latex?h%20%5Ccdot%20d_v"> is the <em>concatenated</em> head-output dimension, which in most architectures equals <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D"> by design — that’s why <img src="https://latex.codecogs.com/png.latex?W_O"> usually looks like a <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D"> square in implementations.</p>
<pre><code>W_O = [[ 1.0,  0.0,  0.0,  0.0,  0.0],
       [ 0.0,  1.0,  0.0,  0.0,  0.0],
       [ 0.0,  0.0,  1.0,  0.0,  0.0],
       [ 0.0,  0.0,  0.0,  1.0,  0.0],
       [ 0.0,  0.0,  0.0,  0.0,  1.0]]</code></pre>
<p>(For simplicity we’ve set <img src="https://latex.codecogs.com/png.latex?W_O">​ to the identity matrix here, meaning the attention output passes through unchanged. In a trained model <img src="https://latex.codecogs.com/png.latex?W_O">​​ would be a learned dense matrix that performs the head-mixing and projection described above.)</p>
</section>
<section id="step-3-attention-scores-qktop" class="level3">
<h3 class="anchored" data-anchor-id="step-3-attention-scores-qktop">Step 3: attention scores <img src="https://latex.codecogs.com/png.latex?QK%5E%5Ctop"></h3>
<p>This is a <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%203"> matrix where entry <img src="https://latex.codecogs.com/png.latex?(i,%20j)"> is the dot product of token <img src="https://latex.codecogs.com/png.latex?i">’s query with token <img src="https://latex.codecogs.com/png.latex?j">’s key — a similarity score <em>between tokens</em> (not between heads), measuring how strongly token <img src="https://latex.codecogs.com/png.latex?i"> wants to attend to token <img src="https://latex.codecogs.com/png.latex?j">.</p>
<pre><code>QK^T = [[ 0.20,  0.07, -0.21],
        [-0.01,  0.40, -0.23],
        [-0.16,  0.06,  0.59]]</code></pre>
<p>Now divide by <img src="https://latex.codecogs.com/png.latex?%5Csqrt%7Bd_k%7D%20=%20%5Csqrt%7B5%7D%20%5Capprox%202.24">:</p>
<pre><code>QK^T / sqrt(5) = [[ 0.089,  0.031, -0.094],
                  [-0.004,  0.179, -0.103],
                  [-0.071,  0.027,  0.264]]</code></pre>
</section>
<section id="step-4-causal-mask-softmax" class="level3">
<h3 class="anchored" data-anchor-id="step-4-causal-mask-softmax">Step 4: causal mask + softmax</h3>
<p>Since we’re doing autoregressive language modeling, each token can only attend to itself and earlier tokens. We mask the upper triangle to <img src="https://latex.codecogs.com/png.latex?-%5Cinfty"> (so softmax gives them weight 0):</p>
<pre><code>masked = [[ 0.089,  -inf,  -inf],
          [-0.004,  0.179,  -inf],
          [-0.071,  0.027,  0.264]]</code></pre>
<p>Apply softmax row by row:</p>
<pre><code>attention_weights = [[1.000, 0.000, 0.000],
                     [0.454, 0.546, 0.000],
                     [0.250, 0.276, 0.474]]</code></pre>
<p>Read this carefully: <strong>the third row says that when computing the output for “sat”, the model attends 25% to “the”, 28% to “cat”, and 47% to itself.</strong> This is the attention pattern, the “who attends to whom” that QK produced.</p>
</section>
<section id="step-5-weighted-sum-of-v" class="level3">
<h3 class="anchored" data-anchor-id="step-5-weighted-sum-of-v">Step 5: weighted sum of V</h3>
<p>Multiply <img src="https://latex.codecogs.com/png.latex?%5Ctext%7Battention%5C_weights%7D%20%5Ccdot%20V">. Per head this is <img src="https://latex.codecogs.com/png.latex?(T%20%5Ctimes%20T)%20%5Ccdot%20(T%20%5Ctimes%20d_v)%20=%20T%20%5Ctimes%20d_v">. With one head and <img src="https://latex.codecogs.com/png.latex?d_v%20=%205">, the result is <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%205">:</p>
<pre><code>attention_output = [[-0.050,  0.150,  0.050,  0.400,  0.150],   # "the" attends only to itself
                    [ 0.196,  0.014, -0.027,  0.291,  0.068],   # "cat" blends "the" + "cat"
                    [-0.022, -0.094,  0.262, -0.061,  0.132]]   # "sat" blends all three</code></pre>
<p>The third row is the interesting one: it’s a weighted blend (25%/28%/47%) of the three V vectors. <strong>This is the V “payload” being routed along the attention edges that QK set up.</strong> The output for “sat” now incorporates information drawn from “the” and “cat” — that’s how the model learns long-range dependencies.</p>
<p>(With multiple heads, you’d have <img src="https://latex.codecogs.com/png.latex?h"> such <img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_v"> blocks, one per head, computed in parallel from independent <img src="https://latex.codecogs.com/png.latex?Q%5E%7B(j)%7D,%20K%5E%7B(j)%7D,%20V%5E%7B(j)%7D"> slices. They get concatenated along the last axis into a single <img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20(h%20%5Ccdot%20d_v)"> tensor before the next step. <strong>Section 6 carries out exactly this multi-head case numerically.</strong>)</p>
</section>
<section id="step-6-through-w_o-back-into-the-residual-stream" class="level3">
<h3 class="anchored" data-anchor-id="step-6-through-w_o-back-into-the-residual-stream">Step 6: through <img src="https://latex.codecogs.com/png.latex?W_O">, back into the residual stream</h3>
<p>The concatenated attention output (here just one head, so “concatenation” is a no-op) is projected by <img src="https://latex.codecogs.com/png.latex?W_O">:</p>
<p><img src="https://latex.codecogs.com/png.latex?h%5E%7B(1)%7D%20=%20h%5E%7B(0)%7D%20+%20%5Ctext%7Battention%5C_output%7D%20%5Ccdot%20W_O"></p>
<p>Shape arithmetic: <img src="https://latex.codecogs.com/png.latex?(T%20%5Ctimes%20(h%20%5Ccdot%20d_v))%20%5Ccdot%20((h%20%5Ccdot%20d_v)%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D)%20=%20T%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D">. In our single-head toy, that’s <img src="https://latex.codecogs.com/png.latex?(3%20%5Ctimes%205)%20%5Ccdot%20(5%20%5Ctimes%205)%20=%203%20%5Ctimes%205">.</p>
<p>This is the <strong>residual addition</strong> — the attention output is <em>added</em> to the previous residual stream, not replacing it. The hidden state <img src="https://latex.codecogs.com/png.latex?h%5E%7B(1)%7D"> is still shape <img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D%20=%203%20%5Ctimes%205">.</p>
<p>In a real model, MLP layers would now run on top, also reading from and writing to the residual stream. We’d repeat the whole thing <img src="https://latex.codecogs.com/png.latex?L"> times. Here we have one layer, so we go straight to the output. (Section 6 adds the MLP and a second layer explicitly.)</p>
</section>
<section id="step-7-final-hidden-state-logits-next-token-distribution" class="level3">
<h3 class="anchored" data-anchor-id="step-7-final-hidden-state-logits-next-token-distribution">Step 7: final hidden state → logits → next-token distribution</h3>
<p>We only care about the next token, so we take the last row of <img src="https://latex.codecogs.com/png.latex?h%5E%7B(1)%7D">: a single 5-vector, the final hidden state for “sat”. Suppose after <img src="https://latex.codecogs.com/png.latex?W_O"> and residual it’s:</p>
<pre><code>h_last = [-0.351,  0.030,  0.802, -0.398,  0.247]</code></pre>
<p>Multiply by the unembedding matrix <img src="https://latex.codecogs.com/png.latex?U">, shape <img src="https://latex.codecogs.com/png.latex?5%20%5Ctimes%2020"> (one column per vocabulary token):</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cell%20=%20h_%7B%5Ctext%7Blast%7D%7D%20%5Ccdot%20U%20%5Cin%20%5Cmathbb%7BR%7D%5E%7B20%7D"></p>
<p>Result might look like:</p>
<pre><code>logits = [-0.8, -0.2,  0.1,  0.5,  1.2,  2.4,  3.1,  2.8,  0.4, -0.1,
          -0.3,  0.0, -0.5, -0.4,  0.6,  0.2,  1.0, -0.2,  0.3, -1.5]
            the  cat  dog  sat  ran   on   mat floor  big small  red ...</code></pre>
<p>The model thinks “on” (logit 2.4), “mat” (3.1), “floor” (2.8) are the most likely continuations of “the cat sat”. Applying softmax at temperature 1:</p>
<pre><code>P("mat"   | "the cat sat") ≈ 0.32
P("floor" | "the cat sat") ≈ 0.24
P("on"    | "the cat sat") ≈ 0.16
P(others)                  ≈ remainder</code></pre>
<p>This is the conditional distribution <img src="https://latex.codecogs.com/png.latex?p(%5Ctext%7Bnext%7D%20%5Cmid%20%5Ctext%7Bcontext%7D)"> that the language model has been trained to approximate. The sampler picks a token from this — greedy would pick “mat” — appends it to the sequence, and the next forward pass begins.</p>
</section>
<section id="final-step-calculations-in-detail." class="level3">
<h3 class="anchored" data-anchor-id="final-step-calculations-in-detail.">Final step calculations in detail.</h3>
<section id="the-setup" class="level4">
<h4 class="anchored" data-anchor-id="the-setup">The setup</h4>
<p>After the last transformer layer (<img src="https://latex.codecogs.com/png.latex?L">), the residual stream is a <img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D"> tensor — one <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D">-dimensional vector per token position. Call it <img src="https://latex.codecogs.com/png.latex?h%5E%7B(L)%7D">.</p>
<p>At inference time, when you want to predict the <em>next</em> token, you only care about the last row: <img src="https://latex.codecogs.com/png.latex?h%5E%7B(L)%7D_T%20%5Cin%20%5Cmathbb%7BR%7D%5E%7Bd_%7B%5Ctext%7Bmodel%7D%7D%7D">. This vector is the model’s final, fully-processed representation of “what comes next” given the entire context so far.</p>
</section>
<section id="the-two-operations" class="level4">
<h4 class="anchored" data-anchor-id="the-two-operations">The two operations</h4>
<p><strong>1. Final LayerNorm (or RMSNorm).</strong> Before unembedding, virtually all modern transformers apply one last normalization to the residual stream. It rescales the vector so its components have controlled magnitude, then applies a learned per-dimension scale (and sometimes shift):</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctilde%7Bh%7D%20=%20%5Ctext%7BLayerNorm%7D(h%5E%7B(L)%7D_T)"></p>
<p>The shape stays <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D">. This step is easy to forget but it matters — without it, the magnitudes coming out of the residual stream would be wild, since every layer has been adding to it.</p>
<p><strong>2. Unembedding (the linear projection to vocabulary).</strong> The normalized vector is multiplied by the unembedding matrix <img src="https://latex.codecogs.com/png.latex?U%20%5Cin%20%5Cmathbb%7BR%7D%5E%7Bd_%7B%5Ctext%7Bmodel%7D%7D%20%5Ctimes%20V%7D">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cell%20=%20%5Ctilde%7Bh%7D%20%5Ccdot%20U%20%5Cin%20%5Cmathbb%7BR%7D%5E%7BV%7D"></p>
<p>Each column of <img src="https://latex.codecogs.com/png.latex?U"> is a <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D">-dimensional vector — one per vocabulary token. The matrix multiplication computes, for every vocabulary token <img src="https://latex.codecogs.com/png.latex?i">, the dot product between the final hidden state and that token’s column:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cell_i%20=%20%5Ctilde%7Bh%7D%20%5Ccdot%20U_%7B:,i%7D"></p>
<p>So each logit <img src="https://latex.codecogs.com/png.latex?%5Cell_i"> is literally <strong>a similarity score between the final hidden state and the <img src="https://latex.codecogs.com/png.latex?i">-th vocabulary token’s representation in <img src="https://latex.codecogs.com/png.latex?U"></strong>. Tokens whose column points in the same direction as <img src="https://latex.codecogs.com/png.latex?%5Ctilde%7Bh%7D"> get high logits; tokens whose column points elsewhere get low logits.</p>
</section>
<section id="interpreting-the-similarity-score" class="level4">
<h4 class="anchored" data-anchor-id="interpreting-the-similarity-score">Interpreting the similarity score</h4>
<p>It is tempting to read this one step further and say: the model is trained to produce a last-token vector that <em>points in the same direction</em> as the embedding vectors of the likely next tokens. That intuition is sound, and it is exactly the geometry the third note develops in full — so here we only state the result and move on.</p>
<p>Training (gradient descent on the cross-entropy loss) shapes <img src="https://latex.codecogs.com/png.latex?%5Ctilde%7Bh%7D"> to have <strong>high dot product with the columns of likely next tokens</strong> and <strong>low dot product with the rest</strong>. Under weight tying (<img src="https://latex.codecogs.com/png.latex?U%20=%20E%5E%5Ctop">, below), those columns <em>are</em> the input embedding vectors, so the picture is almost literal. Two refinements keep it honest, both elaborated in <a href="../../posts/the-final-step/index.html">Note 3</a>: the vector does not point at <em>one</em> token but positions itself among <em>many</em> plausible continuations at once (which is how the model expresses uncertainty), and its <strong>magnitude</strong> — not just its direction — matters, because a longer vector sharpens the softmax (more confident) and a shorter one flattens it (less sure). The compact statement:</p>
<blockquote class="blockquote">
<p>The model learns to produce a final residual-stream vector whose <strong>direction</strong> encodes <em>which</em> next tokens are likely and whose <strong>magnitude</strong> encodes <em>how confident</em> the prediction is.</p>
</blockquote>
<p>This is the foundation of the <strong>logit lens</strong>, which applies the unembedding <img src="https://latex.codecogs.com/png.latex?U"> to <em>intermediate</em> residual streams to ask what the model would predict if forced to commit early; it works because the residual stream lives, throughout the network, in the same space the final read-out uses. Note 3 turns this whole picture — the last step as a similarity search over the vocabulary — into its central theme.</p>
</section>
<section id="weight-tying" class="level4">
<h4 class="anchored" data-anchor-id="weight-tying">Weight tying</h4>
<p>A detail worth knowing: in many models (GPT-2, Llama, and others), <img src="https://latex.codecogs.com/png.latex?U"> is <strong>the same matrix as the token embedding <img src="https://latex.codecogs.com/png.latex?E"></strong> — specifically, <img src="https://latex.codecogs.com/png.latex?U%20=%20E%5E%5Ctop">. This is called <strong>weight tying</strong>. The intuition is elegant: the embedding matrix maps <em>token ID → vector</em> (each row is a token’s representation); the unembedding maps <em>vector → token logits</em> (each column is a token’s representation). It’s the same dictionary, used in two directions.</p>
<p>Weight tying cuts parameter count noticeably — that matrix is <img src="https://latex.codecogs.com/png.latex?V%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D">, which can be hundreds of millions of parameters for large vocabularies — and empirically it often improves quality.</p>
<p>Not all models tie weights (some recent ones keep them separate to give the unembedding more flexibility), but it’s a very common default.</p>
</section>
<section id="from-logits-to-a-distribution" class="level4">
<h4 class="anchored" data-anchor-id="from-logits-to-a-distribution">From logits to a distribution</h4>
<p>Logits are just real numbers, unnormalized. They can be negative, can be huge — they don’t sum to anything meaningful on their own. To turn them into a probability distribution over the vocabulary, apply softmax:</p>
<p><img src="https://latex.codecogs.com/png.latex?p_i%20=%20%5Cfrac%7B%5Cexp(%5Cell_i%20/%20%5Ctau)%7D%7B%5Csum_%7Bj=1%7D%5E%7BV%7D%20%5Cexp(%5Cell_j%20/%20%5Ctau)%7D"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Ctau"> is the sampling temperature. The sampler then draws the next token ID from this distribution (or picks <img src="https://latex.codecogs.com/png.latex?%5Carg%5Cmax"> for greedy decoding), and the next forward pass begins.</p>
</section>
<section id="the-compact-picture" class="level4">
<h4 class="anchored" data-anchor-id="the-compact-picture">The compact picture</h4>
<p><img src="https://latex.codecogs.com/png.latex?%5Cunderbrace%7Bh%5E%7B(L)%7D_T%7D_%7Bd_%7B%5Ctext%7Bmodel%7D%7D%7D%20%5Cxrightarrow%7B%5Ctext%7BLayerNorm%7D%7D%20%5Cunderbrace%7B%5Ctilde%7Bh%7D%7D_%7Bd_%7B%5Ctext%7Bmodel%7D%7D%7D%20%5Cxrightarrow%7B%5Ctimes%20U%7D%20%5Cunderbrace%7B%5Cell%7D_%7BV%7D%20%5Cxrightarrow%7B%5Ctext%7Bsoftmax%7D%7D%20%5Cunderbrace%7Bp%7D_%7BV%7D"></p>
<p>Two matrix multiplications away from a probability distribution over the entire vocabulary. The residual stream did the heavy lifting; the unembedding just reads out the answer.</p>
</section>
<section id="two-things-to-keep-in-mind" class="level4">
<h4 class="anchored" data-anchor-id="two-things-to-keep-in-mind">Two things to keep in mind</h4>
<p><strong>The residual stream “decides” everything before the unembedding.</strong> The unembedding is a fixed linear readout — it has no capacity to think, only to project. By the time you reach <img src="https://latex.codecogs.com/png.latex?U">, all the work of conditioning on the context has already been done by the <img src="https://latex.codecogs.com/png.latex?L"> transformer layers writing into <img src="https://latex.codecogs.com/png.latex?h%5E%7B(L)%7D_T">. The unembedding is a translator from “internal representation space” to “vocabulary space.”</p>
<p><strong>During training, you compute logits for every position.</strong> At inference you only need the last row, but training computes <img src="https://latex.codecogs.com/png.latex?%5Cell"> for all <img src="https://latex.codecogs.com/png.latex?T"> positions in parallel — each row predicts its successor — so the loss can be evaluated everywhere at once (teacher forcing). That’s why the full logits tensor in training has shape <img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20V">, while at inference you typically only materialize the last row.</p>
</section>
</section>
</section>
<section id="a-second-worked-example-two-heads-two-layers-with-an-mlp" class="level2">
<h2 class="anchored" data-anchor-id="a-second-worked-example-two-heads-two-layers-with-an-mlp">6. A second worked example: two heads, two layers, with an MLP</h2>
<p>The first example deliberately stripped the model down to a single head and a single layer, and skipped the MLP entirely. That was the right move for seeing attention clearly, but it hides three things a real Transformer does on every forward pass: it <strong>splits</strong> the work across several heads, it <strong>refines</strong> each token with an MLP, and it <strong>stacks</strong> layers so the residual stream is processed again and again. This example puts all three back, kept just small enough to do by hand.</p>
<p>The point of this section is <strong>dimensionality</strong>: watching the shape of the data at every step as it flows through two complete layers. The numbers below are real (computed and checked), but don’t memorize them — the weights are random, not trained, so the specific final token is meaningless. Follow the <strong>shapes</strong>. (Every matrix is shown rounded to two decimals, so re-adding the displayed intermediates by hand may differ from a shown result by <img src="https://latex.codecogs.com/png.latex?%5Cpm%200.01">; the full-precision computation is consistent.)</p>
<section id="setup-1" class="level3">
<h3 class="anchored" data-anchor-id="setup-1">Setup</h3>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D%20=%206">.</li>
<li><img src="https://latex.codecogs.com/png.latex?h%20=%202"> heads, so the per-head dimension is <img src="https://latex.codecogs.com/png.latex?d_k%20=%20d_v%20=%20d_%7B%5Ctext%7Bmodel%7D%7D%20/%20h%20=%206%20/%202%20=%203">.</li>
<li><img src="https://latex.codecogs.com/png.latex?L%20=%202"> layers, each one a full <strong>attention block + MLP block</strong>.</li>
<li>MLP hidden size <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bff%7D%7D%20=%208">. (Real models use <img src="https://latex.codecogs.com/png.latex?%5Csim%204%20%5Ccdot%20d_%7B%5Ctext%7Bmodel%7D%7D%20=%2024">; we shrink it so the matrices stay on the page.)</li>
<li>ReLU nonlinearity in the MLP: <img src="https://latex.codecogs.com/png.latex?%5Ctext%7BReLU%7D(z)%20=%20%5Cmax(0,%20z)">, applied element-wise — it simply zeros out negatives.</li>
<li>Context: the same 3 tokens, <code>the cat sat</code>, so <img src="https://latex.codecogs.com/png.latex?T%20=%203">.</li>
</ul>
<p>Here is the whole journey as a <strong>shape table</strong>. Everything below just fills in the numbers for these rows.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 29%">
<col style="width: 29%">
<col style="width: 40%">
</colgroup>
<thead>
<tr class="header">
<th>Step</th>
<th>Operation</th>
<th>Output shape</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>token IDs</td>
<td>lookup</td>
<td><img src="https://latex.codecogs.com/png.latex?%5BT%5D%20=%20%5B3%5D"></td>
</tr>
<tr class="even">
<td><img src="https://latex.codecogs.com/png.latex?h%5E%7B(0)%7D"></td>
<td>embedding + positional</td>
<td><img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D%20=%203%20%5Ctimes%206"></td>
</tr>
<tr class="odd">
<td><img src="https://latex.codecogs.com/png.latex?Q,%20K,%20V"> (all heads)</td>
<td><img src="https://latex.codecogs.com/png.latex?h%5E%7B(0)%7D%20W_Q">, etc.</td>
<td><img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%206"> each</td>
</tr>
<tr class="even">
<td>per-head <img src="https://latex.codecogs.com/png.latex?Q,%20K,%20V"></td>
<td>slice into <img src="https://latex.codecogs.com/png.latex?h=2"> blocks</td>
<td><img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%203"> each, ×2</td>
</tr>
<tr class="odd">
<td>scores <img src="https://latex.codecogs.com/png.latex?QK%5E%5Ctop%20/%20%5Csqrt%7Bd_k%7D"></td>
<td>per head</td>
<td><img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%203"> per head</td>
</tr>
<tr class="even">
<td>weights</td>
<td>mask + softmax</td>
<td><img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%203"> per head</td>
</tr>
<tr class="odd">
<td>head output</td>
<td>weights <img src="https://latex.codecogs.com/png.latex?%5Ccdot%20V"></td>
<td><img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%203"> per head</td>
</tr>
<tr class="even">
<td>concat</td>
<td>join 2 heads</td>
<td><img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%20(h%20%5Ccdot%20d_v)%20=%203%20%5Ctimes%206"></td>
</tr>
<tr class="odd">
<td>attn write-back</td>
<td>concat <img src="https://latex.codecogs.com/png.latex?%5Ccdot%20W_O"></td>
<td><img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%206"></td>
</tr>
<tr class="even">
<td><img src="https://latex.codecogs.com/png.latex?h_%7B%5Ctext%7Bmid%7D%7D"></td>
<td><img src="https://latex.codecogs.com/png.latex?h%5E%7B(0)%7D%20+%20%5Ctext%7Battn%7D"></td>
<td><img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%206"></td>
</tr>
<tr class="odd">
<td>MLP pre-activation</td>
<td><img src="https://latex.codecogs.com/png.latex?h_%7B%5Ctext%7Bmid%7D%7D%20W_1"></td>
<td><img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%20d_%7B%5Ctext%7Bff%7D%7D%20=%203%20%5Ctimes%208"></td>
</tr>
<tr class="even">
<td>MLP activation</td>
<td><img src="https://latex.codecogs.com/png.latex?%5Ctext%7BReLU%7D"></td>
<td><img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%208"></td>
</tr>
<tr class="odd">
<td>MLP output</td>
<td><img src="https://latex.codecogs.com/png.latex?%5Ccdot%20W_2"></td>
<td><img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%206"></td>
</tr>
<tr class="even">
<td><img src="https://latex.codecogs.com/png.latex?h%5E%7B(1)%7D"></td>
<td><img src="https://latex.codecogs.com/png.latex?h_%7B%5Ctext%7Bmid%7D%7D%20+%20%5Ctext%7BMLP%7D"></td>
<td><img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%206"></td>
</tr>
<tr class="odd">
<td>… repeat for layer 2 …</td>
<td></td>
<td></td>
</tr>
<tr class="even">
<td><img src="https://latex.codecogs.com/png.latex?h%5E%7B(2)%7D"></td>
<td>final hidden state</td>
<td><img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%206"></td>
</tr>
<tr class="odd">
<td>last row <img src="https://latex.codecogs.com/png.latex?%5Ccdot%20U"></td>
<td>unembedding</td>
<td><img src="https://latex.codecogs.com/png.latex?%5BV%5D"></td>
</tr>
<tr class="even">
<td>softmax</td>
<td>distribution</td>
<td><img src="https://latex.codecogs.com/png.latex?%5BV%5D"></td>
</tr>
</tbody>
</table>
</section>
<section id="step-1-embedding-positional-h0" class="level3">
<h3 class="anchored" data-anchor-id="step-1-embedding-positional-h0">Step 1: embedding + positional → <img src="https://latex.codecogs.com/png.latex?h%5E%7B(0)%7D"></h3>
<p>Each token’s 6-dimensional embedding, plus a positional embedding for its slot, gives the initial residual stream <img src="https://latex.codecogs.com/png.latex?h%5E%7B(0)%7D">, shape <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%206">:</p>
<pre><code>h⁰ = [[ 0.10, -0.10,  0.05,  0.30,  0.15, -0.05],   # the  (pos 1)
      [ 0.40,  0.50, -0.20,  0.20,  0.10,  0.25],   # cat  (pos 2)
      [-0.50,  0.15,  0.65, -0.20,  0.15,  0.05]]   # sat  (pos 3)</code></pre>
</section>
<section id="step-2-project-to-q-k-v-then-split-into-heads" class="level3">
<h3 class="anchored" data-anchor-id="step-2-project-to-q-k-v-then-split-into-heads">Step 2: project to Q, K, V, then split into heads</h3>
<p><img src="https://latex.codecogs.com/png.latex?W_Q"> is now <img src="https://latex.codecogs.com/png.latex?6%20%5Ctimes%206"> (general shape <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D%20%5Ctimes%20(h%20%5Ccdot%20d_k)%20=%206%20%5Ctimes%206">), and likewise <img src="https://latex.codecogs.com/png.latex?W_K,%20W_V">. Multiplying <img src="https://latex.codecogs.com/png.latex?Q%20=%20h%5E%7B(0)%7D%20%5Ccdot%20W_Q"> gives a <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%206"> matrix. The crucial new idea: those 6 columns are <strong>two heads’ worth of Q stacked side by side</strong> — columns 1–3 are head 1’s query, columns 4–6 are head 2’s. The vertical bar marks the split:</p>
<pre><code>Q = [[-0.24, -0.27, -0.43 | -0.05, -0.30, -0.01],   # the
     [-0.37,  0.29, -0.61 | -0.56,  0.17, -0.13],   # cat
     [ 0.20, -0.18, -0.00 |  0.37, -0.14,  0.09]]   # sat
       └──── head 1 ────┘   └──── head 2 ────┘</code></pre>
<p><img src="https://latex.codecogs.com/png.latex?K"> and <img src="https://latex.codecogs.com/png.latex?V"> are computed and split the same way. Each head now has its own <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%203"> query, key, and value. <strong>This is what “multi-head” means: one matrix multiply, then carve the result into <img src="https://latex.codecogs.com/png.latex?h"> independent lanes</strong>, each running the attention formula on its own slice.</p>
</section>
<section id="step-3-attention-inside-each-head" class="level3">
<h3 class="anchored" data-anchor-id="step-3-attention-inside-each-head">Step 3: attention inside each head</h3>
<p>Run Steps 3–5 of the previous example <em>separately</em> in each lane.</p>
<p><strong>Head 1.</strong> Scaled scores <img src="https://latex.codecogs.com/png.latex?Q_1%20K_1%5E%5Ctop%20/%20%5Csqrt%7B3%7D">, then causal mask and softmax:</p>
<pre><code>scores₁ = [[-0.01,  -inf,  -inf],          weights₁ = [[1.00, 0.00, 0.00],
           [ 0.04,  0.07,  -inf],     →               [0.49, 0.51, 0.00],
           [-0.01, -0.04, -0.07]]                     [0.34, 0.33, 0.32]]</code></pre>
<p>Weighted sum of head 1’s <img src="https://latex.codecogs.com/png.latex?V"> gives head 1’s output, shape <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%203">:</p>
<pre><code>head₁_out = [[-0.12, -0.04, -0.07],
             [-0.15,  0.40, -0.07],
             [-0.12,  0.20, -0.20]]</code></pre>
<p><strong>Head 2.</strong> Same procedure on head 2’s slice, giving its own pattern and output:</p>
<pre><code>weights₂ = [[1.00, 0.00, 0.00],         head₂_out = [[ 0.06, -0.03,  0.01],
            [0.45, 0.55, 0.00],     →                [ 0.04, -0.06, -0.05],
            [0.36, 0.32, 0.32]]                      [-0.16,  0.11,  0.03]]</code></pre>
<p>Notice the two heads produce <em>different</em> attention patterns from the same input — head 1 weights “cat” slightly more on row 2 (0.51), head 2 weights it more strongly (0.55). Each head is free to specialize. This is the structural fact the interpretability experiments hinge on: <strong>distinct heads can do distinct jobs</strong>, and you can study them one at a time.</p>
</section>
<section id="step-4-concatenate-the-heads" class="level3">
<h3 class="anchored" data-anchor-id="step-4-concatenate-the-heads">Step 4: concatenate the heads</h3>
<p>Glue the two <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%203"> head outputs back together, side by side, into one <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%20(h%20%5Ccdot%20d_v)%20=%203%20%5Ctimes%206"> matrix:</p>
<pre><code>concat = [[-0.12, -0.04, -0.07 |  0.06, -0.03,  0.01],
          [-0.15,  0.40, -0.07 |  0.04, -0.06, -0.05],
          [-0.12,  0.20, -0.20 | -0.16,  0.11,  0.03]]
            └─ head 1 out ──┘    └─ head 2 out ──┘</code></pre>
</section>
<section id="step-5-w_o-and-the-residual-addition" class="level3">
<h3 class="anchored" data-anchor-id="step-5-w_o-and-the-residual-addition">Step 5: <img src="https://latex.codecogs.com/png.latex?W_O"> and the residual addition</h3>
<p><img src="https://latex.codecogs.com/png.latex?W_O"> has shape <img src="https://latex.codecogs.com/png.latex?(h%20%5Ccdot%20d_v)%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D%20=%206%20%5Ctimes%206">. It mixes the two heads’ information together and maps it back to the residual width. The product <img src="https://latex.codecogs.com/png.latex?%5Ctext%7Bconcat%7D%20%5Ccdot%20W_O"> is the attention block’s <strong>write-back</strong>, shape <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%206">:</p>
<pre><code>attn = [[-0.06,  0.10, -0.11,  0.05,  0.06,  0.01],
        [-0.03, -0.09, -0.10, -0.15,  0.30,  0.08],
        [ 0.19, -0.20,  0.23,  0.09,  0.34, -0.10]]</code></pre>
<p>Add it to the residual stream — <img src="https://latex.codecogs.com/png.latex?h_%7B%5Ctext%7Bmid%7D%7D%20=%20h%5E%7B(0)%7D%20+%20%5Ctext%7Battn%7D">, still <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%206">:</p>
<pre><code>h_mid = [[ 0.04, -0.00, -0.06,  0.35,  0.21, -0.04],
         [ 0.37,  0.41, -0.30,  0.05,  0.40,  0.33],
         [-0.31, -0.05,  0.88, -0.11,  0.49, -0.05]]</code></pre>
</section>
<section id="step-6-the-mlp-block" class="level3">
<h3 class="anchored" data-anchor-id="step-6-the-mlp-block">Step 6: the MLP block</h3>
<p>Now the piece the first example skipped. The MLP acts on the residual stream <strong>one token at a time</strong> — the same <img src="https://latex.codecogs.com/png.latex?W_1,%20W_2"> applied to every row independently, with no interaction between positions. (Hold onto that fact; the Appendix turns on it.)</p>
<p>First, expand from width 6 to width <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bff%7D%7D%20=%208"> via <img src="https://latex.codecogs.com/png.latex?W_1"> (shape <img src="https://latex.codecogs.com/png.latex?6%20%5Ctimes%208">): <img src="https://latex.codecogs.com/png.latex?%5Ctext%7Bpre%7D%20=%20h_%7B%5Ctext%7Bmid%7D%7D%20%5Ccdot%20W_1">, shape <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%208">:</p>
<pre><code>pre = [[ 0.16, -0.07, -0.14,  0.10, -0.12, -0.05, -0.06, -0.29],
       [ 0.17, -0.38,  0.18,  0.24, -0.11, -0.26,  0.35, -0.24],
       [-0.02,  0.22,  0.51, -0.66, -0.29, -0.36, -0.45,  0.36]]</code></pre>
<p>Apply ReLU — every negative entry becomes 0. This is the model’s only nonlinearity, and it’s what lets the MLP do more than a plain matrix multiply:</p>
<pre><code>act = [[0.16, 0.00, 0.00, 0.10, 0.00, 0.00, 0.00, 0.00],
       [0.17, 0.00, 0.18, 0.24, 0.00, 0.00, 0.35, 0.00],
       [0.00, 0.22, 0.51, 0.00, 0.00, 0.00, 0.00, 0.36]]</code></pre>
<p>Then contract back from width 8 to width 6 via <img src="https://latex.codecogs.com/png.latex?W_2"> (shape <img src="https://latex.codecogs.com/png.latex?8%20%5Ctimes%206">): <img src="https://latex.codecogs.com/png.latex?%5Ctext%7Bmlp%7D%20=%20%5Ctext%7Bact%7D%20%5Ccdot%20W_2">, shape <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%206">:</p>
<pre><code>mlp = [[ 0.07,  0.03, -0.04, -0.05,  0.08,  0.07],
       [ 0.08,  0.07, -0.22, -0.48,  0.24,  0.05],
       [ 0.79,  0.29, -0.27, -0.34,  0.26, -0.42]]</code></pre>
<p>Add it back to the residual stream — <img src="https://latex.codecogs.com/png.latex?h%5E%7B(1)%7D%20=%20h_%7B%5Ctext%7Bmid%7D%7D%20+%20%5Ctext%7Bmlp%7D">, shape <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%206">. <strong>Layer 1 is now complete:</strong></p>
<pre><code>h¹ = [[ 0.12,  0.03, -0.10,  0.31,  0.29,  0.04],
      [ 0.45,  0.48, -0.52, -0.43,  0.64,  0.38],
      [ 0.47,  0.24,  0.61, -0.45,  0.75, -0.47]]</code></pre>
<p>Note the shape went <img src="https://latex.codecogs.com/png.latex?6%20%5Cto%208%20%5Cto%206"> inside the MLP: the residual stream stays width 6 everywhere; the width-8 expansion happens <em>only inside</em> the block and is contracted away before the write-back. The residual stream’s width <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D"> is invariant — that constancy is what lets every block read and write the same <img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D"> tape.</p>
</section>
<section id="step-7-layer-2-same-shapes-new-weights" class="level3">
<h3 class="anchored" data-anchor-id="step-7-layer-2-same-shapes-new-weights">Step 7: layer 2 (same shapes, new weights)</h3>
<p>Layer 2 has its own <img src="https://latex.codecogs.com/png.latex?W_Q,%20W_K,%20W_V,%20W_O,%20W_1,%20W_2">, but the shapes and the procedure are <strong>identical</strong> to layer 1. It reads <img src="https://latex.codecogs.com/png.latex?h%5E%7B(1)%7D">, runs two-head attention, writes back, runs its MLP, writes back, and produces <img src="https://latex.codecogs.com/png.latex?h%5E%7B(2)%7D">. We show only the write-backs and the result:</p>
<pre><code>attn² (write-back)   h_mid² = h¹ + attn²        mlp²                 h² = h_mid² + mlp²
[[-0.09,-0.27, 0.01,  [[ 0.02,-0.24,-0.09,       [[-0.04,-0.27,-0.37,  [[-0.02,-0.51,-0.46,
  -0.10,-0.02,-0.08],    0.21, 0.27,-0.05],         -0.66, 0.06, 0.18],   -0.46, 0.33, 0.14],
 [-0.20,-0.48, 0.08,   [ 0.25, 0.01,-0.44,        [-0.17,-0.40,-0.58,   [ 0.08,-0.39,-1.01,
   0.12,-0.06,-0.39],    -0.31, 0.58,-0.02],        -0.93, 0.17, 0.33],   -1.24, 0.75, 0.32],
 [-0.46,-0.16, 0.31,   [ 0.02, 0.08, 0.92,        [-0.22,-0.53,-1.20,   [-0.21,-0.45,-0.28,
   0.21,-0.28,-0.47]]    -0.24, 0.47,-0.94]]        -2.21, 0.90, 0.26]]   -2.45, 1.37,-0.67]]</code></pre>
<p>Each of these is <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%206">. After <img src="https://latex.codecogs.com/png.latex?L%20=%202"> layers the residual stream is still <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%206"> — the same shape it started as. Every layer reshaped the <em>contents</em>, never the <em>shape</em>.</p>
</section>
<section id="step-8-read-out-the-last-token" class="level3">
<h3 class="anchored" data-anchor-id="step-8-read-out-the-last-token">Step 8: read out the last token</h3>
<p>Exactly as in Section 5. Take the last row of <img src="https://latex.codecogs.com/png.latex?h%5E%7B(2)%7D"> (the representation for “sat”, now informed by two full layers of attention + MLP):</p>
<pre><code>h²_last = [-0.21, -0.45, -0.28, -2.45,  1.37, -0.67]</code></pre>
<p>Multiply by the unembedding <img src="https://latex.codecogs.com/png.latex?U"> (shape <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D%20%5Ctimes%20V%20=%206%20%5Ctimes%20V">) to get logits, then softmax. Showing the first 10 vocabulary columns:</p>
<pre><code>logits = [-0.48,  0.27,  2.01, -0.91,  0.35, -1.04, -1.80,  1.03,  0.60,  0.36]
probs  = [ 0.04,  0.07,  0.42,  0.02,  0.08,  0.02,  0.01,  0.16,  0.10,  0.08]</code></pre>
<p>(With random, untrained weights the specific winner carries no meaning — what matters is that the pipeline produced a clean <img src="https://latex.codecogs.com/png.latex?%5BV%5D">-shaped distribution that sums to 1.)</p>
</section>
<section id="what-this-example-added" class="level3">
<h3 class="anchored" data-anchor-id="what-this-example-added">What this example added</h3>
<p>Reading the shape table top to bottom, three things are now visible that the single-head example could not show:</p>
<ol type="1">
<li><strong>Heads are lanes.</strong> One <img src="https://latex.codecogs.com/png.latex?W_Q"> multiply produces all heads at once; the result is <em>sliced</em> into <img src="https://latex.codecogs.com/png.latex?h"> independent <img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_k"> blocks, each running attention alone, then concatenated and remixed by <img src="https://latex.codecogs.com/png.latex?W_O">. The width bookkeeping is <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D%20%5Cto%20(h%20%5Ccdot%20d_k)%20%5Cto%20d_%7B%5Ctext%7Bmodel%7D%7D">, and here <img src="https://latex.codecogs.com/png.latex?h%20%5Ccdot%20d_k%20=%202%20%5Ccdot%203%20=%206%20=%20d_%7B%5Ctext%7Bmodel%7D%7D"> exactly.</li>
<li><strong>The MLP is a per-token refinery.</strong> It expands each token’s vector to <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bff%7D%7D">, applies ReLU, contracts back, and adds the result to the residual — touching each position in isolation. The width bookkeeping is <img src="https://latex.codecogs.com/png.latex?d_%7B%5Ctext%7Bmodel%7D%7D%20%5Cto%20d_%7B%5Ctext%7Bff%7D%7D%20%5Cto%20d_%7B%5Ctext%7Bmodel%7D%7D">.</li>
<li><strong>Layers stack without changing shape.</strong> The residual stream is a <img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D"> tape that every block reads from and writes to additively. Stacking <img src="https://latex.codecogs.com/png.latex?L"> layers just repeats the read–transform–write cycle; the shape is invariant from <img src="https://latex.codecogs.com/png.latex?h%5E%7B(0)%7D"> to <img src="https://latex.codecogs.com/png.latex?h%5E%7B(L)%7D">.</li>
</ol>
</section>
</section>
<section id="recap-of-the-data-flow" class="level2">
<h2 class="anchored" data-anchor-id="recap-of-the-data-flow">7. Recap of the data flow</h2>
<p>Shapes shown per head where the per-head structure matters; <img src="https://latex.codecogs.com/png.latex?h%20%5Ccdot%20d_k%20=%20h%20%5Ccdot%20d_v%20=%20d_%7B%5Ctext%7Bmodel%7D%7D"> in most architectures, but they are conceptually distinct.</p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">flowchart TB
    A["token IDs &amp;nbsp; [T]"] --&gt;|"embedding lookup&lt;br/&gt;E: V × d_model"| B["hidden state h⁰ &amp;nbsp; [T × d_model]&lt;br/&gt;initial residual stream"]
    B --&gt;|"project via W_Q, W_K: d_model × (h·d_k),&lt;br/&gt;W_V: d_model × (h·d_v)"| C["Q, K [T × d_k]; V [T × d_v]&lt;br/&gt;per head"]
    C --&gt;|"per-head QKᵀ / √d_k"| D["attention scores &amp;nbsp; [T × T] per head"]
    D --&gt;|"causal mask + softmax"| E["attention weights &amp;nbsp; [T × T] per head&lt;br/&gt;rows sum to 1"]
    E --&gt;|"multiply by V, per head"| F["per-head output &amp;nbsp; [T × d_v]"]
    F --&gt;|"concatenate the h heads"| G["concat output &amp;nbsp; [T × (h·d_v)]"]
    G --&gt;|"W_O: (h·d_v) × d_model,&lt;br/&gt;add to residual"| H["hidden state after attention &amp;nbsp; [T × d_model]"]
    H --&gt;|"MLP: W₁, ReLU, W₂,&lt;br/&gt;add to residual"| I["hidden state h¹ &amp;nbsp; [T × d_model]&lt;br/&gt;after MLP block"]
    I --&gt;|"more layers ... eventually layer L"| J["final hidden state &amp;nbsp; [T × d_model]"]
    J --&gt;|"take last row × U: d_model × V"| K["logits (last token) &amp;nbsp; [V]"]
    K --&gt;|"softmax"| L["next-token distribution &amp;nbsp; [V]&lt;br/&gt;sums to 1"]
    L --&gt;|"sample"| M["next token ID &amp;nbsp; scalar in [0, V)"]
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
<p><strong>Things to keep clear in your head:</strong></p>
<ol type="1">
<li><strong>Q is one-shot, K and V persist.</strong> That’s why the cache is “KV” — Q for past tokens is never needed again.</li>
<li><strong>The residual stream is the spine.</strong> Every attention and MLP block reads from it and writes back to it (additively). All the “thinking” passes through this <img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D"> tensor.</li>
<li><strong>Per-head V has shape <img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_v">, not <img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D">.</strong> The full-width <img src="https://latex.codecogs.com/png.latex?T%20%5Ctimes%20d_%7B%5Ctext%7Bmodel%7D%7D"> shape only appears <em>after</em> concatenating the <img src="https://latex.codecogs.com/png.latex?h"> heads (and only if <img src="https://latex.codecogs.com/png.latex?h%20%5Ccdot%20d_v%20=%20d_%7B%5Ctext%7Bmodel%7D%7D">, which is the usual but not mandatory choice). <img src="https://latex.codecogs.com/png.latex?W_O"> is the operator that maps that concatenated block back into the residual.</li>
<li><strong>Only the last token’s logits matter for next-token prediction at inference time.</strong> During <em>training</em> you compute logits for every position (to predict each next token in parallel), but at generation time you only need the last.</li>
<li><strong>Logits are unnormalized; softmax produces the actual distribution.</strong> Temperature, top-k, top-p sampling all operate on logits or the resulting distribution to control output diversity.</li>
<li><strong>Attention is the only cross-token channel; the MLP is per-position.</strong> Attention blocks are the <em>only</em> place where one token’s representation can be influenced by another’s. MLP blocks refine each token in isolation. So every <em>relational</em> thing a model does — agreement, coreference, copying, “don’t repeat the previous speaker” — must be carried by attention. Appendix A makes this precise by switching attention off; Appendix B works a small CNN by hand to sharpen the filter-vs-head contrast from Section 1.</li>
</ol>
</section>
<section id="appendix-a-what-if-attention-heads-were-disabled" class="level2">
<h2 class="anchored" data-anchor-id="appendix-a-what-if-attention-heads-were-disabled">Appendix A: What if attention heads were disabled?</h2>
<p>A Transformer, stripped to its skeleton, is a stack of <strong>MLP blocks</strong> with <strong>attention blocks</strong> inserted between them, all communicating through one residual stream. A classic feed-forward neural network is essentially just the MLP part. So a natural question — and a useful one for understanding what attention <em>buys</em> you — is: <strong>what happens if we switch the attention off?</strong> There are two clean ways to do it, and they arrive at the same destination.</p>
<section id="the-baseline-a-network-with-only-mlp-blocks" class="level3">
<h3 class="anchored" data-anchor-id="the-baseline-a-network-with-only-mlp-blocks">The baseline: a network with only MLP blocks</h3>
<p>Recall from Section 6 that an MLP block processes the residual stream <strong>one token at a time</strong>: the same <img src="https://latex.codecogs.com/png.latex?W_1">, ReLU, <img src="https://latex.codecogs.com/png.latex?W_2"> applied to each row independently, with no reference to any other position. So a network built <em>only</em> from MLP blocks processes every token in complete isolation. It can learn a fixed mapping “this token (at this position) → that output,” i.e.&nbsp;per-token and position-conditioned statistics — but it has <strong>no mechanism for one token to influence another’s representation.</strong> Whatever “sat” becomes, it becomes without ever consulting “the” or “cat.”</p>
</section>
<section id="method-1-remove-the-attention-layers-entirely" class="level3">
<h3 class="anchored" data-anchor-id="method-1-remove-the-attention-layers-entirely">Method 1: remove the attention layers entirely</h3>
<p>Delete the attention sub-blocks and wire the embeddings straight into the MLP stack. The update rule becomes simply</p>
<p><img src="https://latex.codecogs.com/png.latex?h%5E%7B(%5Cell+1)%7D%20=%20h%5E%7B(%5Cell)%7D%20+%20%5Ctext%7BMLP%7D(h%5E%7B(%5Cell)%7D),%20%5Cquad%20%5Ctext%7Bper%20token,%20no%20mixing.%7D"></p>
<p>This <em>is</em> the pure-MLP baseline. In the running example, the representation of “sat” can now never absorb anything from “the” or “cat” — each column of the residual stream is processed down its own private pipe. A <strong>relational</strong> rule is therefore impossible: anything that requires <em>comparing two positions</em> — for instance “the next item must differ from the current one” — cannot be expressed, because the two positions never meet.</p>
</section>
<section id="method-2-keep-the-architecture-make-attention-transparent" class="level3">
<h3 class="anchored" data-anchor-id="method-2-keep-the-architecture-make-attention-transparent">Method 2: keep the architecture, make attention transparent</h3>
<p>The second way disables attention <em>without deleting anything</em>. Keep the full Transformer architecture — all the <img src="https://latex.codecogs.com/png.latex?W_Q,%20W_K,%20W_V,%20W_O"> machinery — but fabricate the weights so that the attention block <strong>writes nothing into the residual stream.</strong></p>
<p>Recall the write-back from Section 6: the attention block’s contribution is <img src="https://latex.codecogs.com/png.latex?%5Ctext%7Battn%7D%20=%20%5Ctext%7Bconcat%7D(%5Ctext%7Bhead%20outputs%7D)%20%5Ccdot%20W_O">, and it is <em>added</em> to the residual. So set</p>
<p><img src="https://latex.codecogs.com/png.latex?W_O%20=%200%20%5Cquad%20(%5Ctext%7Bequivalently%20%7D%20W_V%20=%200)."></p>
<p>Now, no matter what attention pattern <img src="https://latex.codecogs.com/png.latex?QK%5E%5Ctop"> computes, the value added to the residual is the zero vector:</p>
<p><img src="https://latex.codecogs.com/png.latex?h_%7B%5Ctext%7Bmid%7D%7D%20=%20h%5E%7B(%5Cell)%7D%20+%200%20=%20h%5E%7B(%5Cell)%7D."></p>
<p>The residual stream passes through the attention block <strong>untouched</strong>. The Q/K/V machinery still runs, still computes attention patterns — but those patterns are <em>transparent</em>: they have no effect on anything downstream. Functionally, the model is once again the pure-MLP network of Method 1.</p>
<p>(You might ask whether there’s a subtler “identity” fabrication — make each token attend only to itself, so attention copies each value through. That doesn’t give transparency: copying each token’s value and adding it back <em>doubles</em> the contribution rather than leaving the stream unchanged. True transparency means the block adds <strong>zero</strong>, which is what zeroing the write-back achieves.)</p>
</section>
<section id="both-roads-lead-to-the-same-place" class="level3">
<h3 class="anchored" data-anchor-id="both-roads-lead-to-the-same-place">Both roads lead to the same place</h3>
<p>Whether you disable attention by <strong>deletion</strong> (Method 1) or by <strong>making it transparent</strong> (Method 2), the Transformer collapses to a <strong>per-position MLP network</strong> — a stack that refines each token in isolation and can never move information between positions. This is the precise sense in which:</p>
<blockquote class="blockquote">
<p><strong>Attention is the only channel through which tokens communicate.</strong> Everything <em>relational</em> a language model does must be carried by attention, because it is the only operation that moves information across positions.</p>
</blockquote>
<p>That is also why the tiny-model story these notes accompany is about attention <em>heads</em> specifically: if a rule relates one token to another, the circuit that enforces it has to live in the attention machinery — there is nowhere else for it to be.</p>
</section>
<section id="why-method-2-is-the-important-one-ablation" class="level3">
<h3 class="anchored" data-anchor-id="why-method-2-is-the-important-one-ablation">Why Method 2 is the important one: ablation</h3>
<p>Method 2 has a feature deletion lacks: it is <strong>selective</strong>. The write-back matrix <img src="https://latex.codecogs.com/png.latex?W_O"> is organized in blocks — one slice of rows per head (Section 6, Step 5). Zero out <strong>just one head’s slice</strong>, and you make <em>exactly that head</em> transparent while leaving every other head working. Re-run the model, measure what changes, and you have a causal probe: <em>what does this one head actually do?</em></p>
<p>This per-head version of Method 2 is called <strong>ablation</strong>, and it is the workhorse of mechanistic interpretability. Two findings from tiny, fully-understood models show why it matters:</p>
<ul>
<li><strong>A rule can rest on a single head.</strong> Ablate that one head and a behavior the model performed perfectly collapses; ablate any other head and nothing changes. The behavior was carried, causally, by one specific lane in one specific layer.</li>
<li><strong>Attention patterns can mislead.</strong> A head whose attention <em>looks</em> like it implements a rule — say it stares almost entirely at the relevant earlier token — may, when ablated, turn out to change nothing: it was not load-bearing. Conversely a head with a messy, unremarkable-looking pattern may be the one holding the rule. The attention pattern tells you what a head <strong>looks at</strong>; only ablation tells you what it <strong>does</strong>.</li>
</ul>
<p>That second lesson is worth underlining, because it is exactly where intuition goes wrong: you cannot read a head’s <em>function</em> off its attention picture. You have to switch the head off — Method 2, one head at a time — and watch what breaks. Disabling attention, far from being a destructive curiosity, is therefore the single most useful tool for finding out <strong>where in a network a behavior lives</strong> — which is the question the accompanying tiny-language-model experiments are built to answer.</p>
</section>
</section>
<section id="appendix-b-a-convolutional-ocr-pass-by-hand" class="level2">
<h2 class="anchored" data-anchor-id="appendix-b-a-convolutional-ocr-pass-by-hand">Appendix B: A convolutional OCR pass, by hand</h2>
<p>Section 1 sketched the OCR convolutional network in words. Here we run one through, with numbers small enough to check by hand, so the <strong>filter</strong> is as concrete as the <strong>head</strong>. Then we lay the two side by side.</p>
<section id="the-image" class="level3">
<h3 class="anchored" data-anchor-id="the-image">The image</h3>
<p>Take a tiny <img src="https://latex.codecogs.com/png.latex?5%20%5Ctimes%205"> grayscale image. Each pixel is <img src="https://latex.codecogs.com/png.latex?0"> (blank) or <img src="https://latex.codecogs.com/png.latex?1"> (ink). This one shows a <strong>vertical stroke</strong> down the middle column — the kind of mark that distinguishes, say, a <code>1</code> or the spine of a <code>T</code>:</p>
<pre><code>       col: 0 1 2 3 4
row 0:      0 0 1 0 0
row 1:      0 0 1 0 0
row 2:      0 0 1 0 0
row 3:      0 0 1 0 0
row 4:      0 0 1 0 0</code></pre>
</section>
<section id="one-filter-sliding" class="level3">
<h3 class="anchored" data-anchor-id="one-filter-sliding">One filter, sliding</h3>
<p>A filter is a small fixed grid of weights. Here is a <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%203"> <strong>vertical-stroke detector</strong>: it rewards ink in its center column and punishes ink on either side, so it responds most strongly to a vertical line.</p>
<pre><code>F_vert = [[-1,  2, -1],
          [-1,  2, -1],
          [-1,  2, -1]]</code></pre>
<p>To convolve, we slide this filter over every <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%203"> window of the image; at each stop we multiply overlapping cells and sum to a single number. A <img src="https://latex.codecogs.com/png.latex?5%20%5Ctimes%205"> image with a <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%203"> filter has <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%203%20=%209"> valid stops, so the output (the <strong>feature map</strong>) is <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%203">.</p>
<p>Look at two stops to see the mechanism:</p>
<p><strong>Top-left window</strong> (rows 0–2, cols 0–2) — the stroke is off to the right, so the filter’s center column sits on blanks:</p>
<pre><code>window      = [[0,0,1],     elementwise·F_vert, summed:
               [0,0,1],     each row: 0·(-1) + 0·(2) + 1·(-1) = -1
               [0,0,1]]     three rows → -3</code></pre>
<p><strong>Top-center window</strong> (rows 0–2, cols 1–3) — now the stroke lines up under the filter’s center column:</p>
<pre><code>window      = [[0,1,0],     each row: 0·(-1) + 1·(2) + 0·(-1) = +2
               [0,1,0],     three rows → +6
               [0,1,0]]</code></pre>
<p>Do this at all nine stops and you get the feature map. The center column lights up (+6); the flanks are suppressed (−3):</p>
<pre><code>conv = [[-3,  6, -3],
        [-3,  6, -3],
        [-3,  6, -3]]</code></pre>
</section>
<section id="relu-then-pooling" class="level3">
<h3 class="anchored" data-anchor-id="relu-then-pooling">ReLU, then pooling</h3>
<p>Apply <strong>ReLU</strong> (zero out negatives) — the map keeps only the positive evidence “a vertical stroke is here”:</p>
<pre><code>relu(conv) = [[0, 6, 0],
              [0, 6, 0],
              [0, 6, 0]]</code></pre>
<p>Then <strong>max-pool</strong> with a <img src="https://latex.codecogs.com/png.latex?2%20%5Ctimes%202"> window to shrink the grid and add a little position tolerance (each output cell is the max of a <img src="https://latex.codecogs.com/png.latex?2%20%5Ctimes%202"> patch). The <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%203"> map becomes <img src="https://latex.codecogs.com/png.latex?2%20%5Ctimes%202">:</p>
<pre><code>pool = [[6, 6],
        [6, 6]]</code></pre>
<p>The pooled map is uniformly high: this filter is shouting “vertical stroke, present.” Pooling means it would keep shouting even if the stroke shifted a pixel — the detector is now slightly <em>position-invariant</em>, exactly the property Section 1 said convolution buys you.</p>
</section>
<section id="a-layer-has-many-filters" class="level3">
<h3 class="anchored" data-anchor-id="a-layer-has-many-filters">A layer has many filters</h3>
<p>A real layer applies <em>many</em> filters in parallel, each producing its own feature map. Add a second filter, a <strong>horizontal-stroke detector</strong> (ink rewarded in the center <em>row</em>):</p>
<pre><code>F_horiz = [[-1, -1, -1],
           [ 2,  2,  2],
           [-1, -1, -1]]</code></pre>
<p>Run it on the <em>same</em> vertical-stroke image and every window cancels to zero — there is no horizontal ink to reward:</p>
<pre><code>conv = relu = pool = all zeros</code></pre>
<p>So the two filters disagree, informatively: the vertical detector fires, the horizontal detector is silent. That contrast <em>is</em> the feature the classifier wants.</p>
</section>
<section id="flatten-and-classify" class="level3">
<h3 class="anchored" data-anchor-id="flatten-and-classify">Flatten and classify</h3>
<p>Flatten the two pooled maps into one feature vector (four numbers per filter, eight in all):</p>
<pre><code>features = [6, 6, 6, 6,   0, 0, 0, 0]
            └ vertical ┘   └ horizontal ┘</code></pre>
<p>A final linear layer reads these features into one score per class — let the “vertical” class sum the vertical-filter features and the “horizontal” class sum the horizontal ones — and softmax turns the scores into probabilities:</p>
<pre><code>logits = [24,  0]            # vertical, horizontal
softmax ≈ [1.00, 0.00]       # "this is a vertical stroke"</code></pre>
<p>The network has converted a grid of pixels, step by step, into a point in a tiny two-class solution space, and read out a crisp answer — the same arc as the language model, just over characters instead of next tokens. (The probability is emphatic because this is a noise-free toy; on real handwriting the distribution would be softer.)</p>
</section>
<section id="filter-vs-head-side-by-side" class="level3">
<h3 class="anchored" data-anchor-id="filter-vs-head-side-by-side">Filter vs head, side by side</h3>
<p>Now the comparison that motivated this appendix. A filter and a head are both <em>feature detectors that get reused across positions</em>, but they differ in the one respect that matters for language:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 16%">
<col style="width: 37%">
<col style="width: 45%">
</colgroup>
<thead>
<tr class="header">
<th></th>
<th><strong>Convolutional filter</strong> (CNN)</th>
<th><strong>Attention head</strong> (Transformer)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>What it stores</td>
<td>a fixed <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%203"> pattern of weights</td>
<td>three projections <img src="https://latex.codecogs.com/png.latex?W_Q,%20W_K,%20W_V"> (and shares <img src="https://latex.codecogs.com/png.latex?W_O">)</td>
</tr>
<tr class="even">
<td>Receptive field</td>
<td><strong>fixed and local</strong> — always the same small window</td>
<td><strong>dynamic and global</strong> — chosen at runtime, can reach any earlier token</td>
</tr>
<tr class="odd">
<td>How it matches</td>
<td>slides the <em>same</em> pattern over every position; fires where the pixels match it</td>
<td>computes, from the content, a query–key similarity to decide <em>which</em> positions to read</td>
</tr>
<tr class="even">
<td>What “reuse across positions” means</td>
<td>weight sharing: identical weights at every location</td>
<td>the same <img src="https://latex.codecogs.com/png.latex?W_Q,%20W_K,%20W_V"> at every position, but the <em>attention pattern</em> is recomputed per input</td>
</tr>
<tr class="odd">
<td>Output at a position</td>
<td>one activation = how well the local patch matches the pattern</td>
<td>a weighted blend of other positions’ <strong>V</strong> payloads</td>
</tr>
<tr class="even">
<td>Set by the data…</td>
<td>at <strong>training</strong> time (the filter weights are learned, then frozen)</td>
<td>at training time <em>and</em> <strong>at run time</strong> (the pattern depends on the actual tokens)</td>
</tr>
</tbody>
</table>
<p>The crucial row is <em>receptive field</em>. The vertical filter above can only ever see a <img src="https://latex.codecogs.com/png.latex?3%20%5Ctimes%203"> patch; to relate ink in the top-left corner to ink in the bottom-right, a CNN must stack many layers until their windows overlap. That is fine for images, where the clues are local. A head pays no such toll: the query for one token can match the key of a token hundreds of positions away in a <em>single</em> step, and <em>which</em> token it matches is decided by the content, not fixed by the architecture. That is the whole reason attention displaced convolution for language — and, looping back to Appendix A, it is also why a relational rule in a language model lives in a <em>head</em>: the head is the only component whose reach is wide enough, and content-driven enough, to relate one token to another.</p>
</section>
</section>
<section id="appendix-c-glossary" class="level2">
<h2 class="anchored" data-anchor-id="appendix-c-glossary">Appendix C: Glossary</h2>
<p>For readers who would like the basics or a refresher. Terms are grouped roughly by where they first appear; Notes 2 and 3 carry their own glossaries for the terms specific to them.</p>
<section id="token-vocabulary" class="level3">
<h3 class="anchored" data-anchor-id="token-vocabulary">Token, vocabulary</h3>
<p>A <strong>token</strong> is the unit of text the model reads — a whole word in our toy examples, usually a sub-word piece in production models. The <strong>vocabulary</strong> (<img src="https://latex.codecogs.com/png.latex?V">) is the fixed set of all possible tokens (20 in Section 5’s toy, 50,257 in GPT-2).</p>
</section>
<section id="embedding-embedding-matrix-e" class="level3">
<h3 class="anchored" data-anchor-id="embedding-embedding-matrix-e">Embedding, embedding matrix <img src="https://latex.codecogs.com/png.latex?E"></h3>
<p>The vector of real numbers that represents a token — its coordinates in the model’s high-dimensional “language space.” The <strong>embedding matrix</strong> <img src="https://latex.codecogs.com/png.latex?E"> has one row per vocabulary token; “looking up” a token means reading its row.</p>
</section>
<section id="residual-stream" class="level3">
<h3 class="anchored" data-anchor-id="residual-stream">Residual stream</h3>
<p>The running vector the Transformer maintains for each position and updates layer by layer; every attention and MLP block <em>adds</em> its output to it. It is the only place the model’s “thinking” lives, and the prediction is read from its final state.</p>
</section>
<section id="attention-head-q-k-v" class="level3">
<h3 class="anchored" data-anchor-id="attention-head-q-k-v">Attention head, Q / K / V</h3>
<p>A sub-mechanism that, for each token, decides how much to read from every earlier token. It does so with three learned projections of the residual: the <strong>query</strong> (“what am I looking for?”), the <strong>key</strong> (“what do I offer to match against?”), and the <strong>value</strong> (“what payload do I carry if matched?”). Query–key similarity sets the attention pattern; the values are what flow along it.</p>
</section>
<section id="multi-head-attention" class="level3">
<h3 class="anchored" data-anchor-id="multi-head-attention">Multi-head attention</h3>
<p>Running <img src="https://latex.codecogs.com/png.latex?h"> attention heads in parallel on independent slices of Q/K/V, then concatenating their outputs and mixing them with <img src="https://latex.codecogs.com/png.latex?W_O">. Each head can specialise in a different relation.</p>
</section>
<section id="w_q-w_k-w_v-w_o" class="level3">
<h3 class="anchored" data-anchor-id="w_q-w_k-w_v-w_o"><img src="https://latex.codecogs.com/png.latex?W_Q,%20W_K,%20W_V,%20W_O"></h3>
<p>The four learned weight matrices of an attention block: three that project the residual into queries, keys and values, and <img src="https://latex.codecogs.com/png.latex?W_O"> that maps the concatenated head outputs back into the residual stream.</p>
</section>
<section id="mlp-block-w_1-w_2-relu" class="level3">
<h3 class="anchored" data-anchor-id="mlp-block-w_1-w_2-relu">MLP block (<img src="https://latex.codecogs.com/png.latex?W_1,%20W_2">, ReLU)</h3>
<p>The fully-connected “feed-forward” sub-layer that refines each token’s vector <em>in place</em>. It expands the vector to a wider hidden size via <img src="https://latex.codecogs.com/png.latex?W_1">, applies a non-linearity (here <strong>ReLU</strong>, which zeroes negatives), and contracts back via <img src="https://latex.codecogs.com/png.latex?W_2">. It never mixes information across positions.</p>
</section>
<section id="logits-softmax-temperature" class="level3">
<h3 class="anchored" data-anchor-id="logits-softmax-temperature">Logits, softmax, temperature</h3>
<p>The model’s raw, unnormalised scores over the vocabulary are <strong>logits</strong>. <strong>Softmax</strong> turns them into a probability distribution (exponentiate, then normalise to sum to one). <strong>Temperature</strong> <img src="https://latex.codecogs.com/png.latex?%5Ctau"> rescales the logits before softmax: low <img src="https://latex.codecogs.com/png.latex?%5Ctau"> sharpens the distribution (greedy at <img src="https://latex.codecogs.com/png.latex?%5Ctau%5Cto%200">), high <img src="https://latex.codecogs.com/png.latex?%5Ctau"> flattens it.</p>
</section>
<section id="unembedding-u-weight-tying" class="level3">
<h3 class="anchored" data-anchor-id="unembedding-u-weight-tying">Unembedding <img src="https://latex.codecogs.com/png.latex?U">, weight tying</h3>
<p>The matrix <img src="https://latex.codecogs.com/png.latex?U"> that maps the final residual vector to one logit per vocabulary token; each column is a token’s representation. <strong>Weight tying</strong> sets <img src="https://latex.codecogs.com/png.latex?U%20=%20E%5E%5Ctop"> — the same dictionary used for input lookup and output scoring (standard in GPT-2, Llama, and our toys).</p>
</section>
<section id="layernorm-rmsnorm" class="level3">
<h3 class="anchored" data-anchor-id="layernorm-rmsnorm">LayerNorm / RMSNorm</h3>
<p>A normalisation applied to the residual vector (notably just before the unembedding) that rescales its components to a controlled magnitude and applies a learned per-dimension scale.</p>
</section>
<section id="causal-mask" class="level3">
<h3 class="anchored" data-anchor-id="causal-mask">Causal mask</h3>
<p>The rule that each token may attend only to itself and earlier tokens. Implemented by setting the upper triangle of the attention scores to <img src="https://latex.codecogs.com/png.latex?-%5Cinfty"> before softmax, so future positions get weight zero.</p>
</section>
<section id="kv-cache" class="level3">
<h3 class="anchored" data-anchor-id="kv-cache">KV cache</h3>
<p>The stored keys and values of all past tokens, reused at each generation step so the new token’s query can attend back to the whole history. Queries are not cached — hence “KV”, not “QKV”.</p>
</section>
<section id="prefill-vs.-generation" class="level3">
<h3 class="anchored" data-anchor-id="prefill-vs.-generation">Prefill vs.&nbsp;generation</h3>
<p><strong>Prefill</strong> processes the whole prompt at once, writing every token’s K and V into the cache. <strong>Generation</strong> then adds one token at a time, growing the cache by one row of K and V per step.</p>
</section>
<section id="ablation" class="level3">
<h3 class="anchored" data-anchor-id="ablation">Ablation</h3>
<p>Switching off one component — e.g.&nbsp;zeroing one head’s slice of <img src="https://latex.codecogs.com/png.latex?W_O"> — and re-measuring behaviour, to test what that component <em>causally</em> does (Appendix A).</p>
</section>
<section id="convolution-filter-feature-map-cnn-terms" class="level3">
<h3 class="anchored" data-anchor-id="convolution-filter-feature-map-cnn-terms">Convolution, filter, feature map (CNN terms)</h3>
<p>A <strong>convolution</strong> slides a small fixed <strong>filter</strong> of weights over an image, computing a local weighted sum at each position; the resulting grid is a <strong>feature map</strong>. The contrast with an attention head — fixed local window vs.&nbsp;dynamic content-driven reach — motivates Section 1 and Appendix B.</p>
</section>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<p>Cunningham, H., Ewart, A., Riggs, L., Huben, R., &amp; Sharkey, L. (2023). <em>Sparse autoencoders find highly interpretable features in language models.</em> arXiv. https://arxiv.org/abs/2309.08600</p>
<p>DeepSeek-AI. (2024). <em>DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model.</em> arXiv. https://arxiv.org/abs/2405.04434</p>
<p>Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., … Olah, C. (2022). <em>Toy models of superposition.</em> Transformer Circuits Thread. https://transformer-circuits.pub/2022/toy_model/index.html</p>
<p>Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., … Amodei, D. (2020). <em>Scaling laws for neural language models.</em> arXiv. https://arxiv.org/abs/2001.08361</p>
<p>nostalgebraist. (2020, August 31). <em>Interpreting GPT: The logit lens.</em> LessWrong. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens</p>
<p>Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. <em>Advances in Neural Information Processing Systems, 30.</em> https://arxiv.org/abs/1706.03762</p>


</section>

 ]]></description>
  <category>LLMs</category>
  <category>Transformers</category>
  <category>Attention</category>
  <category>Teaching</category>
  <guid>https://theworkbench.lerzegov.org/posts/transformers-qkv-attention/</guid>
  <pubDate>Sat, 30 May 2026 22:00:00 GMT</pubDate>
</item>
</channel>
</rss>
