How it works
Each token is projected into a query vector and a key vector . For the active query and a candidate key, a head computes a compatibility score — the dot product — scales it down, then normalises across all keys with a softmax so the weights sum to one:
The factor keeps the scores from blowing up as the dimension grows, which would otherwise push the softmax into a saturated, gradient-starved corner. The arcs above draw directly: brightness and particle density are the weight. The new representation of the query is then — the average of the value vectors, weighted by exactly these arcs.
Where it shows up
The whole picture is the contents of a single transformer block's attention layer, the building block under every modern language and vision model. Two things are worth watching:
- Heads specialise. Each head has a slowly drifting preferred relative offset — lower heads look nearby, higher heads reach across the sequence. Stack several and you see distinct patterns overlaid in different hues, exactly as multi-head attention does in a real model.
- Temperature is the knife edge. Drag it low and each query collapses onto one near-certain key (the greedy regime where gradients vanish); drag it high and the distribution flattens until every token is weighted near-equally and the signal washes out. Real models live in the useful middle.
The knobs
- Tokens — length of the sequence; more tokens, more keys to attend over.
- Heads — number of independent attention patterns stacked, one hue each.
- Temperature — sharpens (low) or flattens (high) the softmax over keys.
- Speed — pace of the query cycle and particle flow.
- Color — base hue the per-head palette is derived from.