✦AI & Machine Learning

Self-Attention

Name: Self-Attention
Author: Ben Ebsworth

Multi-head self-attention as a live particle network — query tokens cycle, heads drift, weights flow.

How it works

Each token is projected into a query vector $q$ and a key vector $k$ . For the active query and a candidate key, a head computes a compatibility score — the dot product — scales it down, then normalises across all keys with a softmax so the weights sum to one:

\alpha_{q,k} = \operatorname{softmax}_k\!\left(\frac{q \cdot k}{\sqrt{d_k}}\right)

The $1/\sqrt{d_k}$ factor keeps the scores from blowing up as the dimension grows, which would otherwise push the softmax into a saturated, gradient-starved corner. The arcs above draw $\alpha$ directly: brightness and particle density are the weight. The new representation of the query is then $\sum_k \alpha_{q,k}\,v_k$ — the average of the value vectors, weighted by exactly these arcs.

Where it shows up

The whole picture is the contents of a single transformer block's attention layer, the building block under every modern language and vision model. Two things are worth watching:

Heads specialise. Each head has a slowly drifting preferred relative offset — lower heads look nearby, higher heads reach across the sequence. Stack several and you see distinct patterns overlaid in different hues, exactly as multi-head attention does in a real model.
Temperature is the knife edge. Drag it low and each query collapses onto one near-certain key (the greedy regime where gradients vanish); drag it high and the distribution flattens until every token is weighted near-equally and the signal washes out. Real models live in the useful middle.

The knobs

Tokens — length of the sequence; more tokens, more keys to attend over.
Heads — number of independent attention patterns stacked, one hue each.
Temperature — sharpens (low) or flattens (high) the softmax over keys.
Speed — pace of the query cycle and particle flow.
Color — base hue the per-head palette is derived from.

ben ebsworth