Scaled Dot-Product Attention
The core operation
From Intuition to Math
In the previous chapter, we built intuition for attention: queries looking for relevant keys, values being retrieved and combined. Now we formalize this with matrices. The math is beautiful once you see it.
Every token in our sequence has an embedding — a vector representation. We stack these into a matrix , where each row is one token's embedding. From this single input, we create three different views through learned linear projections:
These weight matrices , , and are learned during training. They transform the same input into three different roles:
- Queries (Q): What each position is looking for
- Keys (K): What each position offers to be found
- Values (V): What each position contributes when matched
The genius is that these projections are learned. The network discovers what makes a good query, what makes positions findable, and what information should flow when attention happens.
Interactive: Step Through Attention
We start with our input embeddings — one row per token.
Walk through each step to see how a simple 4-token sentence gets transformed. Notice how the input gets projected into Q, K, V, then combined to produce contextualized outputs.
The Dot Product as Similarity
How do we measure which keys match which queries? The dot product.
When two vectors point in similar directions, their dot product is large and positive. When they are orthogonal, it is zero. When they point opposite ways, it is negative. This makes the dot product a natural measure of alignment — exactly what we need for attention.
For a single query vector and key vector , the score is simply . But we need all pairwise scores between every query and every key. Matrix multiplication gives us this in one shot:
The result is an matrix where entry tells us how well query matches key . Position can now see which other positions are relevant to it.
Interactive: Attention Weights Heatmap
Click any cell to see the computation details
Click on any cell to see the underlying computation. Notice how some words attend strongly to related words — "cat" attends to "curious" because adjectives modify nouns, "watched" attends to "cat" because verbs need their subjects.
Why Scale by √d_k?
Here is where many explanations gloss over something crucial. The original paper includes a seemingly arbitrary in the denominator. Why?
"We suspect that for large values of , the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients."
— Attention Is All You Need, Vaswani et al.
Let's unpack this. When is large, dot products between random vectors have larger variance. If we have two random unit vectors in dimensions, their expected dot product squared scales with . Large dot products push softmax toward extreme values — nearly 1 for the maximum, nearly 0 for everything else.
This is a problem. When softmax saturates:
- Gradients vanish. The derivative of softmax approaches zero in saturated regions. No gradient means no learning.
- Attention becomes one-hot. Instead of blending information from multiple sources, each position fixates on exactly one other position.
Dividing by normalizes the variance of dot products back to roughly 1, keeping softmax in a regime where it can still learn.
Interactive: Softmax Saturation Demo
Softmax Output Distribution
Gradient Magnitude
0.2190
✓ Healthy for learning
Max Probability
32.4%
Spread across keys
Distribution Spread
95%
Well distributed
Try it: Increase d_k with scaling OFF and watch the distribution collapse to nearly one-hot. The gradient approaches zero, making learning impossible. Turn scaling ON to see how dividing by keeps gradients healthy regardless of dimension.
Try increasing with scaling off. Watch the distribution collapse into a near-one-hot spike. The gradient magnitude drops to nearly zero. Now turn scaling on — the distribution stays healthy regardless of dimension.
The Complete Formula
Putting it all together, scaled dot-product attention is:
Let's trace through what happens:
-
Compute scores: gives raw attention scores — how much each query matches each key.
-
Scale: Divide by to stabilize gradients.
-
Softmax: Apply softmax row-wise. Each row now sums to 1, giving us a probability distribution over keys.
-
Weighted sum: Multiply by . Each output row is a weighted combination of value vectors, where weights come from attention.
The output has the same shape as the input. But now each position's representation has been enriched with information from other positions, weighted by relevance.
This is what makes attention powerful. A word's meaning can now depend on its context. "Bank" in "river bank" attends to "river" and gets a different representation than "bank" in "savings bank" attending to "savings."
Key Takeaways
- Queries, Keys, and Values are learned projections from the same input: , ,
- The dot product measures pairwise similarity between all queries and keys
- Scaling by prevents softmax saturation and keeps gradients healthy
- The complete formula produces context-aware representations
- Each output position is a weighted sum of values, where attention weights reflect query-key similarity