Self-Attention vs Cross-Attention
Different attention patterns
Self-Attention Recap
In self-attention, a sequence talks to itself. Every word generates a Query, Key, and Value, and every word attends to every other word—including itself.
Consider the sentence "The cat sat on the mat." When processing the word "sat," self-attention allows it to look at "cat" (who is sitting), "mat" (where the sitting happens), and even "sat" itself. All the information needed to understand this word comes from the same sequence.
The key insight: Q, K, and V all come from the same input sequence. If our sequence has length , we compute an attention matrix where each position can attend to all positions.
Self-attention appears in both the encoder and decoder of a transformer. In the encoder, it allows each word to gather context from the entire input. In the decoder, it allows generated words to look at previously generated words (with some restrictions we will discuss shortly).
Cross-Attention
Cross-attention is different. Here, the Query comes from one sequence, while the Keys and Values come from another.
This is how the decoder "reads" the encoder's output. When generating French text from English input, the French words need to query the English representations. The decoder generates Queries asking "what English information do I need right now?" and the encoder's Keys and Values provide the answers.
Think of it as a conversation between two sequences. The decoder sequence asks questions; the encoder sequence provides answers.
Self-Attention vs Cross-Attention
In self-attention, Q, K, and V all come from the same sequence. Each word attends to every word.
In self-attention (left), every position attends to every position within the same sequence. In cross-attention (right), the decoder sequence queries the encoder sequence—a one-way information flow from source to target.
The key point: cross-attention is the bridge between encoder and decoder. Without it, the decoder would generate text with no knowledge of the input. Cross-attention is what makes translation, summarization, and question-answering possible.
Consider translating "Hello" to French. When generating "Bonjour," the decoder's Query for this position asks: "What English word am I translating?" The encoder's Keys from "Hello" respond with high attention scores because they match semantically. The decoder then retrieves the encoder's Value for "Hello" to inform its output.
Masked (Causal) Self-Attention
There is a subtle problem with decoder self-attention during training.
When training a language model, we know the complete target sequence—"The cat sat on the mat." We feed this entire sequence into the decoder at once for efficiency. But here is the issue: when the decoder processes "sat," it should not be able to see "on the mat" because during generation, those words have not been produced yet.
If we let the decoder peek at future words during training, it would learn to cheat. It would not learn to predict the next word; it would learn to copy it.
The solution is masking. We modify the attention scores so that each position can only attend to earlier positions (and itself). This is called causal masking because it respects the causal order of generation—you can only use information from the past.
Mechanically, we set attention scores for future positions to before applying softmax. Since , these positions contribute nothing to the weighted sum.
Causal Mask Pattern
Hover over a cell to see what each position can attend to
The causal mask ensures each position only attends to past and present positions. Future positions are masked with -∞, becoming 0 after softmax.
The triangular pattern emerges naturally. Position 1 can only see position 1. Position 2 can see positions 1 and 2. Position 3 can see positions 1, 2, and 3. And so on.
Hover over any position to see exactly what it can attend to. Notice how the diagonal and lower triangle are visible (allowed), while the upper triangle is masked (forbidden).
During inference, causal masking is implicit—we simply have not generated the future tokens yet. During training, we enforce it explicitly with the mask.
The Three Attention Types in Transformers
The original transformer architecture uses all three types of attention in different places:
1. Encoder Self-Attention (Bidirectional)
Used in: Encoder layers
What it does: Each position attends to all positions in the input sequence
Why: The encoder's job is to build rich representations of the input. It benefits from seeing the full context in both directions. When encoding "The bank by the river," the word "bank" should see "river" to resolve its meaning.
2. Decoder Masked Self-Attention (Causal)
Used in: First attention layer of each decoder block
What it does: Each position attends only to earlier positions in the output sequence
Why: Maintains autoregressive property during training. When generating "Le chat," the word "chat" should not see words that come after it in the training sequence.
3. Decoder Cross-Attention
Used in: Second attention layer of each decoder block
What it does: Decoder positions query encoder positions
Why: Allows the decoder to incorporate information from the input. When generating "chat" (cat), it needs to know that "cat" appeared in the English input.
Attention in Translation
Source (English)
Hello world !
Target (French)
Bonjour monde !
Encoder Self-Attention
Bidirectional: sees all
Decoder Masked Self-Attention
Causal: sees past only
Cross-Attention
Decoder queries encoder
Watch all three attention types work together. The encoder processes the full English input bidirectionally. The decoder uses causal self-attention on generated French words, and cross-attention to query the encoder.
Watch how all three attention types work together during translation. The encoder processes the full source sentence bidirectionally. The decoder uses causal self-attention to attend to previously generated words, and cross-attention to query the encoder's representations.
| Type | Query Source | Key/Value Source | Masking | Purpose |
|---|---|---|---|---|
| Encoder Self-Attention | Input sequence | Input sequence | None | Build contextual input representations |
| Decoder Masked Self-Attention | Output sequence | Output sequence | Causal | Maintain autoregressive generation |
| Decoder Cross-Attention | Output sequence | Encoder output | None | Connect decoder to encoder |
Key Takeaways
- Self-attention has Q, K, V all from the same sequence—the sequence "talks to itself"
- Cross-attention has Q from one sequence, K and V from another—one sequence "queries" the other
- Causal (masked) self-attention prevents positions from attending to future positions
- The mask uses values that become zero after softmax
- Encoder uses bidirectional self-attention to build rich input representations
- Decoder uses causal self-attention (to maintain autoregressive property) and cross-attention (to read the encoder)
- Cross-attention is the bridge that connects encoder and decoder in sequence-to-sequence models