Self-Attention vs Cross-Attention

Different attention patterns

Self-Attention Recap

In self-attention, a sequence talks to itself. Every word generates a Query, Key, and Value, and every word attends to every other word—including itself.

Consider the sentence "The cat sat on the mat." When processing the word "sat," self-attention allows it to look at "cat" (who is sitting), "mat" (where the sitting happens), and even "sat" itself. All the information needed to understand this word comes from the same sequence.

The key insight: Q, K, and V all come from the same input sequence. If our sequence has length $n$ , we compute an $n \times n$ attention matrix where each position can attend to all positions.

Self-attention appears in both the encoder and decoder of a transformer. In the encoder, it allows each word to gather context from the entire input. In the decoder, it allows generated words to look at previously generated words (with some restrictions we will discuss shortly).

Cross-Attention

Cross-attention is different. Here, the Query comes from one sequence, while the Keys and Values come from another.

This is how the decoder "reads" the encoder's output. When generating French text from English input, the French words need to query the English representations. The decoder generates Queries asking "what English information do I need right now?" and the encoder's Keys and Values provide the answers.

Think of it as a conversation between two sequences. The decoder sequence asks questions; the encoder sequence provides answers.

Self-Attention vs Cross-Attention

In self-attention, Q, K, and V all come from the same sequence. Each word attends to every word.

In self-attention (left), every position attends to every position within the same sequence. In cross-attention (right), the decoder sequence queries the encoder sequence—a one-way information flow from source to target.

The key point: cross-attention is the bridge between encoder and decoder. Without it, the decoder would generate text with no knowledge of the input. Cross-attention is what makes translation, summarization, and question-answering possible.

Consider translating "Hello" to French. When generating "Bonjour," the decoder's Query for this position asks: "What English word am I translating?" The encoder's Keys from "Hello" respond with high attention scores because they match semantically. The decoder then retrieves the encoder's Value for "Hello" to inform its output.

Masked (Causal) Self-Attention

There is a subtle problem with decoder self-attention during training.

When training a language model, we know the complete target sequence—"The cat sat on the mat." We feed this entire sequence into the decoder at once for efficiency. But here is the issue: when the decoder processes "sat," it should not be able to see "on the mat" because during generation, those words have not been produced yet.

If we let the decoder peek at future words during training, it would learn to cheat. It would not learn to predict the next word; it would learn to copy it.

The solution is masking. We modify the attention scores so that each position can only attend to earlier positions (and itself). This is called causal masking because it respects the causal order of generation—you can only use information from the past.

Mechanically, we set attention scores for future positions to $-\infty$ before applying softmax. Since $e^{-\infty} = 0$ , these positions contribute nothing to the weighted sum.

Causal Mask Pattern

Hover over a cell to see what each position can attend to

Can attend (lower triangle)

Masked (upper triangle)

The causal mask ensures each position only attends to past and present positions. Future positions are masked with -∞, becoming 0 after softmax.

The triangular pattern emerges naturally. Position 1 can only see position 1. Position 2 can see positions 1 and 2. Position 3 can see positions 1, 2, and 3. And so on.

Hover over any position to see exactly what it can attend to. Notice how the diagonal and lower triangle are visible (allowed), while the upper triangle is masked (forbidden).

During inference, causal masking is implicit—we simply have not generated the future tokens yet. During training, we enforce it explicitly with the mask.

The Three Attention Types in Transformers

The original transformer architecture uses all three types of attention in different places:

1. Encoder Self-Attention (Bidirectional)

Used in: Encoder layers

What it does: Each position attends to all positions in the input sequence

Why: The encoder's job is to build rich representations of the input. It benefits from seeing the full context in both directions. When encoding "The bank by the river," the word "bank" should see "river" to resolve its meaning.

2. Decoder Masked Self-Attention (Causal)

Used in: First attention layer of each decoder block

What it does: Each position attends only to earlier positions in the output sequence

Why: Maintains autoregressive property during training. When generating "Le chat," the word "chat" should not see words that come after it in the training sequence.

3. Decoder Cross-Attention

Used in: Second attention layer of each decoder block

What it does: Decoder positions query encoder positions

Why: Allows the decoder to incorporate information from the input. When generating "chat" (cat), it needs to know that "cat" appeared in the English input.

Attention in Translation

Source (English)

Hello world !

→

Target (French)

Bonjour monde !

Encoder Self-Attention

Bidirectional: sees all

Decoder Masked Self-Attention

Causal: sees past only

Cross-Attention

Decoder queries encoder

Watch all three attention types work together. The encoder processes the full English input bidirectionally. The decoder uses causal self-attention on generated French words, and cross-attention to query the encoder.

Watch how all three attention types work together during translation. The encoder processes the full source sentence bidirectionally. The decoder uses causal self-attention to attend to previously generated words, and cross-attention to query the encoder's representations.

Type	Query Source	Key/Value Source	Masking	Purpose
Encoder Self-Attention	Input sequence	Input sequence	None	Build contextual input representations
Decoder Masked Self-Attention	Output sequence	Output sequence	Causal	Maintain autoregressive generation
Decoder Cross-Attention	Output sequence	Encoder output	None	Connect decoder to encoder

Key Takeaways

Self-attention has Q, K, V all from the same sequence—the sequence "talks to itself"
Cross-attention has Q from one sequence, K and V from another—one sequence "queries" the other
Causal (masked) self-attention prevents positions from attending to future positions
The mask uses $-\infty$ values that become zero after softmax
Encoder uses bidirectional self-attention to build rich input representations
Decoder uses causal self-attention (to maintain autoregressive property) and cross-attention (to read the encoder)
Cross-attention is the bridge that connects encoder and decoder in sequence-to-sequence models