Attention Patterns in Practice

What models actually learn

Emergence of Interpretable Patterns

When we designed the attention mechanism, we specified how attention should be computed—dot products, softmax, weighted sums. But we did not specify what the model should attend to. That emerges from training.

Here is what makes this remarkable: across different model sizes, training runs, and even architectures, transformers consistently develop similar types of attention heads. Nobody told the model to create "heads that attend to the previous token" or "heads that track syntactic structure." These patterns emerge spontaneously because they are useful for the training objective.

What stands out: the model discovers linguistic structure without being taught linguistics. A model trained only to predict the next word ends up learning grammar, semantics, and even some world knowledge—all encoded in attention patterns.

This is both exciting and humbling. We can observe these patterns after training, but we did not design them. The model found them on its own.

Common Head Types

Researchers have catalogued several recurring attention head types that appear across transformer models. Think of these as "roles" that individual heads learn to play.

Previous Token Heads

The simplest pattern: always attend to position $i-1$ . When processing "cat," look at "the." When processing "sat," look at "cat."

Why is this useful? Language has strong local dependencies. The previous word often constrains what comes next. These heads provide a reliable baseline signal about immediate context.

Induction Heads

These are more sophisticated. An induction head performs pattern completion. If the sequence "AB" appeared earlier, and now "A" appears again, the head attends to "B."

Consider: "The cat sat on the mat. The cat..."

When the model sees "The cat" the second time, induction heads attend back to what followed "The cat" previously—"sat." This enables in-context learning, where models can pick up patterns from examples in their input.

Induction heads are a key discovery in mechanistic interpretability. They explain how transformers can learn from few-shot examples.

Syntactic Heads

Some heads learn to track grammatical structure. A head might attend from a verb to its subject, from an adjective to the noun it modifies, or from a closing bracket to its opening bracket.

These heads reveal that transformers implicitly learn syntax. The model was never given parse trees—it discovered syntactic relationships because they help predict the next token.

Positional Heads

Some heads have fixed attention patterns based purely on relative position. They might always attend to tokens 2 positions back, or to the first token in the sequence.

These provide positional anchoring that complements the content-based attention of other heads.

Attention Pattern Gallery

Previous Token Head

Each word primarily attends to the word immediately before it

Previous token heads provide a simple but powerful signal: what word came immediately before? This captures strong local dependencies in language.

Explore different head types by switching tabs. Notice how each pattern serves a distinct purpose: previous token heads provide local context, induction heads enable pattern matching, and syntactic heads track grammatical relationships.

Layer-wise Patterns

Attention patterns also vary systematically across layers. This makes sense—early layers process raw input, while later layers work with increasingly abstract representations.

Early layers (1-3): Tend to focus on local patterns. Many heads attend to nearby positions, capturing surface-level features like adjacent words or punctuation.

Middle layers (4-8): Begin to show more structured patterns. Syntactic heads emerge here, tracking relationships across longer distances. The model starts building phrase-level and clause-level representations.

Late layers (9-12+): Become more diffuse and task-specific. Attention patterns are harder to interpret because they operate on highly abstract representations. These layers make final decisions about what to output.

Think of it as a processing pipeline: raw text → local features → syntactic structure → semantic meaning → task output.

Layer Progression

Layer:

Layer 1

Early: Local patterns, attending to nearby tokens

Local attentionGlobal attention

87% of attention on adjacent tokens

Early layers focus on local context (adjacent words). As we go deeper, attention becomes more distributed, capturing longer-range semantic relationships.

Watch how attention evolves across layers. Early layers focus locally; later layers develop longer-range patterns. The same input is processed differently at each level of abstraction.

The Mystery Remains

Despite these discoveries, much about attention patterns remains unexplained.

Redundancy: Many heads seem to compute similar things. Why does the model need multiple previous-token heads? Pruning experiments show that removing some heads barely hurts performance, while removing others is catastrophic. We cannot always predict which is which.

Polysemanticity: Some heads serve multiple roles depending on the input. A single head might track syntax in some contexts and semantics in others. This makes interpretation difficult.

Emergent behaviors: Large models exhibit attention patterns not seen in smaller models. These emergent capabilities appear suddenly at certain scales, suggesting the model is discovering qualitatively new computational strategies.

The limits of interpretation: We can describe what heads do, but explaining why they do it—in terms of the optimization landscape and training dynamics—remains an open research frontier.

Mechanistic interpretability is an active field trying to reverse-engineer these learned computations. The goal is to understand transformers not just as black boxes that work, but as comprehensible algorithms we can verify and trust.

Interactive Attention Explorer

Enter a sentence (up to 8 words):

The

cat

sat

the

mat

Hover over a word to see where it attends

Simulated Attention Pattern

Based on distance and simple linguistic heuristics

This is a simplified simulation showing characteristic patterns. Real transformer attention is learned from data and captures more subtle relationships. Try different sentences to see how patterns change!

Type your own sentence and see simulated attention patterns. Note: this is a simplified simulation showing characteristic patterns, not actual model outputs. Real attention patterns can be visualized using tools like BertViz or Transformer Lens.

Key Takeaways

Transformers discover linguistic structure without explicit supervision—patterns emerge from training
Common head types include previous-token heads, induction heads, syntactic heads, and positional heads
Induction heads enable in-context learning by completing patterns seen earlier in the sequence
Attention patterns evolve across layers: local in early layers, syntactic in middle layers, semantic in late layers
Many aspects of attention patterns remain mysterious—some heads seem redundant but are actually critical
Mechanistic interpretability aims to reverse-engineer how transformers compute, not just what they output
Understanding these patterns is key to building more interpretable and trustworthy AI systems