The Attention Intuition

What attention really means

Human Attention

Read this sentence:

"The cat sat on the mat because it was tired."

What does "it" refer to?

Your brain instantly knows the answer is "the cat." You did not consciously compare every word. You did not run through a checklist. Your attention simply jumped to the relevant word.

This is the essence of the attention mechanism. When processing "it," a transformer learns to look back at the sentence and attend to the word that matters most—in this case, "cat."

Pronoun Resolution

Click on "it" to see where it attends. The attention flows back to "cat".

Query wordHigh attention

Notice how "it" needs context from earlier in the sentence. Without that connection, the word is meaningless. The attention mechanism gives neural networks this ability to reach back and gather relevant context.

Query, Key, Value — Before the Math

The attention mechanism uses three concepts that might sound abstract at first: Query, Key, and Value. Before we touch any equations, let us build intuition for what these actually mean.

Think of it like searching for a book in a library:

Query: "I'm looking for something about neural networks" — this is what you need
Key: "This shelf contains computer science books" — this is what each location advertises
Value: The actual books on the shelf — this is the content you retrieve

In a transformer:

Each word generates a Query: "What information do I need?"
Each word generates a Key: "Here's what kind of information I can provide"
Each word has a Value: "Here's my actual content"

Query-Key Matching

Select a query word:

Query"it"Who am I referring to?

↓

KeyThe

Keycat

Keysat

Keyon

Keymat

Keybecause

Each key advertises what information it offers. The query searches for the best match.

The key insight: attention is like a soft dictionary lookup. A regular dictionary gives you exactly one result for each lookup. Attention gives you a weighted blend of all entries, where the weights depend on how well each key matches your query.

When the pronoun "it" generates its query, it is essentially asking: "Who or what am I referring to?" The keys from "cat," "mat," and other words offer their identities. The word "cat" has a key that matches well with the query from "it" — both relate to an animate subject capable of being tired. So "it" attends strongly to "cat."

Soft vs Hard Attention

There are two ways attention could work:

Hard attention picks exactly one word to attend to. When processing "it," hard attention would say: "I'm looking at 'cat' and nothing else."

Soft attention looks at all words, but with different weights. When processing "it," soft attention might say: "I'm 80% looking at 'cat,' 10% at 'sat,' 5% at 'mat,' and so on."

Why does soft attention work better?

The key reason is that soft attention is differentiable. During training, we need to compute gradients to update the model's parameters. Hard attention makes a discrete choice, and you cannot compute gradients through discrete choices cleanly. Soft attention uses continuous weights, which allows gradients to flow smoothly.

Attention Heatmap

From ↓

To →

The

cat

sat

the

mat

Low attention

High attention

Hover over cells to see attention weights. Click a word on the left to highlight its row.

In this heatmap, each row shows where one word is looking. Brighter cells mean stronger attention. Notice how patterns emerge — words tend to attend to related words, forming clusters of semantic relationships.

Building Intuition

Let us consolidate the key ideas:

Every word asks a question (its query): "What context do I need?"
Every word offers an answer (its key): "Here's what I represent"
Attention scores determine how much each word listens to every other word
Values carry the actual information that gets passed along

When we process a word like "it," the mechanism computes attention scores with every previous word. These scores act as weights that determine how much of each word's value to incorporate. The result is a context-aware representation that understands "it" refers to "cat."

This is the foundation of everything transformers do. Whether generating text, translating languages, or understanding images, the same principle applies: learn which parts of the input matter most for each part of the output.

In the next chapter, we will add the mathematics. You will see how queries, keys, and values are computed from embeddings, how attention scores are calculated using dot products, and why we need scaling. But the core intuition remains: attention is about learning where to look.

Key Takeaways

Attention allows neural networks to dynamically focus on relevant parts of the input
Query, Key, and Value form a soft dictionary lookup system
Queries represent what information is needed; keys advertise what information is available
Soft attention blends all inputs with learned weights, enabling gradient-based training
Attention patterns reveal which words the model considers related