What Comes Next

The frontier of research

The Attention Bottleneck

The transformer's power comes from attention—letting every token interact with every other token. But this power has a cost.

Attention complexity is O(N²) in sequence length. If you double the context length, you quadruple the computation. The numbers become staggering at scale:

4K context: ~16 million attention operations per layer
32K context: ~1 billion attention operations per layer
100K context: ~10 billion attention operations per layer

Attention Complexity Comparison

Sequence Length:

Standard AttentionO(N²)

1.0K ops

Full pairwise attention

Linear AttentionO(N)

320 ops

Kernel approximation

Mamba (SSM)O(N)

256 ops

State space model

At 32 tokens, standard attention uses ~10x more operations.

Standard attention scales quadratically with sequence length. Linear alternatives sacrifice some expressivity for dramatically better scaling.

This quadratic scaling creates a fundamental tension. We want models that understand entire codebases, books, and conversation histories. But naive attention cannot scale to millions of tokens.

The research community has attacked this problem from multiple angles: sparse attention patterns, linear attention approximations, sliding window approaches. Each trades some expressivity for efficiency. The quest for efficient long-context models remains one of the most active areas of transformer research.

Mixture of Experts (MoE)

Here is a counterintuitive idea: make the model bigger, but only use a small part of it at any time.

Mixture of Experts (MoE) divides the feed-forward network into multiple "experts"—specialized subnetworks that each handle different types of inputs. A learned routing mechanism decides which experts process each token.

Mixture of Experts Routing

Input Tokens

The

quantum

computer

solved

the

equation

beautifully

Router Network

Expert 0

Common words

Expert 1

Verbs & actions

Expert 2

Technical terms

Expert 3

Descriptive

Each token is routed to the most appropriate expert. Only 1-2 experts activate per token, keeping compute manageable even with many experts.

The key insight: not every token needs every parameter. A token about mathematics might route to experts specialized in logical reasoning. A token about poetry might route to different experts tuned for creative language.

GPT-4 is widely believed to use MoE architecture—potentially with 8 experts, only 2 active per token. This means:

Total parameters: massive (enabling broad capabilities)
Active parameters per token: manageable (enabling reasonable inference cost)
Result: a model that acts big but runs small

MoE introduces new challenges. Load balancing across experts is tricky—you want even utilization, not all tokens routing to the same expert. And training MoE models requires careful attention to numerical stability. But the efficiency gains are compelling enough that MoE has become standard at the frontier.

State Space Models

What if we abandoned attention entirely?

State Space Models (SSMs), particularly Mamba, offer an alternative. Instead of computing all-pairs attention, they maintain a compressed state that evolves as tokens flow through. This gives O(N) complexity—linear in sequence length.

The mathematics draw from control theory and signal processing: continuous-time systems discretized for sequence modeling. The key innovation is making state transitions input-dependent, allowing the model to selectively remember or forget information based on what it sees.

Computational Complexity

Sequence Length:

Standard AttentionO(N²)

1.0K ops

Full pairwise attention

Linear AttentionO(N)

320 ops

Kernel approximation

Mamba (SSM)O(N)

256 ops

State space model

At 32 tokens, standard attention uses ~10x more operations.

Standard attention scales quadratically with sequence length. Linear alternatives sacrifice some expressivity for dramatically better scaling.

Early results are promising. Mamba matches transformers on many language modeling benchmarks while being significantly faster on long sequences. But attention has a key advantage: it can directly compare any two positions in a sequence. SSMs must propagate information through state, which may limit certain capabilities.

The field is too young to declare winners. The most capable models may end up combining attention and SSM components, using each where it excels.

Multimodal Transformers

Here is the insight that defines modern AI: the transformer is a general sequence processor.

Text is a sequence of tokens. Images are sequences of patches. Audio is a sequence of frames. Video is a sequence of images. The transformer does not care what the tokens represent—it finds patterns in sequences.

Multimodal Inputs

TextWords & sentences

Embed

Sequence of Token embeddings

12 vectors → Transformer

Same ArchitectureAttention over sequences

The transformer does not know what modality it is processing. It sees sequences of vectors—and finds patterns. This generality is why the same architecture works for text, images, audio, and video.

This insight unlocked multimodal models:

GPT-4V and Gemini process text and images in the same model
Whisper handles audio with the same architecture
DALL-E and Stable Diffusion generate images from text

The pattern is always the same: convert the input modality into a sequence of embeddings, process with transformers, decode back to the target modality. The architecture remains remarkably consistent across wildly different tasks.

This generality matters. We did not need to invent separate architectures for vision, language, and audio. The same pattern—attention over sequences—handles them all. Future multimodal models will likely unify even more modalities: video, 3D, music, code execution traces. The transformer has become a universal interface for intelligence.

Open Questions

For all the progress, fundamental questions remain unanswered:

Why do transformers generalize so well? The training objective is simple: predict the next token. Yet models develop emergent capabilities far beyond this objective—reasoning, planning, even theory of mind. Why does next-token prediction on internet text create systems that can write poetry, debug code, and explain quantum physics?

What are the limits of scaling? The scaling laws suggest consistent improvement with more compute. But these are empirical observations, not physical laws. Where do the curves bend? What capabilities require qualitatively different approaches?

How do we make models more reliable? Current models confidently state falsehoods, struggle with multi-step reasoning, and can be manipulated by adversarial inputs. Solving these problems may require fundamentally new techniques—or may fall out naturally from scale. We do not know.

What is actually happening inside these models? Interpretability research has revealed fascinating structures: circuits that perform specific tasks, attention heads that track syntax, neurons that activate for concepts. But we are far from understanding how billions of parameters compose to produce intelligent behavior.

The field moves fast. Answers that seem settled get overturned. Capabilities that seem impossible become routine. This chapter will be outdated by the time you read it. That velocity is itself a defining feature of the current moment.

Course Summary

We have traveled from the sequence problem to the frontier of AI research. Let us revisit the key insights:

Course Journey

Foundations

Sequence problem
Word order matters
Tokenization
Embeddings

You now understand the architecture behind modern AI.

These fundamentals—attention, embeddings, next-token prediction—will remain relevant as the field continues to evolve.

Click on any section to explore its key concepts. The journey from sequences to modern AI systems is now complete.

Foundations: We started with the challenges of processing sequences—how word order matters, how context disambiguates meaning, why simple approaches fail on long-range dependencies.

Attention: The core innovation. Instead of processing sequences left-to-right, let every position attend to every other. Attention patterns emerge from learning, not from architectural constraints. The model discovers what relationships matter.

Architecture: The transformer block combines attention with feed-forward networks, using residual connections and layer normalization for stable training. Stacking these blocks creates depth without gradient degradation.

Variants: BERT showed bidirectional understanding. GPT showed autoregressive generation. The same core architecture, applied differently, solved fundamentally different problems.

Real-World Systems: We saw how transformers train, scale, and deploy. We learned about fine-tuning, alignment, RAG, and tool calling—the full stack of modern AI systems.

The fundamentals remain essential even as the field evolves. Attention is still attention. Embeddings are still embeddings. The training objective—predict tokens—has not changed since GPT-1. Understanding these fundamentals prepares you for whatever comes next.

Key Takeaways

Attention's O(N²) complexity motivates research into efficient alternatives
Mixture of Experts allows massive models with manageable inference costs
State Space Models offer O(N) complexity as an alternative to attention
Transformers are general sequence processors—the same architecture handles text, images, and audio
Fundamental questions about generalization, scaling limits, and reliability remain open
The field moves fast, but the fundamentals—attention, embeddings, next-token prediction—remain constant
Understanding these foundations prepares you for the innovations still to come